# TF-IDF Code Analysis

TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical method used to highlight the most important words in a document relative to a collection of documents. I did this across each document, in relation to the other banks, and in relation to the years.

Although I ended up not using the TF-IDF analysis within my final research, here is the code from the analysis in a seperate notebook.

### Data Preprocessing
The same process as the main notebook

In [12]:
import nltk
import string
import PyPDF2
import os
import spacy
import string
import numpy as np
import pandas as pd

nlp = spacy.load('en_core_web_sm')

import matplotlib.pyplot as plt
from collections import Counter, defaultdict

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

from sentence_transformers import SentenceTransformer, util

In [2]:

pdf_folder = 'reports'
pdf_texts = {}

for filename in os.listdir(pdf_folder):
    if filename.endswith('.pdf'):
        report_name = filename.replace('.pdf', '').lower()
        with open(os.path.join(pdf_folder, filename), 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            text = ""
            for page in reader.pages:
                content = page.extract_text()
                if content:
                    text += content + "\n"
            pdf_texts[report_name] = text

In [3]:
ubs24 = pdf_texts['ubs_24']
ubs23 = pdf_texts['ubs_23']
ubs22 = pdf_texts['ubs_22']
pwc24 = pdf_texts['pwc_24']
pwc23 = pdf_texts['pwc_23']
pwc22 = pdf_texts['pwc_22']
citi24 = pdf_texts['citi_24']
citi23 = pdf_texts['citi_23']
citi22 = pdf_texts['citi_22']
campden24 = pdf_texts['campden_24']
campden23 = pdf_texts['campden_23']
campden22 = pdf_texts['campden_22']

In [4]:
def clean_text_spacy(text):
    doc = nlp(text)

    cleaned_tokens = [
        token.text.lower() for token in doc if not token.is_punct and not token.is_stop and token.is_alpha
    ]

    cleaned_text = ' '.join(cleaned_tokens)

    return cleaned_text

In [5]:
ubs24_basic = clean_text_spacy(ubs24)
ubs23_basic = clean_text_spacy(ubs23)
ubs22_basic = clean_text_spacy(ubs22)
pwc24_basic = clean_text_spacy(pwc24)
pwc23_basic = clean_text_spacy(pwc23)
pwc22_basic = clean_text_spacy(pwc22)
citi24_basic = clean_text_spacy(citi24)
citi23_basic = clean_text_spacy(citi23)
citi22_basic = clean_text_spacy(citi22)
campden24_basic = clean_text_spacy(campden24)
campden23_basic = clean_text_spacy(campden23)
campden22_basic = clean_text_spacy(campden22)

See what words are distinctive or uniquely important in each report (or bank/year).

1. Each report as its own document 
2. Each bank as its own document 
3. Each year as its own document

## 1. Document Level

In [6]:
def ensure_string(doc):
    return ' '.join(doc) if isinstance(doc, list) else doc

documents = [ensure_string(doc) for doc in [
    ubs22_basic, ubs23_basic, ubs24_basic,
    citi22_basic, citi23_basic, citi24_basic,
    pwc22_basic, pwc23_basic, pwc24_basic,
    campden22_basic, campden23_basic, campden24_basic
]]

In [7]:
doc_labels = ['ubs_22', 'ubs_23', 'ubs24',
              'citi_22', 'citi_23', 'citi_24',
              'pwc_22', 'pwc_23', 'pwc_24',
              'campden_22', 'campden_23', 'campden_24']

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85)
X = vectorizer.fit_transform(documents)
tfidf_df = pd.DataFrame(X.toarray(), index=doc_labels, columns=vectorizer.get_feature_names_out())

for doc in tfidf_df.index:
    print(f"\nTop terms in {doc}:")
    print(tfidf_df.loc[doc].sort_values(ascending=False).head(10))


Top terms in ubs_22:
ubs            0.608322
gfo            0.566757
costs          0.113209
western        0.111199
allocation     0.097119
staff          0.096075
income         0.093690
distributed    0.083037
ledger         0.082498
latin          0.082350
Name: ubs_22, dtype: float64

Top terms in ubs_23:
ubs            0.738285
allocations    0.138755
ag             0.134886
se             0.116817
bank           0.111607
offi           0.102933
ces            0.096070
authority      0.094777
german         0.088978
geopolitics    0.088500
Name: ubs_23, dtype: float64

Top terms in ubs24:
ubs            0.699203
income         0.189426
allocations    0.130658
operating      0.121750
rates          0.112557
ag             0.104832
se             0.101200
bank           0.097994
allocation     0.092055
plan           0.086116
Name: ubs24, dtype: float64

Top terms in citi_22:
survey        0.299535
citibank      0.275708
index         0.246738
aum           0.169736
citigroup     

## 2. Bank Level

In [8]:
bank_docs = {
    "UBS": ' '.join(documents[0:3]),
    "Citi": ' '.join(documents[3:6]),
    "PwC": ' '.join(documents[6:9]),
    "Campden": ' '.join(documents[9:12]),
}

In [9]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85)
X = vectorizer.fit_transform(bank_docs.values())
tfidf_df_bank = pd.DataFrame(X.toarray(), index=bank_docs.keys(), columns=vectorizer.get_feature_names_out())

# Top terms per bank
for bank in tfidf_df_bank.index:
    print(f"\nTop terms for {bank}:")
    print(tfidf_df_bank.loc[bank].sort_values(ascending=False).head(10))


Top terms for UBS:
ubs            0.787555
allocations    0.155048
gfo            0.146243
allocation     0.128152
ag             0.115495
bank           0.113913
document       0.102838
plan           0.101256
operating      0.098091
unsplash       0.094190
Name: UBS, dtype: float64

Top terms for Citi:
survey         0.463460
view           0.368759
mm             0.272006
citibank       0.238287
vs             0.206815
respondents    0.189402
latin          0.163574
index          0.149226
citi           0.146119
citigroup      0.137127
Name: Citi, dtype: float64

Top terms for PwC:
pwc             0.669527
ofﬁces          0.354279
volume          0.254877
ofﬁce           0.249196
study           0.243379
club            0.175165
june            0.172473
transactions    0.122648
pitchbook       0.114090
ﬁrst            0.108085
Name: PwC, dtype: float64

Top terms for Campden:
percent       0.482354
campden       0.450367
figure        0.369318
fig           0.201520
hsbc          

## 3. Year Level

In [10]:
year_docs = {
    "2022": ' '.join([documents[0], documents[3], documents[6], documents[9]]),
    "2023": ' '.join([documents[1], documents[4], documents[7], documents[10]]),
    "2024": ' '.join([documents[2], documents[5], documents[8], documents[11]]),
}

In [11]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85)
X = vectorizer.fit_transform(year_docs.values())
tfidf_df_year = pd.DataFrame(X.toarray(), index=year_docs.keys(), columns=vectorizer.get_feature_names_out())

# Top terms per year
for year in tfidf_df_year.index:
    print(f"\nTop terms for {year}:")
    print(tfidf_df_year.loc[year].sort_values(ascending=False).head(10))


Top terms for 2022:
gfo            0.580323
apply          0.321662
inhouse        0.196720
refinitiv      0.118032
overweight     0.112208
yesno          0.108196
underweight    0.097247
em             0.088524
learn          0.082286
fo             0.078688
Name: 2022, dtype: float64

Top terms for 2023:
ofﬁces      0.624058
ofﬁce       0.438956
fig         0.225240
ﬁrst        0.190391
amily       0.100484
ups         0.088487
controls    0.080443
offi        0.079329
ces         0.074041
bullish     0.072399
Name: 2023, dtype: float64

Top terms for 2024:
hsbc          0.589395
fig           0.373284
pitchbook     0.176819
infralogic    0.142080
disagree      0.122791
chart         0.122706
display       0.116248
volumedeal    0.109789
generative    0.109789
likelier      0.093321
Name: 2024, dtype: float64


TF-IDF provides a look at the most unique terms within each document, however its usefulness is limited for this analysis. The top terms it highlights are often company names or context-light words like “ubs,” “campden,” or “survey” that lack semantic depth. Because TF-IDF does not account for the meaning or use of words in context, it fails to capture the underlying values, ideologies, or discursive strategies central to understanding impact investing. For an analysis focused on the narratives and intentional language used by financial actors, semantic similarity or named entity recognition offered more extensive and related insights.