## TF-IDF

<a href="https://colab.research.google.com/github/febse/ta2025/blob/main/02-03-TF-IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

Until now we have looked at the term frequency matrix (counts of words in documents). However, the term frequency matrix
does not take into account the importance of words in the document. For example, the word "the" is likely to appear in
most documents, but it is not very informative. The term frequency-inverse document frequency (TF-IDF) is a measure that
takes into account the importance of words in the document.

**Term frequency** ($TF(i, d)$) is the number of occurrences of word (token) $i$ in document $d$. It depends strongly
on how general a word is (e.g. "has" vs. "cosine" in general literature) and also on the length of the document.

**Document frequency** ($DF(i)$) is the number of documents that contain word $i$.

**Inverse document frequency** ($IDF(i)$) is simply the inverse relative frequency of the word in the set of documents.
With $N$ documents the IDF is given by:

$$
    DF(i) = \frac{DF(i)}{N}
$$

$$
    IDF(i) = \frac{N}{DF(i)}
$$

It is large for words that occur in many documents, and it will be small for words that appear in only a few documents.

A problem with this definition is that the IDF becomes very large for large corpora (large N) so it is commonly replaced
by its logarithm.

$$
\text{IDF}(i) = 1 + \log\left(\frac{N}{DF(i)}\right)
$$

The addition of 1 in the above equation serves to ensure that the words that occur in all documents are not entirely discarded. The default IDF used in `TfidfVectorizer` is:

$$
\text{IDF}(i) = 1 + \log\left(\frac{N + 1}{DF(i) + 1}\right)
$$

$$
\text{TF-IDF}(i, d) = TF(i, d) \times \text{IDF}(i)
$$

Let's calculate it for the toy corpus with just three documents:

```
    "the quick brown fox",
    "the fast brown dog",
    "the quick red fox"
```


In [51]:
import math
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

corpus = [
    'the small quick brown fox',
    'the small fast brown dog',
    'the small quick red fox'
]

c_vect = CountVectorizer()

term_matrix = c_vect.fit_transform(corpus)
term_matrix_dense = term_matrix.toarray()

pd.DataFrame(term_matrix_dense, columns=c_vect.get_feature_names_out())

Unnamed: 0,brown,dog,fast,fox,quick,red,small,the
0,1,0,0,1,1,0,1,1
1,1,1,1,0,0,0,1,1
2,0,0,0,1,1,1,1,1


In [31]:
tfidf_vect = TfidfVectorizer(smooth_idf=True, use_idf=True, norm=None)
tfidf_term_matrix = tfidf_vect.fit_transform(corpus)
pd.DataFrame(
    tfidf_term_matrix.toarray(),
    columns=c_vect.get_feature_names_out(),
    index=[f"doc{i}" for i in range(1, len(corpus) + 1)]
    )

Unnamed: 0,brown,dog,fast,fox,quick,red,the
doc1,1.287682,0.0,0.0,1.287682,1.287682,0.0,1.0
doc2,1.287682,1.693147,1.693147,0.0,0.0,0.0,1.0
doc3,0.0,0.0,0.0,1.287682,1.287682,1.693147,1.0


In [32]:
# Get the inverse document frequency
tfidf_vect.idf_

array([1.28768207, 1.69314718, 1.69314718, 1.28768207, 1.28768207,
       1.69314718, 1.        ])

In [33]:
# Homework exercise: check the IDF calculations manually and compare
# these with the values from the TfidfVectorizer

tfidf_the = 1 + math.log((3 + 1)/ (3 + 1))
print(tfidf_the)

# IDF for "brown" in the first document

tfidf_brown = 1 + math.log((3 + 1)/ (2 + 1))
print(tfidf_brown)

1.0
1.2876820724517808


In [52]:
# Read in the data
df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/reviews.csv")
df.head()


Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,394349,Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat...,,244.95,5,Very good one! Better than Samsung S and iphon...,0.0
1,34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0
2,248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0
3,167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0
4,73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0


In [None]:
# Most common words (excluding stopwords)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import re

nltk.download("punkt")
nltk.download("stopwords")

# Ensure reviews are strings and clean
texts = df["Reviews"].astype(str).tolist()

# Build stopword set
sw = set(stopwords.words("english"))

tokens = []
for text in texts:
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    words = word_tokenize(text)
    words = [w for w in words if w.isalpha() and w not in sw and len(w) > 1]
    tokens.extend(words)

fdist = FreqDist(tokens)
top_n = 30

for word, count in fdist.most_common(top_n):
    print(f"{word}: {count}")


[nltk_data] Downloading package punkt to /home/amarov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/amarov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


phone: 54119
great: 12358
good: 12214
one: 8051
like: 7601
screen: 7329
use: 7026
battery: 6800
works: 6189
get: 5872
would: 5853
love: 5658
new: 5422
work: 5015
really: 4710
camera: 4661
time: 4634
price: 4395
product: 4200
sim: 4131
well: 4059
bought: 3986
phones: 3912
card: 3906
buy: 3788
got: 3632
back: 3584
even: 3577
iphone: 3460
nice: 3386


In [None]:
# Define the vectorizer

vectorizer = TfidfVectorizer(    
    strip_accents="unicode",
    lowercase=True, # Default is True
    # stop_words=list(sw),
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.95
)

# Remove missing values
df.dropna(inplace=True)

# Drop reviews with neutral ratings
df = df[df['Rating'] != 3]

df["positive"] = np.where(df['Rating'] > 3, 1, 0)

# Create bag-of-words features using TF

X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],
                                                    df['positive'],
                                                    random_state=0)

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f'Number of features: {X_train_vec.shape[1]}')
print(f'Training set shape: {X_train_vec.shape}')
print(f'Test set shape: {X_test_vec.shape}')

Number of features: 53505
Training set shape: (27662, 53505)
Test set shape: (9221, 53505)


In [83]:
# Train logistic regression classifier
logreg = LogisticRegression(max_iter=1000, random_state=0)
logreg.fit(X_train_vec, y_train)

# Make predictions
y_pred = logreg.predict(X_test_vec)
y_pred_proba = logreg.predict_proba(X_test_vec)[:, 1]

# Calculate accuracy and AUC
accuracy = logreg.score(X_test_vec, y_test)
auc_score = roc_auc_score(y_test, y_pred_proba)

print(f'Accuracy: {accuracy:.4f}')
print(f'AUC Score: {auc_score:.4f}')

Accuracy: 0.9373
AUC Score: 0.9784


In [80]:
example = "I'm not happy"

example_vec = vectorizer.transform([example])

example_vec.toarray()

example_pred_proba = logreg.predict_proba(example_vec)[:, 1]
print(f'Predicted probability of positive review for example "{example}": {example_pred_proba[0]:.4f}')

Predicted probability of positive review for example "I'm not happy": 0.0854


In [81]:
# Get feature names and coefficients
feature_names = np.array(vectorizer.get_feature_names_out())
coefficients = logreg.coef_[0]

# Create a dataframe with features and their coefficients
feature_coef = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values('coefficient', ascending=False)

# Top 20 features associated with positive sentiment
print("Top 20 features most associated with POSITIVE sentiment:")
print(feature_coef.head(20).to_string(index=False))

print("\n" + "="*60 + "\n")

# Top 20 features associated with negative sentiment
print("Top 20 features most associated with NEGATIVE sentiment:")
print(feature_coef.tail(20).to_string(index=False))

Top 20 features most associated with POSITIVE sentiment:
   feature  coefficient
     great     9.961724
      love     7.786146
      good     6.453228
 excellent     6.308221
   perfect     5.622785
      best     4.901446
   awesome     4.560858
 excelente     4.237203
   amazing     4.182690
     price     4.063163
     works     3.955867
      nice     3.926389
   not bad     3.809278
  excelent     3.744317
      fast     3.741633
 perfectly     3.660565
        my     3.656253
everything     3.569094
 love this     3.402239
       far     3.354284


Top 20 features most associated with NEGATIVE sentiment:
     feature  coefficient
       after    -3.164644
        back    -3.167097
        junk    -3.242642
         off    -3.278766
        didn    -3.323891
       sucks    -3.359536
        work    -3.416015
      broken    -3.465499
      return    -3.554480
         bad    -3.650224
      months    -3.675169
        poor    -3.718554
    horrible    -3.769665
    terrible    