# TF-IDF and Bag of Words

Bag of Words (BoW) and Term Frequency - Inverse Document Frequency (TF-IDF) are methods for representing text documents as fixed-length numerical vectors.

These vectors can then be used as features in traditional (tabular) machine learning models, such as logistic regression, SVMs, or random forests.

In [2]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv("../data/IMDB.csv")
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


#### Text cleaning

Removing stopwords and punctuation, lemmatizing etc.

In [5]:
import re
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem.snowball import SnowballStemmer
st = SnowballStemmer('english')

# function to clean data
def clean_data(df, col, clean_col):

    # change to lower and remove spaces on either side
    df[clean_col] = df[col].apply(lambda x: x.lower().strip())

    # remove extra spaces in between
    df[clean_col] = df[clean_col].apply(lambda x: re.sub(' +', ' ', x))

    # remove punctuation
    df[clean_col] = df[clean_col].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

    # remove stopwords and get the stem
    df[clean_col] = df[clean_col].apply(lambda x: ' '.join(st.stem(text) for text in x.split() if text not in stop_words))

    return df

dfr = clean_data(df, 'review', 'clean_review')

In [7]:
df["label"] = df["sentiment"].map({"negative": 0, "positive": 1})
df

Unnamed: 0,review,sentiment,clean_review,label
0,One of the other reviewers has mentioned that ...,positive,one review mention watch oz episod hook right ...,1
1,A wonderful little production. <br /><br />The...,positive,wonder littl product br br film techniqu unass...,1
2,I thought this was a wonderful way to spend ti...,positive,thought wonder way spend time hot summer weeke...,1
3,Basically there's a family where a little boy ...,negative,basic famili littl boy jake think zombi closet...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visual stun film...,1
...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,thought movi right good job creativ origin fir...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,bad plot bad dialogu bad act idiot direct anno...,0
49997,I am a Catholic taught in parochial elementary...,negative,cathol taught parochi elementari school nun ta...,0
49998,I'm going to have to disagree with the previou...,negative,go disagre previous comment side maltin one se...,0


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df["clean_review"], df["label"], test_size=0.2, random_state=42
)

## Bag of Words

Changing the documents (reviews) into vectors. Each dimension of the vector corresponds to a word in the vocabulary, and the value is how many times that word appears in the document. (I limit the amount of features at 25000)

**Example:**

"great movie", "bad movie", and "bad bad movie" (vocab is "bad", "great", "movie" alphabetically)

respectively would be 

[0, 1, 1], [1, 0, 1], [2, 0, 1]

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vec = CountVectorizer(max_features=25000)
X_train_bow = bow_vec.fit_transform(X_train)
X_test_bow = bow_vec.transform(X_test)

In [10]:
print(X_train_bow.toarray()[1000:1005])
print(X_train_bow.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(40000, 25000)


#### Classifying with Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=500)
clf.fit(X_train_bow, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,500


In [12]:
from sklearn.metrics import accuracy_score

preds = clf.predict(X_test_bow)
accuracy_score(y_test, preds)
print("Accuracy with Bag of Words:", accuracy_score(y_test, preds))

Accuracy with Bag of Words: 0.8811


Printing most descriptive words:

In [13]:
feature_names = bow_vec.get_feature_names_out()
coef = clf.coef_[0]

top_pos = np.argsort(coef)[-20:]
top_neg = np.argsort(coef)[:20]

print("Top positive words:")
for idx in reversed(top_pos):
    print(feature_names[idx], coef[idx])

print("\nTop negative words:")
for idx in top_neg:
    print(feature_names[idx], coef[idx])

Top positive words:
refresh 1.914388087657759
squirrel 1.7725634267045898
sequit 1.5303566008067233
underr 1.49123240407683
temporari 1.462939344200386
eisenstein 1.4585745612150671
delici 1.3807455890383258
wipe 1.3739621670330164
funniest 1.3516067215830976
bloodbath 1.3376609852503765
kersey 1.3297950118125632
superb 1.3265665127009376
flawless 1.3212403787514653
brilliant 1.3198744677909748
xavier 1.3182588117236183
hotti 1.3076926989492115
occurr 1.30740423598856
adr 1.296624098719423
shemp 1.287105241001195
rudi 1.2854595981763737

Top negative words:
uninterest -1.9278998029724983
worst -1.890884527215714
implement -1.8673971054478795
mst -1.81649331682519
boredom -1.804393267367129
wast -1.790139575685424
baldwin -1.7779213304100618
lifeless -1.7446205959695855
forgett -1.7193109171326535
disgrac -1.6366690355863183
lousi -1.6364939555970497
downhil -1.616874174876399
alright -1.615413288629475
aw -1.6089093735981572
unremark -1.6046177451793815
cortes -1.5898595736930603
obnox

## TF-IDF

#### Term frequency $\times$ inverse document frequency

$TF(t)$ is just what we did with BoW: the number of times term $t$ appears in a document.

$IDF(w) = log(\frac{N}{df(w)})$,

i.e. it measures how rare or informative a word $w$ is across the entire corpus

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer(max_features=25000)
X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)

In [15]:
print(X_train_tfidf.toarray()[1000:1005])
print(X_train_tfidf.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(40000, 25000)


In [16]:
clf2 = LogisticRegression(max_iter=500)
clf2.fit(X_train_tfidf, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,500


In [17]:
preds2 = clf2.predict(X_test_tfidf)
accuracy_score(y_test, preds2)
print("Accuracy with TF-IDF:", accuracy_score(y_test, preds2))

Accuracy with TF-IDF: 0.8909


Overall, TF-IDF performs slightly better than Bag of Words, likely because it downweights very common words and emphasizes more informative ones.