#  Term Frequency - Inverse Document Frequency Vectorization
---

Comparing CountVectorizer with Tf-idf

In [13]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [2]:
reddit_df= pd.read_csv('../data/reddit_df.csv')

In [3]:
reddit_df.head()

Unnamed: 0,subreddit,author,locked,num_comments,selftext,title,timestamp,full_text,full_text_clean
0,1,AutoModerator,0,84,"[Previous](/r/AskHistorians/search?q=title%3A""...",Sunday Digest | Interesting &amp; Overlooked P...,2019-07-07 14:04:52,Sunday Digest | Interesting &amp; Overlooked P...,sunday digest interesting amp overlooked post ...
1,1,AutoModerator,0,1,[Previous weeks!](/r/AskHistorians/search?sort...,"Short Answers to Simple Questions | July 10, 2019",2019-07-10 14:05:16,"Short Answers to Simple Questions | July 10, 2...",short answer simple question july 10 2019 prev...
2,1,tiikerinsilma,0,26,I'm asking this partially because the atrociti...,"(WW2) Did Japan have genocidal plans for Asia,...",2019-07-10 09:09:07,"(WW2) Did Japan have genocidal plans for Asia,...",ww2 japan genocidal plan asia war asking parti...
3,1,Mr_Quinn,0,10,,"In 1627 the last aurochs, or wild cow, died in...",2019-07-10 14:00:39,"In 1627 the last aurochs, or wild cow, died in...",1627 last aurochs wild cow died jaktor w fores...
4,1,Erezen,0,14,"Moreover, how was the movie received in South ...","""The Gods Must Be Crazy"" is a beloved South Af...",2019-07-09 20:36:09,"""The Gods Must Be Crazy"" is a beloved South Af...",god must crazy beloved south african movie rel...


### Model with only text and tfidf

In [4]:
X= reddit_df['full_text_clean']
y= reddit_df['subreddit']

X_train, X_test, y_train, y_test= train_test_split(X, y, stratify=y, random_state=22)

In [5]:
tfi = TfidfVectorizer(ngram_range=(1,2), # Keep n_gram range consistent with cvec
                     max_df=.98,
                     min_df=2)

X_train_tfi = tfi.fit_transform(X_train) 
X_test_tfi = tfi.transform(X_test)

In [6]:
X_train_df = pd.DataFrame(X_train_tfi.toarray(), columns = tfi.get_feature_names())
X_test_df = pd.DataFrame(X_test_tfi.toarray(), columns = tfi.get_feature_names())

In [7]:
lr = LogisticRegression()

lr.fit(X_train_df, y_train)
lr.score(X_train_df, y_train)



0.9749498997995992

In [8]:
lr.score(X_test_df, y_test)

0.8667334669338678

**Train Accuracy** = $0.975$

**Test Accuracy** = $0.867$

In [9]:
y_pred = lr.predict(X_test_tfi)

In [10]:
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()

**Specificity and Sensitivity**

In [11]:
print(f'Specificity: {round(tn / (fp + tn), 2)}')

print(f'Sensitivity: {round(tp /(tp + fn), 2)}')

Specificity: 0.92
Sensitivity: 0.82


**F1 Score**

In [12]:
round(f1_score(y_test, y_pred),3)

0.86

**Cross Val Score**

In [18]:
cross_val_score(lr, X_train_tfi, y_train, cv=20, n_jobs=4).mean()

0.8750540540540539

### Observations

The tf-idf did significantly worse than the Countvectorizer.  My assumption would be that frequency in each document has a neglible effect because each post has a very specific event and time period and it would be the topic of the post as a whole (i.e. American Revolution/Civil War or Ancient Rome) rather than a weighted average of individual terms. 