# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [2]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd

Import the data into the file *headlines.csv*

In [3]:
# TODO: Load the dataset
hl = pd.read_csv('headlines.csv')

As usual, check the dataset basic information.

In [4]:
# TODO: Have a look at the data
print(hl.head(5))
hl.info()

   publish_date                                      headline_text
0      20170721  algorithms can make decisions on behalf of fed...
1      20170721  andrew forrests fmg to appeal pilbara native t...
2      20170721                           a rural mural in thallan
3      20170721  australia church risks becoming haven for abusers
4      20170721  australian company usgfx embroiled in shanghai...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [5]:
# TODO: Perform preprocessing
# import needed modules
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Tokenize
hl['tokens'] = hl.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1)

# Remove punctuation
hl['alpha'] = hl['tokens'].apply(lambda x: [item for item in x if item.isalpha()])

# Remove stop words
stop_words = stopwords.words('english')
hl['stop'] = hl['alpha'].apply(lambda x: [item for item in x if item not in stop_words])

# Stem
stemmer = PorterStemmer()
hl['stemmed'] = hl['stop'].apply(lambda x: [stemmer.stem(item) for item in x])
hl['stemmed']


0         [algorithm, make, decis, behalf, feder, minist]
1       [andrew, forrest, fmg, appeal, pilbara, nativ,...
2                                 [rural, mural, thallan]
3                  [australia, church, risk, becom, abus]
4       [australian, compani, usgfx, embroil, shanghai...
                              ...                        
1994    [constitut, avenu, win, top, prize, act, archi...
1995                         [dark, mofo, number, crunch]
1996    [david, petraeu, say, australia, must, firm, s...
1997    [driverless, car, australia, face, challeng, r...
1998               [drug, compani, criticis, price, hike]
Name: stemmed, Length: 1999, dtype: object

Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [6]:
# TODO: Compute the BOW of the preprocessed data
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer(lowercase=False, analyzer = lambda c:c)
BOW = v.fit_transform(hl['stemmed']).toarray()
BOW.shape

(1999, 4165)

You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [8]:
# TODO: Compute the TF using the BOW
TF = pd.DataFrame(data=BOW, columns=v.get_feature_names_out())
TF = TF.divide(TF.sum(axis=1), axis=0)
np.unique(TF)

array([0.        , 0.08333333, 0.09090909, 0.1       , 0.11111111,
       0.125     , 0.14285714, 0.16666667, 0.18181818, 0.2       ,
       0.22222222, 0.25      , 0.28571429, 0.33333333, 0.4       ,
       0.5       , 1.        ])

In [10]:
# TODO: Compute the IDF
IDF = pd.DataFrame(data=BOW, columns=v.get_feature_names_out())
print(IDF)

      aardman  aaron  ab  aback  abbott  abc  abel  abil  ablett  aborigin  \
0           0      0   0      0       0    0     0     0       0         0   
1           0      0   0      0       0    0     0     0       0         0   
2           0      0   0      0       0    0     0     0       0         0   
3           0      0   0      0       0    0     0     0       0         0   
4           0      0   0      0       0    0     0     0       0         0   
...       ...    ...  ..    ...     ...  ...   ...   ...     ...       ...   
1994        0      0   0      0       0    0     0     0       0         0   
1995        0      0   0      0       0    0     0     0       0         0   
1996        0      0   0      0       0    0     0     0       0         0   
1997        0      0   0      0       0    0     0     0       0         0   
1998        0      0   0      0       0    0     0     0       0         0   

      ...  youtub  zambian  zealand  zedd  zinc  zion  zombi  z

Compute finally the TF-IDF.

In [11]:
# TODO: compute the TF-IDF
tfidf = TF * IDF
tfidf

Unnamed: 0,aardman,aaron,ab,aback,abbott,abc,abel,abil,ablett,aborigin,...,youtub,zambian,zealand,zedd,zinc,zion,zombi,zone,zonta,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
tfidf = TF * IDF
tfidf

Unnamed: 0,aardman,aaron,ab,aback,abbott,abc,abel,abil,ablett,aborigin,...,youtub,zambian,zealand,zedd,zinc,zion,zombi,zone,zonta,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


What are the 10 words with the highest and lowest TF-IDF on average?

In [17]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
print('lowest word: ', tfidf.max(axis=0).sort_values()[:10])
print('highest word: ', tfidf.max(axis=0).sort_values(ascending=False)[:10])

lowest word:  adel        0.083333
haw         0.083333
melb        0.083333
coll        0.083333
gcfc        0.083333
nmfc        0.083333
geel        0.083333
syd         0.083333
gw          0.083333
pacquaio    0.090909
dtype: float64
highest word:  peacemak     1.000000
pump         1.000000
mongolian    1.000000
travel       0.800000
employ       0.800000
record       0.800000
murder       0.666667
arsen        0.666667
water        0.666667
tourist      0.571429
dtype: float64


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [20]:
# TODO: Compute the TF-IDF using scikit learn
# Import the module
from sklearn.feature_extraction.text import TfidfVectorizer
 
# Instantiate the TF-IDF vectorizer
v = TfidfVectorizer(lowercase=False, analyzer=lambda x:x)

# Compute the TF-IDF
tf_idf = v.fit_transform(hl['stemmed']).toarray()
tf_idf = pd.DataFrame(data=tf_idf, columns=v.get_feature_names_out())
tf_idf


Unnamed: 0,aardman,aaron,ab,aback,abbott,abc,abel,abil,ablett,aborigin,...,youtub,zambian,zealand,zedd,zinc,zion,zombi,zone,zonta,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [21]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
print('lowest word: ', tf_idf.max(axis=0).sort_values()[:10])
print('highest word: ', tf_idf.max(axis=0).sort_values(ascending=False)[:10])

lowest word:  coll     0.305258
gw       0.305258
nmfc     0.305258
adel     0.305258
melb     0.305258
syd      0.305258
haw      0.305258
geel     0.305258
gcfc     0.305258
fabio    0.322574
dtype: float64
highest word:  peacemak     1.000000
pump         1.000000
mongolian    0.831769
financ       0.803629
employ       0.795060
aquapon      0.794899
date         0.794899
travel       0.788050
rig          0.786813
mosul        0.779137
dtype: float64


Do you have the same words? How do you explain it?

I dont have the same words as ve.fit_transform normalize the numbers to a scale of 0 to 1