In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report
import numpy as np
import pandas as pd

# TFIDF and sklearn

In this notebook we will guide you through using TF-IDF with sklearn. First we will define a dataset to work with. The dataset X consists out of 10 strings. The first 5 sentences are about 'Obama' the last 5 about 'lion'. The label of each sentence is stored in y.

It is important to check that X and y have the same length, otherwise their might be strings without a label.

In [2]:
X = ["Obama was born on August 4.",
     "Barack is the only president born outside the contiguous 48 states.",
     "Barack was born to an American mother of European descent and an African father.",
     'Barack and his mother moved to the University of Washington in Seattle, where they lived for a year.',
     'Obama described his struggles as a young adult to reconcile social perceptions of his multiracial heritage.',
     'Typically, the lion inhabits grasslands and savannas, but is absent in dense forests.',
     'adult male lions have a prominent mane.', 
     'It is a social species, forming groups called prides.',
     'A lion pride consists of a few adult males, related females and cubs.',
     'Groups of female lions usually hunt together, preying mostly on large ungulates.']

y = ['obama', 'obama', 'obama', 'obama', 'obama', 'lion', 'lion','lion','lion','lion',]
print(len(X), len(y))

10 10


## 1. Split the data

We want to make a seperate test and training dataset. The training dataset shows the data we will use as a corpus to define a TF-IDF vectorizer. Only the words present in this corpus will be taken into account. The test set will also be transformed to a TF-IDF matrix, but new words will not be taken into account. 

For this you can use the train_test_split() of sklearn. Please look up how it works. We want to have 25% of the data of X in the test set.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(len(X_train), len(X_test))

7 3


By printing the labels of y_test we see how many sentences with the label 'Obama' or 'lion' are in the test set. Ideally we have at least one of both.

In [4]:
print(y_test)

['lion', 'obama', 'lion']


## 2. TF-IDF vectorizer

First we'll define some functions:

In [9]:
def tfidf_features(txt, flag):
    """
    Transform the list of strings in txt into a tfidf vector. 
    If we want to train the tfidf vector we do fit_transform, otherwise we'll do transform
    """
    if flag == "train":
        x = tfidf.fit_transform(txt)
    else:
        x = tfidf.transform(txt)
    return x 


def matrix_to_pandas(X, tfidf):
    """
    Transform the sparse tfidf matrix in a pandas dataframe in order to see the column names above the features.
    At this stage you don't need to understand what is happening here. Is is only for visualisation purposes.
    """
    X = X.toarray()
    df = pd.DataFrame(X, columns = tfidf.get_feature_names())
    return df

In the next step we define the TfidfVectorizer. please have a close look at the documentation, as there are quite a lot of interesting arguments we can use. We will also demonstrate a couple of them in this notebook.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Then we train the tfidf matrix. If we print X_tfidf_train.shape we see the dimensions of the matrix. Every row is a new string and every column is a new word. The first number is the number of rows, 7.

In [10]:
tfidf = TfidfVectorizer()
X_tfidf_train = tfidf_features(X_train, flag="train")

print(X_tfidf_train.shape)

(7, 58)


If you now have a look at X_tfidf_train it looks a little weird. You see rows in the following format

(0, 6)	0.515973461359075

You have to read it in the following way: for document 0 the 6th word has importance 0.515973461359075

It is written this way as there are a lot of zero values in a tfidf matrix and we don't want to waste space on showing them. If you would like to have a look at it in matrix form you can just do:

<code>
print(X_tfidf_train.toarray())
</code>

In [14]:
print(X_tfidf_train)

  (0, 6)	0.515973461359075
  (0, 36)	0.42830228436617496
  (0, 8)	0.42830228436617496
  (0, 53)	0.42830228436617496
  (0, 34)	0.42830228436617496
  (1, 39)	0.3681528517608115
  (1, 9)	0.3681528517608115
  (1, 17)	0.3055984836695788
  (1, 16)	0.3681528517608115
  (1, 44)	0.3681528517608115
  (1, 43)	0.3055984836695788
  (1, 23)	0.3681528517608115
  (1, 24)	0.3681528517608115
  (2, 13)	0.27391477876166354
  (2, 1)	0.27391477876166354
  (2, 4)	0.2273727899808243
  (2, 10)	0.27391477876166354
  (2, 12)	0.27391477876166354
  (2, 35)	0.16873681866084675
  (2, 31)	0.2273727899808243
  (2, 2)	0.27391477876166354
  (2, 3)	0.5478295575233271
  (2, 48)	0.19435072341886617
  (2, 7)	0.2273727899808243
  (2, 8)	0.2273727899808243
  :	:
  (4, 43)	0.22421482608304552
  (4, 34)	0.22421482608304552
  (5, 56)	0.26136237371472965
  (5, 15)	0.26136237371472965
  (5, 27)	0.26136237371472965
  (5, 47)	0.26136237371472965
  (5, 55)	0.26136237371472965
  (5, 42)	0.26136237371472965
  (5, 22)	0.2613623737147296

If you execute the following cell you will understand this more clearly. We see the different words that the model has learned in the column names and we see one row per string we have in the training dataset. In the matrix we see the importance of each word for each string.

In [11]:
matrix_to_pandas(X_tfidf_train, tfidf)

Unnamed: 0,adult,african,american,an,and,as,august,barack,born,called,...,to,together,ungulates,university,usually,was,washington,where,year,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.515973,0.0,0.428302,0.0,...,0.0,0.0,0.0,0.0,0.0,0.428302,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.368153,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.273915,0.273915,0.54783,0.227373,0.0,0.0,0.227373,0.227373,0.0,...,0.194351,0.0,0.0,0.0,0.0,0.227373,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.309394,0.309394,0.0,0.309394,0.0,0.0,0.0,0.0,0.0
4,0.224215,0.0,0.0,0.0,0.0,0.27011,0.0,0.0,0.0,0.0,...,0.191651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.27011
5,0.0,0.0,0.0,0.0,0.216953,0.0,0.0,0.216953,0.0,0.0,...,0.185444,0.0,0.0,0.261362,0.0,0.0,0.261362,0.261362,0.261362,0.0
6,0.357939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let us experiment with some arguments we could give to the TfidfVectorizer. We will start by adding 'max_features= 15'. We now see only 15 words of the full vocabulary are taken into account.

In [17]:
tfidf = TfidfVectorizer(max_features= 15) 
matrix_to_pandas(tfidf_features(X_train, flag="train"), tfidf)

Unnamed: 0,adult,an,and,barack,born,groups,his,lions,mother,obama,of,on,social,to,was
0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.5
1,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0
2,0.0,0.693037,0.28764,0.28764,0.28764,0.0,0.0,0.0,0.28764,0.0,0.213462,0.0,0.0,0.245865,0.28764
3,0.0,0.0,0.0,0.0,0.0,0.53069,0.0,0.53069,0.0,0.0,0.393833,0.53069,0.0,0.0,0.0
4,0.347495,0.0,0.0,0.0,0.0,0.0,0.694991,0.0,0.0,0.347495,0.257882,0.0,0.347495,0.297028,0.0
5,0.0,0.0,0.435138,0.435138,0.0,0.0,0.435138,0.0,0.435138,0.0,0.322923,0.0,0.0,0.371942,0.0
6,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let us experiment with some arguments we could give to the TfidfVectorizer. We will add 'stop_words='english'. All stopwords are now removed before the matrix is trained. We see indeed that there are no more stopwords in the columnnames.

In [18]:
tfidf = TfidfVectorizer(max_features= 15, stop_words='english') 
matrix_to_pandas(tfidf_features(X_train, flag="train"), tfidf)

Unnamed: 0,adult,barack,born,groups,heritage,lions,mother,multiracial,obama,perceptions,reconcile,social,struggles,university,young
0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0
2,0.0,0.57735,0.57735,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.707107,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.292256,0.0,0.0,0.0,0.352079,0.0,0.0,0.352079,0.292256,0.352079,0.352079,0.292256,0.352079,0.0,0.352079
5,0.0,0.538281,0.0,0.0,0.0,0.0,0.538281,0.0,0.0,0.0,0.0,0.0,0.0,0.648465,0.0
6,0.707107,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let us experiment with some arguments we could give to the TfidfVectorizer. We will add 'ngram_range=(1,3)'. This means that also combination of up until 5 words can be taken into account. An example for which we would like to have this is 'New York'. 

You see indeed there are now columns that consist of more than 3 words.

In [21]:
tfidf = TfidfVectorizer(max_features= 15, stop_words='english', ngram_range=(1,3)) 
matrix_to_pandas(tfidf_features(X_train, flag="train"), tfidf)

Unnamed: 0,adult,adult reconcile social,barack,born,groups,lions,mother,multiracial heritage,obama,obama described struggles,reconcile social perceptions,social,social perceptions multiracial,struggles young adult,young adult reconcile
0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0
2,0.0,0.0,0.57735,0.57735,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.275669,0.332097,0.0,0.0,0.0,0.0,0.0,0.332097,0.275669,0.332097,0.332097,0.275669,0.332097,0.332097,0.332097
5,0.0,0.0,0.707107,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.707107,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Difference between fit_transform() and transform()

Let's assume you have a training dataset to start with and you want to train a tfidf model on the corpus within this dataset. You perform a fit_transform() on the dataset.

A couple of days later you receive an extra dataset you want to transform (crf test set). Now you only want to transform it, because we want this new dataset to have the same columns and corpus as the training dataset. Let's see what would happen if we use fit_transform again on the test dataset

In [22]:
matrix_to_pandas(tfidf_features(X_test, flag="train"), tfidf)

Unnamed: 0,48 states,absent,barack president born,born outside,born outside contiguous,contiguous 48,contiguous 48 states,grasslands,inhabits,lion,outside contiguous,outside contiguous 48,president born outside,savannas,typically
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.333333,0.0,0.333333,0.333333,0.333333,0.333333,0.333333,0.0,0.0,0.0,0.333333,0.333333,0.333333,0.0,0.0
2,0.0,0.423394,0.0,0.0,0.0,0.0,0.0,0.423394,0.423394,0.322002,0.0,0.0,0.0,0.423394,0.423394


As you can see the columns do not at all correspond with the collumns of the trainingdataset we had earlier. This means there is no way to compare the new transformation with the old transformation. We don't want this. That is why we only transform() the test set.

In [28]:
tfidf = TfidfVectorizer(max_features= 15, stop_words='english', ngram_range=(1,3)) 
df = matrix_to_pandas(tfidf_features(X_train, flag="train"), tfidf)
df

Unnamed: 0,adult,adult reconcile social,barack,born,groups,lions,mother,multiracial heritage,obama,obama described struggles,reconcile social perceptions,social,social perceptions multiracial,struggles young adult,young adult reconcile
0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0
2,0.0,0.0,0.57735,0.57735,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.275669,0.332097,0.0,0.0,0.0,0.0,0.0,0.332097,0.275669,0.332097,0.332097,0.275669,0.332097,0.332097,0.332097
5,0.0,0.0,0.707107,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.707107,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
matrix_to_pandas(tfidf_features(X_test, flag="test"), tfidf)

Unnamed: 0,adult,adult reconcile social,barack,born,groups,lions,mother,multiracial heritage,obama,obama described struggles,reconcile social perceptions,social,social perceptions multiracial,struggles young adult,young adult reconcile
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Yes it worked! The two above matrices now have the same column names. Now we can compare them with eachother!

## 4. Extract important words from each piece of text

In [44]:
#For which document would you like to see the most important words. Choose a number between 0 and 6
doc = 0

print('You have chosen the following document: ')
print(' ')
print(X_train[doc])
print(' ')
print(' ')
print('The two most important words for this document are list below, next to them you see their score. The lower the score the less relevant.')
print(' ')
print(df.iloc[doc].nlargest(5))


You have chosen the following document: 
 
Obama was born on August 4.
 
 
The two most important words for this document are list below, next to them you see their score. The lower the score the less relevant.
 
born                      0.707107
obama                     0.707107
adult                     0.000000
adult reconcile social    0.000000
barack                    0.000000
Name: 0, dtype: float64
