# Vectorization, TF-IDF, and Text Classification

Most machine learning models are built explicitly for numeric data. Text data doesn't have this luxury. Luckily, there are ways that we can covert our text data to numeric representations through vectorization.

In [1]:
import pandas as pd
import numpy as np
import re

# Custom preprocessing function
from utils import preprocess_text

# Data
from sklearn.datasets import fetch_20newsgroups

# Vectorization methods
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

ModuleNotFoundError: No module named 'utils'

In [1]:
#number of documents
n_docs = 100
#one category to review
categories = ['sci.space']
# categories = ['misc.forsale', 'sci.electronics', 'comp.sys.ibm.pc.hardware', 'rec.autos']
    
# Gather data from sklearn's fetch_20newsgroups with test and train data
news_train = fetch_20newsgroups(subset="train",
                                remove=('headers', 'footers', 'quotes'),
                                categories=categories)
news_train
news_test = fetch_20newsgroups(subset="test",
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)
news_test

NameError: name 'fetch_20newsgroups' is not defined

### Compile train/test DataFrames using SKlearn's [`fetch_20newsgroups`](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)

In [2]:
#number of documents
n_docs = 100
#one category to review
categories = ['sci.space']
# categories = ['misc.forsale', 'sci.electronics', 'comp.sys.ibm.pc.hardware', 'rec.autos']
    
# Gather data from sklearn's fetch_20newsgroups with test and train data
news_train = fetch_20newsgroups(subset="train",
                                remove=('headers', 'footers', 'quotes'),
                                categories=categories)
news_test = fetch_20newsgroups(subset="test",
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)

# Convert to pandas DataFrames
train_df = pd.DataFrame({"body": news_train['data'], 
                         "category": news_train['target']})
test_df = pd.DataFrame({"body": news_test['data'], 
                        "category": news_test['target']})

# Limit the number of documents, for the sake of the demonstration
train_df = train_df.iloc[:n_docs]
test_df = test_df.iloc[:n_docs]

# View the shapes of our datasets
print(f"Train Shape: {train_df.shape}")
print(f"Test Shape: {test_df.shape}")

Train Shape: (100, 2)
Test Shape: (100, 2)


### Apply preprocessing

The custom preprocessing function is built to return a list of clean tokens. The vectorizer expects the data to be in a string format. In the cell below, we're applying the `preprocess_text()` function, but then we're joining all of the values back together with a space in between each word.

In [7]:
train_df['body'] = train_df['body'].apply(lambda x : ' '.join(preprocess_text(x, min_word_length=4)))
test_df['body'] = test_df['body'].apply(lambda x : ' '.join(preprocess_text(x, min_word_length=4)))
train_df.head()
#space . join list into string = result from preprocessing

Unnamed: 0,body,category
0,lunar satellite need fuel regular orbit correc...,0
1,glad griffin spend time engineer rather ritual...,0
2,spite great respect people speak cost estimate...,0
3,early fighter german work wwii military first ...,0
4,army signal corp intelligence photointelligenc...,0


## CountVectorizer

[`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is a simple tool that turns raw text into feature vectors. We vectorize the text in 2 steps: 
1. First, we `fit`, the training data to our vectorizer to compute the vocabulary (feature set). 
2. Then, we `transform` with our text for both train and test to count the number occurrences for each word in our vocabulary.

The output of the CountVectorizer's `transform` task is a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix), which condenses the matrix values to avoid storing an excessive amount of zeros.

In [4]:
vectorizer = CountVectorizer()

# learn the vocabulary for the training set, take all the words in body field
vectorizer.fit(train_df['body'])

# count the number of occurrences for our vocabulary terms within train & test, in transform actually count
train_vecs = vectorizer.transform(train_df['body'])
test_vecs = vectorizer.transform(test_df['body'])

#### What is the size of our vocabulary?

In [5]:
print(f"Number of documents: {train_vecs.shape[0]}")
print(f"Size of vocabulary: {train_vecs.shape[1]}")
# train_vecs comes back in spares martrix
# data type that uses memory
# amount of words = 3142

Number of documents: 100
Size of vocabulary: 3142


#### How much of our feature set is just zeros?

As mentioned above, our vectorizer's `transform` function returns a sparse matrix. Using the `nnz` attribute of a sparse matrix returns the number of non-zero values

In [6]:
# Train
print(f"Number of TRAINING non-zero features: {train_vecs.nnz}")
print(f"Number of TRAINING zero features: {(train_vecs.shape[0]*train_vecs.shape[1])-train_vecs.nnz}")

# Test
print(f"Number of TEST non-zero features: {test_vecs.nnz}")
print(f"Number of TEST zero features: {(test_vecs.shape[0]*test_vecs.shape[1])-test_vecs.nnz}")

#how many are zero values vs non-zeros

Number of TRAINING non-zero features: 6436
Number of TRAINING zero features: 307764
Number of TEST non-zero features: 3601
Number of TEST zero features: 310599


### Display a few terms and their tf-idf scores for a few documents. 

This is only meant to be used for demonstration purposes. The cell below has no impact on the actual execution of our task. Also, this cell is only intended for use when the number of documents is small (<100), otherwise it will likely only display a bunch of zeros.

In [8]:
# Notice that I'm transposing the dataframe so that documents are showing as columns and our word features are the rows. 
# This is only to help with visualizing the features.

df_counts = pd.DataFrame(train_vecs.toarray(), 
                         columns=vectorizer.get_feature_names())[:30].T
df_counts.tail(25).style.background_gradient()
#.T to transpose the words on side vs the top

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
world,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
worry,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
worth,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
worthwhile,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
wrap,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
wring,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
wrist,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
write,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0
wrong,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
wwii,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Term Frequency-Inverse Document Frequency (TF-IDF)

Tf-idf is a statistical representation of how relevant a word is to a particular document within a corpus. _Relevance_, in this scenario, can be defined as how much information a word provides about the context of one document vs all other documents in the corpus. 

In short, tf-idf is calculated by comparing the number of times that a particular terms occurs in a given document vs the number of other documents in the corpus that contain that word. A word that frequently occurs in 1 document, but only occurs in a very small number of other documents will have a high tf-idf score.

The calculation for tf-idf is the product of two smaller calculations:

$$TF_{i,j} = \frac{Number~of~times~word_{i}~occurs~in~document_{j}}{Total~number~of~words~in~document_{j}}$$


$$IDF_{i} = log(\frac{Total~number~of~documents~in~corpus}{Number~of~documents~that~contain~word_{i}})$$

##### Example: 

Let's say we have 10,000 documents about the solar system. If we were to take one single document with 200 terms and see that _Europa_ (one of Jupiter's moons) was mentioned 5 times, then _Europa's_ term frequency (tf) for that document would be: 

$$TF_{Europa, document} = \frac{5}{200}=0.025$$


Now if we were to see that _Europa_ only occurs in 50 of the total 10,000 documents, then the inverse document frequency (idf) would be: 

$$IDF_{Europa} = log(\frac{10,000}{50})=2.3$$

Therefore our tf-idf score for _Europa_ for that given document would be:

$$ 0.025 * 2.3 = 0.575 $$

### TF-IDF Vectorization

As you can imagine, this tf-idf score seems to be a bit more informative than a simple count of occurrences. Below, we'll vectorize our data using this calculation and then compare baseline classification results.

In [9]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(train_df['body'])
train_tfidf_vecs = tfidf_vectorizer.transform(train_df['body'])
test_tfidf_vecs = tfidf_vectorizer.transform(test_df['body'])

### Display a few terms and their tf-idf scores for a few documents

This is only meant to be used for demonstration purposes. The cell below has no impact on the actual execution of our task. Also, this cell is only intended for use when the number of documents is small (<100), otherwise it will likely only display a bunch of zeros.

In [10]:
df_tfidf = pd.DataFrame(train_tfidf_vecs.toarray(), 
                         columns=tfidf_vectorizer.get_feature_names())[:30].T
df_tfidf.tail(25).style.background_gradient()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
world,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.084434,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.207875,0.0,0.0,0.0,0.0
worry,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
worth,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
worthwhile,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103127,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wrap,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wring,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wrist,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
write,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.204128,0.0,0.0,0.011233,0.0,0.0
wrong,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wwii,0.0,0.0,0.0,0.558401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Comparison of the representation of the word "space" between the two vectorizers

In [11]:
pd.DataFrame({"TF-IDF: Space":df_tfidf.loc['space'], "CountVectorizer: Space":df_counts.loc['space']})

Unnamed: 0,TF-IDF: Space,CountVectorizer: Space
0,0.0,0
1,0.0,0
2,0.0,0
3,0.0,0
4,0.100583,1
5,0.0,0
6,0.014304,1
7,0.0,0
8,0.0,0
9,0.0,0
