Welcome to the second part of the NLP series. In this notebook, we will touch on the subject of text classification. Let's start with how a typical text generation pipeline looks like.

# Building a Text Classification Pipeline

A generic text classification pipeline contains the following steps:

* Read the data and filter the irrelevant words (called stopwords) and symbols (punctuations, sometimes emojis etc.).

* Tokenize the data.

* Vectorize the data.

* Classify the data.

While this algorithm contain nuances in itself, it is mostly applicable for all the text classification purposes. Here is an introduction for each step. As we are coding, we will dive into the details more and more.

-> **Filtering step:** For the base models the filtering step is important and can significantly improve the performance. On the other hand, more advanced models prefer to keep the stopwords because they may contian some useful information about the context. For example *like* and *not like* pair has opposite meanings. If we drop stopword *not*, the output may be suffer.

-> **Vectorization step:** Vectorization can be done in several ways. Sometimes the more basic ones are more than enough and the other times we may need more advanced staff.


-> **Classifier selection:** We have varity of models in text classification. Smaller ones offer faster convergence, low computational requirements and possibly high accuracies. The complex ones on the flip side, can handle pretty complex tasks and often come with their data preprocessors.

The strategy of a NLP engineer should always start with the small models, then increase the complexity.

For this notebook we will use the IMDB dataset and perform a text classification task in several ways.  

## 1st Mini Project - TFIDF Vectorization & NB Classifier

For the first mini project we will use these tools:

**Tokenization:** Our vectorization algorithm has a built-,n tokenizer so I won't define a new one.

**Vectorization:** TFIDF Vectorization. Here is a brief explanation:

**TfIDFVectorization:**


After tokenizing the text, we need to give these words in somehow to our model. Now since the models only understand numerical inputs, we need to convert these words into a numerical form. We can use one hot encoding but if we do that, the dimensions increase exponentially resulting curse of dimensionality. We need a different approach. What about vectorizing the words such that *similar* or *relevant* ones have similar vectors? Ok that's appealing. To define vector relatedness we need a metric. What about taking dot product of two vectors? We know that as vectors get similar to each other, their dot product increases:

$(1,0,2) · (1,2,0) = 1$

and

$(1,0,2) · (1,0,2) = 5$

And we can assign the number of counts of each word in each document while assigning the vector dimension values. In the end, similar words come up at similar rates through the documents. So this should be working.

 One catch of this intuition is that the dot product (or similarity measure) rewards the bigger vectors. Therefore we need to normalize any word counts:

 $v ⟵ \frac {v} {|v|}$

 $w ⟵ \frac {w} {|w|}$

 And we get nothing but the cosine degree between these vectors:

 $cosθ = \frac {v·w} {|v||w|}$

 Ok, we solved the problem but still the vectorization part is far from the truth. This is when TFIDF scoring system comes into play:


 Term Frequency (as you may guess from its meaning) refers to the occurence frequency of a word given the corpus (total number of words).

$TF_{word_i} = \frac {number\,of\,occurences\,of\,word\,i} {total\,number\, of\, words\, in\, corpus} $

The importance of TF is obvious. If a word occurs more often, then probably it is more important.
While the claim behind TF is fruitful, it would be wrong to purely trust the results of a TF metric. For example, consider the word *the*. It's frequency is probably the highest in most of the datasets. Does this make it the most important word? No, in fact it has no well defined meaning. Therefore, we need another tool for making the model more robust. This tool is IDF. Lets look at the second component:

$IDF_{word_i} = log(\frac {number\,of\,documents(sentences)} {number\,of\,documents\, containing\, word\,i})$

\

and TF-IDF score is calculated by:

\

$TFIDF = TF × IDF$

\



In [1]:
#import the libraries
import zipfile
import pandas as pd
import numpy as np
import re
import textblob
import random
import requests
import io
import nltk

nltk.download('punkt')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

#seed the process
SEED = 42
np.random.seed(SEED)
random.seed(SEED)


[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data]   Package punkt is already up-to-date!


In [2]:
#read and extract the dataset from the source
URL = 'https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip'
r = requests.get(URL, stream=True)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
z.close()

In [3]:
#read the dataset
with open("/content/SMSSpamCollection",'r') as t:
  lines = t.readlines()

#display an instance
print(lines[0])

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...




In [4]:
#initialize a container
labels = []
texts = []

#split the labels and texts
for line in lines:
  line_split = line.split("\t")
  label,text = line_split[0],line_split[1][:-2] #remove \n
  labels.append(label)
  texts.append(text)

#show the instances
print("Text samples:",texts[0:3])
print("Label samples:",labels[0:3])

Text samples: ['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..', 'Ok lar... Joking wif u oni..', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18'"]

Label samples: ['ham', 'ham', 'spam']


In [5]:
#create train test datasets
X_train,X_test,y_train,y_test = train_test_split(texts,labels, random_state = SEED,stratify = labels)

vectorizer = TfidfVectorizer() #has a built-in tokenizer so we don't necessarily define a tokenizer separately
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [6]:
X_train_vectorized

<4180x7757 sparse matrix of type '<class 'numpy.float64'>'
	with 55444 stored elements in Compressed Sparse Row format>

**Classifier selection:** Our first model will be Naive Bayes Classifier. Let's recall the logic behind the algorithm first.

Recall the Bayes Theorem:

$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$

If you haven't noticed the theorem is nothing but writing the same thing in two different form:

$P(A|B) \times P(B) = P(A\cap B) = P(B|A) \times P(A)$

The NB algorithm itself uses this theorem. It is called *naive* because it assumes that each variable is independent from the other. Mathematically speaking:

Objective function $\argmax_c P(c|d) = \frac{P(d|c) \times P(c)}{P(d)} \quad c∈C$

where

* $c$ : Class c
* $d$: Document d


Generally we omit the demoniator because it is same in all calculations.


Given the words $X_1,X_2,...,X_n$ in our corpus, we calculate the following:

\

$P(d|X_1,X_2,...,X_n) ≈ {P(X_1,X_2,...,X_n|d) \times P(d)}$

\

and

\

$P(X_1,X_2,...,X_n|d) ≈ P(X_1|d) \times P(X_2|d)... \times P(X_n|d)$ - Niave Assumption

\

If you haven't before, I highly recommend the reader to practise the algorithm on a small scale and run it manually on the paper. [Here](https://gauthamsanthosh.medium.com/understanding-naive-bayes-in-real-world-3c4da612a0cf) you can see an example for the algorithm.

**One last note:** To speed up the algorithm and prevent underflow we compute these calcuations in the logaritmic scale.



In [7]:
nb = MultinomialNB()

nb.fit(X_train_vectorized,y_train)

preds = nb.predict(X_test_vectorized)

results = classification_report(y_test,preds)

In [8]:
print(results)

              precision    recall  f1-score   support



         ham       0.96      1.00      0.98      1207

        spam       1.00      0.71      0.83       187



    accuracy                           0.96      1394

   macro avg       0.98      0.85      0.90      1394

weighted avg       0.96      0.96      0.96      1394




## 2nd Mini Project Word2Vec and Logistic Regression

For the second project, we will change both our vectorization classification methods. Let's start with Word2Vec.

**Word2Vec**

One of the biggest downsides of the TF-IDF is it does not consider the place of a word. If you heard Bag-of-Words terms before it refers this. Split the words in a phrase and put them into a pocket.

We now that *context* is important in languages. The words can have multiple meanings or words can be used figuratively. the first vectorization form including the context is Word2Vec. It is a shallow Neural Network (NN) architecture trained on n-grams and negative samples (maybe it is called differently I just meant the samples obtained from negative sampling). After training on these, our NN architecture gives us a set of vector called embeddings.

Difference between embeddings and vectors:

An embedding is a subset  of term vector. They are short and usually in 50-100 dimensions.



In [9]:
#import the libraries
from sklearn.linear_model import LogisticRegression
import gensim

#set seed
SEED = 42
np.random.seed(SEED)
random.seed(SEED)


In [10]:
#tokenize the training instance one by one
tokenized_training_words = [nltk.word_tokenize(text) for text in X_train]
tokenized_test_words = [nltk.word_tokenize(text) for text in X_test]

#adapt a word2vec to the training tokens
vectorizer2 = gensim.models.Word2Vec(tokenized_training_words,min_count = 2)

Texts have variable length however logistic regression expects a constant input shape. To overcome this issue, we will get the mean of the word vectors for each document.

In [11]:
def get_mean_vector(word2vec_model, tokens):

    #if the token is in the vocab then go ahead and return me the embedding vector
    vector_list = [word2vec_model.wv[word] for word in tokens if word in word2vec_model.wv.key_to_index]

    #if any of the tokens are not in the corpus return 0
    if len(vector_list) == 0:
        return np.zeros(word2vec_model.vector_size)

    #otherwise return the mean of the vector list
    return np.mean(vector_list, axis=0)

In [12]:
#get the word2vec vector
train_token_means = [get_mean_vector(vectorizer2, tokens) for tokens in tokenized_training_words]
test_token_means = [get_mean_vector(vectorizer2, tokens) for tokens in tokenized_test_words]

In [13]:
#get a sample
print(train_token_means[0])

[-0.21084648  0.445389    0.05559346  0.04506045  0.0531324  -0.6494761

  0.1931807   0.90254235 -0.43405607 -0.28759435 -0.19669372 -0.659014

 -0.20870633  0.32652533  0.09727406 -0.26316154  0.13595487 -0.36503133

 -0.07884301 -1.1389375   0.290794    0.16797522  0.29810777 -0.12711196

 -0.13246979  0.03279702 -0.3087925  -0.23891045 -0.27761534 -0.01645306

  0.42346478 -0.02779581  0.16372022 -0.6320245  -0.08380583  0.5589235

  0.12493669 -0.18918139 -0.33667296 -0.6767805   0.14252342 -0.43321875

 -0.24966075  0.03969884  0.4288661  -0.2743494  -0.37856305  0.05805048

  0.2928564   0.41777518  0.1469517  -0.3204251  -0.08994652  0.1990933

 -0.11836198  0.13418795  0.25779444 -0.04896947 -0.44929236  0.11987397

  0.1331235   0.17214713  0.03372408 -0.02635223 -0.475709    0.40393704

  0.0297726   0.2655269  -0.54664475  0.62844396 -0.1733785   0.4054786

  0.45923078 -0.09581389  0.4676061   0.02291599  0.1467095  -0.08983044

 -0.22206588  0.02735656 -0.3057152  -0.1068

**Classifier Selection:** Logistic Regression

Logistic Regression a simple yet powerful classification algorithm. It has low computational requirements and generally performing well on binary classification tasks.

I won't dive into the details of logistic regression because there are more than a plenty of them. I will leave a [link](https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc) for those who are not familiar with.

Which one performs better LR or NB?

The Naive assumption does not fit well in large texts. Therefore, LR is more powerful.

In [14]:
#define and train a default LR
model = LogisticRegression()
model.fit(train_token_means,y_train)

In [15]:
model.score(test_token_means,y_test)

0.8637015781922525