# IMDB sentiment classification task using BOW

Inspired from fast.ai https://course18.fast.ai/ml

A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Reviews with a score ranging from 5 to 6 are considered netural and thus are not included in the dataset.

In [1]:
from glob import glob
import pandas as pd
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import itertools
import re
import os
import string
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout, Activation
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.regularizers import l2
from tensorflow.keras.models import load_model
from sklearn.model_selection import train_test_split
from IPython.display import Image
from math import ceil

## Preparing the dataset
The aclmdb folder contains two folders: train and test. Each of these folders contains the neg and pos folders with movie reviews.
A negative review will have be labeled with 0, while a positive review with 1.


### Parsing

In [2]:
def parse_folder(name):
    data = []
    for verdict in ('neg', 'pos'):
        for file in glob(os.path.join(name, verdict, '*.txt')):
            data.append({
                'text': open(file, encoding='utf8').read(),
                'verdict': verdict == 'pos'
            })
    return pd.DataFrame(data)

In [3]:
df_train = parse_folder('../input/aclimdb/aclImdb/train/')
df_test = parse_folder('../input/aclimdb/aclImdb/test/')

In [4]:
df_train.iloc[0].text, df_train.iloc[0].verdict

("I rented this movie tonight because it looked like a fun movie. I figured that you really couldn't go wrong with a concept of Ex Girlfriend with super powers. <br /><br />... but the movie was confused and pointless ...<br /><br />it seemed that at every turn the writer kept throwing junk in. Also the writer kept throwing in way too much toilet humor and sexual situations that only a teenage boy could love.<br /><br />It seems that it could have been so simple to draw a story out of Fatal Attraction Super hero .. but I guess not. <br /><br />This is not a fun romantic comedy it was advertised to be. You could not take a child to see it and you would be embarrassed seeing it a date. <br /><br />If the writer could have done a basic story around the high concept and cleaned it up - the movie might have a fighting chance. <br /><br />A serious waste of time.<br /><br />B",
 False)

### BOW approach

A Bag Of Words approach is a way to represent text in a manner that makes it usable in machine learning, which is usually a tensor (i.e a vector or an array). It is fairly easy to implement and can be quite effective, especially when dealing with short text messages, such as movie reviews.  
For illustration, suppose there are only three reviews in the dataset.

In [5]:
reviews = ['This movie is good', 'The movie is bad', 'Bad this movie was']

The first thing to do is to define the *vocabulary*, which is the set of all words. Then the *document-term matrix* is created, which can be seen as a matrix interpretation of the dataset.

In [6]:
vocabulary = set()
for review in reviews:
    for word in review.split(' '):
        vocabulary.add(word)
print(vocabulary)

{'The', 'good', 'bad', 'Bad', 'This', 'is', 'this', 'was', 'movie'}


1. The *document-term matrix* or *term-document matrix* is obtained by viewing every word in the vocabulary as a column, indicating its presence or number of occurences in every document. In this context a document is a movie review. This matrix is the *bag of words* representation.

In [7]:
doc_term = []
for document in reviews:
    row = {'!document': document}
    row.update({word: document.split(' ').count(word) for word in vocabulary})
    doc_term.append(row)
doc_term = pd.DataFrame(doc_term)
display(doc_term)

Unnamed: 0,!document,Bad,The,This,bad,good,is,movie,this,was
0,This movie is good,0,0,1,0,1,1,1,0,0
1,The movie is bad,0,1,0,1,0,1,1,0,0
2,Bad this movie was,1,0,0,0,0,0,1,1,1


The BOW representation works on the principle that if the word *good* appears a lot in a movie review, it is very probable to be a positive one. This approach unfortunately doesn't take into account the order in which the words appear in a sentence, which is  fundamental to its meaning.  
**The movie is shit!** and **The movie is the shit!** look very similar when considering their BOW representation, but these two sencences have very different meanings.

### Tokenization

Tokenization refers to the process of turning a piece of text into a list of tokens or symbols, dealing with punctuation. For example, the text **You call this a "movie"?! It isn't good at all!** could be tokenized like this: **You call this a " movie " ?! It is n't good at all !**. The text is first separated into sentences, then each sentence is separated into tokens.

A common way to do tokenization is with regular expressions, but [nltk](https://www.nltk.org/api/nltk.tokenize.html) contains lots of tokenizers to choose from.

In [8]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’\'“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize_re(text):
    return re_tok.sub(r' \1 ', text).split()

In [9]:
def tokenize_nltk(text):
    return list(itertools.chain.from_iterable(word_tokenize(sentence) for sentence in sent_tokenize(text)))    

### Stemming

Words such as gaming, gamed, games are replaced with game. Only the stems are kept.

In [10]:
stemmer = PorterStemmer()
def tokenize(text):
    return [stemmer.stem(word) for word in tokenize_nltk(text)]

[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is used to create the matrix of token counts. Instead of actual words, there is the posibility to use n-grams. An n-gram is a tuple of n words.

In [11]:
vectorizer = CountVectorizer(ngram_range=(1, 3), tokenizer=tokenize, max_features=1000000)

In [12]:
train_doc_term = vectorizer.fit_transform(df_train.text)
test_doc_term = vectorizer.transform(df_test.text)
train_doc_term

<25000x1000000 sparse matrix of type '<class 'numpy.int64'>'
	with 12632304 stored elements in Compressed Sparse Row format>

A lot of columns are going to be 0, because only a small percentage of the vocabulary appears in each document. In order to keep it from exploding in memory, 0s are not stored.

The vocabulary can be seen with *get_feature_names*.

## Naive Bayes
Using the term-document matrix it is possbile to infer the probability of a review being positive or negative.
https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes

### Basic probability rules

Probability of C occuring, given D: $P(C|D)=\frac{P(C \wedge D)}{P(D)}$  
Probability of both C and D occuring: $P(C \wedge D)=P(C|D)*P(D)$

In [13]:
Image(url='https://ibin.co/4hxqDwhJnxCE.png')

### Infering probabilities using the term-document matrix
Probability of class C being 1 (positive review), given a document D:
$P(C=1|D)=\frac{P(D \wedge C=1)}{P(D)}=\frac{P(D|C=1)*P(C=1)}{P(D)}$  
By computing $\frac{P(C=1|D)}{P(C=0|D)}$ we obtain $\frac{P(D|C=1)}{P(D|C=0)}*\frac{P(C=1)}{P(C=0)}$. If this number is greater than 1, the probability of the review being a positive one is greater than the probability of it being negative.  
For each term we compute the probabilities of it appearing in a negative and in a positivie review.

Learn more about probabilistic inference from [here](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/lecture-videos/) (lectures 21 and 22).

In [14]:
classes = np.array([1, 0, 0])
doc_term_mat = doc_term.drop('!document', axis='columns').values
display(doc_term_mat)
p_c = np.array([(classes == 0).mean(), (classes == 1).mean()])
p_dc = np.ones((2, doc_term_mat.shape[1])) # use ones because by default every term can appear once in every class
for col in range(doc_term_mat.shape[1]):
    for row in range(doc_term_mat.shape[0]):
        p_dc[classes[doc_term_mat[row][col]]][col] += doc_term_mat[row][col]
for c in (0, 1):
    p_dc[c] = p_dc[c] / p_dc[c].sum()
display(p_c, p_dc)

array([[0, 0, 1, 0, 1, 1, 1, 0, 0],
       [0, 1, 0, 1, 0, 1, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 1]])

array([0.66666667, 0.33333333])

array([[0.0952381 , 0.0952381 , 0.0952381 , 0.0952381 , 0.0952381 ,
        0.14285714, 0.19047619, 0.0952381 , 0.0952381 ],
       [0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
        0.11111111, 0.11111111, 0.11111111, 0.11111111]])

In order to compute the probability of a document, given the class, we can multiply the probabilities of all its terms given the same class with the probability of that class occuring. For example, given the negative class, the probability of the first review belonging to it is approximately $0.095*0.095*0.142*0.190 * 0.66$. The last 0.66 comes from the 0.66 probability of the negative class.

### Why is it naive?

It assumes that events are independent. In reality, the probability of the term *awful* appearing in a negative review is not independent of the probability of *bad*.

In [15]:
clf = MultinomialNB()
clf.fit(train_doc_term, df_train.verdict)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
clf.score(test_doc_term, df_test.verdict)

0.87084

### How similar it is to Logistic Regression?

In order to convert multiplication into addition we can work with logarithms.  
  
$\log(\frac{P(D|C=1)}{P(D|C=0)}*\frac{P(C=1)}{P(C=0)})=\frac{P(D|C=1)}{P(D|C=0)}+\log(\frac{P(C=1)}{P(C=0)})$.  
  
Y = predicted values  
  
X = term-doc matrix  
  
W = $\frac{P(D|C=1)}{P(D|C=0)}$  
  
B = $\log(\frac{P(C=1)}{P(C=0)})$  
  
$Y = X * W + B$

Instead of learning these coefficients, we approximated them using a theoretical model.

In [17]:
weights = np.log(p_dc[1] / p_dc[0])
bias = np.log(p_c[1] / p_c[0])
display(weights, bias, doc_term_mat @ weights + bias)

array([ 0.15415068,  0.15415068,  0.15415068,  0.15415068,  0.15415068,
       -0.25131443, -0.5389965 ,  0.15415068,  0.15415068])

-0.6931471805599453

array([-1.17515675, -1.17515675, -0.76969164])

The threshold is now 0 instead of 1, because $\log 1 = 0$.

In [18]:
train_doc_term_bool = train_doc_term > 0
r_neg = (train_doc_term_bool[df_train.verdict.values == 0].sum(0) + 1) / (sum(df_train.verdict == 0) + 1)
r_pos = (train_doc_term_bool[df_train.verdict.values == 1].sum(0) + 1) / (sum(df_train.verdict == 1) + 1)
coef = np.log((r_pos / r_neg).A.flatten())

## Logistic Regression

By doing logistic regression, we try to fit an n-dimensional plane that separates positive reviews from negative reviews.

In [19]:
lreg = LogisticRegression(C=0.2, solver='liblinear', max_iter=500, dual=True) # C comes from regularization
lreg.fit(train_doc_term, df_train.verdict)
lreg.score(test_doc_term, df_test.verdict)

0.90092

Learning the coefficients seems to yield better results than the theoretical model.

## Combining the two approaches

We can initialize the coefficients with these obtained from Naive Bayes and start optimizing from there, instead of randomly initializing them.

In [20]:
def batch_generator(X, y, batch_size, shuffle=False):
    number_of_batches = ceil(X.shape[0]/batch_size)
    counter = 0
    sample_index = np.arange(X.shape[0])
    if shuffle:
        np.random.shuffle(sample_index)
    while True:
        batch_index = sample_index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X[batch_index,:].toarray()
        y_batch = y[batch_index]
        counter += 1
        yield X_batch, y_batch
        if (counter == number_of_batches):
            if shuffle:
                np.random.shuffle(sample_index)
            counter = 0

In [21]:
net = Sequential([
    Dense(1, activation='sigmoid', input_dim=train_doc_term.shape[1], kernel_regularizer=l2(0.1)),
])
net.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
callbacks = [
    ModelCheckpoint('nn_best.h5', monitor='val_acc', verbose=0, save_weights_only=False, save_best_only=True, mode='max'),
    EarlyStopping(monitor='val_loss', min_delta=0, patience=1, verbose=0, mode='auto')
]

Instructions for updating:
Colocations handled automatically by placer.


In [22]:
net.set_weights([coef.reshape(-1, 1)])

In [23]:
X_train, X_valid, Y_train, Y_valid = train_test_split(train_doc_term, df_train.verdict.values, test_size=0.2, stratify=df_train.verdict.values)

In [24]:
batch_size = 64
net.fit_generator(
    generator=batch_generator(X_train, Y_train, batch_size, shuffle=True),
    validation_data=batch_generator(X_valid, Y_valid, batch_size), validation_steps=ceil(len(Y_valid) / batch_size),
    epochs=5, steps_per_epoch=ceil(len(Y_train) / batch_size), callbacks=callbacks)

Instructions for updating:
Use tf.cast instead.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f0226be3da0>

In [25]:
net = load_model('nn_best.h5')

In [26]:
net.evaluate_generator(batch_generator(test_doc_term, df_test.verdict, batch_size=batch_size), steps=ceil(len(df_test.verdict) / batch_size))

[144.91532070618456, 0.90224]