# Bag of Words

I will be building a simple BagOfWords text classifier using PyTorch. It will be trained on the IMDB movie reviews dataset. So, i start with importing necessary libraries and defining key variables...

In [110]:
import pandas as pd
import torch
import os


ROOT_DIR = os.getcwd()
DATA_DIR = os.path.join(ROOT_DIR,"data")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(ROOT_DIR,device)

D:\PROJECTS\Github\nlp-basics cuda


In [10]:
df = pd.read_csv(os.path.join(DATA_DIR,"imdb_reviews.csv"))
df.head()

Unnamed: 0,review,label
0,Once again Mr. Costner has dragged out a movie...,0
1,This is an example of why the majority of acti...,0
2,"First of all I hate those moronic rappers, who...",0
3,Not even the Beatles could write songs everyon...,0
4,Brass pictures (movies is not a fitting word f...,0


In [20]:
df.label.unique()

array([0, 1], dtype=int64)

## Data description

We are provided with a dataset with 62155 textual reviews, each having a positive(1) or a negative(0) label. There are no NaNs so it is safe to proceed. There are 31285 Negative and 30870 positive reviews.


## Bag of Words Representation

So, we have our dataset loaded but all the data is in text format. To convert this to a quantitative "Tensor", we need a way or some kind of "representation". An easy way of representing the text data in terms of numbers is to:

1. Split the text into individual words
2. Compare it against a list of known words
3. Count the number of times a particular word occurs

This is the basic idea behind BoW, we split texts into words, compare it against our "Vocabulary" or set of words and finally find counts of each known word. So, if our Vocabulary is of 5 words say, 

{"alpha","dog","jumps","quick","the"}

then the sentence:
"The quick brown fox jumps over the lazy dog"
should generate the following 5x1 vector (as our vocab is size 5)

1. ["the" , "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]  (also called tokenization)
2. { 0 alphas , 1 dog , 1 jumps , 1 quick , 2 the } (also called numericalization)
3. thus, input vector for sentence = [ 0 , 1 , 1 , 1 , 2 ]

This is a basic rundown version. Optimizations to this changes model behaviour and accuracy by significant factor. It should also be noted that the vocab size directly affects the memory needed as each input row will have number of features equal to our original vocab size. Thus, the vocabulary size needs to be as low as possible.

## Making our own Datasets

We define a class that holds our inputs and targets as class variables. We also need to override "\__getitem__()" and "\__len__()" to return the i-th item and the length of our dataset respectively.

Then we define a DataLoader object that takes a dataset and a batchsize to load objects into memory. As seen from below print statements, getting the 6-th item from the dataset returns a tuple (X,y) where X would be a feature vector and y would be a target vector. 

In [26]:
from torch.utils.data import Dataset,DataLoader
from sklearn.feature_extraction.text import CountVectorizer

class Sequences(Dataset):
    def __init__(self,path):
        """
            Create a dataset by reading a CSV from "path"
            Args:
            path (Path or os.path) : the location of the CSV file to be read. Suffix is to be added.
        """
        
        try:
            df = pd.read_csv(path)
            self.vector = CountVectorizer(stop_words='english',max_df=0.99,min_df=0.005)
            self.seq = self.vector.fit_transform(df.review.tolist())
            self.labels = df.label.tolist()
            self.token2idx = self.vector.vocabulary_
            self.idx2token = {idx:token for token,idx in self.token2idx.items()}
            
        except FileNotFoundError:
            print(f"FileNotFoundError: {path} doesn't specify a CSV file.\n Please check the pathname again\n")
            print("Cannot create dataset!")
    
    
    def __getitem__(self,i):
        """
            Gets the i-th input and its label in the dataset
            i (int): the index number
        """        
        return self.seq[i,:].toarray(),self.labels[i]
    
    def __len__(self):
        """
            Allows the use of len() function on our object
        """
        return self.seq.shape[0]        

In [42]:
ds = Sequences(os.path.join(DATA_DIR,"imdb_reviews.csv"))
train_dl = DataLoader(ds,batch_size=4096)
print(ds[5][0][:20])
print(ds[5][1])

[[0 0 0 ... 0 0 0]]
0


## Understanding the matrices

The dataset has stored the CSV data into "self.seq" which is our input matrix and "self.labels" which is our target matrix. The input matrix has a size of (62155 , 3028). Meaning that we have 62155 reviews and our vocabulary size is 3028.

In [43]:
print(ds.seq.shape)
print(len(ds.labels))

(62155, 3028)
62155


## Model building

With our dataloader setup, we now define our architecture. The first model would be a simple one to see if everything works or not. The model would look like:

Layer 1 : x1 = W1.X + b1

Layer 2 : h1 = ReLU(x1)

Layer 3 : x2 = W2.h + b2

Layer 4 : o = sigmoid(x2)

This will give us the probability of our review being positive or negative. To enhance our model (or specifically , W1,b1,W2,b2 ) , we define a loss function and update it by using an optimizer. We will be using NLL (Negative Log Likelihood) loss and Adam optimizer.

We start with importing all the necessary libraries...

In [46]:
# The "f" in functional is lower-case although imported as upper case "F"
# Keep this in mind to save yourself some "moduleNotFound" headache.

import torch
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim

In [70]:
class BoWClassifier(nn.Module):
    def __init__(self,vocabSize,hiddenSizes):
        """
            Define the model architecture by overriding nn.Module method
            Remember to call the super function first 
            else everything goes haywire
        """
    
        super(BoWClassifier,self).__init__()
        self.layer1 = nn.Linear(vocabSize,hiddenSizes[0])
        self.layer2 = nn.Linear(hiddenSizes[0],hiddenSizes[1])
        self.layer3 = nn.Linear(hiddenSizes[1],1)
        
    
    def forward(self,inputs):
        """
            Defines how forward prop occurs
            Overriden through nn.Module
        """
        
        # Pass inputs to successive layers...
        x = F.relu(self.layer1(inputs.squeeze(1).float()))
        x = F.relu(self.layer2(x))
        
        # And return final output...
        return self.layer3(x)    

In [72]:
# myModel is back from previous repositories!...
myModel = BoWClassifier(vocabSize=len(ds.token2idx),hiddenSizes=[128,64])
myModel

BoWClassifier(
  (layer1): Linear(in_features=3028, out_features=128, bias=True)
  (layer2): Linear(in_features=128, out_features=64, bias=True)
  (layer3): Linear(in_features=64, out_features=1, bias=True)
)

In [79]:
# Binary Cross Entropy Loss...
lossFunc = nn.BCEWithLogitsLoss()

# An optimizer computes gradients for parameters given to it...
optimizer = optim.Adam([p for p in myModel.parameters() if p.requires_grad])

## Model Training

This part deals with training our model. As our architecture is setup with a loss function and an optimizer, we move forward and let PyTorch handle the rest... 

Training has the following steps:

For every epoch,
    1. Get a batch of data through a dataloader (in the form (X,Y))
    2. For every data row:
        a. Zero out the gradients
        b. Find output for the given input
        c. Find the loss b/w output and target
        d. Calculate gradients(also called Backward or backprop)
        e. Update Parameters
        f. Store the loss onto an array
    3. Store the total batch-wise loss into another list
    4. Print out the loss at end of each batch

To do some of the above printing, we use "tqdm". It shows a progress bar and can display a few other things (It looks good!) 

In [111]:
#tqdm is used to show a progress bar (eye-candy)...
from tqdm import tqdm, tqdm_notebook



myModel.train()
trainLosses = []

for epoch in range(5):
    progressBar = tqdm_notebook(train_dl,leave=False)
    losses = []
    n = 0
    
    for inputs,target in progressBar:
        # Zero out all gradients...
        myModel.zero_grad()
        
        # Calculate the current targets...
        output = myModel(inputs)
        
        #Calculate losses...
        loss = lossFunc(output.squeeze(),target.float())
        
        
        #backward prop...
        loss.backward()
        nn.utils.clip_grad_norm_(myModel.parameters(), 3)
        
        #update parameters...
        optimizer.step()
        
        #setting up some eye-candy!...
        progressBar.set_description(f"Loss: {loss.item():.3f}")
        
        #store it in a list...
        losses.append(loss.item())
        n+=1
    
    batchLoss = sum(losses)/n
    trainLosses.append(batchLoss)
    
    tqdm.write(f"Epoch : {epoch+1}\tTrain Loss : {batchLoss:.3f}")

HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Epoch : 1	Train Loss : 0.273


HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Epoch : 2	Train Loss : 0.266


HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Epoch : 3	Train Loss : 0.260


HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Epoch : 4	Train Loss : 0.256


HBox(children=(IntProgress(value=0, max=16), HTML(value='')))

Epoch : 5	Train Loss : 0.252


## Predciting stuff

The models trained and now coming to the part where we use the model to predict if a given blob of text is positive or negative...

In [112]:
def predict(text):
    myModel.eval()
    with torch.no_grad():
        # Apply all the required transforms onto the text...
        ip = torch.LongTensor(ds.vector.transform([text]).toarray())
        
        #forward prop on the input...
        op = myModel(ip)
        pred = torch.sigmoid(op).item()
        
        if pred > 0.5:
            print(f"{pred:.3f} : Positive")
        else:
            print(f"{pred:.3f} : Negative")

In [113]:
# Lets grab a few reviews...

# First, the reeeallllyyy loooonnngggg IRISHMAN...
# It was seriously too long...

irishman = """Martin Scorsese’s THE IRISHMAN is meant to be the director’s last word on the gangster film, being a genre he’s the undisputed master of.
While MEAN STREETS, GOODFELLAS and CASINO emphasized the excitement and rock n’ roll aspect of the lifestyle, before the inevitable fall from grace, THE IRISHMAN tackles the real, human cost of such a lifestyle.
Your only outs are: wind up dead or rot into old age with no one around to care about you, and that’s provided you somehow avoid prison.
It’s a melancholy fate, and appropriately, so is the film, being perhaps more in the vein of RAGING BULL than GOODFELLAS or CASINO. 
It’s a contemplative epic, but also among the most vital films in recent memory. 
If anyone deserves to have the last word on gangsterdom, it’s Scorsese.
It probably could have only ever been made by Netflix, with them giving him a budget somewhere in the neighborhood of $160 million.
No traditional studio would ever finance a character-driven drama in such a way, much less allow him to put out a version that runs 3.5 hours. 
While lengthy, every frame of THE IRISHMAN is packed to the gills with substance. 
There’s not a moment when it drags, and it’s probably the fastest 3.5 hours you’re likely to ever spend on a film.
The price tag, of course, can be attributed up to the CGI de-aging effects. 
For the film to work, De Niro has to be able to convincingly play a man from his late thirties to middle age. 
While yes, he never really looks anything less than middle-aged, you honestly forget all about the CGI after fifteen minutes. 
Rather, you get sucked into the story regardless of the effects. 
For those wondering why they took so long to make it, I can only point towards the last half hour of the film, which delivers a gut punch meditation on aging, would have been impossible to convey had De Niro himself not been approaching Sheehan’s on-screen age.
While De Niro is in virtually every frame of the film, the supporting cast is exceptional, even by Scorsese’s standards. 
In some ways, it’s old home week, with Joe Pesci re-emerging from retirement to play Bufalino, while Harvey Keitel appears, as do newer Scorsese regulars (via his HBO shows) Stephen Graham, Domenick Lombardozzi, Bobby Cannavale, Ray Romano and Jack Huston (playing Robert Kennedy) put in memorable turns. 
Pesci hasn’t lost a beat despite largely being absent from the screen, with Bufalino a change of pace from his iconically live wire parts in GOODFELLAS and CASINO. 
Bufalino is a quieter sort of person, one who’s not quick to anger and even something of a peacemaker, even if he’s inevitably the deadliest of enemies.
Of everyone though, the best role is no doubt, Jimmy Hoffa, with Al Pacino adding another one to his pantheon of great portrayals, sinking his teeth into the part like he hasn’t in years. 
Much of it relies on his banter with De Niro, and truly they are a great pair. 
Pacino gives the film it’s warmth, especially through his unexpectedly touching friendship with Sheehan’s daughter, Peggy (played as an adult by Anna Paquin), who’s terrified of her father and his cronies, but falls for this charismatic gent in a big way.
Despite the more melancholy tone, THE IRISHMAN, like all of Scorsese’s films, is also often hilarious, from the way it depicts the minutia of mob life (everyone dumps their guns in the same place), to the absurdity of names and more. 
One of the better recurring motifs is how every time a gangster is introduced, they reveal the man’s ultimate fate, which involve a gruesome death or prison. 
In another departure, Scorsese emphasizes Robbie Robertson’s score over period tunes, with a few notable exceptions, including repeated, haunting use of The Five Satin’s “In the Still of the Night.”
Truly, THE IRISHMAN feels like it could be the perfect capper to Scorsese/De Niro/Pacino and Pesci’s careers, although everyone is so perfect here I wouldn’t be surprised if they’re not lured into one last project together. 
If not though, no one could have gone out on a better film. 
THE IRISHMAN, or as it’s called on-screen, I HEARD YOU PAINT HOUSES, is a legitimate masterpiece.
"""

In [114]:
predict(irishman)

1.000 : Positive


That was....well, kinda weird but okay...

In [115]:
# I think its cheating but i couldnt think of anything worse than
# GoT S8E3 (given its history) so here goes...
# This review was taken from metacritic and is uncensored...
#read at your own risk...

got = """They stole Jon Snow's destiny in this episode. 
Jon was always written to fight against the white walkers. 
To have someone else steal the kill shot from him, just to fit in with 2019 political agenda is disgusting. 
I've been waiting for this for over 20 years. 
This is not what was promised. 
The battle scene was completely stupid as well. 
Walls are meant to be hidden behind or mounted upon. 
Hopefully they haven't neutered GRRM as they did D&D. 
Ever since the show runners took over the story this show has been total crap."""


In [116]:
predict(got)

0.144 : Negative


It said GoT S8E3 was terrible...its working :')

Time to save this model :D

In [118]:
try:    
    os.mkdir(os.path.join(ROOT_DIR,"models"),)
except OSError:
    pass

SAVE_DIR = os.path.join(ROOT_DIR,"models")

In [119]:
torch.save(myModel.state_dict(), os.path.join(SAVE_DIR,"BoWClassifier-128,64-0.25trainLoss.pt"))

In [121]:
SAVE_FILE = os.path.join(SAVE_DIR,"BoWClassifier-128,64-0.25trainLoss.pt")

myModel = BoWClassifier(vocabSize=len(ds.token2idx),hiddenSizes=[128,64])
myModel.load_state_dict(torch.load(SAVE_FILE))
myModel.eval()

BoWClassifier(
  (layer1): Linear(in_features=3028, out_features=128, bias=True)
  (layer2): Linear(in_features=128, out_features=64, bias=True)
  (layer3): Linear(in_features=64, out_features=1, bias=True)
)