# Text classifier using word vector features


In this notebook we will look at how to train a neural network text classifier in PyTorch. For this we will be using word vectors as the input features for our neural network. 

First lets do some imports:

In [None]:
#In case it's not installed
!pip install torchvision

In [None]:
import torch

import numpy as np
import pandas as pd
import torch.nn as nn
import torch.optim as optim
import torchtext.vocab as vocab
import torch.nn.functional as F

from torchvision import transforms
from torch.distributions import Categorical
from torchtext.data.utils import get_tokenizer
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score, classification_report


#### Define hyperparameters

Now lets define out hyperparameters. It is important to set `num_classes` correctly for however may classes there are in your dataset. 

`max_tokens` defines how many tokens we look at at once. If our text document exceeds this number it will be clipped by `max_tokens`. If our document has less tokens then the input matrix to the model will be zero padded. 

In [None]:
device = 'cpu'
max_tokens = 200
hidden_dim = 256
num_classes = 2
batch_size = 64
num_epochs = 101
learning_rate = 0.002
load_chk = False    # load in pre-trained checkpoint for training
save_path = "wordvec_classifier_model.pt"
# load_path = "wordvec_classifier_model.pt"

#### (optional) pre-process data

You don't need to run the next cell, as it has already been done for you. 

It takes the raw Myers Briggs dataset `data/myers_briggs_comments.tsv` and pre-processes it using stop words and lemmatisation, and gives class labels for the code. Using the same preprocessing steps as are in the Week 6 notebooks. 

Here we are dividing the data into 2 classes, if you want to change it you can edit the file `data-util/preprocess_myersbriggs.py`, run this cell, and create a different class label mapping for the data.

In [None]:
# !python data-util/preprocess_myersbriggs.py

#### Load word vectors

Now lets load our word vectors. If this is taking too long to download in class you can change the first line to:

```word_vectors = vocab.Vectors(name = '../data/glove.6B.100d.top30k.txt')```

In [None]:
word_vectors = vocab.GloVe(name="6B",dim=100) 
tokenizer = get_tokenizer("basic_english")
wordvec_embeddings = nn.Embedding.from_pretrained(word_vectors.vectors)
embedding_dim = wordvec_embeddings.weight.shape[1]

#### Create dataset class

Now lets create a dataset class for our tab-seperated values (TSV) files. This will tell pytorch how to load and sample our dataset during training.

In [None]:
class TSVDataset(Dataset):
    def __init__(self, tsv_file, transform=None):
        self.data = pd.read_csv(tsv_file, sep='\t')
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx, 0]
        label = self.data.iloc[idx, 1]

        if self.transform:
            text = self.transform(text)

        return text, label

#### Convert text to word vector matrix

This class will take a text string and return a matrix containing a sequence of word vectors. This is what will be inputted into our model. Here our model takes inputs of a fixed length so we need to normalise our data so that every input matrix is the same size, regardless of the length of the original text.

Here we start with an empty matrix of zeros, and then put in our word vectors into the rows of the matrix, stopping once we reach the length of the matrix (defined by `max_rows`). 

At the end we have a matrix of shape (`max_tokens`, `embedding_dim`):

In [None]:
def extract_wordvec_tensor(input_str, max_tokens, embedding_dim):
    # Create empty tensor of zeros 
    output_tensor = torch.zeros(max_tokens, embedding_dim)

    # Get tokens
    tokens = tokenizer(str(input_str))

    # Make sure that there are tokens in the list before doing the next steps
    if tokens != []:
        # Clip tokens to the token windown length
        tokens = tokens[:max_tokens]

        # Get word vectors from tokens
        wordvec_seq = word_vectors.get_vecs_by_tokens(tokens)

        # Fill empty_tensor with the values from x
        output_tensor[:wordvec_seq.shape[0], :wordvec_seq.shape[1]] = wordvec_seq
        
    return output_tensor

Now lets create our test and train set classes:

In [None]:
transform = transforms.Compose([lambda x: extract_wordvec_tensor(x, max_tokens, embedding_dim)])

train_set = TSVDataset('../data/mb_processed_train.tsv', transform=transform)
test_set = TSVDataset('../data/mb_processed_test.tsv', transform=transform)

#### Define the classifier network

Lets define our text classification model. Here we are just using two fully connected layers (defined by `nn.Linear`):

In [None]:
class TextClassifier(nn.Module):
    def __init__(self, token_length, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc_1 = nn.Linear(token_length * embedding_dim, hidden_dim)
        self.fc_2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = torch.flatten(x, start_dim=1)
        x = self.fc_1(x)
        x = self.fc_2(x)
        return x

#### Setting up network and optimiser

Here we are creating an instance (`model`) our classier network class. As well as our loss function (`criterion`) and our optimiser (`optimizer`). 

We also define data loaders for our test and training sets, which give us objects we can iterate on in training.

In [None]:
model = TextClassifier(max_tokens, embedding_dim, hidden_dim, num_classes).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(),lr=learning_rate)

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)

if load_chk:
    checkpoint = torch.load(load_path)
    model.load_state_dict(checkpoint['state_dict'])

#### Train the model

Here we train the classifier. We cycle through the dataset in its entirery (an epoch) for however many training epochs we have defined in `num_epochs`. 

Every 10 epochs, we will test the model on the test set. When the model starts to perform worse on the test set (even if training loss is going down), then we can assume the model is **overfitting** to the training data, and there we should probably stop training. The code here automatically checks for this and will only save the model weights if it is an improvement on the test accuracy to the previous best checkpoint of the model. However traning will continue. 

As an educational exercise it is interesting to observe how the training and test loss increasingly diverge as the model starts to overfit. 

In [None]:
best_test_score = 100
is_saved = False
for epoch in range(num_epochs):
    running_loss = 0
    for i, batch in enumerate(train_loader):
        model.zero_grad()
        text_batch, label_batch = batch
        pred = model(text_batch)
        probs = F.softmax(pred, dim=1)
        loss = criterion(probs, label_batch)
        running_loss += loss / (len(train_loader))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    print(f'loss after epoch {epoch:03} is {running_loss:3f}')

    if epoch % 10 == 0 and epoch > 0:
        with torch.no_grad():
            running_loss = 0
            for i, batch in enumerate(test_loader):
                    model.zero_grad()
                    text_batch, label_batch = batch
                    pred = model(text_batch)
                    probs = F.softmax(pred, dim=1)
                    loss = criterion(pred, label_batch)
                    running_loss += loss / (len(test_loader))
            print(f'test loss is after epoch {epoch:03} is {running_loss:3f}')
            if ((running_loss < best_test_score) or (not is_saved)):
                 best_test_score = running_loss
                 save_dict = {}
                 save_dict['state_dict'] = model.state_dict()
                 torch.save(save_dict, save_path)
                 is_saved = True
            else:
                 print('test accuracy is getting worse, you may want to stop training now')
        

#### Load best checkpoint

Now lets reload our best checkpoint to evalaute our best model with the sklearn classification report that we used in Week 6. This way we can get a direct comparison between the two models.

In [None]:
#This cell will give you an error if the state_dict dictionary hasn't been saved, 
#i.e. if you haven't run at least 10 epochs above
checkpoint = torch.load(save_path)
model.load_state_dict(checkpoint['state_dict'])

#### Load test set dataframes

Here we need to load the test set as a pandas dataframe to get our variables `y_test` and `class_names` needed for the classification report:

In [None]:
test_set_df = pd.read_csv('../data/mb_processed_test.tsv', sep='\t')
y_test = test_set_df['1'].to_numpy()

class_names_df = pd.read_csv('../data/mb_class_labels.tsv', sep='\t')
class_names = list(class_names_df['class_names'])

#### Get classification report

This code will iterate over the test set one last time. This time getting predictions for each data sample and using the [sci-kit learn classificiaton report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to give us a break down of the model performance.

In [None]:
with torch.no_grad():
    preds = []
    for i, batch in enumerate(test_loader):
            model.zero_grad()
            text_batch, label_batch = batch
            pred = model(text_batch)
            probs = F.softmax(pred, dim=1)
            prob_dist = Categorical(probs)
            pred_classes = prob_dist.sample()
            preds += pred_classes.numpy().tolist()
    y_pred = np.array(preds)

    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=class_names)



    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)

## Bonus tasks

- **Task A:** Try changing some of [the hyperparameters](#define-hyperparameters), like `max_tokens`, `hidden_dim`, `learning_rate` or `batch_size` and retraining the network. Has that improved the classification accuracy?
- **Task B:** Try adding more layers to the network in [the cell that defines the classifier](#define-the-classifier-network). Does that improve the classification accuracy?
- **Task C:** Try changing the preprocessing of the dataset to classify all 16 personality types in `data-util/preprocess_myersbriggs.py` and running the code in [the data pre-processing cell](#optional-pre-process-data). How does the classification accuracy look now?
- **Task D:** Adapt this code to use a different dataset. You may need to write your own preprocessing script for this. 