## Spam classification using LSTMs: Demo 

### Imports 

In [1]:
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

In [2]:
import torchtext
from torchtext.legacy.data import (
    BucketIterator,
    Field,
    Iterator,
    LabelField,
    TabularDataset,
)

### RNN architectures

- We have seen before that a number of RNN architectures are possible. 

<img src="img/RNN_architectures.png" height="1500" width="1500">     

[source](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

### LSTM text classification 

- An example of many-to-one architecture is **text classification**. 
- In this notebook we are going to build an LSTM for spam classification. 

<center>
<img src="img/lstm-text-classification.png" height="800" width="800">
</center>

### Data

We'll use [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset). 

In [3]:
sms_df = pd.read_csv("data/spam.csv", encoding="latin-1")
sms_df = sms_df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df = sms_df.rename(columns={"v1": "label", "v2": "sms"})
sms_df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Let's create train/test/validation CSVs

In [4]:
from sklearn.model_selection import train_test_split

train_df, valid_test_df = train_test_split(sms_df, test_size=0.2, random_state=123)
valid_df, test_df = train_test_split(valid_test_df, test_size=0.5, random_state=123)

In [5]:
import os

data_path = "./data/sms_split/"
if not os.path.exists(data_path):
    os.mkdir(data_path)

In [6]:
cols = ["sms", "label"]
data_dir = "data/sms/"
train_df.to_csv(data_path + "train.csv", columns=cols, index=False)
valid_df.to_csv(data_path + "valid.csv", columns=cols, index=False)
test_df.to_csv(data_path + "test.csv", columns=cols, index=False)

You should now have `train.csv`, `valid.csv`, and `test.csv` files written under the `data_path` folder. 

### Steps involved

- Text preprocessing 
- Defining the network architecture and the forward method for the network   
- Training the model 

## Preprocessing with `torchtext`

- We'll be using `torchtext` to carry out preprocessing. 

<center>
<img src="img/lstm-preprocess.png" height="600" width="600"> 
</center>    

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [8]:
def tokenize_nltk(text):
    """
    Simple tokenization on white spaces.
    """
    return word_tokenize(text)

In [9]:
tokenize_nltk("This is a test! ")

['This', 'is', 'a', 'test', '!']

### Defining `TEXT` and `LABEL` fields

In [10]:
# treat it as sequential data
TEXT = Field(sequential=True, tokenize=tokenize_nltk, lower=True)

# Don not treat labels as sequential data
LABEL = Field(sequential=False, unk_token=None)

### Loading the data

In [11]:
fields = [("sms", TEXT), ("label", LABEL)]
train, valid, test = TabularDataset.splits(
    path=data_path,  # the root directory where the data lies
    train="train.csv",
    validation="valid.csv",
    test="test.csv",
    format="csv",
    skip_header=True,
    fields=fields,
)

### Build vocabulary 

In [12]:
# It would be better to use glove.twitter.27B.100d here. I'm using the following to save time.
TEXT.build_vocab(train, min_freq=3, vectors="glove.6B.100d")
LABEL.build_vocab(train)

print("Size of vocab: ", len(TEXT.vocab))
print("Number of classes: ", len(LABEL.vocab))

Size of vocab:  2557
Number of classes:  2


### Create data iterators

In [13]:
train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(32, 32, 32),
    sort_key=lambda x: len(x.sms),
    sort=True,
    sort_within_batch=True,
)

### Examining batches

In [14]:
count = 0
for batch in train_iter:
    if count == 4:
        messages_b4 = batch.sms
        labels_b4 = batch.label

    if count == 10:
        messages_b10 = batch.sms
        labels_b10 = batch.label
        break
    count += 1

In [15]:
messages_b4.shape  # sequence_len, batch_size

torch.Size([5, 32])

In [16]:
messages_b10.shape  # sequence_len, batch_size

torch.Size([6, 32])

In [17]:
def print_preprocessed_examples(messages, labels, n=4):
    print("preprocessed corpus:")
    df_data = defaultdict(list)
    for j in range(n):  # sample loop
        df_data["tokens"].append(
            [TEXT.vocab.itos[messages[i, j]] for i in range(messages.shape[0])]
        )
        df_data["example"].append(j)
        df_data["label"] = labels[j].item()
    return pd.DataFrame(df_data)

In [18]:
print_preprocessed_examples(messages_b4, labels_b4)

preprocessed corpus:


Unnamed: 0,tokens,example,label
0,"[what, happen, dear, tell, me]",0,0
1,"[<unk>, between, 10am-7pm, cost, 10p]",1,0
2,"[he, is, a, <unk>, <unk>]",2,0
3,"[wat, u, doing, there, ?]",3,0


In [19]:
print_preprocessed_examples(messages_b10, labels_b10)

preprocessed corpus:


Unnamed: 0,tokens,example,label
0,"[what, about, this, one, then, .]",0,0
1,"[prepare, to, be, <unk>, :, )]",1,0
2,"[sorry, ,, i, 'll, call, later]",2,0
3,"[its, a, part, of, checking, iq]",3,0


### Embedding layer

- The embedding layer in `Pytorch` is where we pass our vocabulary to get back a word vector for each word in the vocabulary. 
- This is our embedding lookup table.

<center>
<img src="img/embedding_layer.png" height="700" width="700"> 
</center>    

[Source](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)

### Creating an embedding lookup table

In [20]:
WORD_VEC_SIZE = 100
VOCAB_SIZE = len(TEXT.vocab)
VOCAB_SIZE

2557

In [21]:
embedding = nn.Embedding(VOCAB_SIZE, WORD_VEC_SIZE)
print("Embedding lookup table shape = ", embedding.weight.shape)

Embedding lookup table shape =  torch.Size([2557, 100])


In [22]:
embedding.weight

Parameter containing:
tensor([[ 9.8933e-01,  1.0475e+00, -4.8583e-01,  ..., -1.4443e+00,
         -1.5130e+00,  1.0711e+00],
        [ 5.6870e-01, -8.6652e-01, -1.4143e+00,  ..., -1.2739e+00,
         -1.3763e+00, -9.3116e-01],
        [ 1.0930e+00,  6.1339e-01,  5.4294e-01,  ...,  1.0198e-01,
         -4.8129e-01, -1.5260e-04],
        ...,
        [-1.3053e+00,  1.4037e-01, -4.6418e-01,  ...,  2.7488e-01,
         -3.0211e-02,  1.3917e-01],
        [-1.6779e-01,  1.5651e+00,  3.7909e-01,  ..., -8.1619e-02,
         -1.1620e+00,  3.1804e-01],
        [ 5.9278e-02,  2.3898e-03, -6.1134e-01,  ...,  1.2081e+00,
         -8.1040e-01,  4.1732e-01]], requires_grad=True)

- `PyTorch` initializes word vectors with a normal distribution. 
- The word embedding weights are by default learnable parameters. 

### Initializing word vectors

- But they could be initialized with external pre-trained word embeddings such as `GloVe` or `fasttext`. 
- The weights could be frozen (`freeze=True`) or we could choose to keep learning them with the training data. 

When we define our model, we will initialize embedding weights with pre-trained embeddings. 

### Input to LSTM

In [23]:
messages_b4_embeddings = embedding(messages_b4)
messages_b4_embeddings.shape  # sequence_length, batch_size, embedding_size

torch.Size([5, 32, 100])

In [24]:
messages_b10_embeddings = embedding(messages_b10)
messages_b10_embeddings.shape  # sequence_length, batch_size, embedding_size
messages_b10_embeddings[0, 0, :]

tensor([ 0.7803, -0.7371, -0.4452, -0.1246, -0.6809, -0.4757,  1.0876,  1.1386,
        -1.7562, -0.8781, -0.2473, -0.8505,  1.5113,  0.6343, -0.9098, -1.1002,
         0.3118,  1.8158,  0.2263, -0.7970, -0.5171, -0.4518, -0.2540,  0.8129,
        -1.7383, -1.5922,  0.0507, -0.7480, -0.0856, -1.8469,  0.0423, -0.1457,
         0.2489,  0.0788,  0.8069,  1.4356, -0.3759, -1.3385, -0.7030, -0.2304,
         0.2911, -0.0935, -0.6701, -1.4933, -0.3423,  0.2331,  2.9337, -1.5512,
         0.4270,  0.0139,  0.6805,  1.1590, -0.6142,  1.7147, -1.0021, -0.7155,
         0.2496, -0.4726, -0.4040, -1.7886, -0.4634,  0.7034, -0.4215, -0.6943,
         2.1463,  1.3065, -0.2575, -1.5041,  0.8911,  0.4946, -0.3457, -0.3844,
         0.4186, -1.4241,  0.9184,  0.4148,  2.2295, -1.3660, -0.9637, -0.0531,
        -2.0992, -0.5671, -1.0233,  0.6194,  0.0986, -0.5512, -0.2059,  1.7479,
         0.9942,  0.9426, -0.2746,  0.5260,  0.5577, -0.0053,  0.2977,  0.3086,
         0.1106, -1.2561,  1.0544, -0.74

<br><br>

## Defining LSTM architecture

Our network will have the following layers. 
- Embedding layer ([`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html))
- One or more unidirectional LSTM layers ([`nn.LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html))
- An activation function layer for non linearity ([`nn.Tanh`](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html) or [`nn.ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html))
- A linear layer ([`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html))
- A `LogSoftmax` layer on top of the output of linear layer ([`nn.LogSoftmax`](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html)).

Return the output of `LogSoftmax` layer.

In [25]:
class LSTMModel(nn.Module):
    def __init__(
        self, embedding_size, vocab_size, output_size, hidden_size, num_layers, dropout
    ):
        super(LSTMModel, self).__init__()

        self.embedding = nn.Embedding(
            num_embeddings=vocab_size, embedding_dim=embedding_size
        )
        
        self.lstm_rnn = nn.LSTM(
            input_size=embedding_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout,
        )
        self.activation_fn = nn.ReLU()
        self.linear_layer = nn.Linear(hidden_size, output_size)
        self.softmax_layer = nn.LogSoftmax(dim=-1)

    def forward(self, x):
        out = self.embedding(x)
        out, (h_state, c_state) = self.lstm_rnn(
            out
        )  # c_0 and h_0 initialized to zeros by default

        # classify based on the hidden representation at the last token
        out = out[-1]
        out = self.activation_fn(out)
        out = self.linear_layer(out)
        out = self.softmax_layer(out)
        return out

### An embedding layer (`nn.Embedding`)

- Creates a lookup table 
- input dimension equal to the size of your `TEXT` vocabulary 
- the output as a vector of size `embedding_size` 
- We could initialize embeddings with random weights and learn as part of the training process. This way we get task-specific embeddings. 
- For example, the parameters of this embedding layer could be randomly initialized with numbers sampled from a normal distribution. 
- We could also initialize these weights with pre-trained embeddings (e.g., `Glove` and `fasttext`). 

### LSTM layers (`nn.LSTM`)

- We pass input size, hidden size, number of layers, dropout etc. 
- We can have one or more `LSTM` layers. 
- In `PyTorch` we'll define more than one layers by setting `num_layers` parameter. 

### Activation functions (`nn.Tanh` or `nn.ReLU`)

- We pass the last hidden layer into an activation function before feeding it into a linear layer.  

### Linear layer (`nn.Linear`)

- The dimensionality of this layer is equal to the number of classes. 

### Softmax layer (`nn.LogSoftmax`)

- The `LogSoftmax` layer gives the log probabilities of each class. 
- We can pick the class with maximum log probability. 

<br><br><br><br>

## Training the model 

Rest of the pipeline looks similar to feedforward neural networks code (except that we are using `torchtext` instead of `DataLoader`). 

In [26]:
manual_seed = 123
torch.manual_seed(manual_seed)  # set the seed (for reproducibility)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

Given a data iterator, the `train` and `evaluate` functions below train and evaluate the model for all batches. 

In [27]:
def train(dataloader):
    total_loss = 0.0
    # iterate throught the data loader
    num_samples = 0

    for batch in dataloader:
        # load the current batch
        batch_input = batch.sms
        batch_output = batch.label

        batch_input = batch_input.to(device)
        batch_output = batch_output.to(device)

        # forward pass
        model_outputs = model(batch_input)

        # compute the loss
        cur_loss = criterion(model_outputs, batch_output)
        total_loss += cur_loss.item()

        # backward propagation (compute the gradients and update the model)
        # clear the buffer
        optimizer.zero_grad()

        # compute the gradients
        cur_loss.backward()

        # update the weights
        optimizer.step()

        num_samples += batch_output.shape[0]

    return total_loss / num_samples

In [28]:
from sklearn.metrics import accuracy_score

def evaluate(dataloader):
    preds = []
    labels = []
    with torch.no_grad():  # for efficiency
        for batch in dataloader:
            # load the current batch
            try:
                batch_input = batch.sms
                batch_output = batch.label

                batch_input = batch_input.to(device)
                batch_output = batch_output.to(device)

                # forward propagation
                model_outputs = model(batch_input)

                # identify the predicted class for each example in the batch
                probabilities, predicted = torch.max(model_outputs.cpu().data, 1)

                preds.extend(predicted)
                labels.extend(batch_output)
            except:
                print("Error calculating predictions")
                print(batch)

    accuracy = accuracy_score(preds, labels)
    return accuracy

### Instantiate the model

In [29]:
HIDDEN_SIZE = 128  # number of units in the hidden layer
NUM_LAYERS = 2  # number of hidden layers
MAX_EPOCHS = 4  # number of passes over the training data
LEARNING_RATE = 0.3  # learning rate for the weight update rule
NUM_CLASSES = 2  # number of classes for the problem
EMBEDDING_SIZE = 100  # size of the word embedding
DROPOUT = 0.2

In [30]:
model = None

model = LSTMModel(
    EMBEDDING_SIZE, VOCAB_SIZE, NUM_CLASSES, HIDDEN_SIZE, NUM_LAYERS, DROPOUT
)
print(model)

LSTMModel(
  (embedding): Embedding(2557, 100)
  (lstm_rnn): LSTM(100, 128, num_layers=2, dropout=0.2)
  (activation_fn): ReLU()
  (linear_layer): Linear(in_features=128, out_features=2, bias=True)
  (softmax_layer): LogSoftmax(dim=-1)
)


In [31]:
model.to(device);  # ship the  to the right device

In [32]:
# Create a directory for writing models.
import os

CHECKPOINT_PATH = "./checkpoint"
if not os.path.exists(CHECKPOINT_PATH):
    os.mkdir(CHECKPOINT_PATH)

### Defining the loss function 

In [33]:
criterion = nn.NLLLoss()  # define the loss function (last node of the network)

### Initializing embedding weights with pretrained embeddings

In [34]:
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)
print(pretrained_embeddings.shape)

torch.Size([2557, 100])


### Carrying out one forward pass

In [35]:
print(messages_b4.shape)

torch.Size([5, 32])


In [36]:
preds = model(messages_b4)
loss = criterion(preds, labels_b4)
print("loss = ", loss.item())

loss =  0.7440012693405151


<br><br>

In [37]:
# create an instance of SGD with required hyperparameters
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

epochs, losses, train_accs, valid_accs = [], [], [], []

# Train the model
for epoch in range(MAX_EPOCHS):
    # train the model for one pass over the data
    train_loss = train(train_iter)
    losses.append(train_loss)

    # compute the training accuracy
    train_acc = evaluate(train_iter)
    train_accs.append(train_acc)

    # compute the validation accuracy
    valid_acc = evaluate(valid_iter)
    valid_accs.append(valid_acc)

    epochs.append(epoch + 1)

    # print the loss for every epoch
    print(
        "Epoch %d, Loss %0.4f, Train accuracy %0.4f; Validation accuracy %0.4f"
        % (epoch + 1, train_loss, train_acc, valid_acc)
    )

    # save model, optimizer, and number of epoch to a dictionary
    model_save = {
        "epoch": epoch,  # number of epoch
        "model_state_dict": model.state_dict(),  # model parameters
        "optimizer_state_dict": optimizer.state_dict(),  # save optimizer
        "loss": train_loss,  # training loss
    }
    torch.save(model_save, CHECKPOINT_PATH + "/model_{}.pt".format(epoch))

Epoch 1, Loss 0.0093, Train accuracy 0.8622; Validation accuracy 0.8869
Epoch 2, Loss 0.0067, Train accuracy 0.8618; Validation accuracy 0.8851
Epoch 3, Loss 0.0049, Train accuracy 0.9349; Validation accuracy 0.9461
Epoch 4, Loss 0.0041, Train accuracy 0.9435; Validation accuracy 0.9551
