1. Make sure you fill in all cells contain YOUR CODE HERE or YOUR ANSWER HERE.
2. After you finished, Restart the kernel & run all cell in order.

# Project II: Text Classification Using LSTM Network
## Deadline: Nov 14, 11:59 pm

You have learned about the basics of neural network training and testing during the class. Now let's move forward to the text classification tasks using simple LSTM networks! In this project, you need to implement two parts:

- **Part I: Building vocabulary for LSTM network**
    - Get familiar with discrete text data processing for neural networks. Building vocabulary by yourself.


- **Part II: Implementing your own LSTM Neural Network**
    - Learn to implement your own LSTM network and aims for a strong performance on the given text classification task.
    - Note that you need to implement the LSTM network manually, any kind of integrated package invoking will get 0 points.
    - Your LSTM network can be 2-4 layers.
    - Expected Accuracy: >=65%.
    ![](./LSTM.png)
    

Let's get started!

In [1]:
import torch
import pandas as pd
import torch.nn as nn # for checking

# nlp library of Pytorch
from torchtext import data

import warnings as wrn
wrn.filterwarnings('ignore')
SEED = 2021

torch.manual_seed(SEED)
torch.backends.cuda.deterministic = True

In [2]:
data_ = pd.read_csv('./sms_spam.csv')
data_.head()
data_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    5574 non-null   object
 1   text    5574 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [3]:
# Field is a normal column 
# LabelField is the label column.

import spacy
nlp = spacy.load("en_core_web_sm")

def tokenizer(text):
    return [ tok.text for tok in nlp.tokenizer(text) ]

TEXT = data.Field(tokenize=tokenizer,batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)

In [4]:
fields = [("type",LABEL),('text',TEXT)]

In [5]:
training_data = data.TabularDataset(path="./sms_spam.csv",
                                    format="csv",
                                    fields=fields,
                                    skip_header=True
                                   )

print(vars(training_data.examples[0]))

{'type': 'ham', 'text': ['Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']}


In [6]:
import random
# train and validation splitting
train_data,valid_data = training_data.split(split_ratio=0.75,
                                            random_state=random.seed(SEED))

#### Question 1 (5 points)
Implement the vocabulary building and the text to label part for training.

In [7]:
#implement Question1 here:
#Building vocabularies => (Token to integer)
#you can use the data package built-in function to build the vocabulary, check the 'torchtext data' doc.
TEXT.build_vocab(train_data, min_freq = 3)
LABEL.build_vocab(train_data, min_freq= 3)

In [8]:
print("Size of text vocab:",len(TEXT.vocab))
print("Size of label vocab:",len(LABEL.vocab))
TEXT.vocab.freqs.most_common(10)

Size of text vocab: 2820
Size of label vocab: 2


[('.', 3658),
 ('to', 1615),
 ('I', 1478),
 (',', 1461),
 ('you', 1383),
 ('?', 1086),
 ('!', 1019),
 ('a', 1003),
 ('the', 882),
 ('...', 869)]

In [9]:
device = torch.device("cpu")
# device = torch.device("cuda") # had problems running on gpu


BATCH_SIZE = 64

# We'll create iterators to get batches of data when we want to use them
"""
This BucketIterator batches the similar length of samples and reduces the need of 
padding tokens. This makes our future model more stable
"""

train_iterator,validation_iterator = data.BucketIterator.splits(
    (train_data,valid_data),
    batch_size = BATCH_SIZE,
    # Sort key is how to sort the samples
    sort_key = lambda x:len(x.text),
    sort_within_batch = True,
    device = device
)

#### Question 2 (25 points)
You need to implement the embedding layer and the LSTM cell according to the given architecture, but you are not allowed to use any integrated package!
LSTM tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
![](./LSTM_CELL.png)

In [10]:
import torch.nn as nn
from torch.autograd import Variable

In [11]:
# implementing a single GRU cell as given in the tutorial
class GRULayer(nn.Module):
    
    def __init__(self, input_dim, hidden_dim):
       
        super(GRULayer, self).__init__()
        self.hidden_dim = hidden_dim
        self.input_dim = input_dim
        self.W_z = nn.Linear(self.input_dim + self.hidden_dim , self.hidden_dim)
        self.W_r = nn.Linear(self.input_dim + self.hidden_dim, self.hidden_dim)
        self.W = nn.Linear(self.input_dim   + self.hidden_dim, self.hidden_dim)
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        # self.bidirectional = bidrectional

    def forward(self, x_input, input_cell):

        batch, sequence, _  = x_input.shape
        hidden_sequence = []
        
        if input_cell is None:
            h_t_minus1 = torch.zeros(batch, self.hidden_dim).to(x_input.device)
        else:
            h_t_minus1 = input_cell 
        
        for t in range(sequence):
            x_t = x_input[:, t, :]
            z_t = self.sigmoid(self.W_z((torch.cat((h_t_minus1, x_t), dim=-1) ))) # W_z dot [h_t-1, x_t]
            r_t = self.sigmoid(self.W_r((torch.cat((h_t_minus1, x_t), dim=-1) ))) # W_r dot [h_t-1, x_t]
            h_tilda_t = self.tanh(self.W(torch.cat(((r_t * h_t_minus1), x_t), dim=-1)))  # W dot [r_t * h_t-1, x_t]
            h_t = (1-z_t) * h_t_minus1 + z_t * h_tilda_t
            h_t_minus1 = h_t # update h_t to be the new h_t_minus1 
            h_t = h_t.unsqueeze(0)
            hidden_sequence.append(h_t)
        
        hidden_sequence = torch.cat(hidden_sequence, dim = 0)
        hidden_sequence = hidden_sequence.transpose(0,1).contiguous()
        
        return hidden_sequence, h_t

In [12]:
class LSTMNet(nn.Module):
    
    def __init__(self,vocab_size,embedding_dim,hidden_dim,output_dim,n_layers,bidirectional,dropout):
        
        super(LSTMNet,self).__init__()
        # In this class, you need to implement the architecture of an LSTM network, the architecture should include:
        self.vocab_size = vocab_size
        self.input_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.num_layers = n_layers
        self.output_dim = output_dim
        self.embedding_dim = embedding_dim
        self.bidirectional = bidirectional
        self.dropout = dropout
        
        # 1. Embedding layer converts integer sequences to vector sequences
        self.embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        
        # 2. LSTM layer process the vector sequences 
        self.layered_gru = nn.ModuleList()
        
        for i in range(0, self.num_layers):
            if i == 0:
                self.layered_gru.append( GRULayer(input_dim = self.input_dim, hidden_dim= self.hidden_dim) )
            else:
                self.layered_gru.append( GRULayer(input_dim = self.hidden_dim, hidden_dim= self.hidden_dim) )
        
        # 3. Dense layer to predict 
        self.fc = nn.Linear(hidden_dim, output_dim)
    
        # 4. Prediction activation function (you can choose your own activate function e.g., ReLU, Sigmoid, Tanh)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self,text,text_lengths):
        text_embedding = self.embedding_layer(text)             #embedding input
        
        if self.bidirectional is True:
            
            for (l, gru) in enumerate(self.layered_gru):
                input_cell = None
                if l == 0:
                    output_forward, input_cell_forward = gru.forward(x_input = text_embedding, input_cell = input_cell)
                    output_backward, input_cell_backward = gru.forward(x_input = text_embedding.flip([0,1]), input_cell = input_cell)
                    
                    output = output_forward + output_backward
                else:
                    output_forward, input_cell_forward = gru.forward(x_input = output, input_cell = input_cell)
                    output_backward, input_cell_backward = gru.forward(x_input = output_forward.flip([0,1]), input_cell = input_cell)

        else:
            for (l, gru) in enumerate(self.layered_gru):
                input_cell = None
                if l == 0:
                    output, input_cell = gru.forward(x_input = text_embedding, input_cell = input_cell)
                else:
                    output, input_cell = gru.forward(x_input = output, input_cell = input_cell)
    
        output = self.fc(output[:, -1, :])
        output = self.sigmoid(output)
        return output

#### Training

In [13]:
SIZE_OF_VOCAB = len(TEXT.vocab)
EMBEDDING_DIM = 300
NUM_HIDDEN_NODES = 64
NUM_OUTPUT_NODES = 1
NUM_LAYERS = 2
BIDIRECTION = True
DROPOUT = 0.1

In [14]:
model = LSTMNet(SIZE_OF_VOCAB,
                EMBEDDING_DIM,
                NUM_HIDDEN_NODES,
                NUM_OUTPUT_NODES,
                NUM_LAYERS,
                BIDIRECTION,
                DROPOUT
               )

In [15]:
import torch.optim as optim
model = model.to(device)
optimizer = optim.Adam(model.parameters(),lr=1e-4)
criterion = nn.BCELoss()
criterion = criterion.to(device)

In [16]:
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(preds)
    
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc

In [17]:
def train(model,iterator,optimizer,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    model.train()
    
    for batch in iterator:
        
        # cleaning the cache of optimizer
        optimizer.zero_grad()
        
        text,text_lengths = batch.text
        
        # forward propagation and squeezing
        predictions = model(text,text_lengths).squeeze()
        
        # computing loss / backward propagation
        loss = criterion(predictions,batch.type)
        loss.backward()
        
        # accuracy
        acc = binary_accuracy(predictions,batch.type)
        
        # updating params
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    # It'll return the means of loss and accuracy
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
        

In [18]:
def evaluate(model,iterator,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    # deactivate the dropouts
    model.eval()
    
    # Sets require_grad flat False
    with torch.no_grad():
        for batch in iterator:
            text,text_lengths = batch.text
            
            predictions = model(text,text_lengths).squeeze()
              
            #compute loss and accuracy
            loss = criterion(predictions, batch.type)
            acc = binary_accuracy(predictions, batch.type)
            
            #keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [19]:
EPOCH_NUMBER = 15

for epoch in range(1,EPOCH_NUMBER+1):
    
    train_loss,train_acc = train(model,train_iterator,optimizer,criterion)
    
    valid_loss,valid_acc = evaluate(model,validation_iterator,criterion)
    
    # Showing statistics
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
    print()

	Train Loss: 0.544 | Train Acc: 81.13%
	 Val. Loss: 0.413 |  Val. Acc: 88.12%

	Train Loss: 0.417 | Train Acc: 86.32%
	 Val. Loss: 0.370 |  Val. Acc: 88.26%

	Train Loss: 0.375 | Train Acc: 86.58%
	 Val. Loss: 0.342 |  Val. Acc: 88.33%

	Train Loss: 0.344 | Train Acc: 86.67%
	 Val. Loss: 0.320 |  Val. Acc: 88.62%

	Train Loss: 0.321 | Train Acc: 87.69%
	 Val. Loss: 0.295 |  Val. Acc: 88.76%

	Train Loss: 0.292 | Train Acc: 88.28%
	 Val. Loss: 0.272 |  Val. Acc: 89.18%

	Train Loss: 0.275 | Train Acc: 88.90%
	 Val. Loss: 0.250 |  Val. Acc: 90.39%

	Train Loss: 0.250 | Train Acc: 89.75%
	 Val. Loss: 0.228 |  Val. Acc: 91.60%

	Train Loss: 0.234 | Train Acc: 90.93%
	 Val. Loss: 0.217 |  Val. Acc: 92.24%

	Train Loss: 0.206 | Train Acc: 92.09%
	 Val. Loss: 0.190 |  Val. Acc: 93.02%

	Train Loss: 0.191 | Train Acc: 93.30%
	 Val. Loss: 0.167 |  Val. Acc: 94.51%

	Train Loss: 0.167 | Train Acc: 94.48%
	 Val. Loss: 0.151 |  Val. Acc: 95.65%

	Train Loss: 0.149 | Train Acc: 94.89%
	 Val. Loss: 