## Embedding layer for word embedding

In [66]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import torch
from torch import nn 
import torch.nn.functional as F
from torch import utils

torch.manual_seed(0)
np.random.seed(0)

**Read data**

In [67]:
# first let us load a dataset called spam.csv
# Downloaded from https://www.kaggle.com/team-ai/spam-text-message-classification?select=SPAM+text+message+20170820+-+Data.csv

data = pd.read_csv("spam.csv")
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


This dataset collects text messages and the task is to classify whether a message is a spam or not.

In [68]:
# some basic information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [69]:
# This dataset is imbalanced
data['Category'].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [70]:
# preprocessing data
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() 
target = le.fit_transform(data['Category']) # convert target into integers
data['Category'] = target
print(le.classes_) # this shows which index maps to which class

data.head()

['ham' 'spam']


Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


**Preprocessing**

Recall in Task 1, we construct a Torch dataset by passing in X and y. Here we cannot do that because features are not numerical. The feature now is text and we cannot convert them to a torch Tensor. Thus, we may need to define our own dataset class, which should inherit from the parent class `utils.data.Dataset`. 

Before that, we should construct training, validation and test sets. Here we don't split them into X and y.

In [71]:
np.random.seed(0)

index = list(range(data.shape[0])) # an list of indices
np.random.shuffle(index) # shuffle the index in-place

p_val = 0.2
p_test = 0.2
N_test = int(data.shape[0] * p_test)
N_val = int(data.shape[0] * p_val)

# get training, val and test sets
test_data = data.iloc[ index[:N_test] ,:]
val_data = data.iloc[ index[N_test: (N_test+N_val)], :]
train_data = data.iloc[ index[(N_test+N_val):], :]

print(test_data.shape)
print(val_data.shape)
print(train_data.shape)

(1114, 2)
(1114, 2)
(3344, 2)


In [72]:
# define our own torch dataset
# for a torch dataset, we need to define two functions: 
#     __len__: return the length of dataset
#     __getitem__: given a index (integer), return the corresponding sample, both y and X

class SpamDataset(utils.data.Dataset):
    def __init__(self, myData):
        """
        myData should be a dataframe object containing both y (first col) and X (second col)
        """
        super().__init__()
        self.data = myData
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        
        return (self.data.iloc[idx,0], self.data.iloc[idx,1]) # (target, text)

In [73]:
# now we can build our torch dataset 
train_torch = SpamDataset(train_data)
val_torch = SpamDataset(val_data)
test_torch = SpamDataset(test_data)

In [74]:
# check
train_torch.__getitem__(2)

(0, 'I not free today i haf 2 pick my parents up tonite...')

**Tokenization**

Before we build a network, we need to tokenize words from these messages. We will use `torchtext` package, which is a package compatible with pytorch for natural language processing. `torchtext` has an inbuilt tokenizer for English words. To install the package, run
```python
pip install torchtext 
```

There are two steps in tokenization. 1) build a vocabulary. 2) transform your text based on vocabulary

In [75]:
#!pip install torchtext

In [76]:
from torchtext.data.utils import get_tokenizer

# build a tokenizer with basic_english
tokenizer = get_tokenizer('basic_english')

In [77]:
tokenizer('I am happy')

['i', 'am', 'happy']

In [78]:
from torchtext.vocab import vocab
from collections import Counter

# ===== step 1: build vocabulary =====
# In pytorch, we need to use the Vocab object to store the vocabulary. The Vocab builds on a Counter object.
# Counter object keep tracks of number of occurences of each word
# thus you may specify the min_freq to filter out infrequent words
counter = Counter() 
for msg in data['Message']:
    counter.update(tokenizer(msg))
vocabulary = vocab(counter, min_freq = 3) # filter out all words that appear less than three times

# set default index = 0 for words that are not in covabulary
vocabulary.set_default_index(0)

In [79]:
len(vocabulary)

2961

In [80]:
# The vocab object maps a word to an idx (an integer)
print(vocabulary['my'])
print(vocabulary['sun'])
print(vocabulary['iertuei']) # something not in vocab will be mapped to default_index = 0

100
1823
0


In [81]:
# define a function that converts a document into tokens (represented by index)
def doc_tokenizer(doc):
    return torch.tensor([vocabulary[token] for token in tokenizer(doc)], dtype=torch.long)

In [82]:
doc_tokenizer('I love music')

tensor([  61,  307, 1086])

Step 2 is to transform documents into tokens based on the vocabulary. Note in real applications where the number of documents is huge, converting all documents before training is very inefficient. As an alternative, we can read in and convert a batch of documents every iteration. This can be easily achieved by defining a `collate_batch` function that will be passed in DataLoader. 

This basic idea is to read in a batch, apply `collate_batch` and return the processed texts.

In [83]:
# ========= Step 2 ==============
# Notice in a corpus, each document can have different size. Thus, we usually pad zeros to the maximum length of document.
# Alternatively, you can concat all documents into a long vector 
# and the starting index of each document is identified in the variable called offsets.

def collate_batch(batch):
    
    target_list, text_list, offsets = [], [], [] 
    
    # loop through all samples in batch
    for idx in range(len(batch)):
        
        _label = batch[idx][0]
        _text = batch[idx][1]
        
        target_list.append( _label )
        tokens = doc_tokenizer( _text )
        text_list.append(tokens)
        
        if idx == 0:
            offsets.append(0)  # the first document starts from idx 0
        else:
            offsets.append(offsets[-1] + tokens.size(0)) # the next document starts from (offsets[-1] + tokens.size(0))
    
    # convert to torch tensor
    target_list = torch.tensor(target_list, dtype=torch.int64)
    offsets = torch.tensor(offsets)
    text_list = torch.cat(text_list) # concat into a long vector
    
    return target_list, text_list, offsets

**Dataloader with customized collating function**

We can now build our dataloader by passing in our defined torch dataset and the collate_batch function. Note we also build data loader for validation and test sets. This is because for real datasets, validation and test data can be very large in size. Thus sometimes it may be difficult to test on every sample in validation set. 

Another reason is that we have `collate_batch` function that process data on the go. If we don't build data loader for validation and test sets, we may need to process them before training or define a different function for that purpose.

In [84]:
torch.manual_seed(0)

batchSize = 8
train_loader = utils.data.DataLoader(train_torch, batch_size=batchSize, shuffle=True, collate_fn=collate_batch)
val_loader = utils.data.DataLoader(val_torch, batch_size=batchSize, shuffle=True, collate_fn=collate_batch)
test_loader = utils.data.DataLoader(test_torch, batch_size=batchSize, shuffle=False, collate_fn=collate_batch)

In [85]:
# check for the first batch
list(train_loader)[0]

(tensor([0, 0, 0, 0, 0, 0, 0, 0]),
 tensor([  60,   75,   50,   51,  972,    3,  235,   89,  275,  278, 1439, 1780,
          146, 1781,  146,  537,  254,   50,   51, 1402,   28,  804,  147,    0,
         1022,  972,  175,   61,  401,    8,  295,  157,  151,  237,   31, 2669,
          220,   19,    5,    5,   18,   50,   51,  246, 1814,   93,  483,  581,
            3,  100,  307,   84,  584,  554, 2475,   93,  296,   89,  848,  106,
           93,  317,  306,   89,  306,   93,  296,   89,  323,  797,   80,    0,
            3,  876,   93,  146,  446,   80,    0,  105,  100,  307,   93,  235,
         1414,   32,  146,    0,  477,  582,  166,  102, 1269,   48,   61,   50,
          158,  159,    0,  331,  100,    0,    5, 2675,   61,  211,   31, 1429,
          147, 1588, 1763,   48, 2036,    0, 1056,  401,  881,  157,  435,    5,
          402,  961,  887,  160,   16,    8,    0,   59,  309,  312,    0,   31,
            0, 1395,    5,  184,  106,  551,  566,   31,  161,    5]),
 te

**Model building**

Now we can build our model. There are two choices for embedding layer. One is `nn.Embedding()` and the other is `nn.EmbeddingBag()`. See https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html. Basically, `nn.EmbeddingBag()` combines `nn.Embedding()` with an aggregation step, either 'sum', 'mean' or 'mode'. 

*Why you need an aggregation step?* 

This is because for every word in one document, we have one vector. Thus, each document is a matrix of size $N_w \times d$, where $N_w$ is the number of words in this document and $d$ is the vector dimension. Usually, we can just take the sum/mean/max of these word vectors in one document and use the resulting vector as representation.


In [86]:
# ====== Step 1 ========= 
class SpamClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, mode='mean') # embedding layer
        self.Linear1 = nn.Linear(embed_dim, 1)
        self.Dropout = nn.Dropout(p=0.1)
    
    def forward(self, text, offsets):
        # note we need offsets to indicate which document we have
        out = self.embedding(text, offsets)
        out = self.Dropout(out)
        out = self.Linear1(out)
        return out
        # for the last layer, we don't apply activation because we can use BCEWithLogitsLoss to combine sigmoid with BCELoss
        
# model initalization
embed_dim = 8
model = SpamClassifier(len(vocabulary), embed_dim)

In [87]:
# ======= Step 2 ==========
loss_fn = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

For step 3, notice we build data loader for both validation and test sets. It is better to define a function for evaluation.

In [88]:
def evaluate(dataloader):
    
    y_pred = torch.tensor([]) # store prediction
    y_true = torch.tensor([]) # store true label
    
    model.eval()
    with torch.no_grad():
        for label, text, offsets in dataloader:
            y_pred_batch = model(text, offsets)
            
            y_pred = torch.cat((y_pred, y_pred_batch.squeeze()))
            y_true = torch.cat((y_true, label.squeeze()))
            
    return y_pred, y_true

In [89]:
# ======== Step 3 ==============
epochs = 30
for epoch in range(epochs):
    
    for y_train, text, offsets in train_loader:
        # zero the parameter gradients
        optimizer.zero_grad()

        # calulate output and loss 
        y_pred_train = model(text, offsets)
        loss = loss_fn(y_pred_train.squeeze(), y_train.float())

        # backprop and take a step
        loss.backward()
        optimizer.step()
    
    # evaluate on validation set
    y_pred_val, y_val = evaluate(val_loader)
    loss_val = loss_fn(y_pred_val.squeeze(), y_val.float())
    
    # note when making prediction, do add sigmoid activation
    pred_label = (torch.sigmoid(y_pred_val) > 0.5).long() # find out the class prediction
    acc = (pred_label == y_val).float().sum()/y_val.shape[0]
    
    model.train() # because when evaluating we change mode to eval mode
    
    print('Epoch {}: {:.4f} (train), {:.4f} (val), {:.4f} (val acc)'.format(epoch, loss, loss_val, acc))

Epoch 0: 0.1313 (train), 0.3079 (val), 0.8824 (val acc)
Epoch 1: 0.0855 (train), 0.2587 (val), 0.8968 (val acc)
Epoch 2: 0.0300 (train), 0.2490 (val), 0.9031 (val acc)
Epoch 3: 0.2668 (train), 0.2574 (val), 0.8941 (val acc)
Epoch 4: 0.7698 (train), 0.2591 (val), 0.8950 (val acc)
Epoch 5: 0.7047 (train), 0.2283 (val), 0.9111 (val acc)
Epoch 6: 0.3694 (train), 0.2283 (val), 0.9129 (val acc)
Epoch 7: 0.5019 (train), 0.2073 (val), 0.9201 (val acc)
Epoch 8: 0.4560 (train), 0.2304 (val), 0.9111 (val acc)
Epoch 9: 0.2780 (train), 0.2333 (val), 0.9031 (val acc)
Epoch 10: 0.0453 (train), 0.2315 (val), 0.9093 (val acc)
Epoch 11: 0.0934 (train), 0.2432 (val), 0.9048 (val acc)
Epoch 12: 0.3313 (train), 0.2261 (val), 0.9084 (val acc)
Epoch 13: 0.2777 (train), 0.2264 (val), 0.9174 (val acc)
Epoch 14: 0.6968 (train), 0.2428 (val), 0.8977 (val acc)
Epoch 15: 0.1649 (train), 0.2218 (val), 0.9156 (val acc)
Epoch 16: 0.0262 (train), 0.2391 (val), 0.9156 (val acc)
Epoch 17: 0.1703 (train), 0.2429 (val), 0

In [90]:
# prediction on test data
y_pred_test, y_true_test = evaluate(test_loader)
y_pred_test = torch.sigmoid(y_pred_test) > 0.5

print(confusion_matrix(y_true_test, y_pred_test))
print(classification_report(y_true_test, y_pred_test))

[[922  32]
 [ 95  65]]
              precision    recall  f1-score   support

         0.0       0.91      0.97      0.94       954
         1.0       0.67      0.41      0.51       160

    accuracy                           0.89      1114
   macro avg       0.79      0.69      0.72      1114
weighted avg       0.87      0.89      0.87      1114



## Going further: More advanced network structure for word embedding (optional)

In Task 2, we have seen how we can include an embedding layer in our neural network model. But we simply take the mean of all word vectors in one document and use the resulting vector as representation for this document. This can result in some information loss. Rather, we would like to take all word vectors in one document into account. There are many advanced models that perform this task, such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM) and so many. Natural language processing is a very exciting area of research.

Below we show you an example of using LSTM. But notice, we need to redefine collate batch and evaluate functions. If you want to know more about LSTM. Check this tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

In [55]:
# Note in Task 2, we store a batch of documents as a long vector and use offsets to indicate the starting index of each document
# Now we need to store the documents as a list of tensors of varying size 
# And we will pad them to the same lengths using 0

from torch.nn.utils.rnn import pad_sequence

def collate_batch_advanced(batch):
    
    target_list, text_list = [], []
    
    # loop through all samples in batch
    for idx in range(len(batch)):
        
        _label = batch[idx][0]
        _text = batch[idx][1]
        
        target_list.append( _label )
        tokens = doc_tokenizer( _text )
        text_list.append(tokens)
            
    # convert to torch tensor
    target_list = torch.tensor(target_list, dtype=torch.int64)
    
    return target_list, text_list


# define the evaluate function, notice we need to pad each document with 0
def evaluate_adv(dataloader):
    
    y_pred = torch.tensor([]) # store prediction
    y_true = torch.tensor([]) # store true label
    
    model.eval()
    with torch.no_grad():
        for label, text in dataloader:
            
            # we will need to pad the sequences with zero, batch_first means to organize batch size to be the first dim
            # and we use 0 to pad them to the same lengths
            text = pad_sequence(text, batch_first=True, padding_value=0)
            
            y_pred_batch = model(text)
            
            y_pred = torch.cat((y_pred, y_pred_batch.squeeze()))
            y_true = torch.cat((y_true, label.squeeze()))
            
    return y_pred, y_true

In [56]:
torch.manual_seed(0)

batchSize = 64
train_loader = utils.data.DataLoader(train_torch, batch_size=batchSize, shuffle=True, collate_fn=collate_batch_advanced)
val_loader = utils.data.DataLoader(val_torch, batch_size=batchSize, shuffle=True, collate_fn=collate_batch_advanced)
test_loader = utils.data.DataLoader(test_torch, batch_size=batchSize, shuffle=False, collate_fn=collate_batch_advanced)

In [57]:
# ====== Step 1 ========= 
# The following code is modified from https://towardsdatascience.com/text-classification-with-pytorch-7111dae111a6
class SpamAdvClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_LstmLayer):
        super().__init__()
        
        self.hidden_dim = hidden_dim
        self.num_LstmLayer = num_LstmLayer
        
        # switch to embedding layer: padding_idx = 0 means we treat index=0 as padding and don't train its embedding
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) 
        
        # LSTM layer: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_dim, num_layers = num_LstmLayer, batch_first=True) 
        
        self.Linear1 = nn.Linear(hidden_dim, 1)
        self.Dropout = nn.Dropout(p=0.1)
    
    def forward(self, x):
        # Hidden and cell state definion
        h = torch.zeros((self.num_LstmLayer, x.size(0), self.hidden_dim))
        c = torch.zeros((self.num_LstmLayer, x.size(0), self.hidden_dim))
        
        # Initialization fo hidden and cell states
        torch.nn.init.xavier_normal_(h)
        torch.nn.init.xavier_normal_(c)
        
        # embedding layer 
        out = self.embedding(x)
        # lstm layer
        out, (hidden, cell) = self.lstm(out, (h,c))
        out = self.Dropout(out)
        # The last hidden state is taken
        out = self.Linear1(out[:,-1,:])
        
        return out
        
# model initalization
embed_dim = 16
hidden_dim = 16
num_LstmLayer = 2
model = SpamAdvClassifier(len(vocabulary), embed_dim, hidden_dim, num_LstmLayer)

In [58]:
# ======= Step 2 ==========
loss_fn = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

The following code may take some time to run.

In [60]:
# ======== Step 3 ==============
epochs = 30
for epoch in range(epochs):
    
    for y_train, text in train_loader:
        # zero the parameter gradients
        optimizer.zero_grad()
        
        # we will need to pad the sequences with zero
        text = pad_sequence(text, batch_first=True, padding_value=0)

        # calulate output and loss 
        y_pred_train = model(text)
        loss = loss_fn(y_pred_train.squeeze(), y_train.float())

        # backprop and take a step
        loss.backward()
        optimizer.step()
    
    # evaluate on validation set
    y_pred_val, y_val = evaluate_adv(val_loader)
    loss_val = loss_fn(y_pred_val.squeeze(), y_val.float())
    
    # note when making prediction, do add sigmoid activation
    pred_label = (torch.sigmoid(y_pred_val) > 0.5).long() # find out the class prediction
    acc = (pred_label == y_val).float().sum()/y_val.shape[0]
    
    model.train() # because when evaluating we change mode to eval mode
    
    print('Epoch {}: {:.4f} (train), {:.4f} (val), {:.4f} (val acc)'.format(epoch, loss, loss_val, acc))

Epoch 0: 0.5960 (train), 0.3699 (val), 0.8797 (val acc)
Epoch 1: 0.5203 (train), 0.3674 (val), 0.8797 (val acc)
Epoch 2: 0.2446 (train), 0.3671 (val), 0.8797 (val acc)
Epoch 3: 0.4836 (train), 0.3571 (val), 0.8797 (val acc)
Epoch 4: 0.1948 (train), 0.2268 (val), 0.8797 (val acc)
Epoch 5: 0.1212 (train), 0.1414 (val), 0.9497 (val acc)
Epoch 6: 0.0059 (train), 0.1505 (val), 0.9443 (val acc)
Epoch 7: 0.1620 (train), 0.1370 (val), 0.9596 (val acc)
Epoch 8: 0.1431 (train), 0.1300 (val), 0.9506 (val acc)
Epoch 9: 0.0246 (train), 0.1189 (val), 0.9560 (val acc)
Epoch 10: 0.0545 (train), 0.1038 (val), 0.9704 (val acc)
Epoch 11: 0.0140 (train), 0.1062 (val), 0.9731 (val acc)
Epoch 12: 0.2050 (train), 0.1083 (val), 0.9686 (val acc)
Epoch 13: 0.1420 (train), 0.1338 (val), 0.9623 (val acc)
Epoch 14: 0.0123 (train), 0.0813 (val), 0.9767 (val acc)
Epoch 15: 0.0476 (train), 0.0752 (val), 0.9785 (val acc)
Epoch 16: 0.0150 (train), 0.0785 (val), 0.9776 (val acc)
Epoch 17: 0.0051 (train), 0.0780 (val), 0

In [61]:
# prediction on test data
y_pred_test, y_true_test = evaluate_adv(test_loader)
y_pred_test = torch.sigmoid(y_pred_test) > 0.5

print(confusion_matrix(y_true_test, y_pred_test))
print(classification_report(y_true_test, y_pred_test))

[[940  14]
 [ 10 150]]
              precision    recall  f1-score   support

         0.0       0.99      0.99      0.99       954
         1.0       0.91      0.94      0.93       160

    accuracy                           0.98      1114
   macro avg       0.95      0.96      0.96      1114
weighted avg       0.98      0.98      0.98      1114

