# Programming for Data Science and Artificial Intelligence

## Deep Learning - NLP + TorchText + Embedding

Here we shall improve the previous one by adding:

Improve the learning
- pre-trained word embeddings (**) (improve accuracy by around 20)
- changed optimizer to Adam (make the thing learn faster)
- orthogonal initialization (not significant improvements but certainly the choice in RNN/LSTM or even CNN!)

Improve efficiency
- packed padded sequences in RNN to save computations and also ask the RNN to ignore padding (++) (this is the deal breaker; without this, my accuracy is 50) (https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch)
- put padding_idx in embedding layer to save computations (no hit to accuracy but good practice to do)

In [None]:
import torchtext
import torch
from torch import nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Loading the dataset

In [None]:
#uncomment this if you are not using puffer
import os
os.environ['http_proxy'] = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))

### Tokenizing

In [None]:
#pip install spacy
#python -m spacy download en_core_web_sm
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
tokens = tokenizer("We are learning torchtext in U.K.!")  #some test
tokens

### Text to integers

In [None]:
from torchtext.vocab import build_vocab_from_iterator
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>', '<pad>', '<bos>', '<eos>'])
vocab.set_default_index(vocab["<unk>"])

In [None]:
#see some example
vocab(['here', 'is', 'a', 'unknownword', 'a'])

In [None]:
len(vocab)

### ** FastText Embeddings **

We will first download the pre-trained vectors, here I am using FastText.  Then we will get all the FastText embeddings that exist in the vocab.  

**Small Intro to Embeddings**

1. *Word2Vec* - the first efficient word embedding, trained on Continuous Bag-of-words (CBOW) and SkipGram (SG).  The limitations include: (1) works only with local window information, not the whole document, (2) no subword information (prefix, suffix, etc.), (3) cannot handle OOV words, and (4) do not handle context.

First three problems were addressed by GloVe and FastText, and last one has been resolved by Elmo and BERT.

<img src = "../../figures/word2vec.png" width=300>

2. *GloVe* - particularly adresses problem no. 1 which uses co-occurrence statistics of the whole corpus.  For example, given words i=ice, j=steam, we want to study a ratio of occurrence probabilities with some probe word k=solid as this figure:

<img src = "../../figures/glove.png" width=300>

3. *FastText* - addresses problem no. 2 and 3.  Uses the skipgram arhitecture to train but with the following improvements:  (1) faster and simpler to train, (2) consider subwords as ngrams (If we consider the word “what” and use n=3 or tri-grams, the word would be represented by the character n-grams: <”wh”,”wha”,”hat”,”at”>. < and > are special symbols that are added at the start and end of each word.), (3) it can generate embeddings from OOV thanks to the ngrams.  An OOV word vector can be built with the average vector representation of its n-grams.     Big disadvantage is its high memory requirements.

4. *ElMo* - given same word "stick" can have different meanings.  By using a bi-directional LSTM, ElMo was able to understand not only the next words, but also the preceding ones.  Also can work like FastText on subwords and do not suffer OOV problem.

### Defining hyperparameters

### Batch Iterator

In [None]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: 1 if x == 'pos' else 0

In [None]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence #++

def collate_batch(batch):
    label_list, text_list, length_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        length_list.append(processed_text.size(0))  #++<-----packed padded sequences require length
    #criterion expects float labels
    return torch.tensor(label_list, dtype=torch.float64), pad_sequence(text_list, padding_value=pad_idx, batch_first=True), torch.tensor(length_list, dtype=torch.int64)

In [None]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

train_iter, test_iter = IMDB()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_loader = DataLoader(split_train_, batch_size=batch_size,
                              shuffle=True, collate_fn=collate_batch)
valid_loader = DataLoader(split_valid_, batch_size=batch_size,
                              shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(test_dataset, batch_size=batch_size,
                             shuffle=True, collate_fn=collate_batch)

### ++ About pack padded sequence ++
By packing the padded sequence, the RNN (RNN, LSTM, GRU) does not need to do unnecessary computations.

### Build the model

An addition to this model is that we are not going to learn the embedding for the `<pad>` token. This is because we want to explitictly tell our model that padding tokens are irrelevant to determining the sentiment of a sentence. This means the embedding for the pad token will remain at what it is initialized to (we initialize it to all zeros later). We do this by passing the index of our pad token as the `padding_idx` argument to the `nn.Embedding` layer.

Before we pass our embeddings to the RNN, we need to pack them, which we do with `nn.utils.rnn.packed_padded_sequence`. This will cause our RNN to only process the non-padded elements of our sequence. The RNN will then return `packed_output` (a packed sequence) as well as the `hidden` states (hidden states are tensors while output is in packed form). Without packed padded sequences, `hidden` is tensor from the last element in the sequence, which will most probably be a pad token, however when using packed padded sequences they are both from the last non-padded element in the sequence. 

We then unpack the output sequence, with `nn.utils.rnn.pad_packed_sequence`, to transform it from a packed sequence to a tensor. The elements of `output` from padding tokens will be zero tensors (tensors where every element is zero). Usually, we only have to unpack output if we are going to use it later on in the model. Although we aren't in this case, we still unpack the sequence just to show how it is done.

Here is the code we will be using for our RNN by simply using <code>nn.init.orthogonal_</code>

### See how the shape changes

In [None]:
#embedded

In [None]:
#assert torch.equal(output[:, -1, :], packed_hn.squeeze(0)) #they will not be equal

### Training

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=lr) #<----changed to Adam
criterion = nn.BCEWithLogitsLoss() #combine sigmoid with binary cross entropy

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [None]:
def train(model, loader, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train() #useful for batchnorm and dropout
    for i, (label, text, text_length) in enumerate(loader): 
        label = label.to(device) #(batch_size, )
        text = text.to(device) #(batch_size, seq len)
                
        #predict
        predictions = model(text, text_length).squeeze(1) #output by the fc is (batch_size, 1), thus need to remove this 1
        
        #calculate loss
        loss = criterion(predictions, label)
        acc = binary_accuracy(predictions, label)
        
        #backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
                
    return epoch_loss / len(loader), epoch_acc / len(loader)

In [None]:
def evaluate(model, loader, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for i, (label, text, text_length) in enumerate(loader): 
            label = label.to(device) #(batch_size, )
            text = text.to(device) #(batch_size, seq len)

            predictions = model(text, text_length).squeeze(1) 
            
            loss = criterion(predictions, label)
            acc = binary_accuracy(predictions, label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(loader), epoch_acc / len(loader)

### Putting everything together

In [None]:
best_valid_loss = float('inf')

train_losses = []
train_accs = []
valid_losses = []
valid_accs = []

for epoch in range(num_epochs):

    train_loss, train_acc = train(model, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_loader, criterion)
    
    #for plotting
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    valid_losses.append(valid_loss)
    valid_accs.append(valid_acc)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'models/fasttext.pt')
    
    print(f'Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(1, 1, 1)
ax.plot(train_losses, label = 'train loss')
ax.plot(valid_losses, label = 'valid loss')
plt.legend()
ax.set_xlabel('updates')
ax.set_ylabel('loss')

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(1, 1, 1)
ax.plot(train_accs, label = 'train acc')
ax.plot(valid_accs, label = 'valid acc')
plt.legend()
ax.set_xlabel('updates')
ax.set_ylabel('acc')

In [None]:
model.load_state_dict(torch.load('models/fasttext.pt'))
test_loss, test_acc = evaluate(model, test_loader, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

### Test on some random reviews

In [None]:
test_str = "This is Chaky.  This movie is really good good good"
text = torch.tensor(text_pipeline(test_str)).to(device)
text

In [None]:
text = text.reshape(1, -1)  #because batch_size is 1

In [None]:
text_length = torch.tensor([text.size(1)]).to(dtype=torch.int64)

In [None]:
text.shape

In [None]:
text_length.shape

In [None]:
def predict(text, text_length):
    model.eval()
    with torch.no_grad():
        output = model(text, text_length).squeeze(1)
        rounded_preds = torch.round(torch.sigmoid(output))
        return rounded_preds

In [None]:
predict(text, text_length)  #accurate!

### Practice

- Try to turn off FastText embedding and see the accuracy.  For me, the accuracy reduced by around 10 to 20%
- Try to change Adam back to SGD.
- Try not to pack sequence and see what happens
- Try change your personal review and see whether your model can do well

### Trivials

If you don't like to pad, you can either use batch_size=1, or group samples by length.

Next class, let's try LSTM which is a better variant of RNN and see whether the accuracy improves.