# Introduction to Python and Natural Language Technologies

__Laboratory 08, Deep learning and NLP__

__November 05, 2020__

__Ádám Kovács__


## Data loading

Pytorch has a special framework to work with textual data, called [TorchText](https://pytorch.org/text/), it has a lot of useful built in methods and datasets.

The Field class is one of the main concepts of TorchText.

The parameters of a Field specify how the data should be processed.

The TEXT field describes how the articles need to be processed, also the LABEL field is to process the label of the articles.

We can pass tokenize='spacy' to our tokenizer to use spacy's methods. The default tokenizer would be splitting on spaces.

LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels.

For more on [Fields](https://torchtext.readthedocs.io/en/latest/data.html)

Random seeds are used for reproducibility.

__NOTE: it is advised to use Google Colab for this laboratory. If you have completed the exercises, you can download the notebook and upload it to the repository__

In [None]:
!pip install torchtext==0.4

In [None]:
import torch
from torchtext import data
from torchtext.datasets import text_classification
import os

Download and extract the dataset we produced during the class

In [None]:
import os

data_dir = os.getenv("data")
if data_dir is None:
    data_dir = ""

ml_path = os.path.join(data_dir, "data.zip")

if not os.path.exists(ml_path):
    print("Download data")
    import urllib
    u = urllib.request.URLopener()
    u.retrieve("http://sandbox.hlt.bme.hu/~adaamko/dataset.zip", ml_path)

unzip_path = os.path.join(data_dir, "data")

if not os.path.exists(unzip_path):
    print("Unzip data")
    from zipfile import ZipFile
    with ZipFile(ml_path) as myzip:
        myzip.extractall(data_dir)

data_dir = unzip_path

In [None]:
import random

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.long)

## 1. Use data.TabularDataset to load in the dataset and split the train file into train and dev. Use random.seed(SEED) when splitting the dataset

In [None]:
def load_dataset(TEXT, LABEL):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
train, valid, test = load_dataset(TEXT, LABEL)

In [None]:
print(f'Number of training examples: {len(train)}')
print(f'Number of validation examples: {len(valid)}')
print(f'Number of testing examples: {len(test)}')

In [None]:
assert len(train) == 84000

Building vocabulary is an essential step to handle textual data. This is a lookup table where every unique word has a corresponding index.

This is done so our machine learning model can operate on numbers instead of strings. The indexes are then used to construct embeddings for our words.

In [None]:
TEXT.build_vocab(train)  
LABEL.build_vocab(train)

In [None]:
print(len(TEXT.vocab))

In [None]:
print(TEXT.vocab.freqs.most_common(20))

We have a special unknown or __< unk >__ token. For example, if the sentence was "This film is great and I love it" but if the word "love" is not in the vocab, it would become "This film is great and I __< unk >__ it".
    
We feed batches into our model. And we feed one batch at a time. Within a batch all sentences need to be in the same size. We need to ensure that each sentence in the batch is the same size, and the shorter ones are padded.

![pad](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/79bb86abc9e89951a5f8c4a25ca5de6a491a4f5d/assets/sentiment6.png)

_(image from bentrevett)_

## 1.1 Use data.BucketIterator to consturct iterators on training, dev and split data. 

In [None]:
def construct_iterators(train, dev, test):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = construct_iterators(
    train, valid, test)

# 2. Build a LSTM model that predicts the label

Hints:
Each batch, text, is a tensor of size [sentence length, batch size].

The input batch is then passed through the embedding layer to get word embeddings. Then, the embedded layer is then fed into the LSTM..

The LSTM returns 2 tensors, output of size [sentence length, batch size, hidden dim] and hidden of size [1, batch size, hidden dim]. We take the last layer of the output.

Finally, we feed the output of lstm through the linear layer, fc, to produce a prediction

In [None]:
import torch.nn as nn
from torch import autograd


class LSTMClassifier(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        # YOUR CODE HERE
        raise NotImplementedError()

    def forward(self, text):
        # YOUR CODE HERE
        raise NotImplementedError()

In [None]:
INPUT_DIM = len(TEXT.vocab)

EMBEDDING_DIM = 100
HIDDEN_DIM = 100
OUTPUT_DIM = 4

model = LSTMClassifier(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [None]:
import torch.optim as optim

# Define the learning rate optimizer, you can experiment with various optimizers: https://pytorch.org/docs/stable/optim.html
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

# Send the tensors to GPU if available
model = model.to(device)
criterion = criterion.to(device)

In [None]:
from sklearn.metrics import classification_report


def class_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    rounded_preds = preds.argmax(1)
    correct = (rounded_preds == y).float()  # convert into float for division
    acc = correct.sum() / len(correct)
    return acc

### 2.1 Implement the train and the evaluate functions.

train should :
- iterate throught the dataset with the given iterator, 
- get the output from the model
- calculate the loss and the accuracy
- Propagate backward the loss
- And calculate epoch loss

In [None]:
from sklearn.metrics import accuracy_score
import torch.nn.functional as F


def train(model, iterator, optimizer, criterion):
    # YOUR CODE HERE
    raise NotImplementedError()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

evaluate should:
- set the model to .eval() mode
- with torch.no_grad(), iterate on the iterator
- calculate the prediction and the loss on the validation dataset
- calculate epoch loss

In [None]:
def evaluate(model, iterator, criterion):
    # YOUR CODE HERE
    raise NotImplementedError()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time


def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 15

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')

    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

## 2.2 Add pretrained embeddings to the model instead of training it.

- You need to set vectors to the vocabulary (https://pytorch.org/text/vocab.html)
- Set the embeddings weights in __init__
- Turn off the training of the embedding layer

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## -------------------------------------------PASSING LEVEL---------------------------------------------------------

# 3. After a few epochs, the model starts to overfit (the accuracy goes down in dev and up in train). Add Dropout layers to the model.

## 3.1 Change LSTM to Bi-LSTM
- See: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
- Be aware that Bi-LSTM produces two outputs, so the shape will also be 2*hidden_dim

## -------------------------------------------EXTRA LEVEL-----------------------------------------------------------------