The `en_core_web_md` model from spacy [contains pretrained GloVe vectors](https://github.com/explosion/spacy-models/releases//tag/en_core_web_md-2.2.5)

Also, we include vector calculation caching, to speed up the preprocessing for successive runs.

In [None]:
import numpy as np
import pandas as pd
import torch
import pickle
import random
import copy
from time import time
from tqdm import tqdm
from os import path
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence

CACHE_FILE = "vecors.cache"
MODEL_FILE = "best.model"
MODEL_INFO_FILE = "best.model.info"
SEED = 42
torch.manual_seed(SEED)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MAX_EPOCH = 100
EARLY_STOPPING_EPOCHS = 3
EARLY_STOPPING_TRESHOLD = 0.01  # diff of NLL

In [None]:
# create or load vectors from cache file
if path.exists(CACHE_FILE):
    with open(CACHE_FILE, "rb") as f:
        x_all, y_all = pickle.loads(f.read())
    print("Loaded vectors and labels from cache")
else:
    import spacy
    nlp = spacy.load("en_core_web_md")
    dataset = pd.read_csv('sst5.data.txt')  # .head(1000) to only load 1000
    documents = nlp.pipe(dataset["text"].to_list())
    x_all = list(map(lambda doc: [word.tensor for word in doc], documents))
    y_all = (dataset["label"] + 2).to_list()  # instead of -2 - 2, make it 0-4
    with open(CACHE_FILE, "wb") as f:
        f.write(pickle.dumps((x_all, y_all)))

# split first into 60:40 and the 40 into 50:50 -> 60:20:20
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all, test_size=0.4, random_state=SEED)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.5, random_state=SEED)

# custom dataloader-dataset for variable length sequences, as TensorDataset only works for equal length sequences
class MyDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, i):
        return self.x[i], self.y[i]


# padding collate-function to receive batch-processable vectors
def pad(batch):
    x = [torch.FloatTensor(xx).to(DEVICE) for xx, _ in batch]
    y = torch.LongTensor([yy for _, yy in batch]).to(DEVICE)
    return pad_sequence(x, batch_first=True).to(DEVICE), y


# build dataloaders
train_loader = DataLoader(MyDataset(x_train, y_train), batch_size=128, shuffle=True, collate_fn=pad)
test_loader = DataLoader(MyDataset(x_test, y_test), batch_size=1000, collate_fn=pad)
val_loader = DataLoader(MyDataset(x_val, y_val), batch_size=1000, collate_fn=pad)

## Processing and optimizations

We chose to see if these following parameters affect the performance:

* Increase/decrease the dimension of the hidden state of the RNN
* Increase/decrease drop out rates
* Remove Early Stopping and, instead, train the model for a fixed number of epochs.
* Use a Bidirectional LSTM, and define document embedding as the concatenation of the last  state of forward LSTM with the last state of backward LSTM.

Additionally, we performed hyperparamter search on values that were not given explicitly like number of layers, batch_size and the learning rate.

In [None]:
class LSTM(nn.Module):
    def __init__(self, input_size, num_classes, hidden_size, num_layers, dropout, bidirectional):
        super(LSTM, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True,
                            dropout=dropout if num_layers > 1 else 0, bidirectional=bool(bidirectional))
        self.activation = nn.Linear(hidden_size * (bidirectional + 1) * num_layers, num_classes)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        """ Forward Pass. Given a batch, the model first fetches the corresponding embeddings, and calculates
        hidden states of the given sequences (documents) with the LSTM model. The last hidden state
        of LSTM should be used as document embedding. Using document embedding, the model predicts
        the probability distribution of the output classes through the decoder (a linear projection) and
        softmax layer. """
        _, hidden = self.lstm(x)
        hidden = hidden[-1].permute((1, 0, 2)).reshape(len(x), -1)  # use all outputs of all layers of all directions
        hidden = self.activation(hidden)  # Decode the hidden state
        return self.softmax(hidden)


def validate(model, loader, criterion):
    model.eval()  # disables dropout and gradients
    res = np.mean([float(criterion(model(x), y.long())) for x, y in loader])
    model.train()
    return res

def train(x_train, y_train, val_loader, test_loader, nr_samples, batch_s,
          bidir, hidden, layers, lr_e, drop, estop):
    start = time()
    train_loader = DataLoader(MyDataset(x_train, y_train), batch_size=batch_s, shuffle=True, collate_fn=pad)
    # our LSTM model
    model = LSTM(input_size=len(x_train[0][0]), num_classes=5, hidden_size=hidden,
                 num_layers=layers, dropout=drop, bidirectional=bidir).to(DEVICE)
    # Loss Function. Loss is calculated using Negative Log Likelihood.
    criterion = nn.NLLLoss()
    # Optimization. Adam with default parameters* is used.
    optimizer = optim.Adam(model.parameters(), lr=0.1 ** lr_e)
    tq = tqdm(range(MAX_EPOCH), leave=False)
    vlh, correct = [0], 0  # validation loss history
    for _ in tq:
        train_loss = 0.0

        # run training iteration
        for xt, yt in train_loader:
            # train_loader iterator returns a batch_size x seq_len x embedding_size list of tensors
            optimizer.zero_grad()
            output = model(xt)
            # calculate loss and
            loss = criterion(output, yt)
            train_loss += float(loss)
            pred = output.max(1, keepdim=True)[1]
            correct += int(pred.eq(yt.view_as(pred)).sum().item())
            loss.backward()
            optimizer.step()

        # run validation
        train_loss /= len(train_loader)
        val_loss = validate(model, val_loader, criterion)
        vlh.append(val_loss)

        # statistics
        correct /= nr_samples
        stats = {"corr%": f"{correct * 100:.2f}", "tr_lo%": f"{train_loss * 100:.2f}", "va_lo%": f"{val_loss * 100:.2f}"}
        tq.set_postfix_str(str(stats)[1:-1].replace("'", ""))

        if estop:
            # Early Stopping. After each epoch or after a certain number of batches (defined as a hyperparameter),
            # evaluate the model on the validation set. If the evaluation result improves, save the
            # model as the best performing model so far.
            if vlh[-1] < min(vlh[0: -1]):
                best_model = copy.deepcopy(model)
            # If the results are not improving after a certain number
            # of evaluations (given as another hyper-parameter), terminate training.
            if vlh[-1] > min(vlh[-EARLY_STOPPING_EPOCHS-1 : -1]):
                break
    test_loss = validate(model, test_loader, criterion)
    print("\r", end="")  # remove tqdm
    duration = time() - start
    stats.update({"te_lo%": f"{test_loss*100:.2f}", "duration": f"{duration:.1f}"})
    return stats, best_model if estop else model


if path.exists(MODEL_INFO_FILE):
    with open(MODEL_INFO_FILE, "rb") as f:
        best_info = pickle.loads(f.read())
else:
    best_info = {"te_lo%": "-30"} # baseline: 30%
hyper_params = {"batch_s": [512, 256, 128, 64, 32], "bidir": [0, 1], "hidden": [32, 64, 128, 256],
                "layers": [1, 2, 3], "lr_e": [2, 3, 4], "drop": [0.0, 0.2, 0.4], "estop": [1]}
extra_params = ["corr%", "tr_lo%", "va_lo%", "te_lo%", "duration"]
print(*hyper_params.keys(), *extra_params, sep="\t")
while True:
    hyper = {k: random.choice(v) for k, v in hyper_params.items()}
    # print(*hyper.values(), sep="\t\t")  # print before training for debugging purposes
    stats, model = train(x_train, y_train, val_loader, test_loader, len(x_train), **hyper)
    print(*hyper.values(), *stats.values(), sep="\t\t")
    # if the model performs better than the best, save it!
    if float(stats["te_lo%"]) > float(best_info["te_lo%"]):
        torch.save(model, MODEL_FILE)
        best_info = stats
        with open(MODEL_INFO_FILE, "wb") as f:
            pickle.dump(best_info, f)

#Experimenting with Model Variations
>In this task, we experiment with variations of the baseline architecture, defined in the previous
>task. In general, at least 3 variations should be implemented, trained, and evaluated. In each
>variation, only one change should be applied to the baseline architecture. In that way, the effect
>of the change can be traced by comparing the evaluation results.

To do this, we can look at the (sorted) output of the hyperparameter search log and sort it for various factors to find
the effects of certain hyperparameters.

# Hyperparameter analysis and results

We can see that without early stopping, the model seems to overfit on 100 epochs,
as the negative training loss (roughly equivalent to the accuracy) is over 80% for the
best runs, but the best validation losses (va_lo%) are rather in the 50%-accuracy tests.

This indicates, running for a fixed number of epochs instead of early stopping could be a bad idea.

Findings in hyperparameters:
* The bidirectional processing seems to have the highest impact by far, as the best 20 results all have it enabled.
* Lower numbers of layers also seem to be beneficial
* The effects of dropout (only working with >1 layer) seem to be very minor or not even detectable
* A smaller learning rate seems to limit the overfitting issue, as it produces better test-accuracies but lower train-accuracies
* The training duration seems to be roughly [260-500] regardless of the parameters, by debugging, we found out that
 the padding and transferring the vectors to the gpu seem to make up a significant portion of time, and the effects
 of bigger batch-sizes seem to be canceled out by the on average longer batches. Obviously, early stopping strongly
 impacts training times.

**Results**: None of the models perform as good as the models from previous exercises (up to 70% test accuracy)
when validated, as they tend to overfit, but the best overfittet accuracies indicate that there is potential for
80%+ accuracies if there was more data.