# Introduction to Python and Natural Language Technologies

__Laboratory 07, Deep learning and NLP__

__March 25, 2021__

__Ádám Kovács__


During this laboratory we are going to use the same classification dataset that we used the last time: SemEval 2019 - Task 6. 
The dataset is about Identifying and Categorizing Offensive Language in Social Media.
__Preparation:__
- You will need the Semeval dataset (we will have code to download it)
- You will need to install pytorch:
    - pip install torch 
- You will also need to have pandas, torchtext, numpy and scikit learn installed, you can find the instructions for them in the lecture notebook.

We are going to use an open source library for building optimized deep learning models that can be run on GPUs, the library is called [Pytorch](https://pytorch.org/docs/stable/index.html). It is one of the most widely used libraries for building neural networks/deep learning models.

__NOTE: If your notebook/PC is not good enough, it is advised to use Google Colab for this laboratory for free access to GPUs. If you have completed the exercises, you can download the notebook and upload it to the repository__

In [None]:
!pip install torch

In [None]:
# Import the needed libraries
import pandas as pd
import numpy as np

## 0. Download the dataset and load it into a pandas DataFrame

__Note: you can reuse your code from the previous lab!__

In [None]:
# First we download the data using the code from last week
import os
if not os.path.isdir('./data'):
    os.mkdir('./data')

import urllib
u = urllib.request.URLopener()
u.retrieve("http://sandbox.hlt.bme.hu/~adaamko/offenseval.tsv",
           "data/offenseval.tsv")

## 0.1 Read in the dataset into a Pandas DataFrame
Use `pd.read_csv` with the correct parameters to read in the dataset. If done correctly, `DataFrame` should have 3 columns, 
`id`, `tweet`, `subtask_a`.

In [None]:
import pandas as pd
import numpy as np

In [None]:
def read_dataset():
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
train_data_unprocessed = read_dataset()

assert type(train_data_unprocessed) == pd.core.frame.DataFrame
assert len(train_data_unprocessed.columns) == 3
assert (train_data_unprocessed.columns == ['id', 'tweet', 'subtask_a']).all()

## 0.2 Convert `subtask_a` into a binary label
The task is to classify the given tweets into two category: _offensive(OFF)_ , _not offensive (NOT)_. For machine learning algorithms you will need integer labels instead of strings. Add a new column to the dataframe called `label`, and transform the `subtask_a` column into a binary integer label.

In [None]:
def transform(train_data):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
from pandas.api.types import is_numeric_dtype

train_data = transform(train_data_unprocessed)

assert "label" in train_data
assert is_numeric_dtype(train_data.label)
assert (train_data.label.isin([0,1])).all()

## 1. Train a simple neural network on this dataset

__HINT: you can reuse the code from the Lecture! Most of the code will be very similar that we used there!__

In [None]:
#Import pytorch and set a fixed random seed for reproducibility
import torch

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### 1.1 Split the dataset into a train and a validation dataset
Use the random seed for splitting. You should split the dataset into 70% training data and 30% validation data

In [None]:
from sklearn.model_selection import train_test_split as split

def split_data(train_data, random_seed):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
tr_data, val_data = split_data(train_data, SEED)
assert len(tr_data) == 9268

### 1.2 Use CountVectorizer to prepare the features for the sentences
You should fit CountVectorizer using _10000_ features

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def prepare_vectorizer(tr_data):
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
word_to_ix = prepare_vectorizer(tr_data)
VOCAB_SIZE = len(word_to_ix.vocabulary_)
assert VOCAB_SIZE == 10000

### 1.3 Prepare the DataLoader for batch processing

The __prepare_dataloader(..)__ function will take the training and the validation dataset and convert them to one-hot encoded vectors with the help of the initialized CountVectorizer.

You should prepare two FloatTensor for the converted tweets of the training and the validation data.

Then zip together the vectors with the labels as a list of tuples!

__Hint: look at the lecture (but be careful, we had different types of labels there!)__

In [None]:
def prepare_dataloader(tr_data, val_data, word_to_ix):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
tr_data_loader, val_data_loader = prepare_dataloader(tr_data, val_data, word_to_ix)
assert type(tr_data_loader[0][0]) == torch.Tensor
assert len(tr_data_loader) == 9268
assert type(tr_data_loader[0][1]) == int

- __We have the correct lists now, it is time to initialize the DataLoader objects!__
- __Create two DataLoader objects with the lists we have created__
- __Shuffle the training data but not the validation data!__
- __Set a BATCH_SIZE, experiment with different sized batches to see if it improves the performance__

In [None]:
from torch.utils.data import DataLoader

def create_dataloader_iterators(tr_data_loader, val_data_loader, BATCH_SIZE):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Try to experiment with different sized batches and see if changing this will improve the performance or not!
BATCH_SIZE = 64

In [None]:
train_iterator, valid_iterator = create_dataloader_iterators(tr_data_loader, val_data_loader, BATCH_SIZE)
assert type(train_iterator) == torch.utils.data.dataloader.DataLoader

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### 1.4 Build the model
At first, the model only should contain a single Linear layer that takes one-hot-encoded vectors and trainsforms it into the dimension if the __NUM_LABELS__(how many classes we are trying to predict). Then, run through the output on a softmax activation to produce probabilites of the classes!

In [None]:
from torch import nn

class BoWClassifier(nn.Module):  # inheriting from nn.Module!
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# SET THE CORRECT INPUT AND OUTPUT DIMENSIONS!
#INPUT_DIM = ...
#OUTPUT_DIM = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
model = BoWClassifier(OUTPUT_DIM, INPUT_DIM)

In [None]:
# Set the optimizer and the loss function!
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

In [None]:
model = model.to(device)
criterion = criterion.to(device)

In [None]:
assert model.linear.in_features == 10000
assert model.linear.out_features == 2

### Implement the following functions:
- __calculate_performance__: This should calculate the batch-wise accuracy of your model!
- __train__ - Train your model on the training data! This function should set the model to training mode, then use the given iterator to iterate through the training samples and make predictions using the provided model. You should then propagate back the error with the loss function and the optimizer. Finally return the average epoch loss and accuracy!
- __evaluate__ - Evaluate your model on the validation dataset. This function is essentially the same as the trainnig function, but you should set your model to eval mode and don't propagate back the errors to your weights!

In [None]:
def calculate_performance(preds, y):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
import torch.nn.functional as F
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    # YOUR CODE HERE
    raise NotImplementedError()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    # YOUR CODE HERE
    raise NotImplementedError()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### 1.5 Training loop!
Below is the training loop of our model! Try to set an EPOCH number that will correctly train your model :) (it is not underfitted but neither overfitted!

In [None]:
# Set an EPOCH number!
N_EPOCHS = 15

In [None]:
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_score = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_score = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Fscore: {train_score*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Fscore: {valid_score*100:.2f}%')

### 1.6 Change calculate_performance to calculate FScore instead of accuracy

Our dataset is very imbalanced. We have twice as many NOT offensive tweets as offensive ones. Accuracy is not a good measure for this.

See https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html for fscore calculation.

You should expect a heavy drop in performance when you calculate fscore instead of accuracy!

__NOTE: DON'T FORGET TO RERUN THE MODEL INITIALIZATION WHEN YOU ARE TRYING TO RUN THE MODEL MULTIPLE TIMES. IF YOU DON'T REINITIALIZE THE MODEL IT WILL CONTINUE THE TRAINING WHERE IT HAS STOPPED LAST TIME AND DOESN'T RUN FROM SRATCH!__

These lines:


`model = BoWClassifier(OUTPUT_DIM, INPUT_DIM)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()
model = model.to(device)
criterion = criterion.to(device)`

This will reinitialize the model!

In [None]:
from sklearn.metrics import f1_score

def calculate_performance(preds, y):
    # YOUR CODE HERE
    raise NotImplementedError()

## 2. Add more linear layers to your model and experiment with other hyperparameters

### 2.1 More layers

Currently we only have a single linear layers in our model. Try to add one or more additional linear layers to the model.
You should introduce a HIDDEN_SIZE parameter that will be the size of the intermediate representation between the linear layers. Also add a RELU activation function between the linear layers.

See more:
- https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html
- https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_nn.html

In [None]:
from torch import nn

class BoWDeepClassifier(nn.Module):  # inheriting from nn.Module!
    def __init__(self, num_labels, vocab_size, hidden_size):
        # YOUR CODE HERE
        raise NotImplementedError()

    def forward(self, bow_vec):
        # YOUR CODE HERE
        raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Write down your experiences with changing the parameters to the cell below

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
HIDDEN_SIZE = 200
learning_rate = 0.001
BATCH_SIZE = 64
N_EPOCHS = 15

In [None]:
model = BoWDeepClassifier(OUTPUT_DIM, INPUT_DIM, HIDDEN_SIZE)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.NLLLoss()

model = model.to(device)
criterion = criterion.to(device)

In [None]:
# TRAINING LOOP HERE!
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_score = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_score = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Fscore: {train_score*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Fscore: {valid_score*100:.2f}%')

# ================ PASSING LEVEL ====================

## 3. Implement automatic early-stopping in the training loop
Early stopping is a very easy method to avoid the overfitting of your model.

You should:
- Save the training and the validation loss of the last two epochs (if you are atleast in the third epoch)
- If the loss increased in the last two epoch on the training data but descreased or stagnated in the validation data, you should stop the training automatically!

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## 4. Handling class imbalance
Our data is imbalanced, the first class has twice the population of the second class.

One way of handling imbalanced data is to weight the loss function, so it penalizes errors on the smaller class.

Look at the documentation of the loss function: https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html

Set the weights based on the inverse population of the classes (so the less sample a class has, more the errors will be penalized!)

In [None]:
tr_data.groupby("label").size()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# ================ EXTRA LEVEL ====================