### Salary prediction, episode II: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


In [152]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F


data = pd.read_csv("comments.tsv", sep='\t')
texts = data['comment_text'].values
target = data['should_ban'].values
data

Unnamed: 0,should_ban,comment_text
0,0,The picture on the article is not of the actor...
1,1,"Its madness. Shes of Chinese heritage, but JAP..."
2,1,Fuck You. Why don't you suck a turd out of my ...
3,1,God is dead\nI don't mean to startle anyone bu...
4,1,THIS USER IS A PLANT FROM BRUCE PERENS AND GRO...
...,...,...
995,0,rowspan=9 colspan=8|Did Not Qualify
996,0,"""== Disputed and under-referenced ==\n\nI have..."
997,0,Why?\nWhy does this event have its own page? 1...
998,0,"Que? \n\nWas this fat fingers? If not, can yo..."


In [153]:
from nltk.tokenize import TweetTokenizer
from gensim.models import Word2Vec


tokenizer = TweetTokenizer()
sentences = []
for i in range(len(texts)):
    sentences.append(tokenizer.tokenize(texts[i].lower()))
model = Word2Vec(sentences=sentences, min_count=1).wv

In [154]:
max_len = 0
for s in sentences:
    max_len = max(max_len, len(s))
for s in sentences:
    while len(s) < max_len:
        s.append('.')

In [155]:
data = torch.empty((1000, 100, 1247))
index = 0
for s in sentences:
    for i in range(len(s)):
        word = torch.tensor(model[s[i]])
        if i == 0:
            vector_s = word.view(100, 1)
        else:
            vector_s = torch.cat((vector_s, word.view(100, 1)), 1)
    data[index] = vector_s
    index += 1

In [156]:
new_target = torch.empty((1000, 2))
new_target[:, 0] = torch.tensor(target == 1, dtype=int)
new_target[:, 1] = torch.tensor(target == 0, dtype=int)
target = new_target

In [162]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

In [170]:
class Data(torch.utils.data.Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
        self.len = self.X.shape[0]
       
    def __getitem__(self, index):
        return self.X[index], self.y[index]
   
    def __len__(self):
        return self.len

In [291]:
batch_size = 50
train_dataloader = torch.utils.data.DataLoader(dataset=Data(data, target), batch_size=batch_size, shuffle=True)
test_dataloader = torch.utils.data.DataLoader(dataset=Data(data, target), batch_size=batch_size, shuffle=True)

In [184]:
class mynn(nn.Module):
    def __init__(self):
        super(mynn, self).__init__()
        self.conv = nn.Conv1d(100, 10, 3)
        self.lin1 = nn.Linear(10, 4)
        self.lin2 = nn.Linear(4, 2)
        
        
    def forward(self, X):
        X = F.max_pool1d(F.relu(self.conv(X)), 1245)
        X = X.view(X.size(0), -1)
        X = F.relu(self.lin1(X))
        X = self.lin2(X)
        return X

In [185]:
model = mynn()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

In [186]:
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()

In [187]:
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.756


In [320]:
class mynn(nn.Module):
    def __init__(self):
        super(mynn, self).__init__()
        self.conv = nn.ModuleList([nn.Conv1d(100, 10, kernel) for kernel in [2, 3, 4, 5]])
        self.lin1 = nn.Linear(40, 10)
        self.lin2 = nn.Linear(10, 2)
        
        
    def forward(self, X):
        X_conv = [F.relu(conv1d(X)) for conv1d in self.conv]
        X_pool = [F.max_pool1d(x_conv, 1248 - i - 2) for i, x_conv in enumerate(X_conv)]
        X = torch.cat([x_pool.squeeze(dim=2) for x_pool in X_pool], dim=1)
        X = F.relu(self.lin1(X))
        X = self.lin2(X)
        return X

In [321]:
model = mynn()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

In [322]:
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()

In [323]:
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.777


In [283]:
class mynn(nn.Module):
    def __init__(self):
        super(mynn, self).__init__()
        self.conv = nn.ModuleList([nn.Conv1d(100, 10, kernel) for kernel in [2, 3, 4, 5]])
        self.lin1 = nn.Linear(40, 10)
        self.lin2 = nn.Linear(10, 2)
        self.norm = nn.BatchNorm1d(40)
        
        
    def forward(self, X):
        X_conv = [F.relu(conv1d(X)) for conv1d in self.conv]
        X_pool = [F.max_pool1d(x_conv, 1248 - i - 2) for i, x_conv in enumerate(X_conv)]
        X = torch.cat([x_pool.squeeze(dim=2) for x_pool in X_pool], dim=1)
        X = self.norm(X)
        X = F.relu(self.lin1(X))
        X = self.lin2(X)
        return X

In [284]:
model = mynn()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.814


In [347]:
batch_size = 50
train_dataloader = torch.utils.data.DataLoader(dataset=Data(data, target), batch_size=batch_size, shuffle=True)
test_dataloader = torch.utils.data.DataLoader(dataset=Data(data, target), batch_size=batch_size, shuffle=True)

In [318]:
class nn_batch(nn.Module):
    def __init__(self):
        super(nn_batch, self).__init__()
        self.conv = nn.ModuleList([nn.Conv1d(100, 10, kernel) for kernel in [3, 4, 5]])
        self.lin1 = nn.Linear(30, 10)
        self.lin2 = nn.Linear(10, 2)
        self.norm = nn.BatchNorm1d(30)
        
        
    def forward(self, X):
        X_conv = [F.relu(conv1d(X)) for conv1d in self.conv]
        X_pool = [F.max_pool1d(x_conv, 1248 - i - 3) for i, x_conv in enumerate(X_conv)]
        X = torch.cat([x_pool.squeeze(dim=2) for x_pool in X_pool], dim=1)
        X = self.norm(X)
        X = F.relu(self.lin1(X))
        X = self.lin2(X)
        return X

In [319]:
model = nn_batch()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.872


In [329]:
class nn_batch(nn.Module):
    def __init__(self):
        super(nn_batch, self).__init__()
        self.conv = nn.ModuleList([nn.Conv1d(100, 10, kernel) for kernel in [3, 4, 5]])
        self.lin1 = nn.Linear(30, 10)
        self.lin2 = nn.Linear(10, 4)
        self.lin3 = nn.Linear(4, 2)
        self.norm1 = nn.BatchNorm1d(30)
        self.norm2 = nn.BatchNorm1d(10)
        self.norm3 = nn.BatchNorm1d(4)
        
        
    def forward(self, X):
        X_conv = [F.relu(conv1d(X)) for conv1d in self.conv]
        X_pool = [F.max_pool1d(x_conv, 1248 - i - 3) for i, x_conv in enumerate(X_conv)]
        X = torch.cat([x_pool.squeeze(dim=2) for x_pool in X_pool], dim=1)
        X = self.norm1(X)
        X = F.relu(self.lin1(X))
        X = self.norm2(X)
        X = F.relu(self.lin2(X))
        X = self.norm3(X)
        X = self.lin3(X)
        return X

In [330]:
model = nn_batch()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.915


In [331]:
class nn_batch(nn.Module):
    def __init__(self):
        super(nn_batch, self).__init__()
        self.conv = nn.ModuleList([nn.Conv1d(100, 10, kernel) for kernel in [3, 4, 5]])
        self.lin1 = nn.Linear(30, 10)
        self.lin2 = nn.Linear(10, 4)
        self.lin3 = nn.Linear(4, 2)
        self.norm1 = nn.BatchNorm1d(30)
        self.norm2 = nn.BatchNorm1d(10)
        self.norm3 = nn.BatchNorm1d(4)
        
        
    def forward(self, X):
        X_conv = [F.relu(conv1d(X)) for conv1d in self.conv]
        X_pool = [F.avg_pool1d(x_conv, 1248 - i - 3) for i, x_conv in enumerate(X_conv)]
        X = torch.cat([x_pool.squeeze(dim=2) for x_pool in X_pool], dim=1)
        X = self.norm1(X)
        X = F.relu(self.lin1(X))
        X = self.norm2(X)
        X = F.relu(self.lin2(X))
        X = self.norm3(X)
        X = self.lin3(X)
        return X

In [332]:
model = nn_batch()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.715


In [341]:
class nn_batch(nn.Module):
    def __init__(self):
        super(nn_batch, self).__init__()
        self.conv = nn.ModuleList([nn.Conv1d(100, 10, kernel) for kernel in [3, 4, 5, 6]])
        self.lin1 = nn.Linear(40, 10)
        self.lin2 = nn.Linear(10, 4)
        self.lin3 = nn.Linear(4, 2)
        self.norm1 = nn.BatchNorm1d(40)
        self.norm2 = nn.BatchNorm1d(10)
        self.norm3 = nn.BatchNorm1d(4)
        
        
    def forward(self, X):
        X_conv = [F.relu(conv1d(X)) for conv1d in self.conv]
        X_pool = [F.max_pool1d(x_conv, 1248 - i - 3) for i, x_conv in enumerate(X_conv)]
        X = torch.cat([x_pool.squeeze(dim=2) for x_pool in X_pool], dim=1)
        X = self.norm1(X)
        X = F.relu(self.lin1(X))
        X = self.norm2(X)
        X = F.relu(self.lin2(X))
        X = self.norm3(X)
        X = self.lin3(X)
        return X

In [342]:
model = nn_batch()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.941


In [348]:
class nn_batch(nn.Module):
    def __init__(self):
        super(nn_batch, self).__init__()
        self.conv = nn.ModuleList([nn.Conv1d(100, 10, kernel) for kernel in [4, 5, 6, 7]])
        self.lin1 = nn.Linear(40, 10)
        self.lin2 = nn.Linear(10, 4)
        self.lin3 = nn.Linear(4, 2)
        self.norm1 = nn.BatchNorm1d(40)
        self.norm2 = nn.BatchNorm1d(10)
        self.norm3 = nn.BatchNorm1d(4)
        
        
    def forward(self, X):
        X_conv = [F.relu(conv1d(X)) for conv1d in self.conv]
        X_pool = [F.max_pool1d(x_conv, 1248 - i - 4) for i, x_conv in enumerate(X_conv)]
        X = torch.cat([x_pool.squeeze(dim=2) for x_pool in X_pool], dim=1)
        X = self.norm1(X)
        X = F.relu(self.lin1(X))
        X = self.norm2(X)
        X = F.relu(self.lin2(X))
        X = self.norm3(X)
        X = self.lin3(X)
        return X

In [349]:
model = nn_batch()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
    for X, y in train_dataloader:
        optimizer.zero_grad()
        pred = model(X.to(torch.float32))
        loss = criterion(pred, y.to(float))
        loss.backward()
        optimizer.step()
begin = 0
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_dataloader:
        outputs = model(X.to(torch.float32))
        _, predictions = torch.max(outputs, 1)
        _, y_true = torch.where(y == 1)
        correct += int((y_true == predictions).sum())
        total += y_true.shape[0]
        end = begin + batch_size
        begin = end


print(correct / total)

0.952


### Steps:
* One CNN with 10 out_channels, accuracy = 0.756
* Four CNNs with different kernels, accuracy = 0.777
* Add more linears and batchnorms, accuracy = 0.915
* Add more kernels, accuracy = 0.952

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!