### Salary prediction, episode II: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv("./Train_rev1.zip", compression='zip', index_col=None)
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')
text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"

data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string "NaN"

In [3]:
import nltk

tokenizer = nltk.tokenize.WordPunctTokenizer()

def preprocess(text):
    return ' '.join(tokenizer.tokenize(str(text).lower()))

data["FullDescription"] = data["FullDescription"].apply(preprocess)
data["Title"] = data["Title"].apply(preprocess)
from collections import Counter

all_words_dataset = (data['Title'] + ' ' + data['FullDescription']).str.split()
all_words = (word for words in all_words_dataset for word in words)

token_counts = Counter(all_words)

min_count = 10

# tokens from token_counts keys that had at least min_count occurrences throughout the dataset
tokens = [token for token, count in token_counts.items() if count >= min_count]

# Add a special tokens for unknown and empty words
UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + tokens
token_to_id = {token: idx for idx, token in enumerate(tokens)}

UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])

In [4]:
from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

def as_matrix(sequences, max_len=None):
    """ Convert a list of tokens into a matrix with padding """
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))
        
    max_len = min(max(map(len, sequences)), max_len or float('inf'))
    
    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i,seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

def to_tensors(batch, device):
    batch_tensors = dict()
    for key, arr in batch.items():
        if key in ["FullDescription", "Title"]:
            batch_tensors[key] = torch.tensor(arr, device=device, dtype=torch.int64)
        else:
            batch_tensors[key] = torch.tensor(arr, device=device)
    return batch_tensors


def make_batch(data, max_len=None, word_dropout=0, device=device, **kw):
    """
    Creates a keras-friendly dict from the batch data.
    :param word_dropout: replaces token index with UNK_IX with this probability
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    batch = {}
    batch["Title"] = as_matrix(data["Title"].values, max_len)
    batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len)
    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))
    
    if word_dropout != 0:
        batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)
    
    if TARGET_COLUMN in data.columns:
        batch[TARGET_COLUMN] = data[TARGET_COLUMN].values
    
    return to_tensors(batch, device)

def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
    dropout_mask &= matrix != pad_ix
    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])



def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, device=device, **kwargs):
    """ iterates minibatches of data in random order """
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]], device=device, **kwargs)
            yield batch
        
        if not cycle: break

from tqdm.auto import tqdm


def print_metrics(model, data, batch_size=64, epoch=None, name="", device=torch.device('cpu'), writer=None, **kw):
    squared_error = abs_error = num_samples = 0.0
    model.eval()
    with torch.no_grad():
        for batch in iterate_minibatches(data, batch_size=batch_size, shuffle=False, device=device, **kw):
            batch_pred = model(batch)
            squared_error += torch.sum(torch.square(batch_pred - batch[TARGET_COLUMN]))
            abs_error += torch.sum(torch.abs(batch_pred - batch[TARGET_COLUMN]))
            num_samples += len(batch_pred)
    mse = squared_error.detach().cpu().numpy() / num_samples
    mae = abs_error.detach().cpu().numpy() / num_samples
    print("%s results:" % (name or ""))
    print("Mean square error: %.5f" % mse)
    print("Mean absolute error: %.5f" % mae)
    if writer:
        writer.add_scalar('MSE/test', mse, epoch)
        writer.add_scalar('MAE/test', mae, epoch)
    return mse, mae

In [79]:
class ConvEncoder(nn.Module):
    def __init__(self, hid_size, filter_count, conv_kernel_size=3, **kw):
        super().__init__()

        self.conv = nn.Conv1d(hid_size, filter_count, conv_kernel_size)
        self.pool = nn.AdaptiveMaxPool1d(1)

    def forward(self, batch):
        conv = self.conv(batch)
        pool = self.pool(conv).squeeze(-1)

        return pool

import gensim.downloader as api
class TrainedEmbedding(nn.Module):
    def __init__(self, tokens, **kw):
        super().__init__()
        
        model = api.load('glove-twitter-100')
        unk_vector = model.get_vector('unknown')
        def get_vector(token):
            try:
                return model.get_vector(token)
            except KeyError:
                return unk_vector
        weight = [get_vector(token) for idx, token in enumerate(tokens)]
        self.emb = nn.Embedding.from_pretrained(torch.FloatTensor(weight))
        self.emb_size = self.emb.weight.size(1)

    def forward(self, batch):
        return self.emb(batch)

# class LSTMEncoder(nn.Module):
#     def __init__(self, **kw):
#         super().__init__()

#         self.lstm = nn.LSTM(emb_size, hidden_size, batch_first=True)

#     def forward(self, batch):
#         pass

class SalaryPredictor(nn.Module):
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64, out_dim=16, drop_p=0.2, **kw):
        super().__init__()

        self.emb = nn.Embedding(n_tokens, hid_size)
        self.emb = TrainedEmbedding(tokens)
        hid_size = self.emb.emb_size
        #title_filter_count = out_dim        
        # self.title_conv = nn.Conv1d(hid_size, title_filter_count, 4)
        # self.title_pool = nn.AdaptiveMaxPool1d(1)
        self.title_enc = ConvEncoder(hid_size, out_dim, **kw)
        

        #desc_filter_count = out_dim
        # self.desc_conv = nn.Conv1d(hid_size, desc_filter_count, 4)
        # self.desc_pool = nn.AdaptiveMaxPool1d(1)
        self.desc_enc = ConvEncoder(hid_size, out_dim, **kw)

        cat_out_dim = out_dim
        self.cat_encoder = nn.Linear(n_cat_features, cat_out_dim)

        fc_in_dim = out_dim + out_dim + cat_out_dim
        self.norm = nn.BatchNorm1d(fc_in_dim)
        self.drop = nn.Dropout(drop_p)
        self.fc_output1 = nn.Linear(fc_in_dim, 10)
        self.fc_output2 = nn.Linear(10, 1)
        
    def forward(self, batch):
        title_enc = batch['Title']
        title_enc = self.emb(title_enc).permute(0, 2, 1)
        
        title_enc = self.title_enc(title_enc)

        desc_enc = batch['FullDescription']
        desc_enc = self.emb(desc_enc).permute(0, 2, 1)
        
        desc_enc = self.desc_enc(desc_enc)

        cat_enc = batch['Categorical']
        cat_enc = self.cat_encoder(cat_enc)

        full_cat = torch.cat((title_enc, desc_enc, cat_enc), 1)
        full_cat = self.norm(full_cat)
        full_cat = self.drop(full_cat)

        
        fc = self.fc_output1(full_cat)
        fc = self.fc_output2(fc)
        
        return fc.squeeze(-1)

In [80]:
config = {
    'epochs': 30,
    'batch_size': 128,
    'opt_lr': 1e-4,
    'hid_size': 64,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    'out_dim': 32,
    'drop_p': 0.2,
    'conv_kernel_size': 3
}

In [81]:
model = SalaryPredictor(**config).to(device)

In [82]:
from torch.utils.tensorboard import SummaryWriter

EPOCHS = config['epochs']
BATCH_SIZE = config['batch_size']

criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=config['opt_lr'])

writer = SummaryWriter(log_dir='logs')
writer.add_graph(model, make_batch(data_train[:5], **config))

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm(enumerate(
            iterate_minibatches(data_train, **config)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    mse, mae = print_metrics(model, data_val, epoch=epoch, writer=writer, **config)

writer.add_hparams(config, {'mse': mse, 'mae': mae})

writer.close()

epoch: 0


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 4.17268
Mean absolute error: 1.64713
epoch: 1


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 1.01455
Mean absolute error: 0.79577
epoch: 2


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.49968
Mean absolute error: 0.55981
epoch: 3


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.28921
Mean absolute error: 0.42386
epoch: 4


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.20690
Mean absolute error: 0.35731
epoch: 5


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.15337
Mean absolute error: 0.30474
epoch: 6


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.13077
Mean absolute error: 0.28033
epoch: 7


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.11922
Mean absolute error: 0.26615
epoch: 8


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.10704
Mean absolute error: 0.25133
epoch: 9


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.09940
Mean absolute error: 0.24111
epoch: 10


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.09557
Mean absolute error: 0.23491
epoch: 11


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.09201
Mean absolute error: 0.23033
epoch: 12


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.08731
Mean absolute error: 0.22360
epoch: 13


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.08534
Mean absolute error: 0.22011
epoch: 14


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.08261
Mean absolute error: 0.21619
epoch: 15


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.08047
Mean absolute error: 0.21343
epoch: 16


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07972
Mean absolute error: 0.21167
epoch: 17


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07766
Mean absolute error: 0.20853
epoch: 18


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07596
Mean absolute error: 0.20585
epoch: 19


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07457
Mean absolute error: 0.20377
epoch: 20


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07382
Mean absolute error: 0.20310
epoch: 21


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07279
Mean absolute error: 0.20139
epoch: 22


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07222
Mean absolute error: 0.19982
epoch: 23


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07151
Mean absolute error: 0.19892
epoch: 24


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07095
Mean absolute error: 0.19820
epoch: 25


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07010
Mean absolute error: 0.19701
epoch: 26


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.07069
Mean absolute error: 0.19714
epoch: 27


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.06940
Mean absolute error: 0.19620
epoch: 28


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.06891
Mean absolute error: 0.19473
epoch: 29


  0%|          | 0/1529 [00:00<?, ?it/s]

 results:
Mean square error: 0.06977
Mean absolute error: 0.19576


In [43]:
def explain(model, sample, col_name='Title'):
    """ Computes the effect each word had on model predictions """
    sample = dict(sample)
    sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]
    data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))

    for drop_i in range(len(sample_col_tokens)):
        data_drop_one_token.loc[drop_i, col_name] = ' '.join(UNK if i == drop_i else tok
                                                   for i, tok in enumerate(sample_col_tokens)) 

    *predictions_drop_one_token, baseline_pred = model(make_batch(data_drop_one_token, device=device)).detach().cpu()
    diffs = baseline_pred - torch.Tensor(predictions_drop_one_token)
    return list(zip(sample_col_tokens, diffs))

from IPython.display import HTML, display_html


def draw_html(tokens_and_weights, cmap=plt.get_cmap("bwr"), display=True,
              token_template="""<span style="background-color: {color_hex}">{token}</span>""",
              font_style="font-size:14px;"
             ):
    
    def get_color_hex(weight):
        rgba = cmap(1. / (1 + np.exp(float(weight))), bytes=True)
        return '#%02X%02X%02X' % rgba[:3]
    
    tokens_html = [
        token_template.format(token=token, color_hex=get_color_hex(weight))
        for token, weight in tokens_and_weights
    ]
    
    
    raw_html = """<p style="{}">{}</p>""".format(font_style, ' '.join(tokens_html))
    if display:
        display_html(HTML(raw_html))
        
    return raw_html

In [44]:
i = np.random.randint(len(data))
print("Index:", i)
print("Salary (gbp):", np.expm1(model(make_batch(data.iloc[i: i+1], device=device)).detach().cpu()))

tokens_and_weights = explain(model, data.loc[i], "Title")
draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');

tokens_and_weights = explain(model, data.loc[i], "FullDescription")
draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);

Index: 10190
Salary (gbp): tensor([31246.4883])


### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!