### Salary prediction, episode II: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


In [1]:
# < A whole lot of your code > - models, charts, analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [41]:
import torch

In [2]:
data = pd.read_csv("./Train_rev1.zip", compression='zip', index_col=None)
data.shape

(244768, 12)

In [3]:
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

In [4]:
text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"

data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string "NaN"

data.sample(3)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,Log1pSalary
220653,72373179,Business Intelligence Analyst,Business Intelligence/BI/SAP BW/Analyst Our ho...,Gloucestershire Tetbury GL8 8,Long Furlong,,permanent,Reed Technology Bristol Permanents,IT Jobs,35000.00 - 45000.00 GBP Annual,40000,jobserve.com,10.59666
156028,71087949,Resoucing Manager East London,Resourcing Manager required for large establis...,East London London South East,South East London,,permanent,Amida Recruitment Limited,Trade & Construction Jobs,25000 - 30000 per annum + Package,27500,careerstructure.com,10.221977
166242,71296561,Sponsorship Sales IT & Tech Conferences,We are looking for a consultative sponsorship ...,North London,North London,full_time,permanent,Gemini Search,Sales Jobs,25000 - 28000 per annum + Benefits + Great Com...,26500,mediaweekjobs.co.uk,10.184937


In [5]:
data[text_columns] = data[text_columns].fillna(method="ffill")

In [6]:
import nltk
#TODO YOUR CODE HERE

tokenizer = nltk.tokenize.WordPunctTokenizer()

data["FullDescription"] = data["FullDescription"].apply(lambda descrp: ' '.join(tokenizer.tokenize(descrp.lower())))
data["Title"] = data["Title"].apply(lambda title: ' '.join(tokenizer.tokenize(title.lower())))


In [8]:
from collections import Counter
token_counts = Counter()

# Count how many times does each token occur in both "Title" and "FullDescription" in total
for index, row in data.iterrows():
    token_counts.update(row['Title'].split())
    token_counts.update(row['FullDescription'].split())

In [9]:
min_count = 10

# tokens from token_counts keys that had at least min_count occurrences throughout the dataset
tokens = sorted(t for t, c in token_counts.items() if c >= min_count)#TODO<YOUR CODE HERE>

# Add a special tokens for unknown and empty words
UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + tokens

In [13]:
token_to_id = {token: idx for idx, token in enumerate(tokens)}

In [54]:
UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])

def as_matrix(sequences, max_len=None, min_len=10):
    """ Convert a list of tokens into a matrix with padding """
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))
        
    max_len = min(max(map(len, sequences)), max_len or float('inf'))
    max_len = max(max_len, min_len)
    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i,seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

In [15]:
len(token_to_id)

34158

In [16]:
import gensim.downloader 
embeddings = gensim.downloader.load("fasttext-wiki-news-subwords-300")



In [19]:
counter = 0
for token in tokens:
    if token not in embeddings:
        counter += 1
        
print(counter)

9945


In [25]:
from gensim.models import KeyedVectors
glove_embs = KeyedVectors.load_word2vec_format('glove.6B.50d.txt', binary=False, no_header=True)

In [27]:
counter = 0
for token in tokens:
    if token not in glove_embs:
        counter += 1
        
print(counter)

7935


In [34]:
emb_util_lambda = lambda idx: glove_embs[tokens[idx]] if tokens[idx] in glove_embs else list(np.random.normal(size=50))

In [42]:
embeddings_tensor_glove = torch.FloatTensor([emb_util_lambda(idx) for idx in range(len(tokens))])

  embeddings_tensor_glove = torch.FloatTensor([emb_util_lambda(idx) for idx in range(len(tokens))])


In [50]:
embeddings_tensor_glove.shape

torch.Size([34158, 50])

In [51]:
from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

In [52]:
from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))

print("Train size = ", len(data_train))
print("Validation size = ", len(data_val))

Train size =  195814
Validation size =  48954


In [75]:
import torch
import torch.nn as nn
import torch.nn.functional as F


device = 'cuda' if torch.cuda.is_available() else 'cpu'


def to_tensors(batch, device):
    batch_tensors = dict()
    for key, arr in batch.items():
        if key in ["FullDescription", "Title"]:
            batch_tensors[key] = torch.tensor(arr, device=device, dtype=torch.int64)
        else:
            batch_tensors[key] = torch.tensor(arr, device=device)
    return batch_tensors


def make_batch(data, max_len=None, word_dropout=0.1, device=device):
    """
    Creates a keras-friendly dict from the batch data.
    :param word_dropout: replaces token index with UNK_IX with this probability
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    batch = {}
    batch["Title"] = as_matrix(data["Title"].values, max_len)
    batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len=20)
    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))
    
    if word_dropout != 0:
        batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)
    
    if TARGET_COLUMN in data.columns:
        batch[TARGET_COLUMN] = data[TARGET_COLUMN].values
    
    return to_tensors(batch, device)

def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
    dropout_mask &= matrix != pad_ix
    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])

In [56]:
make_batch(data_train[:3], max_len=10)

{'Title': tensor([[27645, 29893, 33674,     1,     1,     1,     1,     1,     1,     1],
         [29239,   197, 19175, 20042, 15554, 23162,  4051,     1,     1,     1],
         [10609, 30412, 17746,    33,  8705, 29157,    65,     1,     1,     1]],
        device='cuda:0'),
 'FullDescription': tensor([[27645, 29893, 33674, 32939,   982, 27645, 29893, 33674, 16451, 32939],
         [29239,   197, 19175, 20042, 15554, 23162,  4051, 25511,   907,    82],
         [30746, 21956, 20601,  6409, 16451,  8165, 27493,   982, 30412, 17746]],
        device='cuda:0'),
 'Categorical': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'),
 'Log1pSalary': tensor([ 9.7115, 10.4631, 10.7144], device='cuda:0')}

In [58]:
class ConvNextLikeBlock(nn.Module):
    def __init__(self, dim, drop_rate=0.1):
        super().__init__()
        
        self.depthwise_conv = nn.Conv1d(in_channels=dim, out_channels=dim, kernel_size=5, padding=2, groups=dim)  # depthwise conv 5x5, padding 2, dim->dim
        
        self.norm = nn.BatchNorm1d(dim)
        
        self.pointwise_conv1 = nn.Conv1d(in_channels=dim, out_channels=dim*4, kernel_size=1)  # 1x1 conv, dim -> dim*4  YOUR CODE

        self.activation = nn.GELU()
        
        self.pointwise_conv2 = nn.Conv1d(in_channels=dim*4, out_channels=dim, kernel_size=1)  # 1x1 conv, 4*dim -> dim YOUR CODE

        self.dropout = nn.Dropout(drop_rate)

    def forward(self, x):
        input = x
        # YOUR CODE: sequentially apply to x: depthwise_conv + norm + pointwise_conv1 + activation + pointwise_conv2 + layer_scale
        x = self.depthwise_conv(x)
        x = self.norm(x)
        x = self.pointwise_conv1(x)
        x = self.activation(x)
        x = self.pointwise_conv2(x)

        x = input + self.dropout(x)
        return x

In [70]:
def create_stem(out_channels, in_channels=50):
    return nn.Sequential(
        nn.Conv1d(in_channels=in_channels, out_channels=out_channels, kernel_size=2, stride=2, padding=0), # YOUR CODE; conv 2x2, stride 2, padding 0
        nn.BatchNorm1d(out_channels)
    )

def create_downscale_block(in_channels, out_channels):
    return nn.Sequential(
        nn.BatchNorm1d(in_channels),
        nn.Conv1d(in_channels=in_channels, out_channels=out_channels, kernel_size=2, stride=2, padding=0)  # YOUR CODE: conv 2x2, stride 2, padding 0
    )

In [94]:
class GlobalAveragePool(nn.Module):
    def __init__(self, dim):
        super().__init__()
        
    def forward(self, x):
        return torch.mean(x, dim=-1)

class GlobalMaxPool(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, x):
        return x.max(dim=-1).values
    
class GlobalSoftmaxPool(nn.Module):
    def __init__(self):
        super().__init__()
        self.sm = nn.Softmax(dim=-1)

    def forward(self, x):
        return torch.mean(x * self.sm(x), dim=-1)

In [100]:
class TextEncoder(nn.Module):
    def __init__(self, initial_embedding_weights=embeddings_tensor_glove, hid_size=50):
        super().__init__()
        self.main_part = nn.Sequential()
        assert hid_size == initial_embedding_weights.shape[1]
        self.emb = nn.Embedding.from_pretrained(initial_embedding_weights)
        
        self.main_part.add_module('stem', create_stem(64))
        self.main_part.add_module('convnext1', ConvNextLikeBlock(64))
        self.main_part.add_module('downscale1', create_downscale_block(64, 128))
        
        self.main_part.add_module('convnext2', ConvNextLikeBlock(128))
        self.main_part.add_module('downscale2', create_downscale_block(128, 256))
        
        self.main_part.add_module('convnext3', ConvNextLikeBlock(256))
        
        self.maxpool = GlobalMaxPool()
        self.softmaxpool = GlobalSoftmaxPool()

        self.maxpool_norm = nn.BatchNorm1d(256)
        self.final_norm = nn.BatchNorm1d(512)
        
        self.linear = nn.Linear(512, 256)
        self.relu = nn.ReLU()

        
    def forward(self, x):
        x = self.emb(x)
        x.swapaxes_(1, 2)
        x = self.main_part(x)
        maxpool_normed = self.maxpool_norm(self.maxpool(x))
        softmaxpooled = self.softmaxpool(x)
        x = self.final_norm(torch.cat((maxpool_normed, softmaxpooled), dim=1))
        return self.relu(self.linear(x))

    
    

In [101]:
class CategoricalEncoder(nn.Module):
    def __init__(self, features_num, last_size=256):
        super().__init__()
        self.mlp_block = nn.Sequential()
        self.mlp_block.add_module('linear_1', nn.Linear(features_num, 1024))
        self.mlp_block.add_module('relu_1', nn.ReLU())
        self.mlp_block.add_module('linear_2', nn.Linear(1024, 512))
        self.mlp_block.add_module('relu_2', nn.ReLU())
        self.mlp_block.add_module('linear_3', nn.Linear(512, last_size))
        self.mlp_block.add_module('relu_3', nn.ReLU())
        
    def forward(self, x):
        return self.mlp_block(x)

In [102]:
class SalaryPredictor(nn.Module):
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64, last_size_cat=256):
        super().__init__()
        #  YOUR CODE HERE
        self.title_enc = TextEncoder()
        self.full_desc_enc = TextEncoder()
        self.cat_enc = CategoricalEncoder(n_cat_features)
        self.mlp_block = nn.Sequential(
            nn.Linear(512 + last_size_cat, 1024),
            nn.ReLU(),
            nn.BatchNorm1d(1024),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, 1),
        )

        
    def forward(self, batch):
        # YOUR CODE HERE
        title_processed = self.title_enc(batch['Title'])
        full_desc_processed = self.full_desc_enc(batch['FullDescription'])
        cat_features_processed = self.cat_enc(batch['Categorical'])
        concated = torch.cat((title_processed, full_desc_processed, cat_features_processed), 1)
        return self.mlp_block(concated)

        
        

In [103]:
model = SalaryPredictor()

In [104]:
model = SalaryPredictor().to(device)
batch = make_batch(data_train[:100], device=device)
criterion = nn.MSELoss()

dummy_pred = model(batch)
dummy_loss = criterion(dummy_pred.squeeze_(1), batch[TARGET_COLUMN])
print(dummy_pred.shape)
assert dummy_pred.shape == torch.Size([100])
assert len(torch.unique(dummy_pred)) > 20, "model returns suspiciously few unique outputs. Check your initialization"
assert dummy_loss.ndim == 0 and 0. <= dummy_loss <= 250., "make sure you minimize MSE"

torch.Size([100])


In [105]:
def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, device=device, **kwargs):
    """ iterates minibatches of data in random order """
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]], device=device, **kwargs)
            yield batch
        
        if not cycle: break

In [106]:
from tqdm.auto import tqdm

BATCH_SIZE = 16
EPOCHS = 5

In [107]:
def print_metrics(model, data, batch_size=BATCH_SIZE, name="", device=torch.device('cpu'), **kw):
    squared_error = abs_error = num_samples = 0.0
    model.eval()
    with torch.no_grad():
        for batch in iterate_minibatches(data, batch_size=batch_size, shuffle=False, device=device, **kw):
            batch_pred = model(batch)
            batch_pred.squeeze_(1)
            squared_error += torch.sum(torch.square(batch_pred - batch[TARGET_COLUMN]))
            abs_error += torch.sum(torch.abs(batch_pred - batch[TARGET_COLUMN]))
            num_samples += len(batch_pred)
    mse = squared_error.detach().cpu().numpy() / num_samples
    mae = abs_error.detach().cpu().numpy() / num_samples
    print("%s results:" % (name or ""))
    print("Mean square error: %.5f" % mse)
    print("Mean absolute error: %.5f" % mae)
    return mse, mae


In [108]:
model = SalaryPredictor().to(device)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=device)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        pred.squeeze_(1)

        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val, device=device)

      

epoch: 0


12239it [05:21, 38.02it/s]                           


 results:
Mean square error: 0.11732
Mean absolute error: 0.26378
epoch: 1


12239it [05:19, 38.25it/s]                           


 results:
Mean square error: 0.10971
Mean absolute error: 0.25240
epoch: 2


12239it [05:18, 38.37it/s]                           


 results:
Mean square error: 0.10554
Mean absolute error: 0.24821
epoch: 3


12239it [05:19, 38.25it/s]                           


 results:
Mean square error: 0.09817
Mean absolute error: 0.23713
epoch: 4


12239it [05:19, 38.28it/s]                           


 results:
Mean square error: 0.09462
Mean absolute error: 0.23155


### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!