### The grand quest: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either pure __tensorflow__ or __keras__. Feel free to adapt the seminar code for your needs.


In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [45]:
data = pd.read_csv("./data/Train_rev1.csv", index_col=None).sample(frac=1)

In [6]:
rubbish_columns = ['Id', 'LocationRaw', 'SalaryRaw', 'SourceName', 'SalaryNormalized']
text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]

# Preprocessing

In [47]:
from sklearn.base import TransformerMixin
import sklearn.preprocessing
import nltk
from collections import Counter

In [48]:
def preprocess_general(data):
    new_data = pd.DataFrame(index=data.index)
    for column in text_columns + categorical_columns:
        new_data[column] = data[column]
    new_data = new_data.fillna('NaN')
    return new_data, np.log1p(data['SalaryNormalized']).astype('float32')

## Categorical data

In [49]:
class PreprocessCategorical(TransformerMixin):
    def __init__(self, min_company=40, min_city=20):
        self.le_category = sklearn.preprocessing.LabelEncoder()
        self.le_type = sklearn.preprocessing.LabelEncoder()
        self.city_counter = Counter()
        self.company_counter = Counter()
        self.min_company = min_company
        self.min_city = min_city
        self.mean = {}
        
    def _fit_category(self, data):
        self.le_category.fit(np.concatenate([data['Category'], ['other']]))
        
    def _fit_type(self, data):
        self.le_type.fit(data['ContractTime'] + data['ContractType'])
        
    def _fit_city_company(self, data, what, minimum, target):
        keys = {value if count >= minimum else 'other' for value, count in Counter(data[what]).most_common()}
        tmp_data = pd.DataFrame(index=data.index)
        tmp_data[what] = data[what].apply(lambda value: value if value in keys else 'other')
        tmp_data['Salary'] = target
        self.mean[what] = tmp_data.groupby(what)['Salary'].mean()
        
    def fit(self, data, target):
        self._fit_category(data)
        self._fit_city_company(data, 'LocationNormalized', self.min_city, target)
        self._fit_city_company(data, 'Company', self.min_company, target)
        self._fit_type(data)
        return self
    
    def _transform_category(self, data, new_data):
        new_data['Category'] = self.le_category.transform(data['Category'])
    
    def _transform_type(self, data, new_data):
        new_data['Contract'] = self.le_type.transform(data['ContractTime'] + data['ContractType'])
        
        
    def _transform_city_company(self, data, what, new_data):
        new_data[what] = data[what].apply(lambda value: self.mean[what].get(value, self.mean[what]['other']))
        
    def transform(self, data):
        new_data = pd.DataFrame(index=data.index)
        self._transform_category(data, new_data)
        self._transform_type(data, new_data)
        self._transform_city_company(data, 'LocationNormalized', new_data)
        self._transform_city_company(data, 'Company', new_data)
        for column in data:
            if column not in categorical_columns:
                new_data[column] = data[column]
        return new_data
        

## Text data

In [50]:
class PreprocessText(TransformerMixin):
    def __init__(self, min_count = 10):
        self.tokenizer = nltk.tokenize.WordPunctTokenizer()
        self.min_count = min_count
        
    def fit(self, data, target=None):
        token_counts = Counter(' '.join(data['Title']).split())
        token_counts += Counter(' '.join(data['FullDescription']).split())
        self.tokens = ["PAD", "UNK"] + [token for token, count in token_counts.items() if count >= self.min_count]
        self.token_to_id = {token: index for index, token in enumerate(self.tokens)}
        return self
    
    def transform(self, data):
        new_data = pd.DataFrame(index=data.index)
        for column in data:
            new_data[column] = data[column].apply(
                lambda text: list(map(lambda word: self.token_to_id.get(word, 0), 
                                      self.tokenizer.tokenize(str(text).lower()))
                                 )
            ) if column in text_columns else data[column]
        return new_data

## Applying preprocessing

In [3]:
from sklearn.model_selection import train_test_split

In [52]:
data, target = preprocess_general(data)
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.3, random_state=325)

In [53]:
prep_text = PreprocessText()
prep_cat = PreprocessCategorical()
prep_text.fit(data_train)
prep_cat.fit(data_train, target_train)

<__main__.PreprocessCategorical at 0x7feea23c3c18>

In [54]:
data_train = prep_text.transform(prep_cat.transform(data_train))
data_test = prep_text.transform(prep_cat.transform(data_test))
vocab = prep_text.tokens

In [55]:
data_train.to_csv('data/data_prep_train.csv')
data_test.to_csv('data/data_prep_test.csv')
target_train.to_csv('data/target_train.csv')
target_test.to_csv('data/target_test.csv')
pd.DataFrame(vocab).to_csv('data/vocab.csv')

In [7]:
import pandas as pd
import json

data_train = pd.read_csv('data/data_prep_train.csv', index_col=0)
data_test = pd.read_csv('data/data_prep_test.csv', index_col=0)
for data in data_train, data_test:
    for column in text_columns:
        data[column] = data[column].apply(json.loads)
target_train = pd.read_csv('data/target_train.csv', index_col=0)
target_test = pd.read_csv('data/target_test.csv', index_col=0)
vocab = pd.read_csv('data/vocab.csv', index_col=0).values[:, 0]

In [108]:
data_train.index

Int64Index([ 28275, 146657,  73853, 158965, 188818,  63645, 223710, 148653,
             49666,  88847,
            ...
            241556, 135564,   3920,  37822, 214359,  97894, 114385,  98588,
            170890, 165969],
           dtype='int64', length=171337)

In [1]:
index = [145870, 60044, 84190, 145699, 19529, 78946, 6452, 97752, 119591, 123033, 58918, 19663, 114791, 15335, 39073, 1725, 52149, 59793, 115934, 110450, 141889, 139812, 50591, 77724, 74148, 72997, 86634, 50457, 136237, 167056, 111, 78692, 132171, 34982, 21299, 3690, 26901, 60926, 35092, 134985, 115368, 144064, 99252, 129396, 128598, 22202, 142663, 131428, 66701, 44634, 37555, 6947, 62270, 107257, 158569, 36195, 87638, 125, 8300, 69535, 81779, 37398, 57211, 31072, 26863, 126242, 60237, 43676, 97142, 122028, 30173, 120316, 13048, 155096, 88460, 121406, 157502, 12709, 161977, 1584, 30009, 143710, 110077, 69880, 142797, 81611, 125105, 17696, 62001, 87536, 40470, 137905, 136452, 143030, 95182, 164850, 31441, 72614, 41581, 113562]

In [11]:
target_train.iloc[index]

Unnamed: 0_level_0,10.532123
28275,Unnamed: 1_level_1
177397,10.463132
75929,9.564583
4766,10.021315
204734,10.463132
12920,10.915107
52646,10.221977
140153,10.308986
120476,9.113939
127723,9.952325
3697,10.571342


In [9]:
target_train[index]

KeyError: '[145870  60044  84190 145699  19529  78946   6452  97752 119591 123033\n  58918  19663 114791  15335  39073   1725  52149  59793 115934 110450\n 141889 139812  50591  77724  74148  72997  86634  50457 136237 167056\n    111  78692 132171  34982  21299   3690  26901  60926  35092 134985\n 115368 144064  99252 129396 128598  22202 142663 131428  66701  44634\n  37555   6947  62270 107257 158569  36195  87638    125   8300  69535\n  81779  37398  57211  31072  26863 126242  60237  43676  97142 122028\n  30173 120316  13048 155096  88460 121406 157502  12709 161977   1584\n  30009 143710 110077  69880 142797  81611 125105  17696  62001  87536\n  40470 137905 136452 143030  95182 164850  31441  72614  41581 113562] not in index'

# Batches iteration

In [13]:
def make_input(data):
    result = {}
    for column in ["Title", "FullDescription"]:
        max_len = max(map(len, data[column]))
        result[column] = np.zeros([data.shape[0], max_len])
        for index, line in enumerate(data[column]):
            result[column][index][:len(line)] = np.array(line)
            
    for col in ["Category", "Contract", "LocationNormalized", "Company"]:
        result[col] = np.array(data[col].values).reshape(data.shape[0])
    return result

def iter_batches(data, target=None, batch_size=100, shuffle=True, cycle=False):
    
    while True:
        indices = np.arange(data.shape[0])
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(data), batch_size):
            index = indices[start:start+batch_size]
            if target is not None:
                yield make_input(data.iloc[index]), target.iloc[index]
            else:
                yield make_input(data.iloc[index])
                
        if not cycle: break


# Model

In [20]:
import keras
import keras.layers as L

Using TensorFlow backend.


In [27]:
def create_categorical_model(hid_size=32, category_size=15, contract_size=6):
    category = L.Input(shape=(1,), name="Category")
    contract = L.Input(shape=(1,), name="Contract")
    location = L.Input(shape=(1,), name="LocationNormalized")
    company = L.Input(shape=(1,), name="Company")
    emb_category = L.Reshape((category_size,))(L.Embedding(29, category_size)(category))
    emb_contract = L.Reshape((contract_size,))(L.Embedding(9, contract_size)(contract))
    hidden_0 = L.Concatenate()([emb_category, emb_contract, location, company])
    hidden_1 = L.Dense(units=hid_size, activation='relu')(hidden_0)
    hidden_2 = L.Dense(units=hid_size, activation='relu')(hidden_1)
    output = L.Dense(units=1)(hidden_2)
    model = keras.models.Model(inputs=[category, contract, location, company], outputs=output)
    return model

def create_textual_model(emb_size=128):
    vocab_size=len(vocab)
    
    title = L.Input(shape=(None,), name='Title')
    descr = L.Input(shape=(None,), name='FullDescription')
    emb_title = L.Embedding(vocab_size, emb_size)(title)
    emb_descr = L.Embedding(vocab_size, emb_size)(descr)
    conv_title = L.Conv1D(kernel_size=(2,), filters=32, activation='relu')(emb_title)
    conv_descr = L.Conv1D(kernel_size=(4,), filters=64, activation='relu')(emb_descr)
    pool_title = L.GlobalMaxPool1D()(conv_title)
    pool_descr = L.GlobalMaxPool1D()(conv_descr)
    recur_title_forward = L.LSTM(units=16, activation='relu')(emb_title)
    recur_descr_forward = L.LSTM(units=32, activation='relu')(emb_descr)
    recur_title_backward = L.LSTM(units=16, go_backwards=True, activation='relu')(emb_title)
    recur_descr_backward = L.LSTM(units=32, go_backwards=True, activation='relu')(emb_descr)
    hidden_0 = L.Concatenate()([
        pool_title, pool_descr, 
        recur_title_forward, recur_descr_forward, 
        recur_title_backward, recur_descr_backward
    ])
    hidden_1 = L.Dense(emb_size, activation='relu')(hidden_0)
    output = L.Dense(1)(hidden_1)
    model = keras.models.Model(inputs=[title, descr], outputs=output)
    return model


In [18]:
def create_testing_model(emb_size=10):
    vocab_size=len(vocab)
    title = L.Input(shape=(None,), name='Title')
    emb_title = L.Embedding(vocab_size, emb_size)(title)
    conv_title = L.Conv1D(kernel_size=(2, ), filters=64, activation='relu')(emb_title)
    pool_title = L.GlobalMaxPool1D()(conv_title)
    output = L.Dense(1)(pool_title)
    model = keras.models.Model(inputs=title, outputs=output)
    return model

In [28]:
model = create_textual_model(emb_size=10)
model.summary()
model.compile(optimizer='adam', loss='mean_squared_error')

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Title (InputLayer)              (None, None)         0                                            
__________________________________________________________________________________________________
FullDescription (InputLayer)    (None, None)         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 10)     630430      Title[0][0]                      
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 10)     630430      FullDescription[0][0]            
__________________________________________________________________________________________________
conv1d_2 (

In [30]:
steps_per_epoch = 100
model.fit_generator(
    iter_batches(data_train, target_train),
    validation_data=iter_batches(data_test, target_test),
    steps_per_epoch=steps_per_epoch,
    validation_steps=10,
    epochs=10
)

Epoch 1/10

KeyboardInterrupt: 

In [377]:
next(iterate_minibatches(data_prep_train, target_train))

TypeError: 'Series' object cannot be interpreted as an integer

### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

In [397]:


model.fit(make_input(data_prep_train), 
          target_train, 
          validation_data=(make_input(data_prep_test), target_test), 
          epochs=3
         )

KeyError: 233077

In [206]:
for i, target in zip(data_prep_train.index, target_train):
    print(model.predict(make_input(data_prep_train.loc[[i]]))[0][0], target)
    input()

9.920802 9.784760475158691

10.574045 10.96959400177002

10.624446 10.915106773376465

10.527648 10.389025688171387

10.514307 9.615805625915527

10.631734 10.95955753326416

10.673228 10.505094528198242


KeyboardInterrupt: 

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time - our `L.GlobalMaxPool1D`
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not want to use __`.get_keras_embedding()`__ method for word2vec
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback](https://keras.io/callbacks/#earlystopping).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!