### The grand quest: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either pure __tensorflow__ or __keras__. Feel free to adapt the seminar code for your needs.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("./data/Train_rev1.csv", index_col=None).sample(frac=1)

In [3]:
rubbish_columns = ['Id', 'LocationRaw', 'SalaryRaw', 'SourceName', 'SalaryNormalized']
text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]

# Preprocessing

In [4]:
from sklearn.base import TransformerMixin
import sklearn.preprocessing
import nltk
from collections import Counter

In [5]:
def preprocess_general(data):
    new_data = pd.DataFrame(index=data.index)
    for column in text_columns + categorical_columns:
        new_data[column] = data[column]
    new_data = new_data.fillna('NaN')
    return new_data, np.log1p(data['SalaryNormalized']).astype('float32')

## Categorical data

In [6]:
class PreprocessCategorical(TransformerMixin):
    def __init__(self, min_company=40, min_city=20):
        self.le_category = sklearn.preprocessing.LabelEncoder()
        self.le_type = sklearn.preprocessing.LabelEncoder()
        self.city_counter = Counter()
        self.company_counter = Counter()
        self.min_company = min_company
        self.min_city = min_city
        self.mean = {}
        
    def _fit_category(self, data):
        self.le_category.fit(np.concatenate([data['Category'], ['other']]))
        
    def _fit_type(self, data):
        self.le_type.fit(data['ContractTime'] + data['ContractType'])
        
    def _fit_city_company(self, data, what, minimum, target):
        keys = {value if count >= minimum else 'other' for value, count in Counter(data[what]).most_common()}
        tmp_data = pd.DataFrame(index=data.index)
        tmp_data[what] = data[what].apply(lambda value: value if value in keys else 'other')
        tmp_data['Salary'] = target
        self.mean[what] = tmp_data.groupby(what)['Salary'].mean()
        
    def fit(self, data, target):
        self._fit_category(data)
        self._fit_city_company(data, 'LocationNormalized', self.min_city, target)
        self._fit_city_company(data, 'Company', self.min_company, target)
        self._fit_type(data)
        return self
    
    def _transform_category(self, data, new_data):
        new_data['Category'] = self.le_category.transform(data['Category'])
    
    def _transform_type(self, data, new_data):
        new_data['Contract'] = self.le_type.transform(data['ContractTime'] + data['ContractType'])
        
        
    def _transform_city_company(self, data, what, new_data):
        new_data[what] = data[what].apply(lambda value: self.mean[what].get(value, self.mean[what]['other']))
        
    def transform(self, data):
        new_data = pd.DataFrame(index=data.index)
        self._transform_category(data, new_data)
        self._transform_type(data, new_data)
        self._transform_city_company(data, 'LocationNormalized', new_data)
        self._transform_city_company(data, 'Company', new_data)
        for column in data:
            if column not in categorical_columns:
                new_data[column] = data[column]
        return new_data
        

## Text data

In [7]:
class PreprocessText(TransformerMixin):
    def __init__(self, min_count = 10):
        self.tokenizer = nltk.tokenize.WordPunctTokenizer()
        self.min_count = min_count
        
    def fit(self, data, target=None):
        token_counts = Counter(' '.join(data['Title']).split())
        token_counts += Counter(' '.join(data['FullDescription']).split())
        self.tokens = ["PAD", "UNK"] + [token for token, count in token_counts.items() if count >= self.min_count]
        self.token_to_id = {token: index for index, token in enumerate(self.tokens)}
        return self
    
    def transform(self, data):
        new_data = pd.DataFrame(index=data.index)
        for column in data:
            new_data[column] = data[column].apply(
                lambda text: list(map(lambda word: self.token_to_id.get(word, 0), 
                                      self.tokenizer.tokenize(str(text).lower()))
                                 )
            ) if column in text_columns else data[column]
        return new_data

## Applying preprocessing

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
data, target = preprocess_general(data)
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.3, random_state=325)

In [10]:
prep_text = PreprocessText()
prep_cat = PreprocessCategorical()
prep_text.fit(data_train)
prep_cat.fit(data_train, target_train)

<__main__.PreprocessCategorical at 0x7fb94049fe10>

In [11]:
data_prep_train = prep_text.transform(prep_cat.transform(data_train))
data_prep_test = prep_text.transform(prep_cat.transform(data_test))

In [12]:
data_prep_train.to_csv('data/data_prep_train.csv')
data_prep_test.to_csv('data/data_prep_test.csv')
target_train.to_csv('data/target_train.csv')
target_test.to_csv('data/target_test.csv')

In [20]:
data_prep_train.head(5)

Unnamed: 0,Category,Contract,LocationNormalized,Company,Title,FullDescription
208221,15,1,10.206763,10.349338,"[53046, 39461, 58514]","[53046, 39461, 40384, 47312, 30875, 39461, 585..."
108637,11,3,10.109263,10.72516,"[2560, 43962, 23460, 29577]","[2560, 43962, 23460, 0, 16588, 11192, 23826, 1..."
86363,14,6,10.350903,10.618861,"[18976, 30453, 37004, 34174, 56318, 0]","[18976, 30453, 37004, 34174, 56318, 0, 12746, ..."
21821,8,6,10.534687,10.274894,"[3221, 2321, 6533]","[3221, 6046, 46209, 62312, 34174, 21891, 4038,..."
156983,8,7,10.315285,10.482319,"[0, 51220]","[38991, 30453, 2091, 44206, 32096, 58837, 5815..."


In [344]:
data.iloc[0:10]

Unnamed: 0,Title,FullDescription,Category,Company,LocationNormalized,ContractType,ContractTime
377,Band 7 Paediatric Occupational Therapist North...,Pulse are urgently looking to recruit a Band 7...,Healthcare & Nursing Jobs,,London,full_time,
175325,Senior Mechanical Engineer Water Operations,Senior Mechanical Engineer Water Operations B...,Engineering Jobs,Executive Recruitment Services,Bristol,,permanent
53815,Sales and Marketing Manager,Sales and Marketing Manager Profile Yolk Recru...,Travel Jobs,Yolk Recruitment,UK,,permanent
218504,C++ Developer FX Derivatives,"Job Role: C++ Developer Location: London, City...",IT Jobs,Client Server Ltd.,London,,permanent
199858,Retail Store Manager,Topps Tiles is the No.**** Tiling Specialist i...,Customer Services Jobs,TOPPS TILES,Bodmin,,permanent
30349,Shea Gas Joiners,must have valid tickets **** 6 weeks work This...,Trade & Construction Jobs,RMF Construction Services Ltd.,Lymm,,contract
150701,Registration Services Advisor,This successful company based in Woking are se...,Admin Jobs,Faith Recruitment,Woking,part_time,
117366,Care Worker St Ives and Surrounding villages,"Carer ‘I love being able to help others, no ma...",Customer Services Jobs,Allied Healthcare,UK,,permanent
164432,Principle Design Engineer,This World leading manufacturer of test machin...,IT Jobs,,High Wycombe,,permanent
161058,Application Sales Engineer Capital Equipment,Application Sales Engineer Capital Equipment ...,Sales Jobs,ATA,North East England,,permanent


# Batches iteration

In [403]:
def make_input(data):
    result = {}
    for column in ["Title", "FullDescription"]:
        max_len = max(map(len, data[column]))
        result[column] = np.zeros([data.shape[0], max_len])
        for index, line in enumerate(data[column]):
            result[column][index][:len(line)] = line
            
    for col in ["Category", "Contract", "LocationNormalized", "Company"]:
        result[col] = np.array(data[col].values).reshape(data.shape[0])
    return result

def iter_batches(data, target=None, batch_size=100, shuffle=True, cycle=False):
    
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            index = indices[start:start+batch_size]
            if target is not None:
                yield make_input(data.iloc[index]), target[index]
            else:
                yield make_input(data.iloc[index])
                
        if not cycle: break


# Model

In [371]:
import keras
import keras.layers as L

In [372]:
def create_categorical_model(hid_size=32, category_size=15, contract_size=6):
    category = L.Input(shape=(1,), name="Category")
    contract = L.Input(shape=(1,), name="Contract")
    location = L.Input(shape=(1,), name="LocationNormalized")
    company = L.Input(shape=(1,), name="Company")
    emb_category = L.Reshape((category_size,))(L.Embedding(29, category_size)(category))
    emb_contract = L.Reshape((contract_size,))(L.Embedding(9, contract_size)(contract))
    hidden_0 = L.Concatenate()([emb_category, emb_contract, location, company])
    hidden_1 = L.Dense(units=hid_size, activation='relu')(hidden_0)
    hidden_2 = L.Dense(units=hid_size, activation='relu')(hidden_1)
    output = L.Dense(units=1)(hidden_2)
    model = keras.models.Model(inputs=[category, contract, location, company], outputs=output)
    return model

def create_textual_model(emb_size=128):
    vocab_size=len(prep_text.tokens)
    
    title = L.Input(shape=(None,), name='Title')
    descr = L.Input(shape=(None,), name='FullDescription')
    emb_title = L.Embedding(vocab_size, emb_size)(title)
    emb_descr = L.Embedding(vocab_size, emb_size)(descr)
    conv_title = L.Conv1D(kernel_size=(2,), filters=32, activation='relu')(emb_title)
    conv_descr = L.Conv1D(kernel_size=(4,), filters=64, activation='relu')(emb_descr)
    pool_title = L.GlobalMaxPool1D()(conv_title)
    pool_descr = L.GlobalMaxPool1D()(conv_descr)
    print(title.shape)
    recur_title_forward = L.LSTM(units=16, activation='relu')(emb_title)
    recur_descr_forward = L.LSTM(units=32, activation='relu')(emb_descr)
    recur_title_backward = L.LSTM(units=16, go_backwards=True, activation='relu')(emb_title)
    recur_descr_backward = L.LSTM(units=32, go_backwards=True, activation='relu')(emb_descr)
    hidden_0 = L.Concatenate()([
        pool_title, pool_descr, 
        recur_title_forward, recur_descr_forward, 
        recur_title_backward, recur_descr_backward
    ])
    hidden_1 = L.Dense(emb_size, activation='relu')(hidden_0)
    output = L.Dense(1)(hidden_1)
    model = keras.models.Model(inputs=[title, descr], outputs=output)
    return model


In [373]:
def create_testing_model(emb_size=10):
    vocab_size=len(prep_text.tokens)
    title = L.Input(shape=(None,), name='Title')
    emb_title = L.Embedding(vocab_size, emb_size)(title)
    conv_title = L.Conv1D(kernel_size=(2, ), filters=64, activation='relu')(emb_title)
    pool_title = L.GlobalMaxPool1D()(conv_title)
    output = L.Dense(1)(pool_title)
    model = keras.models.Model(inputs=title, outputs=output)
    return model

In [380]:
model = create_testing_model()
model.summary()
model.compile(optimizer='adam', loss='mean_squared_error')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Title (InputLayer)           (None, None)              0         
_________________________________________________________________
embedding_118 (Embedding)    (None, None, 10)          630960    
_________________________________________________________________
conv1d_46 (Conv1D)           (None, None, 64)          1344      
_________________________________________________________________
global_max_pooling1d_17 (Glo (None, 64)                0         
_________________________________________________________________
dense_77 (Dense)             (None, 1)                 65        
Total params: 632,369
Trainable params: 632,369
Non-trainable params: 0
_________________________________________________________________


In [384]:
next(iter_batches(data, train_test_split))

AttributeError: 'function' object has no attribute 'shape'

In [404]:
steps_per_epoch = 100
model.fit_generator(
    iter_batches(data_prep_train, target_train),
    validation_data=iter_batches(data_prep_test, target_test),
    steps_per_epoch=steps_per_epoch,
    validation_steps=10,
    epochs=3
)

Epoch 1/3


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]


Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fb9336ede80>

In [377]:
next(iterate_minibatches(data_prep_train, target_train))

TypeError: 'Series' object cannot be interpreted as an integer

In [310]:
model = create_textual_model()
model.summary()
# model.compile(optimizer='adam', loss='mean_squared_error')

(?, ?)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Title (InputLayer)              (None, None)         0                                            
__________________________________________________________________________________________________
FullDescription (InputLayer)    (None, None)         0                                            
__________________________________________________________________________________________________
embedding_110 (Embedding)       (None, None, 128)    8076288     Title[0][0]                      
__________________________________________________________________________________________________
embedding_111 (Embedding)       (None, None, 128)    8076288     FullDescription[0][0]            
__________________________________________________________________________________________________
con

In [397]:


model.fit(make_input(data_prep_train), 
          target_train, 
          validation_data=(make_input(data_prep_test), target_test), 
          epochs=3
         )

KeyError: 233077

In [206]:
for i, target in zip(data_prep_train.index, target_train):
    print(model.predict(make_input(data_prep_train.loc[[i]]))[0][0], target)
    input()

9.920802 9.784760475158691

10.574045 10.96959400177002

10.624446 10.915106773376465

10.527648 10.389025688171387

10.514307 9.615805625915527

10.631734 10.95955753326416

10.673228 10.505094528198242


KeyboardInterrupt: 

### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time - our `L.GlobalMaxPool1D`
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not want to use __`.get_keras_embedding()`__ method for word2vec
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback](https://keras.io/callbacks/#earlystopping).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!