# Notebook 5: Model Training
While a Novelty Score can be calculated using a white-box method detailed in the previous Jupyter Notebook, we would like to predict whether a paper is novel in as close to real-time as possible. 

This notebook contains code used to train various machine learning models to achieve the task mentioned above.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd

from src import preprocessing

from tqdm import tqdm, tqdm_notebook
tqdm.pandas()

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

## Load training and validation data

In [3]:
train = pd.read_pickle('data/dataframes/main_df_scored.pkl')
valid = (
    pd
    .read_pickle('data/dataframes/valid_scored.pkl')
    .pipe(lambda d: d[d.notna().all(axis=1)])
)
display(train.head(), valid.head())

Unnamed: 0,abstract,authors,day,month,tags,title,year,publish_date,title+abstract,top_topics,topic_score,author_score,score
0,Learned feature representations and sub-phonem...,"[Fred Richardson, Douglas Reynolds, Najim Dehak]",3,4,"[cs.CL, cs.CV, cs.LG, cs.NE, stat.ML]",A Unified Deep Neural Network for Speaker and ...,2015,2015-03-04,A Unified Deep Neural Network for Speaker and ...,"(3, 12, 19)",0.864535,0.301551,0.583043
1,We propose a simple neural network model to de...,"[Muhammad Ghifary, W. Kleijn, Mengjie Zhang]",21,9,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Domain Adaptive Neural Networks for Object Rec...,2014,2014-09-21,Domain Adaptive Neural Networks for Object Rec...,"(3, 4, 9)",1.021247,0.338184,0.679716
2,Recent studies have demonstrated the power of ...,"[Lionel Pigou, Aäron Oord, Sander Dieleman, Mi...",5,6,"[cs.CV, cs.AI, cs.LG, cs.NE, stat.ML]",Beyond Temporal Pooling: Recurrence and Tempor...,2015,2015-05-06,Beyond Temporal Pooling: Recurrence and Tempor...,"(3, 9, 12)",0.556297,0.245878,0.401087
3,"In this paper, we address the task of Optical ...","[Rakesh Achanta, Trevor Hastie]",20,9,"[stat.ML, cs.AI, cs.CV, cs.LG, cs.NE]",Telugu OCR Framework using Deep Learning,2015,2015-09-20,Telugu OCR Framework using Deep Learning. In ...,"(1, 3, 19)",0.760365,0.739321,0.749843
4,Recent progress in using recurrent neural netw...,"[Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas...",27,2,"[stat.ML, cs.AI, cs.CL, cs.CV, cs.LG]",Describing Videos by Exploiting Temporal Struc...,2015,2015-02-27,Describing Videos by Exploiting Temporal Struc...,"(3, 9, 19)",0.477121,0.449995,0.463558


Unnamed: 0,abstract,authors,day,month,tags,title,year,publish_date,title+abstract,top_topics,author_score,score
0,"Real-time, accurate, and robust pupil detectio...","[Thiago Santini, Wolfgang Fuhl, Enkelejda Kasn...",24,12,"[cs.CV, cs.HC]",PuRe: Robust pupil detection for real-time per...,2017,2017-12-24,PuRe: Robust pupil detection for real-time per...,"(0, 9, 16)",0.289215,0.252405
2,"The Stixel World is a medium-level, compact re...","[Daniel Hernandez-Juarez, Antonio Espinosa, Da...",13,10,[cs.CV],GPU-accelerated real-time stixel computation,2016,2016-10-13,GPU-accelerated real-time stixel computation. ...,"(1, 9, 16)",0.390351,0.736584
3,Endowing a chatbot with personality or an iden...,"[Qiao Qian, Minlie Huang, Haizhou Zhao, Jingfa...",9,6,[cs.CL],Assigning personality/identity to a chatting m...,2017,2017-09-06,Assigning personality/identity to a chatting m...,"(4, 6, 15)",0.86882,0.33494
4,We present a matrix factorization algorithm th...,"[Arthur Mensch, Julien Mairal, Gaël Varoquaux,...",30,11,"[math.OC, cs.LG, stat.ML]",Subsampled online matrix factorization with co...,2016,2016-11-30,Subsampled online matrix factorization with co...,"(0, 10, 11)",0.705347,0.695043
5,We present an approximation algorithm that tak...,[Matthew Streeter],21,2,"[cs.LG, cs.AI, cs.NE]",Approximation Algorithms for Cascading Predict...,2018,2018-02-21,Approximation Algorithms for Cascading Predict...,"(1, 4, 16)",0.517207,0.459156


---
# Model Training
I have chosen to train six common machine learning models, with linear regression being the benchmark algorithm. Given that our goal is to predict the novelty of a paper in as close to real-time as possible, this means that the forecast horizon is a hyperparameter. 

For a smaller forecast horizon, we are training our model to predict papers that were published relatively recently, and vice-versa.

### Predictors
I have chosen to use a three features to predict the Novelty Score (label). They are:
1. Author Score — We are able to reuse this metric since the information is readily available.
2. Topic Combinations — This categorical feature would be useful since documents were grouped by Topic Combinations to find the Topic Score.
3. Modified Topic Score — The papers included in calculating this modified Topic Score are a subset of all the papers in a topic combination, in that papers that are exceed the forecast horizon are ignored. For example, for a paper, $p$, that was published at time $t$ in a topic combination, the modified Topic Score for $p$ includes all papers that preceeded it, and as well as papers that came after $p$ so long as the subsequent papers publish date is within the forecast horizon (i.e. $\leq t$ + forecast horizon)

### Metric
Since this is a regression problem, the mean squared error (MSE) metric was used to evaluate the models' performances.

In [4]:
models = [LinearRegression(), 
          Ridge(alpha=1), 
          SVR(kernel='rbf'), 
          KNeighborsRegressor(n_neighbors=5, weights='distance'), 
          DecisionTreeRegressor(), 
          MLPRegressor(hidden_layer_sizes=(50, 50))]

In [9]:
def standardize(x):
    return StandardScaler().fit_transform(x)


months = (1, 3, 6, 12)
train_summary = pd.DataFrame(columns=[type(model).__name__ for model in models], index=months)
valid_summary = pd.DataFrame(columns=[type(model).__name__ for model in models], index=months)

for n_month in tqdm_notebook(months, desc='Tuning hyperparameter'):
    train = preprocessing.featurize_publish_date(train, n_month)
    valid = preprocessing.featurize_publish_date(valid, n_month, prior=train[['top_topics', 'publish_date']])
    
    # Preparing and standarizing training data
    numeric_features = standardize(train[['author_score', 'lookahead']])
    categorical_features = pd.get_dummies(train['top_topics'].append(valid['top_topics']))[:len(train)].values
    features = np.concatenate([numeric_features, categorical_features], axis=1)
    labels = standardize(train[['score']])
    
    # Preparing and standarizing validation data
    v_numeric_features = standardize(valid[['author_score', 'lookahead']])
    v_categorical_features = pd.get_dummies(train['top_topics'].append(valid['top_topics']))[-len(valid):].values
    v_features = np.concatenate([v_numeric_features, v_categorical_features], axis=1)
    v_labels = standardize(valid[['score']])
    
    for model in tqdm_notebook(models, desc='Training models', leave=False):
        model.fit(features, labels if type(model).__name__ not in ['MLPRegressor', 'SVR'] else labels.ravel())
        y_train_pred = model.predict(features)
        y_valid_pred = model.predict(v_features)
        
        train_summary.at[n_month, type(model).__name__] = mean_squared_error(y_true=labels, y_pred=y_train_pred)
        valid_summary.at[n_month, type(model).__name__] = mean_squared_error(y_true=v_labels, y_pred=y_valid_pred)

HBox(children=(IntProgress(value=0, description='Tuning hyperparameter', max=4, style=ProgressStyle(descriptio…

HBox(children=(IntProgress(value=0, description='Training models', max=6, style=ProgressStyle(description_widt…




Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



HBox(children=(IntProgress(value=0, description='Training models', max=6, style=ProgressStyle(description_widt…




Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



HBox(children=(IntProgress(value=0, description='Training models', max=6, style=ProgressStyle(description_widt…




Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



HBox(children=(IntProgress(value=0, description='Training models', max=6, style=ProgressStyle(description_widt…




Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



In [10]:
train_summary.to_csv('data/training_metrics/train.csv')
valid_summary.to_csv('data/training_metrics/valid.csv')
display(train_summary, valid_summary)

Unnamed: 0,LinearRegression,Ridge,SVR,KNeighborsRegressor,DecisionTreeRegressor,MLPRegressor
1,0.495284,0.496746,0.629875,0,1.26677e-09,0.105235
3,0.484228,0.485536,0.608289,0,1.20592e-09,0.0956046
6,0.467248,0.468383,0.580721,0,1.29126e-09,0.10631
12,0.431757,0.432676,0.527678,0,1.2935e-09,0.0865826


Unnamed: 0,LinearRegression,Ridge,SVR,KNeighborsRegressor,DecisionTreeRegressor,MLPRegressor
1,2.74628e+20,1.52646,1.57104,1.59534,1.70524,1.71758
3,2.29761e+21,1.53858,1.57487,1.59217,1.72501,1.63245
6,6.38724e+18,1.55564,1.57904,1.62778,1.72583,1.71562
12,5.57115e+20,1.59641,1.59404,1.60942,1.72097,1.70309


Looking at the results above, we can see that all models are overfitting to the data, as the validation MSEs are higher than the training MSEs. The reason for the poor performance is that the Topic Score label is dependent on the number of papers that came before it. Journal articles in the validation and test sets are more recent, and as such would have lower Topic Scores. This fact could not be captured by the models above. 

The next logical step to take in improving this modelling work is to reformulate the training procedure as a time series problem and use a Long Short Term Memory (LSTM) network to capture the sequential dependencies in the data. Unfortunately, I was not able to develop and train the model in TensorFlow given the amount of time I have. This task will be relinquished to being part of the Future Work.

---
# Future Work
- Utilize textual information from the articles themselves to create embeddings for each paper
- Use an LSTM-like network to model the sequential dependencies within the data