# Model Building

The main motive here is use build a model for this prediction task. One of the goals here will be to decide upon a proper evaluation metric and use it to compare the final model against a dummy baseline model.

### Import necessary packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
import tensorflow as tf
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')

### Read training and test datasets

Preliminary visual analysis on the data gave a couple of pointers to properly read the data into a dataframe:
- Separator used is ';'
- Decimal character is ','
- Missing values are represented as 'NA'

In [3]:
data = {
    'train': pd.read_csv('train.csv', sep=';', decimal=',', na_values=['NA',]),
    'test': pd.read_csv('validation.csv', sep=';', decimal=',', na_values=['NA',]),
}

### Choice of evaluation metric

Usually for classification problems, **accuracy** is used as the evaluation metric if the dataset is balanced (meaning the distribution of the output classes are mostly uniform).

However, in this use case, there is a high degree of class imbalance (almost 13:1) and clearly **accuracy** cannot be used. In these instances, usually we use other metrics like **precision**, **accuracy** or **f1 score**. In most cases, we use a weighted average of all. Depending on the problem statement and whether we care more about not misclassifying the positive class or negative class, the weights are decided. For example, for a cancer prediction problem, we care more about not having any false negatives as it means a person with cancer is diagnosed as negative and may go untreated.

For this problem, since we are unsure of the impact of false positives and false negatives, we will collectively analyse all 3 metrics as the evaluation metric with higher importance given to precision and f1 score than recall (as recall will always be high because of high number of positives).

### Build preprocessing pipeline

Since the final goal is to build an online learning system and we do not know whether the new data will belong to the same distribution or not, we will not impose any assumptions and consider all features for our learning process. Additionally we will also not remove outliers from the data and use as is.

So basically, we will be sacrificing the strength of the predictor (in terms of how well it can predict based on the given data) for adaptability (meaning it will be more robust to changes in the data distribution for predictions)

*Note: Initially, I used scikit-learn for the model building and thus had to perform all the preprocessing steps done during EDA. However, here I will be using tensorflow's estimator API to build the final model and thus preprocessing steps such as encoding are no longer needed. Only imputing missing values and scaling numeric features will be done. Also because of the presence of outliers, instead of mean and std. dev. we will use scaling using median and IQR (inter-Quartile Range) as these are less sensitive to outliers.*

In [4]:
def preprocess(data, subset):
    df = data[subset]
    df['v19'] = df['v19'].apply(str)
    median_values = {
        'v2': 28.67,
        'v3': 0.000425,
        'v8': 1.75,
        'v11': 2.0,
        'v14': 120.0,
        'v15': 113.0,
        'v17': 1200000.0}
    iqr_values = {
        'v2': 17.83,
        'v3': 0.0008125000000000001,
        'v8': 4.5,
        'v11': 6.0,
        'v14': 280.0,
        'v15': 1059.75,
        'v17': 2800000.0}
    df = df.fillna(median_values)
    for column in df.select_dtypes(include=np.number):
        df[column] = (df[column] - median_values[column]) / iqr_values[column]
    for column in df.select_dtypes(exclude=np.number):
        df[column].fillna('missing', inplace=True)
        categories = set(df[column].unique()).union({'missing',})
        df[column] = df[column].astype('category', ordered=False, categories=categories)
    df['classLabel'] = df['classLabel'].map({
        'no.': 0,
        'yes.': 1
    })
    df['classLabel'] = df['classLabel'].astype('int64')
    return df

for subset in ('train', 'test'):
    data[subset] = preprocess(data, subset)

### Train-test split

Since this is a class imbalance problem, we have to split the data into training and validation sets ina stratified way to preserve the distribution of the class labels.

In [5]:
data['train'], data['val'] = train_test_split(data['train'], random_state=0, stratify=data['train']['classLabel'])

### Baseline model

Since we are dealing with class imbalance, the baseline model we will use is a dummy classifier which always predicts the most frequent class, which is *yes.* in our case

In [6]:
y_true = data['val']['classLabel']
y_pred = np.ones_like(y_true)
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        69
           1       0.93      1.00      0.96       856

   micro avg       0.93      0.93      0.93       925
   macro avg       0.46      0.50      0.48       925
weighted avg       0.86      0.93      0.89       925



In [7]:
y_true = data['test']['classLabel']
y_pred = np.ones_like(y_true)
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       107
           1       0.47      1.00      0.63        93

   micro avg       0.47      0.47      0.47       200
   macro avg       0.23      0.50      0.32       200
weighted avg       0.22      0.47      0.30       200



As expected, considering **1** as the true value, the recall score is 100% as there are no false negatives, however the precision suffers tremendously at 47% because of a high number of false positives.
The dummy baseline model has an f1 score of 63%

### Modelling the classifier

Let's now build a logistic regression classifier using tensorflow and a weighted cross entropy loss. We are selecting a simple classifier as our ultimate goal is to build an online learning system.

We will not assume any linear relationship between the features and response and will use a perceptron instead, which being universal function approximators should be of good use here.

In [8]:
tf.enable_eager_execution()

In [9]:
def make_input_fn(subset):
    df = data[subset]
    def input_fn():
        label = df['classLabel']
        return tf.data.Dataset.from_tensor_slices((dict(df),label)).batch(32)
    return input_fn

train_input_fn = make_input_fn('train')
val_input_fn = make_input_fn('val')
test_input_fn = make_input_fn('test')

In [10]:
numeric_columns = [tf.feature_column.numeric_column(column) 
                   for column in data['train'].select_dtypes(include=np.number)
                   if column != 'classLabel']
categorical_columns = [tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(
    column, vocabulary_list=data['train'][column].cat.categories
)) for column in data['train'].select_dtypes(exclude=np.number)
                   if column != 'classLabel']

In [11]:
classifier = tf.estimator.DNNClassifier(
    hidden_units=[4, 4],
    feature_columns=numeric_columns+categorical_columns,
    model_dir='model',
    n_classes=2,
    optimizer=tf.train.AdamOptimizer(learning_rate=0.001)
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f33e6e26390>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


Note: The model specification mentioned above is the modelled I finally settled for. Initially I went with a LinearClassifier and did not get good performance. I experimented for a while for the best set of hyperparamaters. The goal was to keep it simple so that it does not overfit and set a learning rate such that the model is able to adapt itself to new data (however, not too quickly).

In [12]:
epochs = 10
for i in range(epochs):
    print('Epoch {}'.format(i+1))
    classifier.train(train_input_fn)

Epoch 1
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into model/model.ckpt.
INFO:tensorflow:loss = 35.237076, step = 0
INFO:tensorflow:Saving checkpoints for 87 into model/model.ckpt.
INFO:tensorflow:Loss for final step: 5.5571694.
Epoch 2
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from model/model.ckpt-87
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 87 into model/model.ckpt.
INFO:tensorflow:loss = 8.330797, step = 87
INFO:tensorflow:Saving checkpoints for 174 into model/model.ckpt.
INFO:tensorflow:Loss for final step: 3.189396.
Epoch 3

In [13]:
y_pred = [pred['class_ids'][0] for pred in list(classifier.predict(val_input_fn))]
y_true = data['val']['classLabel']
print(classification_report(y_true, y_pred))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from model/model.ckpt-870
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
              precision    recall  f1-score   support

           0       1.00      0.96      0.98        69
           1       1.00      1.00      1.00       856

   micro avg       1.00      1.00      1.00       925
   macro avg       1.00      0.98      0.99       925
weighted avg       1.00      1.00      1.00       925



In [14]:
y_pred = [pred['class_ids'][0] for pred in list(classifier.predict(test_input_fn))]
y_true = data['test']['classLabel']
print(classification_report(y_true, y_pred))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from model/model.ckpt-870
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
              precision    recall  f1-score   support

           0       0.69      0.48      0.56       107
           1       0.56      0.75      0.64        93

   micro avg       0.60      0.60      0.60       200
   macro avg       0.62      0.61      0.60       200
weighted avg       0.63      0.60      0.60       200



So, the scores returned by the model seems better in terms of precision and recall tradeoff, and scores on the validation set (which was split from the training set) is better overall.

However, the sharp difference between scores on validation set and test set indicates that the test data might belong to a separete distribution.

In [15]:
data['test']['classLabel'].value_counts()

0    107
1     93
Name: classLabel, dtype: int64

Now we see that the test dataset has actually a faiely balanced distibution of class labels. This might explain the sharp differenc in scores.

So now, two things could be done:
- First to attempt ways to reduce the class imbalance and try making it balanced, which may introduce bias in the model
- To incrementally train on the test data and see if final score improves over time.

### An attempt to train incrementally

For this, we will treat every single point of the test data as a new dataset and run gradient descent steps with a hope that the final model will have better adapted to the test data.

Before we attempt this, we need to split the test data into training and validation sets to avoid giving a biased score at the end. SInce it will be difficult to report the scores after every step, we will save the precision, recall and f1 scores evaluadted on the validation set of the test data into a DatFrame instead.

In [16]:
online_train, online_test = train_test_split(data['test'], test_size=0.1, random_state=0, stratify=data['test']['classLabel'])

In [17]:
scores_df = pd.DataFrame(columns=['precision', 'recall', 'f1'])

In [18]:
def input_fn(df):
    label = df['classLabel']
    return tf.data.Dataset.from_tensor_slices((dict(df),label)).batch(1)

Now, let's train incrementally.

In [19]:
for index in range(online_train.shape[0]):
    print('Epoch {}'.format(index+1))
    classifier.train(input_fn=lambda: input_fn(online_train.iloc[[index]]))
    y_pred = [pred['class_ids'][0] for pred in list(classifier.predict(lambda: input_fn(online_test)))]
    y_true = online_test['classLabel']
    scores_df = scores_df.append({
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
    }, ignore_index=True)

Epoch 1
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from model/model.ckpt-870
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 870 into model/model.ckpt.
INFO:tensorflow:loss = 2.586345, step = 870
INFO:tensorflow:Saving checkpoints for 871 into model/model.ckpt.
INFO:tensorflow:Loss for final step: 2.586345.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from model/model.ckpt-871
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Epoch 2
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from model/mod

In [20]:
scores_df[:5]

Unnamed: 0,precision,recall,f1
0,0.666667,0.888889,0.761905
1,0.666667,0.888889,0.761905
2,0.666667,0.888889,0.761905
3,0.666667,0.888889,0.761905
4,0.666667,0.888889,0.761905


In [21]:
scores_df[-5:]

Unnamed: 0,precision,recall,f1
175,0.875,0.777778,0.823529
176,0.875,0.777778,0.823529
177,0.875,0.777778,0.823529
178,0.875,0.777778,0.823529
179,0.875,0.777778,0.823529


Analysing the first 5 and last 5 reveals that our online learning works well. In the first 5 iterations, we see that the precision was low and recall was high because of the high fraction of positives labels we encountered during training. However, the test data is much more balanced. As we train incrementally, the precision and f1 scores improves dramatically with slight decrease in recall score.

At this point, we can be convinced that the online learning is working as expected and the model parameters are being updated on every step. Given a lot of new data, we can be sure that the model will adapt to the changed distribution. So we can declare the online training a success.

**Effectively this completes part 1 and part 3 of the assignment, however I plan to make the online learning process a separate microservice which will be invoked by the prediction microservice. I will not focus much on speed, rather successfully being able to build the mmicroservice architechture. In practical use, I am sure they can be definitely be tuned to be faster and scalable.**