# Loyal Software Engineering, ML assessment

The following is a technical assessment for the Machine Learning Software Engineer Role at Loyal Health. In this assessment, you will be expected to (1) prepare training, validation, and test data, (2) train an intent classification model with the architecture of your choice, and (3) evaluate the performance of your model. You are free to use any external libraries just be sure to update the requirements.txt with the needed libraries. Please do not use any external datasets or knowledge-bases. We expect that this assessment should take around three hours. The goal of this assessment is simply to test your ability to quickly go from limited data to a working model. We don't expect your model to be perfect, so please try to calibrate your efforts accordingly. More broadly, we will evaluate this assessment based on your code structure/readability, model performance, model evaluation procedure, system design explanations, and creativity. There are many acceptable ways to approach this assessment, so please feel free to take some creative liberties. Best of luck! 

## Setup

First, be sure to have conda installed on your machine. If you do not have conda installed, please follow the instructions __[here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)__ . With conda installed we will create a conda environment using python 3.7 as follows:

<code><center>conda create --name LoyalEnv python=3.7</center></code>

With the environment (LoyalEnv) created you may now activate it by using the following command in your terminal:

<code><center>conda activate LoyalEnv</center></code>

Next to run this with jupyter in your terminal with the environment activated you will install ipykernel:

<code><center>conda install ipykernel</center></code>

Now, from the terminal run the command <code>pip install -r requirements.txt</code> to install the required libraries or by uncommenting and running the first code block of the imports section. If you have trouble getting conda working in the Jupyter Notebooks, __[this](https://towardsdatascience.com/get-your-conda-environment-to-show-in-jupyter-notebooks-the-easy-way-17010b76e874)__ article may be helpful. You are not required to use a conda environment as specified for your model development, but be aware that creating a conda environment with the specifications of your requirements.txt file will be how we run your code. 


## Imports

In [1]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install tensorflow_hub

Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install tensorflow_text

Note: you may need to restart the kernel to use updated packages.


In [229]:
"""Provided Imports"""
from datasets import load_dataset

"""Your Imports"""
from transformers import BertTokenizer
from sklearn import preprocessing
from sklearn.preprocessing import LabelBinarizer
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub
import torch
import torch.nn as nn
from tensorflow.keras import datasets, layers, models, callbacks
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

## Data Preprocessing

Here we have provided you with code to load the train, validation, and test data from the Huggingface datasets library. We have also provided you with 
several thousand unlabeled data points from the same distribution. Feel free to leverage these in any way you like. If you so desire, you can create your own dataset object using any library you like from what we have provided.We just ask that you do not do any hand labeling of the data. Using as many blocks as you like, do any data preprocessing you wish to prepare the data for ingestion into your model. Briefly rationalize any data preparation steps in a markdown block or comments.

In [8]:
labeled_data = load_dataset("christianloyal/loyal_clinc_MLE")
print(labeled_data)

Using custom data configuration christianloyal--loyal_clinc_MLE-0661c385abbfe8c6
Reusing dataset csv (/Users/m31418/.cache/huggingface/datasets/csv/christianloyal--loyal_clinc_MLE-0661c385abbfe8c6/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2956
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 4500
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 3000
    })
})


In [207]:
#Create pandas dataframes containing the training, validation, and test sets
train = pd.DataFrame.from_dict(labeled_data["train"])
val = pd.DataFrame.from_dict(labeled_data["validation"])
test = pd.DataFrame.from_dict(labeled_data["test"])

correct_size = int(((len(train.index)+len(val.index))*3)/4)
to_move = val.sample(correct_size - len(train.index))

train = pd.concat([train, to_move], ignore_index=True)
val = val.drop(to_move.index)

1489


In [10]:
#Check the distribution of classes within the training set
all_data['label'].value_counts()

card_declined         89
sync_device           89
next_song             89
apr                   89
shopping_list         88
                      ..
directions            53
credit_score          52
distance              52
cancel_reservation    52
share_location        52
Name: label, Length: 150, dtype: int64

In [11]:
unlabeled_data = load_dataset("christianloyal/loyal_clinc_MLE_unlabeled")

Using custom data configuration christianloyal--loyal_clinc_MLE_unlabeled-25971229c07b1588
Reusing dataset csv (/Users/m31418/.cache/huggingface/datasets/csv/christianloyal--loyal_clinc_MLE_unlabeled-25971229c07b1588/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

In [216]:
binarizer=LabelBinarizer()

trainfeatures=train.copy()
trainlabels=trainfeatures.pop("label")
trainfeatures=trainfeatures.values
trainlabels=binarizer.fit_transform(trainlabels.values)

testfeatures=test.copy()
testlabels=testfeatures.pop("label")
testfeatures=testfeatures.values
testlabels=binarizer.transform(testlabels.values)

validfeatures=val.copy()
validlabels=validfeatures.pop("label")
validfeatures=validfeatures.values
validlabels=binarizer.transform(validlabels.values)


In [13]:
#Initialize the tokenizer and set the max length of input samples
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
MAX_LENGTH = 32

The distribution between training, validation, and test sets seemed off to me; it seemed more appropriate to me that the data follow a 7:2:1 split between training, validation, and test samples. For the purposes of consistency across assessment, I decided to leave the test set alone. However, I rebalanced the sizes of the training and validation sets to better reflect a 75:25 split. While perusing the data I also noticed that there is a severe imbalance in the classes. This may not present a large problem as long as each class is still represented in the training data and the test data reflects the training data, however this is an issue that could be solved either with minority-oversampling, majority-undersampling, or synthetic data generation. Furthermore, unlabeled data could be used to pretrain a model such as Bert to make it more domain specific, or could be used along with the labeled data to train a GAN whose discriminator model is ultimately used to make predictions. Due to time constraints, these solutions were not explored further in this iteration of the project.

## Model Training

Here you may use as many code blocks as you please for the training of your model. As a tip, if you wish to run your training on a hardware accelerated device feel free to run this notebook on one of __[Google Colab](https://colab.research.google.com/)__'s free machines. We have set up our environment for this assignment to best replicate the environment of Colab machines for minimal difficulty in leveraging these resources. Feel free to use the validation set for any tuning, but just be sure that none of the validation data is used directly as training data in addition to the training set. Also, refrain from using any of the test data for model tuning. It may also be helpful to set a random seed or using anything else that will improve model reproducibility. In writing, briefly rationalize any architectural or or procedural decisions via a markdown block or comments.

In [223]:
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/2'

In [224]:
def build_classifier_model():
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.2)(net)
    net = tf.keras.layers.Dense(300, activation='sigmoid', name='Dense_Layer_1')(net)
    net = tf.keras.layers.Dense(150, activation='softmax', name='Dense_Layer_2')(net)
    return tf.keras.Model(text_input, net)

In [225]:
classifier_model = build_classifier_model()
classifier_model.summary()

Model: "model_20"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_mask': (Non  0           ['text[0][0]']                   
                                e, 128),                                                          
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128)}                                               

In [226]:
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
metrics = [tf.metrics.CategoricalAccuracy(), f1_score, precision, recall]

In [None]:
def recall(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    
    recall = true_positives / (all_positives + K.epsilon())
    return recall

def precision(y_true, y_pred):
    y_true = K.ones_like(y_true) 
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_score(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [231]:
epochs=10
optimizer=tf.keras.optimizers.Adam()
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics,
                        )

earlystopping = callbacks.EarlyStopping(monitor ="val_loss", 
                                        mode ="min", patience = 5, 
                                        restore_best_weights = True)

In [232]:
history = classifier_model.fit(x=trainfeatures,y=trainlabels,
                               validation_data=(validfeatures,validlabels),
                               batch_size=128,
                               epochs=epochs,
                              callbacks=[earlystopping])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10


Preprocessing of the data occurs within the model as part of the first layer, handled by a version of Bert from Tfhub. This layer produces token id as well as attention mask and token type information. Then, these inputs are encoded by Bert, and the pooled output is taken. A dropout rate of .2 is applied to prevent overfitting before the output is then fed through two dense layers. The second dense layer was decided on after experimentation showed that just one dense layer was unable to capture the nuances of differences in the input, especially given the large number of classes. In another version, a convolutional layer could be used to capture information specific to the sequences of tokens. Because the true classes of the data have been converted to one-hot vectors, Categorical Cross Entropy is used as the loss function. 

## Model Evaluation

Here we ask that you evaluate the performance of your model on the provided test set. The metric that you choose to use is up to you, but please explain your choice. When evaluating your model, keep in mind that we have a hold out test set of our own, so overfitting to the test set should be avoided. Briefly explain your choices in evaluating and assess the performance of your model. Explain the good and the bad and talk about how the model could be improved in the future.

In [None]:
loss, 
accuracy, 
f1_score, precision, recall = classifier_model.evaluate(testfeatures,testlabels)

## Final Steps

Congrats on completing the assessment. Now, all we ask is that you create a private repository on Github with your Assessment.ipynb and requirements.txt file and share that with us. Feel free to add any additional details you like in the readme file. We would also appreciate it if you linked us to any relevant coding projects on Github via the readme file or in the provided markdown block below. Thanks for your interest in Loyal!

*** Add any additional notes or external links here. *** 