# CAmper Initiation
This notebook contains descriptions for the model training procedure.
The model.py script learns and outputs the file as per this notebook, which can be run in batch.

In [2]:
# import libraries
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

import pickle

#### Model Selection
We have preprocessed the file and then done some data cleaning and exploration on the text files. Nothing too in depth, but for the purpose of the excercise, we can move ahead with the modelling procedure.

We have two requirements.
1. Predict the top 2 class labels.
2. Output the relevance measure for each of the labels.

Given that we are looking for a relevance score, logistic regression would be a good choice here. Since we are going to ultimately regress the probability that an instance belongs to class k. We won't actually write the algorithm here, since it'll take time to tune it and get the learning rates right. Purpose for this excercise let's just API call sklearns linear model modules. 

I'm using a stochastic gradient descent optimser here, mainly for extensibility since it supports minibatch descent. So if we scale decide to scale for bigger data sets, we can just change the optimser itself instead of the whole model. 

Our relevence measure will be the estimated probabilities that the logit function returns for an instance (i) belonging to class (k and j).

Lets get on with it.

#### Feature extractor / tokenizer
Now in the modelling pipeline, we will be feeding in a already cleaned dataset. Which means we can feed it straight into the feature extractor.
The requirements have said that the application needs to process a **full string sentence**. This means we first need to feed into a tokenizer to generate tokens. As the data is cleaned, we can use a simple whitespace tokenizer to generate the tokens. From our EDA, we found that some bigrams and trigrams that may be significant. Not real strong, but let's include them in this case.

Also, given that we are going to be feeding the runtime feature extracture one instance at a time, tf-idf vectoriser isn't going to be efficient (we will have to feed the training set for each instance and then log each seen instance in the document list, so on so forth). Instead a simple count vectoriser will do the job perfectly here to generate the feature vector. 

I will use sklearns inbuilt CountVectoriser, which does the whitespace tokenization and some bigram and trigram generation implicitly. We should also keep in mind that the cleaned dataset is stemmed using a prettey aggressive Lancaster stemmer.

In [3]:
# Read in the data
_data = pd.read_csv("CLEANED_LABELED.csv")
    
    
# Tokenize and vectorise.
_cVectorizer = CountVectorizer(analyzer='word', ngram_range=(2, 3))

#### Training
We will split the data into training and test sets, roughly 70-30 which should suffice. Then we will use the vectoriser that we generated to vectorise the both training and test sets.

After training on the vectorised training set. We can predict and see how our mode went.

##### Parameters
I'm leaving the default hyperparameters to the logistic regression classifier. Although we could do a bit of a hyperparameter search to optimise our mode, for the purpose of this excercise, the default hyperparameters would do. Obviously, the loss function is given as log loss to the API.

In [12]:

# Make test and training data.
x_train, x_test, y_train, y_test = train_test_split(_data['question_text'], _data['code'], test_size=0.3,random_state=300)

# Fit the vectoriser using training data.
_cVectorizer.fit(x_train)

# Generate feature vectors
fv_train = _cVectorizer.transform(x_train)
fv_test = _cVectorizer.transform(x_test)

# Logistic regression mode with default parameters
_SGDModel = SGDClassifier(loss="log").fit(fv_train, y_train)

# Predict
y_pred = _SGDModel.predict(fv_test)
for i,(x,y,z) in enumerate(zip(x_test,y_pred,y_test)):
    print(x , y,z)
    if i == 9:
        break

the inform i nee to do my job is ready avail to me ENA.3 ENA.3
i know what is expect from me to be success in my rol ALI.5 ALI.5
my team memb hold each oth account for high qual TEA.2 TEA.2
we wer enco to be innov ev though som of our in may not hav success INN.2 INN.2
we held ourselv and our team memb account for result TEA.2 TEA.2
i understand what is expect of me ALI.5 ALI.5
the inform resourc i nee to do my job effect ar ready avail ENA.3 ENA.3
at pied we hold ourselv and our team memb account for result TEA.2 TEA.2
within our branch we hold ourselv account for result TEA.2 TEA.2
we wer enco to be innov ev though som of our in may not hav been success INN.2 INN.2




Seems to be doing prettey good. Let's have a look at the accuracy

In [14]:
# check accuracy
print(accuracy_score(y_test, y_pred))

0.987654320988


Seems to be getting 98% accuracy. Might be overfitting tho. We won't play around too much around with this for now. Now in order to implement this model, we need to serialize the model and the vectoriser and load it onto our application. We can do that using pickle. That part of the code can be found in the model.py file.