<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Naive-Bayes" data-toc-modified-id="Naive-Bayes-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Naive Bayes</a></span></li></ul></div>

In this experiment we will build a simple pipeline using thew training set along with the handfull of binary features engineered during the project initiation stage to train classifiers using:

- Naive Bayes
- Elastic Net
- Light Gradient Boosting
- Deep Neural Network

We will monitor `matthews_corrcoef` when training the classifiers and then document `classification_report` for both training and validation sets.

## Naive Bayes

In [None]:
from sklearn.naive_bayes import BernoulliNB # Note that all features we will use here are binary
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import matthews_corrcoef,make_scorer, balanced_accuracy_score
from sklearn.preprocessing import FunctionTransformer, PolynomialFeatures
from sklearn.feature_selection import SelectKBest,chi2

import numpy as np
import json
import pandas as pd

# load the configuration file
with open("config.json",'rb') as f:
    config = json.load(f)
feature_store_dir = config['feature_store_dir']    

# Load data sets, prepare x_train and y_train
analytical_data = pd.read_csv(feature_store_dir + 'analytical_data.csv')
train_knumber = pd.read_csv(feature_store_dir + 'train_KNUMBER.csv')
dummy_features = pd.read_csv(feature_store_dir + "features_dummy.csv")

x_train = pd.merge(train_knumber[['KNUMBER','DATASET']],
    analytical_data[['KNUMBER','COMPLEXITY']], how='left', on = "KNUMBER")
x_train = pd.merge(x_train,dummy_features, how = 'left', on='KNUMBER')
# Encode target labels:
# 0 = L, 1 = M, 2 = H
y_train = x_train[['KNUMBER','COMPLEXITY']].copy()
y_train.loc[:,'value'] = 0
y_train.loc[:,'value'][y_train.loc[:,'COMPLEXITY'] == 'M'] = 1
y_train.loc[:,'value'][y_train.loc[:,'COMPLEXITY'] == 'H'] = 2
y_train = y_train.loc[:,'value'].values

mcc = make_scorer(matthews_corrcoef)

def prepare_input(train_set):
    train_set = train_set.drop(['KNUMBER','DATASET', 'COMPLEXITY'],axis = 1)
    return train_set

clf_pipeline = Pipeline(steps=[
    ("prepare_input",FunctionTransformer(prepare_input)),
    ("add_poly_int", PolynomialFeatures()),
    ("feature_selection",SelectKBest(chi2)),
    ("clf",BernoulliNB())
])

param_grid = {
                "clf__alpha":np.linspace(1,1000,100),
                "clf__fit_prior":[True,False],
                "add_poly_int__degree":[1,2,3],
                "feature_selection__k":['all',10,20,30]
             }

clf_search = GridSearchCV(clf_pipeline,
                          param_grid = param_grid,
                          scoring = mcc,
                          cv = 5,
                          n_jobs=7,
                          verbose=2
                          )
clf_search.fit(X = x_train,y = y_train)

  interactivity=interactivity, compiler=compiler, result=result)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Fitting 5 folds for each of 2400 candidates, totalling 12000 fits


In [45]:
clf_search.best_estimator_.classes_

array([0, 1, 2], dtype=int64)

In [43]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true = y_train, y_pred = clf_search.predict(x_train)))
print(confusion_matrix(y_true = y_train, y_pred = clf_search.predict(x_train)))



              precision    recall  f1-score   support

           0       0.46      0.25      0.32      6668
           1       0.60      0.85      0.70     14307
           2       0.35      0.10      0.16      3951

    accuracy                           0.57     24926
   macro avg       0.47      0.40      0.39     24926
weighted avg       0.52      0.57      0.51     24926

[[ 1651  4828   189]
 [ 1624 12120   563]
 [  325  3220   406]]


In [44]:
balanced_accuracy_score(y_true = y_train, y_pred = clf_search.predict(x_train))

0.39916567995876057

In [61]:
x_train.head()

Unnamed: 0,KNUMBER,DATASET,COMPLEXITY,PRODUCTCODE_IYE_x,PRODUCTCODE_IYN_x,PRODUCTCODE_JJX_x,PRODUCTCODE_LYZ_x,PRODUCTCODE_NBW_x,CLASSADVISECOMM_AN_x,CLASSADVISECOMM_HE_x,...,PRODUCTCODE_IYN_y,PRODUCTCODE_JJX_y,PRODUCTCODE_LYZ_y,PRODUCTCODE_NBW_y,CLASSADVISECOMM_AN_y,CLASSADVISECOMM_HE_y,CLASSADVISECOMM_IM_y,CLASSADVISECOMM_MI_y,CLASSADVISECOMM_RA_y,CLASSADVISECOMM_TX_y
0,K033484,train,H,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,K043617,train,L,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,K053310,train,H,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,K053527,train,H,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,K060387,train,H,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
y_train

array(['H', 'L', 'H', ..., 'L', 'L', 'L'], dtype=object)

In [16]:
train_knumber.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24926 entries, 0 to 24925
Data columns (total 4 columns):
KNUMBER         24926 non-null object
DECISIONDATE    24926 non-null object
DECISIONYEAR    24926 non-null int64
DATASET         24926 non-null object
dtypes: int64(1), object(3)
memory usage: 779.0+ KB
