# Step 3 - Build Model

### Domain and Data

Here I am working with the Madelon data set, a synthetic data set with many variables and a high degree of non-linearity.  I will use a limited number of features for training the full model selected using the SelectKBest function, using kBest = [2, 8, 30] as determined in the previous step.

### Problem Statement

Here I intend to perform a series of grid searches to obtain a better predictor for the Madelon data set.  I will consider not only the Logistic Regression method used in the previous step, but also k-nearest neighbors and SVC classifiers.  Given the highly non-linear behavior of the Madelon data set, these classifiers may provide better accuracy than the linear logistic regression classifier.  I hope to significantly improve upon the low accuracies found in the previous sections. 

### Solution Statement

Beyond the pipeline used in the previous section, here the data set must be further split.  The SelectKBest transformation is applied in a loop, each time creating a copy of the original data frame to train upon.  Then, each model is applied in a GridSearchCV object, with a range of chosen parameters.  The accuracies of the best models upon the training set are displayed below, and the best was selected to be re-fit and have its performance tested on the validation set.

In [28]:
from lib.project_5 import load_data_from_database, make_data_dict, general_model, general_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

data = load_data_from_database()
data_dict = make_data_dict(data, rand_seed = 742)
data_dict = general_transformer(StandardScaler(), data_dict)

kBest = [2,8,30]

data_dicts = []

for k in kBest:
    l2_weights = [1,0.5,0.2,0.1,0.05,0.04,0.03,0.02,0.01]
    k_dict = general_transformer(SelectKBest(k = k), data_dict.copy())

    logistic_params = {'C' : l2_weights}
    logistic_grid = GridSearchCV(LogisticRegression(penalty='l2',solver='liblinear',fit_intercept=True),
                                 param_grid=logistic_params)
    this_dict = general_model(k_dict.copy(), logistic_grid)
    this_dict['name'] = "Logistic_"+str(k)+"_Features"
    data_dicts.append(this_dict)

    knn_params = {'n_neighbors' : [1,3,5,11,21,51],'weights':['uniform','distance'],'p':[1,2]}
    knn_grid = GridSearchCV(KNeighborsClassifier(),param_grid=knn_params)
    this_dict = general_model(k_dict.copy(), knn_grid)
    this_dict['name'] = "KNN_"+str(k)+"_Features"
    data_dicts.append(this_dict)

    svc_penalty = [1e3,3e2,1e2,30,10,3,1,3e-1,1e-1,3e-2,1e-2,3e-3,1e-3]
    svc_params = {'C' : l2_weights,'kernel':['rbf','sigmoid']}
    svc_grid = GridSearchCV(SVC(),param_grid=svc_params)
    this_dict = general_model(k_dict.copy(), svc_grid)
    this_dict['name'] = "SVC_"+str(k)+"_Features"
    data_dicts.append(this_dict)

In [29]:
import pandas as pd

pd.DataFrame([{"Name":data_dicts[i]['name'],
               "Accuracy":data_dicts[i]['metrics'][0].iloc[0,1]}
              for i in range(len(data_dicts))])[["Name","Accuracy"]]

Unnamed: 0,Name,Accuracy
0,Logistic_2_Features,0.619287
1,KNN_2_Features,0.600705
2,SVC_2_Features,0.606486
3,Logistic_8_Features,0.599976
4,KNN_8_Features,0.820713
5,SVC_8_Features,0.752839
6,Logistic_30_Features,0.632772
7,KNN_30_Features,0.7579
8,SVC_30_Features,0.711482


In [30]:
print "Best Model:\n",data_dicts[4]['models'][0].best_estimator_

Best Model:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='distance')


In [31]:
k_dict = general_transformer(SelectKBest(k = 8), data_dict.copy())
best_dict = general_model(k_dict.copy(), data_dicts[4]['models'][0].best_estimator_,test_scores=True)
best_dict['metrics'][0]

Unnamed: 0,Score,Cross-Validation,Validation
0,accuracy,0.822126,0.795
1,roc_auc,0.88916,0.79487
2,precision,0.835556,0.818841
3,recall,0.806016,0.755853
4,f1,0.819776,0.786087


### Metric

Using accuracy for the same reasons as described in the previous steps, the best model is a k-Nearest Neighbors model using 8 features, considering the 5 closest neighbors weighted according to the euclidian distance between points.  It has an accuracy of 82.2% on the training set, and 79.5% on the validation set.  The minor discrepancy with the above table is probably due to a mis-match of random seeds.

The performance of the other metrics is similar, showing training scores between 0.8 and 0.9 with only moderate reduction when the model is applied to the validation set.

### Benchmark

Using the full pipeline and a broader selection of classification models than a simple logistic regression, I have identified a model which improves significantly on the baseline accuracy.  A naive logistic regression on the Madelon data set produced only 53.6% accuracy on the training set itself, only marginally better than the baseline accuracy of 50%.  In contrast, a k-Nearest Neighbors model selected using a grid search using only a sub-set of the data improved upon this significantly, generating a 79.5% accuracy on the validation set.

On a conceptual level, the Madelon data set has significant non-linearity, so it is unsurprising that the logistic regression models performed poorly.  SVC and k-Nearest Neighbors classifiers had signficantly better performance not only for 8 features, but also for 30, with the latter possessing higher accuracy.  This may imply that the data is meaningfully clustered such that both SVC and k-Nearest Neighbors are meaningful models, but does not contain the degree of separation between classes that is required for support vectors to provide good results for that classification scheme.

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

<img src="assets/build_model.png" width="600px">