In [1]:
# Enter your name(s) here
# Alan Tran and John Smith

# Assignment 4 : Using `scikit-learn`

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. In this assigment you'll explore how to train various classifiers using the `scikit-learn` library. The scikit-learn documentation can be found [here](http://scikit-learn.org/stable/documentation.html).

In this assignment we'll attempt to classify patients as either having or not having diabetic retinopathy, using the same Diabetic Retinopathy data set from your previous assignments. Recall that this dataset contains 1151 records and 20 attributes (some categorical, some continuous). You can find additional details about the dataset [here](http://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set).

In [2]:
#You may add additional imports
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import time


# Custom
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier


In [3]:
%matplotlib inline

In [4]:
# Read the data from csv file
col_names = []
for i in range(20):
    if i == 0:
        col_names.append('quality')
    if i == 1:
        col_names.append('prescreen')
    if i >= 2 and i <= 7:
        col_names.append('ma' + str(i))
    if i >= 8 and i <= 15:
        col_names.append('exudate' + str(i))
    if i == 16:
        col_names.append('euDist')
    if i == 17:
        col_names.append('diameter')
    if i == 18:
        col_names.append('amfm_class')
    if i == 19:
        col_names.append('label')

data = pd.read_csv("messidor_features.txt", names = col_names)
print(data.shape)
data.head(10)

(1150, 20)


Unnamed: 0,quality,prescreen,ma2,ma3,ma4,ma5,ma6,ma7,exudate8,exudate9,exudate10,exudate11,exudate12,exudate13,exudate14,exudate15,euDist,diameter,amfm_class,label
0,1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1,0
1,1,1,24,24,22,18,16,13,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0,0
2,1,1,62,60,59,54,47,33,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0,1
3,1,1,55,53,53,50,43,31,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0,0
4,1,1,44,44,44,41,39,27,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0,1
5,1,1,44,43,41,41,37,29,28.3564,6.935636,2.305771,0.323724,0.0,0.0,0.0,0.0,0.502831,0.126741,0,1
6,1,0,29,29,29,27,25,16,15.448398,9.113819,1.633493,0.0,0.0,0.0,0.0,0.0,0.541743,0.139575,0,1
7,1,1,6,6,6,6,2,1,20.679649,9.497786,1.22366,0.150382,0.0,0.0,0.0,0.0,0.576318,0.071071,1,0
8,1,1,22,21,18,15,13,10,66.691933,23.545543,6.151117,0.496372,0.0,0.0,0.0,0.0,0.500073,0.116793,0,1
9,1,1,79,75,73,71,64,47,22.141784,10.054384,0.874633,0.09978,0.023386,0.0,0.0,0.0,0.560959,0.109134,0,1


### A. Data prep

Q1. All of the classifiers in `scikit-learn` require that you separate the feature columns from the class label column, so go ahead and do that first. You should end up with two separate data frames: one that contains all of the feature values and one that contains the class labels. 

Note: Later in this assignment, you may get a warning stating "a column-vector was passed when a 1d array was expected." This indicates that some function wants a _flat array_ of labels, rather than a 2D DataFrame of labels. You can go ahead and transform the labels into a flat array here by doing either `labels.values.ravel()` or `labels.iloc[:,0]`. And you can just use that flat array for everything.

Print the `shape` of your features data frame, the shape or len of your labels dataframe or array, and the `head` of the features data frame.

In [5]:
labels = data['label']
labels.values.ravel()
features = data.drop('label', axis=1)

print("Shape of features dataframe: ", features.shape)
print("Shape of labels: ", labels.shape)
features.head()

Shape of features dataframe:  (1150, 19)
Shape of labels:  (1150,)


Unnamed: 0,quality,prescreen,ma2,ma3,ma4,ma5,ma6,ma7,exudate8,exudate9,exudate10,exudate11,exudate12,exudate13,exudate14,exudate15,euDist,diameter,amfm_class
0,1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1
1,1,1,24,24,22,18,16,13,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0
2,1,1,62,60,59,54,47,33,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0
3,1,1,55,53,53,50,43,31,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0
4,1,1,44,44,44,41,39,27,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0


### B. Decision Trees (DT) & Cross Validation

**Train/Test Split**

Q2. You can train a classifier using the holdout method by splitting your data into a  training set and a  test set, then you can evaluate the classifier on the held-out test set. 

Let's try this with a decision tree classifier. 

* Use `sklearn.model_selection.train_test_split` to split your dataset into training and test sets (do an 80%-20% split). Display how many records are in the training set and how many are in the test set.
* Use `sklearn.tree.DecisionTreeClassifier` to fit a decision tree classifier on the training set. Use entropy as the split criterion. 
* Now that the tree has been learned from the training data, we can run the test data through and predict classes for the test data. Use the `predict` method of `DecisionTreeClassifier` to classify the test data. 
* Then use `sklearn.metrics.accuracy_score` to print out the accuracy of the classifier on the test set.

In [6]:

X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.2, random_state=100)
dt = DecisionTreeClassifier(criterion='entropy', random_state=100)

dt = dt.fit(X_train, Y_train)
dt = dt.fit(X_train, Y_train)

predicted_val = dt.predict(X_test)
acc = sk.metrics.accuracy_score(Y_test, predicted_val)
print("Accuracy: ", acc)

Accuracy:  0.6217391304347826


Q3. Note that the DecisionTree classifier has many parameters that can be set. Try tweaking parameters like split criterion, max_depth, min_impurity_decrease, min_samples_leaf, min_samples_split, etc. to see how they affect accuracy. Print the accuracy of a few different variations.

In [7]:
dt = DecisionTreeClassifier(criterion='gini', max_depth=5, min_impurity_decrease=0.05, min_samples_leaf=2, min_samples_split=2, random_state=100)
dt = dt.fit(X_train, Y_train)
predicted_val = dt.predict(X_test)
acc = accuracy_score(Y_test, predicted_val)
print("Accuracy: ", acc)
# Changing the impurity decrease to a value above 0 causes the accuracy to tank.

dt = DecisionTreeClassifier(criterion='entropy', max_depth=None, min_impurity_decrease=0.05, min_samples_leaf=2, min_samples_split=2, random_state=100)
dt = dt.fit(X_train, Y_train)
predicted_val = dt.predict(X_test)
acc = accuracy_score(Y_test, predicted_val)
print("Accuracy: ", acc)
# Changing the criterion back to entropy does help with the performance.

dt = DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=100)
dt = dt.fit(X_train, Y_train)
predicted_val = dt.predict(X_test)
acc = accuracy_score(Y_test, predicted_val)
print("Accuracy: ", acc)
# Choosing the max_depth hyperparameter to be 10 seems to yield better accuracy. This is probably due to overfitting.

dt = DecisionTreeClassifier(criterion='gini', max_depth=5, min_impurity_decrease=0, min_samples_leaf=5, min_samples_split=200, random_state=100)
dt = dt.fit(X_train, Y_train)
predicted_val = dt.predict(X_test)
acc = accuracy_score(Y_test, predicted_val)
print("Accuracy: ", acc)
# Setting the min_samples_split to a high number (200) yielded the best accuracy. todo...

Accuracy:  0.5608695652173913
Accuracy:  0.5478260869565217
Accuracy:  0.6260869565217392
Accuracy:  0.7


**Cross Validation**

Q4. You have now built a decision tree and tested it's accuracy using the "holdout" method. But as discussed in class, this is not sufficient for estimating generalization accuracy. Instead, we should use Cross Validation to get a better estimate of accuracy. 

Use `sklearn.model_selection.cross_val_score` to perform 10-fold cross validation on a decision tree. You will pass the FULL dataset into `cross_val_score` which will automatically divide it into the number of folds you tell it to, train a decision tree model on the training set for each fold, and test it on the test set for each fold. It will return a numpy array with the accuracy out of each fold. Average these accuracies to print out the generalization accuracy of the model.

In [8]:
duplicate_dt = DecisionTreeClassifier()
accuracies = cross_val_score(duplicate_dt, features, labels, cv=10)
print("Accuracy from CV: ", accuracies.mean())

Accuracy from CV:  0.6104347826086955


**Nested Cross Validation**

Q5. Now we want to tune our model to use the best parameters to avoid overfitting to our training data. Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters (hyperparameters) specified in a grid. 
* Use `sklearn.model_selection.GridSearchCV` to find the best `max_depth`, `max_features`, and `min_samples_leaf` for your tree. Use a 5-fold-CV and 'accuracy' for the scoring criteria.
* Try the values [5,10,15,20] for `max_depth` and `min_samples_leaf`. Try [5,10,15] for `max_features`. 
* Print out the best value for each of the tested parameters (`best_params_`).
* Print out the accuracy of the model with these best values (`best_score_`).

In [9]:
params = {'max_depth':[5, 10, 15, 20], 'max_features': [5, 10, 15], 'min_samples_leaf':[5, 10, 15, 20]}
grid_cv_score = GridSearchCV(duplicate_dt, param_grid=params, cv=5)
grid_cv_score.fit(features, labels)
print("Best parameters: ", grid_cv_score.best_params_)
print("Best accuracy: ", grid_cv_score.best_score_)

Best parameters:  {'max_depth': 15, 'max_features': 15, 'min_samples_leaf': 20}
Best accuracy:  0.6478260869565217


Q6. What you did in Q5 performed the _inner_ loop of a nested CV (no test set was held out). What you did in Q4 performed an _outer_ loop of CV (holds out a test set). Now we need to combine them to perform the nested cross-validation that we discussed in class. To do this, you'll need to pass the a `GridSearchCV` into a `cross_val_score`. 

What this does is: the `cross_val_score` splits the data in to train and test sets for the first outer fold, and it passes the train set into `GridSearchCV`. `GridSearchCV` then splits that set into train and validation sets for k number of folds (the inner CV loop). The hyper-parameters for which the average score over all inner iterations is best, is reported as the `best_params_`, `best_score_`, and `best_estimator_`(best decision tree). This best decision tree is then evaluated with the test set from the `cross_val_score` (the outer CV loop). And this whole thing is repeated for the remaining k folds of the `cross_val_score` (the outer CV loop). 

That is a lot of explanation for a very complex (but IMPORTANT) process, which can all be performed with a single line of code!

Be patient for this one to run. The nested cross-validation loop can take some time. A [ * ] next to the cell indicates that it is still running.

Print the accuracy of your tuned, cross-validated model. This is the official accuracy that you would report for your model.

In [10]:
# DOUBLE CHECK THIS because we're not sure if cv=10 or cv=5
accuracies = cross_val_score(grid_cv_score, features, labels)
print("Accuracy from CV: ", accuracies.mean())

Accuracy from CV:  0.5921739130434783


### C. Naive Bayes (NB) & Evaluation Metrics

`sklearn.naive_bayes.GaussianNB` implements the Gaussian Naive Bayes algorithm for classification. This means that the liklihood of continuous features is estimated using a Gaussian distribution. (Refer to slide 13 of the Naive Bayes powerpoint notes.)

Q7. Create a `sklearn.naive_bayes.GaussianNB` classifier. Use `sklearn.model_selection.cross_val_score` to do a 10-fold cross validation on the classifier. Display the accuracy.

In [11]:
nb_classifier = GaussianNB()
accuracies = cross_val_score(nb_classifier, features, labels, cv=10)
print("Accuracy:", accuracies.mean())

Accuracy: 0.5947826086956522


Q8. `cross_val_score` returns the scores of every test fold. There is another function called `cross_val_predict` that returns predicted y values for every record in the test fold. In other words, for each element in the input, `cross_val_predict` returns the prediction that was obtained for that element when it was in the test set. 

* Use `cross_val_predict` and `sklearn.metrics.confusion_matrix` to print the confusion matrix for the classifier.

* Sckit-learn also provides a useful function `sklearn.metrics.classification_report` for evaluating the classifier on a per-class basis. It is a summary of the precision, recall, and F1 score for each class (and support is just the actual class count). Display the classification report for your Naive Bayes classifier.

In [12]:
cv_predict = cross_val_predict(nb_classifier, features, labels, cv=10)
cm = sk.metrics.confusion_matrix(labels, cv_predict)
print(cm)

class_peport = sk.metrics.classification_report(labels, cv_predict)
print(class_peport)

[[500  39]
 [427 184]]
              precision    recall  f1-score   support

           0       0.54      0.93      0.68       539
           1       0.83      0.30      0.44       611

    accuracy                           0.59      1150
   macro avg       0.68      0.61      0.56      1150
weighted avg       0.69      0.59      0.55      1150



### D. k-Nearest Neighbor (KNN) & Pipelines 

For some classification algorithms, scaling of the data is critical (like KNN, SVM, Neural Nets). For other classification algorithms, data scaling is not necessary (like Naive Bayes and Decision Trees). _Take a minute to think about why this is the case!!_ But using scaled data with an algorithm that doesn't explicitly need it to be scaled does not hurt the results of that algorithm.

Q10. The distance calculation method is central to the KNN algorithm. By default, `KNeighborsClassifier` uses  Euclidean distance as its metric (but this can be changed). Because of the distance calculations, it is critical to scale the data before running Nearest Neighbor!

We discussed why dimensionality reduction may also be needed with KNN because of the curse of dimensionality. So we may want to also perform a dimensionality reduction with PCA before running KNN. PCA should only be performed on scaled data! (Remember that you can also reduce dimensionality by performing feature selection and feature engineering.) 

An important note about scaling data and dimensionality reduction is that they should only be performed on the **training** data, then you transform the test data into the scaled, PCA space that was found on the training data. (Refer to the concept of [data leakage](https://machinelearningmastery.com/data-leakage-machine-learning/).)

So when you are doing cross-validation, the scaling and PCA needs to happen *inside of your CV loop*. This way, it is performed on the training set for the first fold, then the test set is put into that space. On the second fold, it is performed on the trainng set for the second fold, and the test set is put into that space. And so on for the remaining folds. 

In order to do this with scikit-learn, you must create what's called a `Pipeline` and pass that in to the cross validation. This is a very important concept for Data Mining and Machine Learning, so let's practice it here.

Do the following:
* Create a `sklearn.preprocessing.StandardScaler` object to standardize the dataset’s features (mean = 0 and variance = 1). (Do not call `fit` on it yet. Just create the `StandardScaler` object.)
* Create a `sklearn.decomposition.PCA` object to perform PCA dimensionality reduction. (Do not call `fit` on it yet. Just create the `PCA` object.)
* Create a `sklearn.neighbors.KNeighborsClassifier`. The number of neighbors defaults to 5 (k=5). Go ahead and change it to 7. (Do not call `fit` on it yet. Just create the `KNeighborsClassifier` object.)
* Create a `sklearn.pipeline.Pipeline` object and set the `steps` to the scaler, the PCA, and the KNN objects that you just created. 
* Pass the `pipeline` object in to a `cross_val_score` as the estimator, along with the features and the labels, and use a 5-fold-CV. 

In each fold of the cross validation, the training phase will use _only_ the training data for scaling, PCA, and training the model. Then the testing phase will scale & transform the test data into the PCA space (found on the training data) and run the test data through the trained classifier, to return an accuracy measurement for each fold. Print the average accuracy across all 5 folds. 

In [51]:
scaler = StandardScaler() 
pca = PCA()
knn = KNeighborsClassifier(n_neighbors=7)
pipeline = Pipeline(steps=[('scaler', scaler), ('pcal', pca), ('knnl', knn)])

scores = cross_val_score(estimator=pipeline, X=features, y=labels, cv=5)
print(scores.mean())

0.6182608695652174


Q11. Another important part of KNN is choosing the best number of neighbors (tuning the hyperparameter, k). We can use nested cross validation to do this. Let's try k values from 1-25 to find the best one. 

We _also_ want to find the best number of dimensions to project down onto using PCA. We can use nested cross validation to do this as well. Let's try from 5-19 dimensions.

* Starter code is provided to create the "parameter grid" to search. You will need to change this code! Where I have "knn__n_neighbors", this indicates that I want to tune the "n_neighbors" parameter in the "knn" part of the pipeline. When you created your pipeline above, you named the KNN part of the pipeline with a string. You should replace "knn" in the param_grid below with whatever you named your KNN part of the pipeline: **<replace_this>__n_neighbors.** Do the same for the PCA part of the pipeline.
* Create a `sklearn.model_selection.GridSearchCV` and pass in the pipeline, the param_grid, and set it to a 5-fold-CV.
* Now, on that `GridSearchCV` object, call `fit` and pass in the features and labels.
* Show the best number of dimensions and best number of neighbors for this dataset by printing the `best_params_` from the `GridSearchCV`.
* Also print the accuracy when using this best number of dimensions and neighbors by printing the `best_score_` from the `GridSearchCV`.

Be patient, this can take some time to run. It is trying every combination of dimensions from 5-19 with every k from 1-25! A [ * ] next to the cell indicates that it is still running.

In [52]:
'''
On the "pca" part of the pipeline, 
tune the n_components parameter,
by trying the values 5-19.

On the "knn" part of the pipeline, 
tune the n_neighbors parameter,
by trying the values 1-25.
'''
param_grid = {
    'pcal__n_components': list(range(5, 19)),
    'knnl__n_neighbors': list(range(1, 25))
}
knn_tune = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
knn_tune.fit(features, labels)

print("Best dimensions:", knn_tune.best_params_)
print("Best score:", knn_tune.best_score_)

Best dimensions: {'knnl__n_neighbors': 23, 'pcal__n_components': 14}
Best score: 0.6617391304347826


Q12. In Q11, we did not hold out a test set. The accuracy reported out is on the _validation_ set. So now we need to wrap the whole process in another cross-validation to perform a nested cross-validation and report the accuarcy of this KNN model on unseen test data. This is the official accuracy you would report on this model.

You'll need to pass the `GridSearchCV` into a `cross_val_score`, just as you did with the decision tree. Use a 5-fold-CV for the outer loop. 

Again, be patient for this one to run. The nested cross-validation loop can take some time. It is doing what it did above in Q11 five times. A [ * ] next to the cell indicates that it is still running. (Just for comparison, mine takes about 2 mins to run and the fan revs up so it sounds like my computer is going to explode. All computers are different, so yours could take shorter or longer...)

<img src="model_is_training.png" width="250">

In [15]:
accuracies = cross_val_score(knn_tune, features, labels, cv=5)
print("Accuracy from CV: ", accuracies.mean())

Accuracy from CV:  0.6417391304347827


### E. Support Vector Machines (SVM)

Q13. Now put it all together with an SVM. 
* Create a `pipeline` that includes scaling, PCA, and an `sklearn.svm.SVC`.
* Create a parameter grid that tries number of dimensions from 5-19 and SVM kernels `linear`, `rbf` and `poly`.
* Create a `GridSearchCV` for the inner CV loop. Use a 5-fold CV.
* Run a `cross_val_predict` with a 10-fold CV for the outer loop. 
* Print out the accuracy and the classification report of using an SVM classifier on this data.

In [16]:

scaler_svm = StandardScaler()
pca_svm = PCA()
svc = SVC()
pipeline_svm = Pipeline(steps=[('scaler', scaler_svm), ('pca', pca_svm), ('svc', svc)])


param_grid_svm = {
    'pca__n_components': list(range(5, 19)), 
    'svc__kernel': ('linear', 'rbf', 'poly')
}

svc_tune = GridSearchCV(pipeline_svm, param_grid_svm, cv=5)
# print(svc_tune.cv_results)
predicted_labels = cross_val_predict(svc_tune, features, labels, cv=10)

accuracy = accuracy_score(labels, predicted_labels)

print("Accuracy:", accuracy)
print(classification_report(labels, predicted_labels))

Accuracy: 0.72
              precision    recall  f1-score   support

           0       0.66      0.84      0.74       539
           1       0.81      0.62      0.70       611

    accuracy                           0.72      1150
   macro avg       0.74      0.73      0.72      1150
weighted avg       0.74      0.72      0.72      1150



### F. Neural Networks (NN)

Q14. Train a multi-layer perceptron with a single hidden layer using `sklearn.neural_network.MLPClassifier`. 
* Create a pipeline with scaling and a neural net. (No PCA on this one. But scaling is critical to neural nets.)
* Use `GridSearchCV` with 5 fold cross validation to find the best hidden layer size and the best activation function. 
* Try values of `hidden_layer_sizes` ranging from `(30,)` to `(60,)` by increments of 10.
* Try activation functions `logistic`, `tanh`, `relu`.
* Wrap your `GridSearchCV` in a 5-fold `cross_val_score` and report the accuracy of your neural net.

Be patient, as this can take a few minutes to run. You may get ConvergenceWarnings as it runs - that is fine.

In [17]:
scalar_nn = StandardScaler()
mlp = MLPClassifier()
pipeline_nn = Pipeline([('scaler', scalar_nn),('mlp', mlp)])
ranges = [(i,) for i in range(30,70,10)]
param_grid_nn = {
    'mlp__hidden_layer_sizes': ranges, 
    'mlp__activation': ['logistic', 'tanh', 'relu']
}

nn_tune = GridSearchCV(pipeline_nn, param_grid_nn, cv=5)
nn_score = cross_val_score(nn_tune, features, labels, cv=5)
print(nn_score.mean())

0.7226086956521739


### G. Ensemble Classifiers

Ensemble classifiers combine the predictions of multiple base estimators to improve the accuracy of the predictions. One of the key assumptions that ensemble classifiers make is that the base estimators are built independently (so they are diverse).

**Random Forests**

Q15. Use `sklearn.ensemble.RandomForestClassifier` to classify the data. Scaling the data is not necessary for Decision Trees (take a minute to think about why). So, no need for a pipeline here.

Use a `GridSearchCV` with a 5-fold CV to tune the hyperparameters to get the best results. 
* Try `max_depth` ranging from 35-45
* Try `min_samples_leaf` of 8, 10, 12
* Try `max_features` of `"sqrt"` and `"log2"`

Wrap your GridSearchCV in a cross_val_score with 5-fold CV to report the accuracy of the model.

Be patient, this can take a few minutes to run.

In [18]:
rf = RandomForestClassifier()
param_grid_rf = {
    'max_depth': range(35,45), 
    'min_samples_leaf': [8, 10, 12],
    'max_features':['sqrt', 'log2']
}

rf_tune = GridSearchCV(rf, param_grid_rf, cv=5)
rf_score = cross_val_score(rf_tune, features, labels, cv=5)
print(rf_score.mean())

0.68


**AdaBoost**

Random Forests are a kind of averaging ensemble classifier, where several estimators are built independently and then to average their predictions (by taking a vote). There is another method of training ensemble classifiers called *boosting*. Here the classifiers are trained sequentially and each time the sampling of the training set depends on the performance of previously generated models.

Q16. Evaluate a `sklearn.ensemble.AdaBoostClassifier` classifier on the data. By default, `AdaBoostClassifier` uses decision stumps as the base classifiers (but this can be changed). Use 150 base classifiers to make an `AdaBoostClassifier` and evaluate it's accuracy with a 5-fold-CV.

In [19]:
ada_classifier = AdaBoostClassifier(n_estimators = 150)

ada_score = cross_val_score(ada_classifier, features, labels, cv=5)

print(ada_score.mean())

0.6991304347826087


### H. Build your final model

Now you have tested all kinds of classifiers on this data. Some have performed better than others. 

Q17. We may not want to deploy any of these models in the real world to actually diagnose patients because the accuracies are not high enough. What can we do to improve the accuracy rates? Answer as a comment:

In [20]:
'''
Answer here as a comment.
- Feature engineer to find more correlations within a dataset
    * Feature selection to extrapolate those patterns
- Lower dimensionality for the KNN model.
- Test on a larger variety of hyperparameters
- Continue using ensemble methods, such as multiclass partitioning or stacking, that might add more features to the dataset 
'''

'\nAnswer here as a comment.\n- Feature engineer to find more correlations within a dataset\n    * Feature selection to extrapolate those patterns\n- Lower dimensionality for the KNN model.\n- Test on a larger variety of hyperparameters\n- Continue using ensemble methods, such as multiclass partitioning or stacking, that might add more features to the dataset \n'

Q18. Let's say we *did* get to the point where we had a model with very high accuracy and we want to deploy that model and use it for real-world predictions.

* Let's say we're going to deploy our SVM classifier.
* We need to make one final version of this model, where we use ALL of our available data for training (we do not hold out a test set this time, so no outer cross-validation loop). 
* We need to tune the parameters of the model on the FULL dataset, so copy the code you entered for Q13, but remove the outer cross validation loop (remove `cross_val_predict`). Just run the `GridSearchCV` by calling `fit` on it and passing in the full dataset. This results in the final trained model with the best parameters for the full dataset. You can print out `best_params_` to see what they are.
* The accuracy of this model is what you assessed and reported in Q13.


* Use the `pickle` package to save your model. We have provided the lines of code for you, just make sure your final model gets passed in to `pickle.dump()`. This will save your model to a file called finalized_model.sav in your current working directory. 

In [21]:
import pickle

final_model = None

scaler_svm = StandardScaler()
pca_svm = PCA()
svc = SVC()
pipeline_svm = Pipeline(steps=[('scaler', scaler_svm), ('pca', pca_svm), ('svc', svc)])

param_grid_svm = {
    'pca__n_components': list(range(5, 19)), 
    'svc__kernel': ('linear', 'rbf', 'poly')
}

svc_tune = GridSearchCV(pipeline_svm, param_grid_svm, cv=5)
svc_tune.fit(features, labels)
print(svc_tune.best_params_)

filename = 'finalized_model.sav'
pickle.dump(svc_tune, open(filename, 'wb'))

{'pca__n_components': 18, 'svc__kernel': 'linear'}


Q19. Now if someone wants to use your trained, saved classifier to classify a new record, they can load the saved model and just call predict on it. 
* Given this new record, classify it with your saved model and print out either "Negative for disease" or "Positive for disease."

In [22]:
# use this as the new record to classify
record = [ 0.05905386, 0.2982129, 0.68613149, 0.75078865, 0.87119216, 0.88615694,
  0.93600623, 0.98369184, -0.47426472, -0.57642756, -0.53115361, -0.42789774,
 -0.21907738, -0.20090532, -0.21496782, -0.2080998, 0.06692373, -2.81681183,
 -0.7117194 ]

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

if loaded_model.predict([record])[0] :
    print("Positive for disease.")
else : 
    print("Negative for disease")

Positive for disease.
