# Assignment: Ensemble Methods: DSEI210-Fall-2019: Michael Grossberg

In this assignment we are going to get comfortable combining and leveraging ML algorithms with ensemble methods. In order to get respectable results we would more certainly need to spend much more time on data normalization and cleaning than we will be able to. Also we are having you subsample the data severly (so it will run in a reasonable amount of time) which will have a strong negative effect on classification.

**Background:** Suppose you are given a collection of high-importance dermatoscopic images of skin lesions. With a high degree of certainty, you want to determine whether or not a particular lesion is malignant or benign. 

**Objective:** Experiment with different ensemble paradigms combined with different classifiers. See if ensembles help and how. How do they effect accruacy, precision, recall, f1 and variance vs. bias.

------------------------------------------------------------------------

## Introductory Data Analysis and Preprocessing

The path here will be to go from data and meta data to a standard matrix 'X' where the rows are observations, and the columns are features, in this case pixels. We also have a 'y' which is a boolean '0' for a benign kind of cancer and '1' for a malignant kind.

### My Imports

Here are the extensive list of imports for the whole notbook I had. You may have some differences in your solution. There are other ways to resize and import images. If you prefer those you are free to use them but you should, at least, know how to use these libraries. It is a very good idea to keep constants that are frequently used in your data at the TOP of the file. This is starndard across software development. In particular don't hide data paths deep in your code. When you get the repo there will be a zip file for the data. Unzip that file wherever you want (usually outside your your repo) but use the path variable to point to that location (here called `DATA_PATH`). You will also
come up with a location for `temp` data we called `PROCESSED_DATA_PATH`. This is where you save out your preprocessed images and the data for your `X` and `y`. This is important because you may find you have to kill your kernel, restart some other time and you want to save your work so you can resume.

<div></div>
<hr/>
<div style="background-color:azure;color:red"> <em>Tip:</em> Some of the algorithms run really slow. For debugging purposes only, run the algorithms on 1/4 or 1/8 of the data. Then double it and measure the running times. Before you run the bagging algorithms you should have some idea how long it will take to run the whole algorithm. Keep that in mind when you change parameters and explore better performance. Also I have included an import for FloatProgress. Learn how to use that, particularly in the preprocessing, so you know how far you are at any time.</div>

In [13]:
# Standard Stuff and Pre-processing
import numpy as np
import pandas as pd
import os
import os.path as op
import json
from skimage.io import imread, imsave
from skimage.transform import resize
from ipywidgets import FloatProgress
from matplotlib import pyplot as plt
%matplotlib inline

# Fix imbalanced data sets
from imblearn.over_sampling import SMOTE

#Machine Learning
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import ExtraTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# This is the relative path where the data is stored
DATA_PATH = "/Users/carlostavarez/Desktop/DSE/Machine_Learning/data/SampledImages"
PROCESSED_DATA_PATH = "/Users/carlostavarez/Desktop/DSE/Machine_Learning/data/processed_images"

## Unzip the dataset

Unzip the data set from the repo or download it. Put the SampledImages directory in some location and change the data path accordingly. You will see two kinds of files. One kind of file end with ".jpeg" and the other have no extension. 

* Flip through the images manually
* We will create a directory to hold processed images which we will rite to disk and stored in `PROCESSED_DATA_PATH`.
* Use the os and the os.path python library to check if the directory exists and if not create it
* You can loop through the `DATA_PATH` by reading the contents of the directory with `os.listdir` and running a for loop. We will treat the `jpeg` files and the metadata files separately. Keep in mind that we will build our `X` matrix of classification input from our images and our `y` classification outputs from our metadata.
* Lets focus on the metadata first. This meta data is in JSON form so you can parse them using the python json library imported above. You can either read them in as text and then use the json library to parse and convert it to a dictionary, or read it directly in as a JSON file. For each file we will look for a sub-key with name `benign_malignant`. If that key is present and the key has the value `malignant` the corresponding target value in `y` should be 1. A small number of rows are missing this key. In that case check the `diagnosis` key. In this case if the `diagnosis` is `basal cell carcinoma` then `y` should also be 1 otherwise, for other `diagnosis` when there is no `benign_malignant` key. Finally if there is no `benign_malignant` and no `diagnosis` value, the meta-data and corresponding image, must be ignored.
* In order to make the problem tractable you will read in an image using imread, resize the images to 25x25 pixels and store the image as a row in a matrix which will have one row for each image and 25x25x3 columns with 3 being R-G-B color channels. It is also worthwile to save out the images using imsave with a slightly different file name (like tacking on `small` before the `.jpeg`) in your `PROCESSED_DATA_PATH` directory. When you have created a matrix `X` with one row for each image for which we have a meta-data value 0 or 1 in `y`, save both `X` and `y` in the numpy file format (.npy) usin the command `np.save`. Keep in mind that we should check that this file(s) exists first before processing the images each time you run the notebook. This way you can start and stop your analysis loading from the file, if you processed the data or recomputing if you don't have them proceessed. You can manaually override the test when you need to so it will compute everyting again.


In [159]:
# fill in

# create two list: one for images file names; and for metadata file names
filenames = os.listdir(DATA_PATH)

# list the names of the files accordingly
metadata_filenames = [i for i in filenames if len(i)==12]
images_filenames = [i for i in filenames if len(i)==17]

metadata_filenames = sorted(metadata_filenames)
images_filenames = sorted(images_filenames)

print("Length of metadata names: {}\nLength of images names: {}\nLength of all files names: {}".format(len(metadata_filenames), 
                                                            len(images_filenames), 
                                                            len(filenames)))

Length of metadata names: 1768
Length of images names: 1768
Length of all files names: 3536


In [160]:
# function loads json files in json formta
def jsonopener(file):
    with open(DATA_PATH + '/' + file, 'r') as fp:
            return json.load(fp)

In [161]:
metadata_filenames[:10], images_filenames[:10]

(['ISIC_0000004',
  'ISIC_0000016',
  'ISIC_0000049',
  'ISIC_0000059',
  'ISIC_0000063',
  'ISIC_0000080',
  'ISIC_0000082',
  'ISIC_0000105',
  'ISIC_0000109',
  'ISIC_0000146'],
 ['ISIC_0000004.jpeg',
  'ISIC_0000016.jpeg',
  'ISIC_0000049.jpeg',
  'ISIC_0000059.jpeg',
  'ISIC_0000063.jpeg',
  'ISIC_0000080.jpeg',
  'ISIC_0000082.jpeg',
  'ISIC_0000105.jpeg',
  'ISIC_0000109.jpeg',
  'ISIC_0000146.jpeg'])

In [225]:
# empty variables to be filled by the metadata filler loop
y = []
X = []

# metadata filler loop
for file in metadata_filenames:
    
    #json keys to determine the value of y
    bm = 'benign_malignant'
    diag = 'diagnosis'
    
    # json openner
    jsonfile = jsonopener(file)
    
    # Seve name shorter
    jsonname = file[7:]
    imgname = file[7:] +'.jpeg'
    
    """Everytime time a condition is met, the image with the same name 
    as the json file is loaded to x the value of y is decided by the value
    of the keys and the json file and the image are saved in a processed
    image folder"""
    
    if bm in jsonfile['meta']['clinical'].keys():
        
        with open(jsonname, 'w') as outfile:
            json.dump(jsonfile, outfile)
        
        img = resize(imread(DATA_PATH + '/' + file + '.jpeg'), (25, 25))
        
        X.append(img)
        
        imsave(imgname, img)
        
        if jsonfile['meta']['clinical']['benign_malignant'] == 'malignant':
            y.append(int(1))
            
        else:
            y.append(int(0))
            
    elif diag in jsonfile['meta']['clinical'].keys():
        
        with open(jsonname, 'w') as outfile:
            json.dump(jsonfile, outfile)
        
        img = resize(imread(DATA_PATH + '/' + file + '.jpeg'), (25, 25))
        
        X.append(img)
        
        imsave(imgname, img)
        
        if jsonfile['meta']['clinical']['diagnosis'] == 'basal cell carcinoma':
            y.append(int(1))
            
        else:
            y.append(int(0))
            
    else:
        continue
    


In [226]:
del metadata_filenames
del images_filenames

In [14]:
# checking if file it's not already saved
X_savefile = '/Users/carlostavarez/Desktop/DSE/Machine_Learning/data/processed_images/X.npy'
y_savefile = '/Users/carlostavarez/Desktop/DSE/Machine_Learning/data/processed_images/y.npy'

X_not_cached = not op.exists(X_savefile)
y_not_cached = not op.exists(y_savefile)

print(X_not_cached, y_not_cached)

False False


In [15]:
if X_not_cached:
    np.save(X_savefile, X)
else:
    X = np.load(X_savefile)

In [16]:
if y_not_cached:
    np.save(y_savefile, y)
else:
    y = np.load(y_savefile)

### Setting up our training and testing sets

At this point you should have `X` and `y` loaded. I stored them, combined in a single file called `data_array.npy`. Compute the count of malignant and benign by counting and printing the frequency of the two classes. You will see that the benign class is much more frequent. As a result we are going to consider using sythetic minority oversampling to deal with the imbalenced class. You need to install the `imbalanced-learn` module conda install. We will use the *synthetic minority over-sampling* within that library for that task. Below is our train test split and resampling application. What you see below is our usual train test split, followed by a resampled train test split using SMOTE.



In [17]:
# fill in
random_state = 42
np.random.seed(random_state)
print("Count of malignant and benign: {}\nCount nonCancer: {}\nTotal: {}".format(sum(y), len(y)-sum(y), len(y)))

Count of malignant and benign: 239
Count nonCancer: 1528
Total: 1767


In [18]:
nsample, d1, d2, d3 = X.shape
X = X.reshape(nsample, d1*d2*d3)
X.shape

(1767, 1875)

In [19]:
sm = SMOTE(random_state=42)

X_bal, y_bal = sm.fit_resample(X, y)

In [20]:
print("Count of malignant and benign: {}\nCount nonCancer: {}\nTotal: {}".format(sum(y_bal), 
                                                                                 len(y_bal)-sum(y_bal), 
                                                                                 len(y_bal)))

Count of malignant and benign: 1528
Count nonCancer: 1528
Total: 3056


Note: For reproducible results, througout we will include `random_state=42` as a parameter
to all possible funtions.

## Fitting and Evaluation

In this section we will first apply some basic classifiers to understand the speed and baseline performance results we should expect. We will start out with basic training test splits. Note that because we are not using crossvalidation at the start, we should expect that our testing results might be overly optimistic.

### 1. How well does the resampling work?

For our evaluation we will focus on 4 metrics. We note that we don't really have to worry about multiple classes since this is a binary classification problem. Thus we only look at the four metrics:

 * f1
 * accuracy
 * precision
 * recall
 
We look at this for training and test. You may find it convienient to create boilerplate code for this within a loop because we will be doing this kind of evalution many times. For our first experiment just look at

* k-Nearest Neighbors

In this case we will use k=3. Now compute the four metrics for the unresampled X_train, y_train and X_test, y_test.  For the first set of numbers you *might* see similar results to the following:

    Regular KNN N=3 neighbors
    accuracy_score on train  fit: 	     0.882
    f1_score  on train  fit: 	         0.61
    precision_score on train  fit: 	     0.884
    recall_score on train  fit: 	     0.466
    accuracy_score on test  fit: 	     0.805
    f1_score  on test  fit: 	         0.368
    precision_score on test  fit: 	     0.51
    recall_score on test  fit: 	         0.287

Consider recall and review the score. A 0.287 score means that for 100% of real malignant cases **only** 29% are flagged as being malignant. While a higher than desirable false alarms (lower precision score) might be tolorable because patents may be retested, low recall is not acceptable.  Next do this again, separately where training takes place on X_resamp_train, y_resamp_train but **test** on X_test, y_test and evaluate with the 4 metric. With this mixed training how does KNN with N=3 perform accross the training and testing?


In [21]:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.25)

In [13]:
# fill in

knn_ubal = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2, weights='uniform')
knn_pipe = make_pipeline(StandardScaler(), knn_ubal)
knn_pipe.fit(x_train, y_train)

y_tr_pred = knn_pipe.predict(x_train)
y_ts_pred = knn_pipe.predict(x_test)

print("Accuracy score on train: {}\nAccuracy score on test: {}\n".format(accuracy_score(y_train, y_tr_pred), 
                                                                         accuracy_score(y_test, y_ts_pred)))

print("f1 score on train: {}\nf1 score on test: {}\n".format(f1_score(y_train, y_tr_pred, average='binary'), 
                                                             f1_score(y_test, y_ts_pred, average='binary')))

print("Precision score on train: {}\nPrecision score on test: {}\n".format(precision_score(y_train, y_tr_pred, average='binary'), 
                                                                           precision_score(y_test, y_ts_pred, average='binary')))

print("Recall score on train: {}\nRecall score on test: {}\n".format(recall_score(y_train, y_tr_pred, average='binary'), 
                                                                     recall_score(y_test, y_ts_pred, average='binary')))


Accuracy score on train: 0.8913207547169811
Accuracy score on test: 0.8484162895927602

f1 score on train: 0.3739130434782608
f1 score on test: 0.028985507246376812

Precision score on train: 0.8431372549019608
Precision score on test: 0.1111111111111111

Recall score on train: 0.24022346368715083
Recall score on test: 0.016666666666666666



In [22]:
x_tr_bal, x_ts_bal, y_tr_bal, y_ts_bal = train_test_split(X_bal, y_bal, 
                                                          random_state=42, stratify=y_bal, test_size=0.25)

In [16]:
# fill in
knn_bal = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2)
knn_bal_pipe = make_pipeline(StandardScaler(), knn_bal)
knn_bal_pipe.fit(x_tr_bal, y_tr_bal)

y_tr_bal_pred = knn_bal_pipe.predict(x_tr_bal)
y_ts_bal_pred = knn_bal_pipe.predict(x_test)

print("Accuracy score on train: {}\nAccuracy score on test: {}\n".format(accuracy_score(y_tr_bal, y_tr_bal_pred), 
                                                                         accuracy_score(y_test, y_ts_bal_pred)))

print("f1 score on train: {}\nf1 score on test: {}\n".format(f1_score(y_tr_bal, y_tr_bal_pred, average='binary'), 
                                                             f1_score(y_test, y_ts_bal_pred, average='binary')))

print("Precision score on train: {}\nPrecision score on test: {}\n".format(
    precision_score(y_tr_bal, y_tr_bal_pred, average='binary'), 
    precision_score(y_test, y_ts_bal_pred, average='binary')))

print("Recall score on train: {}\nRecall score on test: {}\n".format(
    recall_score(y_tr_bal, y_tr_bal_pred, average='binary'), 
    recall_score(y_test, y_ts_bal_pred, average='binary')))

Accuracy score on train: 0.9223385689354275
Accuracy score on test: 0.7737556561085973

f1 score on train: 0.9278768233387358
f1 score on test: 0.5454545454545454

Precision score on train: 0.8661119515885023
Precision score on test: 0.375

Recall score on train: 0.9991273996509599
Recall score on test: 1.0



### 2. Basline with training on resampled data and testing with test data

We are now going to consider a sequence of algorithms and then ensemble versions of these algorithms. Because we will be using a number of algorithms and comparing their results you will want to write a loop. To strictly minimize copy past code and instead use template strings and set up your classifiers in an array so you can loop through. In this part you neet to use `cross_validate` from `sklearn.model_selection`. You are going to be calling this on the resampled data which has balanced classes. You should do 5-fold crossvalidation, you should make sure that it displays all the scoreing functions `scoring=['f1','accuracy','precision','recall']`, that it returns the training scores `return_train_score=True` and that you stratify using the `groups` optional argument. You are making your own reports here, where you print out the training and testing scores for each of the 4 metrics. Because `cross_validate` returns a score for each fold of the cross validataion, what you should show is the mean and the standard devation. We basically assume that the standard devation is the "error" in our estimate so for training accruacy you it might print `0.92 +/- 0.01` where the first number is the mean of the 5 accuracies for that classifier, and the second number is the stdev. Do this for these 4 classifiers:

1. Random Forest
2. k-Nearest Neighbors 
3. Extra Trees
4. Support Vector Machine



In [18]:
# fill in
rfc = RandomForestClassifier(random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
etc = ExtraTreeClassifier(random_state=42)
svc = SVC(random_state=42)

cls = [rfc, knn, etc, svc]
cls_names = ['Random Forest', 'K-Nearest Neighbors', 'Extra Trees', 'Support Vector Machine']

stds = StandardScaler()

x_ts_bal_std = stds.fit_transform(x_ts_bal)

for name, cl in zip(cls_names, cls):
    cv = cross_validate(cl, x_ts_bal_std, y_ts_bal, groups=y_ts_bal, cv=5, 
                        scoring=['f1', 'accuracy', 'precision', 'recall'], return_train_score=True, n_jobs=-1)
    
    train_f1_mn = np.mean(cv['train_f1'])
    train_f1_sd = np.std(cv['train_f1'])
    test_f1_mn = np.mean(cv['test_f1'])
    test_f1_sd = np.std(cv['test_f1'])
    
    train_accuracy_mn = np.mean(cv['train_accuracy'])
    train_accuracy_sd = np.std(cv['train_accuracy'])
    test_accuracy_mn = np.mean(cv['test_accuracy'])
    test_accuracy_sd = np.std(cv['test_accuracy'])
    
    train_precision_mn = np.mean(cv['train_precision'])
    train_precision_sd = np.std(cv['train_precision'])
    test_precision_mn = np.mean(cv['test_precision'])
    test_precision_sd = np.std(cv['test_precision'])
    
    train_recall_mn = np.mean(cv['train_recall'])
    train_recall_sd = np.std(cv['train_recall'])
    test_recall_mn = np.mean(cv['test_recall'])
    test_recall_sd = np.std(cv['test_recall'])
    
    print("%s\n\nf1 score train: %0.2f +/- %0.2f\nf1 score test: %0.2f +/- %0.2f\n"%(name, 
                                                                                   train_f1_mn, 
                                                                                   train_f1_sd, 
                                                                                   test_f1_mn, 
                                                                                   test_f1_sd))
    
    print("Accuracy score train: %0.2f +/- %0.2f\nAccuracy score test: %0.2f +/- %0.2f\n"%(train_accuracy_mn, 
                                                                                            train_accuracy_sd, 
                                                                                            test_accuracy_mn, 
                                                                                            test_accuracy_sd))
    
    print("Precission score train: %0.2f +/- %0.2f\nPrecission score test: %0.2f +/- %0.2f\n"%(train_precision_mn, 
                                                                                               train_precision_sd, 
                                                                                               test_precision_mn, 
                                                                                               test_precision_sd))
    
    print("Recall score train: %0.2f +/- %0.2f\nRecall score test: %0.2f +/- %0.2f\n\n\n"%(train_recall_mn, 
                                                                                           train_recall_sd, 
                                                                                           test_recall_mn, 
                                                                                           test_recall_sd))

Random Forest

f1 score train: 1.00 +/- 0.00
f1 score test: 0.78 +/- 0.04

Accuracy score train: 1.00 +/- 0.00
Accuracy score test: 0.78 +/- 0.04

Precission score train: 1.00 +/- 0.00
Precission score test: 0.78 +/- 0.04

Recall score train: 0.99 +/- 0.01
Recall score test: 0.78 +/- 0.06



K-Nearest Neighbors

f1 score train: 0.89 +/- 0.00
f1 score test: 0.80 +/- 0.03

Accuracy score train: 0.88 +/- 0.00
Accuracy score test: 0.77 +/- 0.03

Precission score train: 0.82 +/- 0.00
Precission score test: 0.72 +/- 0.02

Recall score train: 0.98 +/- 0.00
Recall score test: 0.90 +/- 0.05



Extra Trees

f1 score train: 1.00 +/- 0.00
f1 score test: 0.73 +/- 0.03

Accuracy score train: 1.00 +/- 0.00
Accuracy score test: 0.72 +/- 0.03

Precission score train: 1.00 +/- 0.00
Precission score test: 0.70 +/- 0.02

Recall score train: 1.00 +/- 0.00
Recall score test: 0.76 +/- 0.06



Support Vector Machine

f1 score train: 0.88 +/- 0.01
f1 score test: 0.80 +/- 0.04

Accuracy score train: 0.87 +/- 0.

### 3. Bagging the classiefiers

So here we are going to see if bagging the classifiers can help. You are going to run the same 4 classifiers through but wrap them in a `BaggingClassifier` which will use boostrap resampling. You also will should start by trying 60% (.6) of the sampling data, and only 70% (.7) of the features to make each of the ensemble learners more independent. Start out with 20 estimators. Since you need to pass this config to each of the constructors you can make a dictionary of config variables and then use the python `**` to unpack it. So if the config variables are like this:

`bagged_config = dict(n_estimators=20, max_samples=.6, max_features=.7, random_state=42)`

Then you use them like this:

`BaggingClassifier(KNeighborsClassifier(n_neighbors=3), **bagged_config)`

Write this as a loop over the classiffiers like above but for debuging purposes you can use the python `break` statement to only do one loop until you get it working. Each time through it should print the report like you made above shouing training and testing of each of the four matrics, using 5-fold cross_validataion which gives you a mean and a standard devation just like above. Compare the performance. Try adjusting the parameters such as max_samples or max_features to improve the performance.

Explain for each of the 4 bagged versions of your classifiers:

1. Random Forest
2. k-Nearest Neighbors 
3. Extra Trees
4. Support Vector Machine

explain why the results make sense or not.


In [30]:
# fill in
bagged_config = dict(n_estimators=20, max_samples=0.6, max_features=0.7, random_state=42, n_jobs=-1)

rfc_b = RandomForestClassifier(random_state=42)
knn_b = KNeighborsClassifier(n_neighbors=3)
etc_b = ExtraTreeClassifier(random_state=42)
svc_b = SVC(random_state=42)

clb = [rfc_b, knn_b, etc_b, svc_b]
clb_names = ['Random Forest', 'K-Nearest Neighbors', 'Extra Trees', 'Support Vector Machine']

for name, cl in zip(clb_names, clb):
    
    bagcl = BaggingClassifier(cl, **bagged_config)
    
    stdsc = StandardScaler()
    x_ts_bal_sd = stdsc.fit_transform(x_ts_bal)
    
    cv = cross_validate(bagcl, x_ts_bal_sd, y_ts_bal, groups=y_ts_bal, cv=5, 
                        scoring=['f1', 'accuracy', 'precision', 'recall'], return_train_score=True, n_jobs=-1)
    
    
    train_f1_mn = np.mean(cv['train_f1'])
    train_f1_sd = np.std(cv['train_f1'])
    test_f1_mn = np.mean(cv['test_f1'])
    test_f1_sd = np.std(cv['test_f1'])
    
    train_accuracy_mn = np.mean(cv['train_accuracy'])
    train_accuracy_sd = np.std(cv['train_accuracy'])
    test_accuracy_mn = np.mean(cv['test_accuracy'])
    test_accuracy_sd = np.std(cv['test_accuracy'])
    
    train_precision_mn = np.mean(cv['train_precision'])
    train_precision_sd = np.std(cv['train_precision'])
    test_precision_mn = np.mean(cv['test_precision'])
    test_precision_sd = np.std(cv['test_precision'])
    
    train_recall_mn = np.mean(cv['train_recall'])
    train_recall_sd = np.std(cv['train_recall'])
    test_recall_mn = np.mean(cv['test_recall'])
    test_recall_sd = np.std(cv['test_recall'])
    
    print("%s\n\nf1 score train: %0.2f +/- %0.2f\nf1 score test: %0.2f +/- %0.2f\n"%(name, 
                                                                                   train_f1_mn, 
                                                                                   train_f1_sd, 
                                                                                   test_f1_mn, 
                                                                                   test_f1_sd))
    
    print("Accuracy score train: %0.2f +/- %0.2f\nAccuracy score test: %0.2f +/- %0.2f\n"%(train_accuracy_mn, 
                                                                                            train_accuracy_sd, 
                                                                                            test_accuracy_mn, 
                                                                                            test_accuracy_sd))
    
    print("Precission score train: %0.2f +/- %0.2f\nPrecission score test: %0.2f +/- %0.2f\n"%(train_precision_mn, 
                                                                                               train_precision_sd, 
                                                                                               test_precision_mn, 
                                                                                               test_precision_sd))
    
    print("Recall score train: %0.2f +/- %0.2f\nRecall score test: %0.2f +/- %0.2f\n\n\n"%(train_recall_mn, 
                                                                                           train_recall_sd, 
                                                                                           test_recall_mn, 
                                                                                           test_recall_sd))
    




Random Forest

f1 score train: 0.96 +/- 0.00
f1 score test: 0.78 +/- 0.02

Accuracy score train: 0.96 +/- 0.00
Accuracy score test: 0.76 +/- 0.02

Precission score train: 0.95 +/- 0.00
Precission score test: 0.73 +/- 0.02

Recall score train: 0.97 +/- 0.01
Recall score test: 0.84 +/- 0.04



K-Nearest Neighbors

f1 score train: 0.85 +/- 0.01
f1 score test: 0.77 +/- 0.03

Accuracy score train: 0.83 +/- 0.01
Accuracy score test: 0.74 +/- 0.03

Precission score train: 0.78 +/- 0.01
Precission score test: 0.70 +/- 0.02

Recall score train: 0.93 +/- 0.01
Recall score test: 0.86 +/- 0.04



Extra Trees

f1 score train: 0.99 +/- 0.00
f1 score test: 0.79 +/- 0.03

Accuracy score train: 0.99 +/- 0.00
Accuracy score test: 0.78 +/- 0.04

Precission score train: 0.98 +/- 0.00
Precission score test: 0.77 +/- 0.05

Recall score train: 0.99 +/- 0.01
Recall score test: 0.81 +/- 0.03



Support Vector Machine

f1 score train: 0.84 +/- 0.01
f1 score test: 0.77 +/- 0.05

Accuracy score train: 0.83 +/- 0.

### 4. Adaptive Boosting

Here we will consider the adaptive boosting classifier. Not all learners can be boosted (unlike bagging). We will only consider two boosted algorithms here. 

1. Extra Tree
2. Descision Tree

In the case of decision tree use max_depth=2 and max_features .7. Here try 100 estimators in the `AdaBoostClassifier` wrapping the two algorothms. Just like above use 5 fold cross_validataion, random_seed=42 everywhere, and print out your reports to be able to compare. Play with the paramters a bit to try to get the best results possible. 


In [36]:
# fill in


rfc_b = RandomForestClassifier(random_state=42, max_depth=2, max_features=0.7)
etc_b = ExtraTreeClassifier(random_state=42)


clb = [rfc_b, etc_b]
clb_names = ['Random Forest', 'Extra Trees']

for name, cl in zip(clb_names, clb):
    
    ada = AdaBoostClassifier(cl, n_estimators=100, random_state=42)
    
    stdsc = StandardScaler()
    x_ts_bal_sd = stdsc.fit_transform(x_ts_bal)
    
    cv = cross_validate(ada, x_ts_bal_sd, y_ts_bal, groups=y_ts_bal, cv=5, 
                        scoring=['f1', 'accuracy', 'precision', 'recall'], return_train_score=True, n_jobs=-1)
    
    
    train_f1_mn = np.mean(cv['train_f1'])
    train_f1_sd = np.std(cv['train_f1'])
    test_f1_mn = np.mean(cv['test_f1'])
    test_f1_sd = np.std(cv['test_f1'])
    
    train_accuracy_mn = np.mean(cv['train_accuracy'])
    train_accuracy_sd = np.std(cv['train_accuracy'])
    test_accuracy_mn = np.mean(cv['test_accuracy'])
    test_accuracy_sd = np.std(cv['test_accuracy'])
    
    train_precision_mn = np.mean(cv['train_precision'])
    train_precision_sd = np.std(cv['train_precision'])
    test_precision_mn = np.mean(cv['test_precision'])
    test_precision_sd = np.std(cv['test_precision'])
    
    train_recall_mn = np.mean(cv['train_recall'])
    train_recall_sd = np.std(cv['train_recall'])
    test_recall_mn = np.mean(cv['test_recall'])
    test_recall_sd = np.std(cv['test_recall'])
    
    print("%s\n\nf1 score train: %0.2f +/- %0.2f\nf1 score test: %0.2f +/- %0.2f\n"%(name, 
                                                                                   train_f1_mn, 
                                                                                   train_f1_sd, 
                                                                                   test_f1_mn, 
                                                                                   test_f1_sd))
    
    print("Accuracy score train: %0.2f +/- %0.2f\nAccuracy score test: %0.2f +/- %0.2f\n"%(train_accuracy_mn, 
                                                                                            train_accuracy_sd, 
                                                                                            test_accuracy_mn, 
                                                                                            test_accuracy_sd))
    
    print("Precission score train: %0.2f +/- %0.2f\nPrecission score test: %0.2f +/- %0.2f\n"%(train_precision_mn, 
                                                                                               train_precision_sd, 
                                                                                               test_precision_mn, 
                                                                                               test_precision_sd))
    
    print("Recall score train: %0.2f +/- %0.2f\nRecall score test: %0.2f +/- %0.2f\n\n\n"%(train_recall_mn, 
                                                                                           train_recall_sd, 
                                                                                           test_recall_mn, 
                                                                                           test_recall_sd))

Random Forest

f1 score train: 1.00 +/- 0.00
f1 score test: 0.82 +/- 0.03

Accuracy score train: 1.00 +/- 0.00
Accuracy score test: 0.81 +/- 0.03

Precission score train: 1.00 +/- 0.00
Precission score test: 0.78 +/- 0.03

Recall score train: 1.00 +/- 0.00
Recall score test: 0.87 +/- 0.04



Extra Trees

f1 score train: 1.00 +/- 0.00
f1 score test: 0.70 +/- 0.08

Accuracy score train: 1.00 +/- 0.00
Accuracy score test: 0.70 +/- 0.07

Precission score train: 1.00 +/- 0.00
Precission score test: 0.69 +/- 0.06

Recall score train: 1.00 +/- 0.00
Recall score test: 0.72 +/- 0.11





### 5. Gradient Boosting

Below do exactly as above with Adaptive boosting but now with gradient boosting. Unfortunately you will not be able to use ExtraTrees here so you will end up just gradient boosting the default Decision Tree classifier. Again report the results and as before do 5 fold cross valitation. 


In [12]:
# fill in


clb_names = 'Random Forest'
    
gb_cl = GradientBoostingClassifier(random_state=42, max_depth=2, max_features=0.7, n_estimators=100)
    
cv = cross_validate(gb_cl, x_ts_bal, y_ts_bal, groups=y_ts_bal, cv=5, 
                    scoring=['f1', 'accuracy', 'precision', 'recall'], return_train_score=True)
    
    
train_f1_mn = np.mean(cv['train_f1'])
train_f1_sd = np.std(cv['train_f1'])
test_f1_mn = np.mean(cv['test_f1'])
test_f1_sd = np.std(cv['test_f1'])
    
train_accuracy_mn = np.mean(cv['train_accuracy'])
train_accuracy_sd = np.std(cv['train_accuracy'])
test_accuracy_mn = np.mean(cv['test_accuracy'])
test_accuracy_sd = np.std(cv['test_accuracy'])
    
train_precision_mn = np.mean(cv['train_precision'])
train_precision_sd = np.std(cv['train_precision'])
test_precision_mn = np.mean(cv['test_precision'])
test_precision_sd = np.std(cv['test_precision'])
    
train_recall_mn = np.mean(cv['train_recall'])
train_recall_sd = np.std(cv['train_recall'])
test_recall_mn = np.mean(cv['test_recall'])
test_recall_sd = np.std(cv['test_recall'])
    
print("%s\n\nf1 score train: %0.2f +/- %0.2f\nf1 score test: %0.2f +/- %0.2f\n"%(clb_names, 
                                                                                   train_f1_mn, 
                                                                                   train_f1_sd, 
                                                                                   test_f1_mn, 
                                                                                   test_f1_sd))
    
print("Accuracy score train: %0.2f +/- %0.2f\nAccuracy score test: %0.2f +/- %0.2f\n"%(train_accuracy_mn, 
                                                                                            train_accuracy_sd, 
                                                                                            test_accuracy_mn, 
                                                                                            test_accuracy_sd))
    
print("Precission score train: %0.2f +/- %0.2f\nPrecission score test: %0.2f +/- %0.2f\n"%(train_precision_mn, 
                                                                                               train_precision_sd, 
                                                                                               test_precision_mn, 
                                                                                               test_precision_sd))
    
print("Recall score train: %0.2f +/- %0.2f\nRecall score test: %0.2f +/- %0.2f\n\n\n"%(train_recall_mn, 
                                                                                           train_recall_sd, 
                                                                                           test_recall_mn, 
                                                                                           test_recall_sd))

Random Forest

f1 score train: 0.98 +/- 0.01
f1 score test: 0.80 +/- 0.04

Accuracy score train: 0.98 +/- 0.01
Accuracy score test: 0.79 +/- 0.04

Precission score train: 0.97 +/- 0.01
Precission score test: 0.77 +/- 0.04

Recall score train: 0.99 +/- 0.00
Recall score test: 0.84 +/- 0.05





### Hard Voting classifier

Here you are going to implement a card voting classifier. You will pick three diverse learners of your choice. For a group of learners to be diverse means that they make different errors on the same data. In other words, one might say that the errors made by the classifers are uncorrelated.

Use sklearn's VotingClassifier to bind them together and again use cross_validate to score. 

In [23]:
# fill in


sm = StandardScaler()

x_ts_bal_sd = sm.fit_transform(x_ts_bal)

In [28]:
gnb = GaussianNB()
svm = SVC()
gbc = GradientBoostingClassifier(random_state=42, max_depth=2, max_features=0.7, n_estimators=100)

vhc = VotingClassifier(estimators=[('gnb', gnb), ('svm', svm), ('gbc', gbc)], voting='hard', n_jobs=-1)

cv = cross_validate(vhc, x_ts_bal_sd, y_ts_bal, groups=y_ts_bal, cv=5, 
                    scoring=['f1', 'accuracy', 'precision', 'recall'], return_train_score=True)
    
    
train_f1_mn = np.mean(cv['train_f1'])
train_f1_sd = np.std(cv['train_f1'])
test_f1_mn = np.mean(cv['test_f1'])
test_f1_sd = np.std(cv['test_f1'])
    
train_accuracy_mn = np.mean(cv['train_accuracy'])
train_accuracy_sd = np.std(cv['train_accuracy'])
test_accuracy_mn = np.mean(cv['test_accuracy'])
test_accuracy_sd = np.std(cv['test_accuracy'])
    
train_precision_mn = np.mean(cv['train_precision'])
train_precision_sd = np.std(cv['train_precision'])
test_precision_mn = np.mean(cv['test_precision'])
test_precision_sd = np.std(cv['test_precision'])
    
train_recall_mn = np.mean(cv['train_recall'])
train_recall_sd = np.std(cv['train_recall'])
test_recall_mn = np.mean(cv['test_recall'])
test_recall_sd = np.std(cv['test_recall'])
    
print("%s\n\nf1 score train: %0.2f +/- %0.2f\nf1 score test: %0.2f +/- %0.2f\n"%('Voting Hard Classifier', 
                                                                                   train_f1_mn, 
                                                                                   train_f1_sd, 
                                                                                   test_f1_mn, 
                                                                                   test_f1_sd))
    
print("Accuracy score train: %0.2f +/- %0.2f\nAccuracy score test: %0.2f +/- %0.2f\n"%(train_accuracy_mn, 
                                                                                            train_accuracy_sd, 
                                                                                            test_accuracy_mn, 
                                                                                            test_accuracy_sd))
    
print("Precission score train: %0.2f +/- %0.2f\nPrecission score test: %0.2f +/- %0.2f\n"%(train_precision_mn, 
                                                                                               train_precision_sd, 
                                                                                               test_precision_mn, 
                                                                                               test_precision_sd))
    
print("Recall score train: %0.2f +/- %0.2f\nRecall score test: %0.2f +/- %0.2f\n\n\n"%(train_recall_mn, 
                                                                                           train_recall_sd, 
                                                                                           test_recall_mn, 
                                                                                           test_recall_sd))



Voting Hard Classifier

f1 score train: 0.92 +/- 0.01
f1 score test: 0.79 +/- 0.04

Accuracy score train: 0.92 +/- 0.01
Accuracy score test: 0.78 +/- 0.03

Precission score train: 0.89 +/- 0.01
Precission score test: 0.74 +/- 0.03

Recall score train: 0.95 +/- 0.01
Recall score test: 0.85 +/- 0.07





## Conclusion

Write a conclusion explaining what we should take away from the results.

**Acknowlegement: Many Thanks to Toma Suciu for the first draft**

For every classifier used the models tend to overfit, however, voting classifier showed a decrease in performcance on the training set while the testing set remained the same for all.