# Gradient Boosting - Lab

## Introduction

In this lab, we'll learn how to use both Adaboost and Gradient Boosting Classifiers from scikit-learn!

## Objectives

You will be able to:

* Compare and contrast Adaboost and Gradient Boosting
* Use adaboost to make predictions on a dataset
* Use Gradient Boosting to make predictions on a dataset

## Getting Started

In this lab, we'll learn how to use Boosting algorithms to make classifications on the [Pima Indians Dataset](http://ftp.ics.uci.edu/pub/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names). You will find the data stored within the file `pima-indians-diabetes.csv`. Our goal is to use boosting algorithms to classify each person as having or not having diabetes. Let's get started!

We'll begin by importing everything we need for this lab. In the cell below:

* Import `numpy`, `pandas`, and `matplotlib.pyplot`, and set the standard alias for each. Also set matplotlib visualizations to display inline. 
* Set a random seed of `0` by using `np.random.seed(0)`
* Import `train_test_split` and `cross_val_score` from `sklearn.model_selection`
* Import `StandardScaler` from `sklearn.preprocessing`
* Import `AdaboostClassifier` and `GradientBoostingClassifier` from `sklearn.ensemble`
* Import `accuracy_score`, `f1_score`, `confusion_matrix`, and `classification_report` from `sklearn.metrics`

In [18]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report 

Now, use pandas to read in the data stored in `pima-indians-diabetes.csv` and store it in a DataFrame. Display the head to inspect the data we've imported and ensure everything loaded correctly. 

In [9]:
df = pd.read_csv('pima-indians-diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Cleaning, Exploration, and Preprocessing

The target we're trying to predict is the `'Outcome'` column. A `1` denotes a patient with diabetes. 

By now, you're quite familiar with exploring and preprocessing a dataset, so we won't hold your hand for this step. 

In the following cells:

* Store our target column in a separate variable and remove it from the dataset
* Check for null values and deal with them as you see fit (if any exist)
* Check the distribution of our target
* Scale the dataset
* Split the dataset into training and testing sets, with a `test_size` of `0.25`

In [17]:
y = df['Outcome']
df = df.drop('Outcome',axis=1)

In [44]:
X = df
columns = X.columns

In [39]:
y.shape

(768,)

In [56]:
scaler = StandardScaler() 
scaled_df = scaler.fit_transform(X)
scaled_df
X = pd.DataFrame(scaled_df, columns = columns)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [67]:
X.shape
y.shape

(768,)

In [57]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .25) 

## Training the Models

Now that we've cleaned and preprocessed our dataset, we're ready to fit some models!

In the cell below:

* Create an `AdaBoostClassifier`
* Create a `GradientBoostingClassifer`

In [58]:
adaboost_clf = AdaBoostClassifier()
gbt_clf = GradientBoostingClassifier()

Now, train each of the classifiers using the training data.

In [59]:
ada_model = adaboost_clf.fit(X_train, y_train) 

In [60]:
gbt_model = gbt_clf.fit(X_train, y_train) 

Now, let's create some predictions using each model so that we can calculate the training and testing accuracy for each.

In [61]:
adaboost_train_preds = ada_model.predict(X_train)
adaboost_test_preds = ada_model.predict(X_test)
gbt_clf_train_preds = gbt_model.predict(X_train) 
gbt_clf_test_preds = gbt_model.predict(X_test) 

Now, complete the following function and use it to calculate the training and testing accuracy and f1-score for each model. 

In [62]:
def display_acc_and_f1_score(true, preds, model_name):
    acc = accuracy_score(true,preds) 
    f1 = f1_score(true,preds)
    print("Model: {}".format(model_name))
    print("Accuracy: {}".format(acc))
    print("F1-Score: {}".format(f1))
    
print("Training Metrics")
display_acc_and_f1_score(y_train, adaboost_train_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_train, gbt_clf_train_preds, model_name='Gradient Boosted Trees')
print("")
print("Testing Metrics")
display_acc_and_f1_score(y_test, adaboost_test_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_test, gbt_clf_test_preds, model_name='Gradient Boosted Trees')

Training Metrics
Model: AdaBoost
Accuracy: 0.8506944444444444
F1-Score: 0.7760416666666667

Model: Gradient Boosted Trees
Accuracy: 0.9288194444444444
F1-Score: 0.8923884514435695

Testing Metrics
Model: AdaBoost
Accuracy: 0.7083333333333334
F1-Score: 0.5882352941176471

Model: Gradient Boosted Trees
Accuracy: 0.7395833333333334
F1-Score: 0.6268656716417911


Let's go one step further and create a confusion matrix and classification report for each. Do so in the cell below.

In [None]:
adaboost_confusion_matrix = None
adaboost_confusion_matrix

In [None]:
gbt_confusion_matrix = None
gbt_confusion_matrix

In [None]:
adaboost_classification_report = None
print(adaboost_classification_report)

In [None]:
gbt_classification_report = None
print(gbt_classification_report)

**_Question:_** How did the models perform? Interpret the evaluation metrics above to answer this question.

Write your answer below this line:
_______________________________________________________________________________________________________________________________

 
 
As a final performance check, let's calculate the `cross_val_score` for each model! Do so now in the cells below. 

Recall that to compute the cross validation score, we need to pass in:

* a classifier
* All training Data
* All labels
* The number of folds we want in our cross validation score. 

Since we're computing cross validation score, we'll want to pass in the entire (scaled) dataset, as well as all of the labels. We don't need to give it data that has been split into training and testing sets because it will handle this step during the cross validation. 

In the cells below, compute the mean cross validation score for each model. For the data, use our `scaled_df` variable. The corresponding labels are in the variable `target`. Also set `cv=5`.

In [69]:
print('Mean Adaboost Cross-Val Score (k=5):')
print(cross_val_score(X, y))
# Expected Output: 0.7631270690094218

Mean Adaboost Cross-Val Score (k=5):


TypeError: estimator should be an estimator implementing 'fit' method,      Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0       0.639947  0.848324       0.149641       0.907270 -0.692891  0.204013   
1      -0.844885 -1.123396      -0.160546       0.530902 -0.692891 -0.684422   
2       1.233880  1.943724      -0.263941      -1.288212 -0.692891 -1.103255   
3      -0.844885 -0.998208      -0.160546       0.154533  0.123302 -0.494043   
4      -1.141852  0.504055      -1.504687       0.907270  0.765836  1.409746   
5       0.342981 -0.153185       0.253036      -1.288212 -0.692891 -0.811341   
6      -0.250952 -1.342476      -0.987710       0.719086  0.071204 -0.125977   
7       1.827813 -0.184482      -3.572597      -1.288212 -0.692891  0.419775   
8      -0.547919  2.381884       0.046245       1.534551  4.021922 -0.189437   
9       1.233880  0.128489       1.390387      -1.288212 -0.692891 -4.060474   
10      0.046014 -0.340968       1.183596      -1.288212 -0.692891  0.711690   
11      1.827813  1.474267       0.253036      -1.288212 -0.692891  0.762457   
12      1.827813  0.566649       0.563223      -1.288212 -0.692891 -0.620962   
13     -0.844885  2.131507      -0.470732       0.154533  6.652839 -0.240205   
14      0.342981  1.411672       0.149641      -0.096379  0.826616 -0.785957   
15      0.936914 -0.653939      -3.572597      -1.288212 -0.692891 -0.252897   
16     -1.141852 -0.090591       0.770014       1.660007  1.304175  1.752428   
17      0.936914 -0.434859       0.253036      -1.288212 -0.692891 -0.303664   
18     -0.844885 -0.560048      -2.021665       1.095454  0.027790  1.435129   
19     -0.844885 -0.184482       0.046245       0.593630  0.140667  0.330932   
20     -0.250952  0.159787       0.976805       1.283638  1.347590  0.927452   
21      1.233880 -0.685236       0.770014      -1.288212 -0.692891  0.432467   
22      0.936914  2.350587       1.080200      -1.288212 -0.692891  0.990912   
23      1.530847 -0.059293       0.563223       0.907270 -0.692891 -0.379816   
24      2.124780  0.691838       1.286991       0.781814  0.574812  0.584771   
25      1.827813  0.128489       0.046245       0.342717  0.305642 -0.113285   
26      0.936914  0.817027       0.356432      -1.288212 -0.692891  0.940144   
27     -0.844885 -0.747831      -0.160546      -0.347291  0.522715 -1.115947   
28      2.718712  0.754432       0.666618      -0.096379  0.262228 -1.242867   
29      0.342981 -0.121888       1.183596      -1.288212 -0.692891  0.267472   
..           ...       ...            ...            ...       ...       ...   
738    -0.547919 -0.685236      -0.470732      -0.221835  0.696373  0.584771   
739    -0.844885 -0.591345       0.253036      -1.288212 -0.692891  0.952836   
740     2.124780 -0.027996       0.563223       1.032726  0.609544  1.308210   
741    -0.250952 -0.591345      -1.297896      -0.033651  0.123302 -0.151361   
742    -0.844885 -0.372265      -0.574128      -0.159107  0.314325 -0.443275   
743     1.530847  0.597947       1.286991      -1.288212 -0.692891  0.089785   
744     2.718712  1.004810       0.976805       1.032726  0.522715  1.092447   
745     2.421746 -0.653939       0.770014       0.781814  0.218813 -0.252897   
746    -0.844885  0.817027       1.286991       1.283638 -0.692891  2.196645   
747    -0.844885 -1.248585       0.253036       1.283638 -0.197966  1.815887   
748    -0.250952  2.068912       0.046245       0.091805  1.043689  0.559387   
749     0.639947  1.286484      -0.367337      -1.288212 -0.692891 -0.976336   
750     0.046014  0.472758       0.046245      -1.288212 -0.692891 -0.100593   
751    -0.844885  0.003301       0.459827       1.158182 -0.050356  0.889377   
752    -0.250952 -0.403562      -0.367337       0.217261 -0.692891 -0.760573   
753    -1.141852  1.881130       0.976805       1.471822  3.735386  1.435129   
754     1.233880  1.036107       0.459827       0.719086 -0.692891  0.051710   
755    -0.844885  0.222381       0.976805       1.158182  0.262228  0.572079   
756     0.936914  0.504055       1.080200       1.283638 -0.692891  0.000942   
757    -1.141852  0.065895       0.149641      -1.288212 -0.692891  0.546695   
758    -0.844885 -0.466156       0.356432      -1.288212 -0.692891  0.698998   
759     0.639947  2.162804       1.183596      -1.288212 -0.692891  0.445159   
760    -0.547919 -1.029505      -0.574128       0.342717 -0.553964 -0.455967   
761     1.530847  1.536861       0.253036       0.656358 -0.692891  1.523973   
762     1.530847 -0.998208      -0.367337      -1.288212 -0.692891 -1.204791   
763     1.827813 -0.622642       0.356432       1.722735  0.870031  0.115169   
764    -0.547919  0.034598       0.046245       0.405445 -0.692891  0.610154   
765     0.342981  0.003301       0.149641       0.154533  0.279594 -0.735190   
766    -0.844885  0.159787      -0.470732      -1.288212 -0.692891 -0.240205   
767    -0.844885 -0.873019       0.046245       0.656358 -0.692891 -0.202129   

     DiabetesPedigreeFunction       Age  
0                    0.468492  1.425995  
1                   -0.365061 -0.190672  
2                    0.604397 -0.105584  
3                   -0.920763 -1.041549  
4                    5.484909 -0.020496  
5                   -0.818079 -0.275760  
6                   -0.676133 -0.616111  
7                   -1.020427 -0.360847  
8                   -0.947944  1.681259  
9                   -0.724455  1.766346  
10                  -0.848280 -0.275760  
11                   0.196681  0.064591  
12                   2.926869  2.021610  
13                  -0.223115  2.191785  
14                   0.347687  1.511083  
15                   0.036615 -0.105584  
16                   0.238963 -0.190672  
17                  -0.658012 -0.190672  
18                  -0.872441 -0.020496  
19                   0.172520 -0.105584  
20                   0.701041 -0.531023  
21                  -0.253316  1.425995  
22                  -0.063049  0.660206  
23                  -0.630831 -0.360847  
24                  -0.658012  1.511083  
25                  -0.805998  0.660206  
26                  -0.648952  0.830381  
27                   0.045675 -0.956462  
28                  -0.685193  2.021610  
29                  -0.407342  0.404942  
..                        ...       ...  
738                 -0.057009 -1.041549  
739                 -0.540228  0.745293  
740                  0.945671  1.255820  
741                 -0.217075 -0.616111  
742                 -0.763716 -0.956462  
743                  0.791645  1.000557  
744                  2.120497  0.490030  
745                  0.048695  1.085644  
746                 -0.343920 -0.531023  
747                  1.884928 -0.105584  
748                 -0.192914  0.234767  
749                 -0.887541  1.425995  
750                  2.144658 -0.956462  
751                 -0.636871 -0.445935  
752                 -0.751636 -0.701198  
753                 -0.754656 -0.616111  
754                 -0.087210  1.000557  
755                  1.767143  0.319855  
756                 -0.244256  0.490030  
757                 -0.645932  1.596171  
758                 -0.830159 -0.616111  
759                 -0.585529  2.787399  
760                  0.888288 -0.956462  
761                 -0.208015  0.830381  
762                 -0.996266 -0.020496  
763                 -0.908682  2.532136  
764                 -0.398282 -0.531023  
765                 -0.685193 -0.275760  
766                 -0.371101  1.170732  
767                 -0.473785 -0.871374  

[768 rows x 8 columns] was passed

In [None]:
print('Mean GBT Cross-Val Score (k=5):')
print(None)
# Expected Output: 0.7591715474068416

These models didn't do poorly, but we could probably do a bit better by tuning some of the important parameters such as the **_Learning Rate_**. 

## Summary

In this lab, we learned how to use scikit-learn's implementations of popular boosting algorithms such as AdaBoost and Gradient Boosted Trees to make classification predictions on a real-world dataset!