# ISLR - Python Ch8 Applied 11

- [Import Caravan Dataset](#Import-Caravan-Dataset)
- [Split the data](#Split-the-data)
- [Build a Boosting Model](#Build-a-Boosting-Model)
- [Predict with Boosting Model](#Predict-with-Boosting-Model)
- [Build Confusion Matrix](#Build-Confusion-Matrix)
- [Compare with KNN Model](#Compare-with-KNN-Model)

In [72]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline

In [5]:
df = pd.read_csv('../../../data/Caravan.csv', index_col=0)

In [11]:
df.head()

Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,MRELGE,...,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND,Purchase
1,33,1,3,2,8,0,5,1,3,7,...,0,0,0,1,0,0,0,0,0,No
2,37,1,2,2,8,1,4,1,4,6,...,0,0,0,1,0,0,0,0,0,No
3,37,1,2,2,8,0,4,2,4,3,...,0,0,0,1,0,0,0,0,0,No
4,9,1,3,3,3,2,3,2,4,5,...,0,0,0,1,0,0,0,0,0,No
5,40,1,4,2,10,1,4,1,4,7,...,0,0,0,1,0,0,0,0,0,No


## Split the data

In [14]:
# Define the predictors and the response variables
predictors = df.columns.tolist()
predictors.remove('Purchase')

X = df[predictors].values
y = df['Purchase'].values

# use the first 1000 as training and the remainder for testing
X_train = X[0:1000]
X_test = X[1000:]
y_train = y[0:1000]
y_test = y[1000:]

## Build a Boosting Model

In [97]:
# build and fit a boosting model to the training data
booster = GradientBoostingClassifier(learning_rate=0.01, n_estimators=1000, max_depth=3, 
                                     random_state=0)
boost_est = booster.fit(X_train, y_train)

In [98]:
# get the variable importance
Importances = pd.DataFrame(boost_est.feature_importances_, index=predictors, 
             columns=['Importance']).sort_values(by='Importance', ascending=False)
Importances.head(8)

Unnamed: 0,Importance
MOSTYPE,0.073911
MGODGE,0.045734
MOPLHOOG,0.044637
PPERSAUT,0.040973
MKOOPKLA,0.040659
ABRAND,0.039156
PLEVEN,0.035447
PWAPART,0.033261


The above dataframe list the top 8 predictors for classifying the response variable 'Purchase'.

## Predict with Boosting Model

In [91]:
y_pred = boost_est.predict_proba(X_test)
print(y_pred)

[[ 0.97420521  0.02579479]
 [ 0.95516316  0.04483684]
 [ 0.984695    0.015305  ]
 ..., 
 [ 0.95235931  0.04764069]
 [ 0.89500081  0.10499919]
 [ 0.97384333  0.02615667]]


The above gives the class probabilities for [No Yes] for each instance in the test set.
The predicted class according to the problem are 'yes' if the 'yes' probability exceeds 20% and 'No' otherwise.

In [92]:
# if the yes probability exceeds 0.2 then assign it as a purchase
pred_purchase = ['No'if row[1] < 0.2 else 'Yes' for row in y_pred ]

## Build Confusion Matrix

In [93]:
cm = confusion_matrix(y_true = y_test, y_pred=pred_purchase, labels=['No', 'Yes'])
print(cm)

[[4335  198]
 [ 251   38]]


In [94]:
# The CM matrix is [[NN NY]
#                   [YN, YY] 
# where C_ij is equal to the number of observations known to be in group i but 
# predicted to be in group j.

#so the fraction predicted to be Yes that are actually Yes is 
cm[1,1]/(cm[1,1]+cm[0,1])

0.16101694915254236

So 16% that are predicted to be in class Yes are actually Yes. Apply a KNN model to this data for comparison.

## Compare with KNN Model

In [95]:
# Build KNN clasifier
knn_est = KNeighborsClassifier(n_neighbors=5).fit(X_train,y_train)
# make predictons
predicted_class = knn_est.predict_proba(X_test)
# if the yes probability exceeds 0.2 then assign it as a purchase
knn_pred = ['No'if row[1] < 0.2 else 'Yes' for row in predicted_class ]
# build confusion matrix
cm = confusion_matrix(y_true = y_test, y_pred=knn_pred, labels=['No', 'Yes'])
print(cm)

[[3386 1147]
 [ 174  115]]


In [96]:
#so the fraction predicted to be Yes that are actually Yes is 
cm[1,1]/(cm[1,1]+cm[0,1])

0.091125198098256741

So the Boosting performs nearly 2x better than the KNN on this hard classification problem.