In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set()
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_blobs
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
%matplotlib inline

#                                                       Outline

## Section 1: Decision Trees
#### Section 1.1 Intuition Underlying Tree Based Models
#### Section 1.2 Applying Tree Based Models
#### Section 1.3 Shortcomings & Limitations

## Section 2: Ensemble Methods & Random Forests
#### Section 2.1 Intuition underlying ensemble approaches
#### Section 2.2 Toy Classification Example
#### Section 2.3 Time series Regression Example
#### Section 2.4 MNIST Dataset Classification

# Section 1.1 Intuition Underlying Tree Based Models
### I Spy With My Little Eye...
There is a popular game played by children, during road trips, called "I Spy With My Little Eye". It involves one person choosing something they saw and the other has to infer said something based on a series of yes/no questions.
![alt text](Errata/Fig1.pdf "I Spy With My Little Eye")

During this game, the player partitions the space of total possible objects, with each question defining a new split, till there's just one member left in the accessible space.

For instance, the first question takes away the inanimate objects in the space.
![alt text](Errata/Fig2b.pdf "I Spy With My Little Eye")

The second question removes the 4-legged friends from the picture.
![alt text](Errata/Fig2c.pdf "I Spy With My Little Eye")

The final question removes the terrestrial creatures, in favor of our feathered friends.
![alt text](Errata/Fig2d.pdf "I Spy With My Little Eye")

In mathematical parlance, the little girl is solving a multi-class classification problem using disjoint partitions on the space of measurable events using directed acyclic graphs. Used thus, these DAGs are Classification And Regression Trees.
![alt text](Errata/Fig3a.pdf "I Spy With My Little Eye")

Classification & Regression Trees (CARTs) are non-parametric models, that learn such "splits" on the feature space from data, to define a mapping from different partitions of the feature space to values on the target space.
![alt text](Errata/Fig4.pdf "I Spy With My Little Eye")

# Section 1.2 Applying Tree Based Models
In this section, we apply tree based models to datasets and analyze their performance to visualize their strangths and weaknesses. But first, we start off by generating some data for classification.

In [None]:
X, y = make_blobs(n_samples=300, centers=4,random_state=0, cluster_std=1)
plt.figure(figsize=(7,7))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='gist_rainbow')
plt.xlabel("$X_1$")
plt.ylabel("$X_2$")

In [None]:
tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X, y)

In the last code cell, first we instantiate the decision tree model ("tree = DecisionTreeClassifier(max_depth=3)"). 
The second step is fitting the model basde on the data. The process of fitting involves evaluating between a series of splits of the feature space and choosing the best at each step via a greedy approach.

A tree defines a partition on the space, and every node of the tree defines a split of the dataset. The tree is a sequence of splits, chosen based on the data. At every step, possible splits are evaluated based on increase in homogeneity of data due to the split.

![alt text](Errata/Fig5.pdf "I Spy With My Little Eye")

There are various arguments that can be passed to the classifier during instantiation. The primary ones include:

a) max_depth=n,  This determines the number of splits in the tree. 

b) criterion="gini" or "entropy", This determines the criterion used to evaluate the split. 
etc.


We shall experiment over the maximum depth criterion, later. For the moment, let's keep that fixed at 3

In [None]:
plt.figure(figsize=(4,4))
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap='gist_rainbow',clim=(y.min(), y.max()), zorder=3)
xx, yy = np.meshgrid(np.linspace(np.min(X[:,0]),np.max(X[:,0]), num=200),np.linspace(np.min(X[:,1]),np.max(X[:,1]), num=200))
Z = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
n_classes = len(np.unique(y))
contours = plt.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap='gist_rainbow', zorder=1)
#plt.savefig('Fig6f.pdf', bbox_inches='tight')

In this step, we vary the maximum depth of the tree from 3 to 6 to 10. As we see in the ensuing figures, the tree's predictions improve a tad, but then quickly devolve to overfitting.
![alt text](Fig6a3.pdf "I Spy With My Little Eye")
![alt text](Fig6c5.pdf "I Spy With My Little Eye")
![alt text](Fig6e10.pdf "I Spy With My Little Eye")

#### Solved Exercise: Classification on the Pima Indian Diabetes Dataset

Diabetes is a condition where the body develops a resistance to insulin. The PIMA Indians of Arizona have been studied for decades and we found that they were extremely prone to adult onset diabetes. In essence, the tribe changed its dietary habits (to modern processed foods). For some stange reason, this caused them to have the highest incidence of diabetes in the entire country.

Here, we'll go through a dataset released by the government, where almost 800 tribe members from Arizona were tested. We measured their blood glucose, their Blood pressures etc...and their diabetes state. Let's see if we can use this data, with a DecisionTree model to create a classifier to predict if a tibesperson has diabetes based on their vitals.

In [None]:
colnames=['Pregnancy','Glucose','BP','Skin','Insulin','BMI','Pedigree','Age','Label']
pima=pd.read_csv("Errata/pima-indians-diabetes.csv",header=None,names=colnames)
pima.head(10)

In [None]:
feature_column_names=['Pregnancy','Glucose','BP','Skin','Insulin','BMI','Pedigree','Age']
X=pima[feature_column_names]
y=pima.Label

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7,random_state=1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
clf=DecisionTreeClassifier(max_depth=3)
clf.fit(X_train,y_train)

In [None]:
ypred=clf.predict(X_test)

In [None]:
print("Decision Tree Classifier Accuracy:", metrics.accuracy_score(y_test,ypred))

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size=0.6)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, train_size=0.5)

X_train.shape,X_val.shape,X_test.shape

In [None]:
e=[]

for i in range(15):
    clf=DecisionTreeClassifier(max_depth=i+1)
    clf.fit(X_train,y_train)
    ytemp=clf.predict(X_val)
    e.append(metrics.accuracy_score(y_val,ytemp))
    
plt.plot(e)

### Exercise solutions