<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Visualizing CARTs with admissions data

_Authors: Kiefer Katovich (SF)_

---

Using the admissions data from earlier in the course, build CARTs, look at how they work visually, and compare their performance to more standard, parametric models.


---

### 1. Load in admissions data and other python packages

In [29]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [30]:
admit = pd.read_csv('../../data/admissions.csv')

---

### 2. Create regression and classification X, y data

The regression data will be:

    Xr = [admit, gre, prestige]
    yr = gpa
    
The classification data will be:

    Xc = [gre, gpa, prestige]
    yc = admit

In [31]:
admit = admit.dropna()

In [32]:
admit.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


In [33]:
Xr = admit[['admit','gre','prestige']]
yr = admit.gpa

Xc = admit[['gpa','gre','prestige']]
yc = admit.admit

---

### 3. Cross-validate regression and logistic regression on the data

Fit a linear regression for the regression problem and a logistic for the classification problem. Cross-validate the R2 and accuracy scores.

In [34]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score

In [35]:
reg_scores = cross_val_score(LinearRegression(), Xr, yr, cv=4)
cls_scores = cross_val_score(LogisticRegression(), Xc, yc, cv=4)

print (reg_scores, np.mean(reg_scores))
print (cls_scores, np.mean(cls_scores))

linreg = LinearRegression().fit(Xr, yr)
logreg = LogisticRegression().fit(Xc, yc)

[0.22470964 0.08296819 0.03204903 0.16434809] 0.12601873539032846
[0.71       0.72727273 0.68686869 0.70707071] 0.7078030303030304


---

### 4. Building regression trees

With `DecisionTreeRegressor`:

1. Build 4 models with different parameters for `max_depth`: `max_depth=1`, `max_depth=2`, `max_depth=3`, and `max_depth=None`
2. Cross-validate the R2 scores of each of the models and compare to the linear regression earlier.

In [36]:
from sklearn.tree import DecisionTreeRegressor

In [37]:
dtr1 = DecisionTreeRegressor(max_depth=1)
dtr2 = DecisionTreeRegressor(max_depth=2)
dtr3 = DecisionTreeRegressor(max_depth=3)
dtrN = DecisionTreeRegressor(max_depth=None)

In [38]:
dtr1.fit(Xr, yr)
dtr2.fit(Xr, yr)
dtr3.fit(Xr, yr)
dtrN.fit(Xr, yr)

DecisionTreeRegressor()

In [39]:
dtr1_scores = cross_val_score(dtr1, Xr, yr, cv=4)
dtr2_scores = cross_val_score(dtr2, Xr, yr, cv=4)
dtr3_scores = cross_val_score(dtr3, Xr, yr, cv=4)
dtrN_scores = cross_val_score(dtrN, Xr, yr, cv=4)

print (dtr1_scores, np.mean(dtr1_scores))
print (dtr2_scores, np.mean(dtr2_scores))
print (dtr3_scores, np.mean(dtr3_scores))
print (dtrN_scores, np.mean(dtrN_scores))

[0.16618105 0.1535036  0.03860296 0.10081223] 0.1147749611042107
[0.20722899 0.14179888 0.04112242 0.11836674] 0.1271292579858377
[0.15422529 0.123802   0.05252648 0.08070045] 0.10281355747790263
[-0.14933234 -0.15835462 -0.4649248  -0.17945056] -0.23801557770560355


---

### 5. Building classification trees

With `DecisionTreeClassifier`:

1. Again build 4 models with different parameters for `max_depth`: `max_depth=1`, `max_depth=2`, `max_depth=3`, and `max_depth=None`
2. Cross-validate the accuracy scores of each of the models and compare to the logistic regression earlier.

Note that now you'll be using the classification task where we are predicting `admit`.

In [24]:
from sklearn.tree import DecisionTreeClassifier

In [25]:
dtc1 = DecisionTreeClassifier(max_depth=1)
dtc2 = DecisionTreeClassifier(max_depth=2)
dtc3 = DecisionTreeClassifier(max_depth=3)
dtcN = DecisionTreeClassifier(max_depth=None)

In [26]:
dtc1.fit(Xc, yc)
dtc2.fit(Xc, yc)
dtc3.fit(Xc, yc)
dtcN.fit(Xc, yc)


DecisionTreeClassifier()

In [27]:
dtc1_scores = cross_val_score(dtc1, Xc, yc, cv=4)
dtc2_scores = cross_val_score(dtc2, Xc, yc, cv=4)
dtc3_scores = cross_val_score(dtc3, Xc, yc, cv=4)
dtcN_scores = cross_val_score(dtcN, Xc, yc, cv=4)

print (dtc1_scores, np.mean(dtc1_scores))
print (dtc2_scores, np.mean(dtc2_scores))
print (dtc3_scores, np.mean(dtc3_scores))
print (dtcN_scores, np.mean(dtcN_scores))

[0.68       0.68686869 0.66666667 0.67676768] 0.6775757575757576
[0.69       0.76767677 0.63636364 0.61616162] 0.6775505050505051
[0.77       0.76767677 0.61616162 0.6969697 ] 0.7127020202020202
[0.62       0.67676768 0.58585859 0.56565657] 0.6120707070707071


---

### 6. Print out the "feature importances"

The model has an attribute called `.feature_importances_` which can tell us which features were most important vs. others. It ranges from 0 to 1, with 1 being the most important.

An easy way to think about the feature importance is how much that particular variable was used to make decisions. Really though, it also takes into account how much that feature contributed to splitting up the class or reducing the variance.

A feature with higher feature importance reduced the criterion (impurity) more than the other features.

Below, show the feature importances for each variable predicting private vs. not, sorted by most important feature to least.

In [41]:
fi = pd.DataFrame({
        'feature':Xc.columns,
        'importance':dtc3.feature_importances_
    })

fi.sort_values('importance', ascending=False, inplace=True)
fi

Unnamed: 0,feature,importance
2,prestige,0.402035
0,gpa,0.355654
1,gre,0.242311
