<a href="https://colab.research.google.com/github/ckalibsnelson/HackCville---Node-A/blob/master/07_Decision_Trees_and_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Desion Tree Classifiers

In [0]:
import pandas as pd

# the two models we will be working with today
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

# get the iris dataset
from sklearn import datasets

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv', header=0)
df.shape

(150, 5)

### Train/Test Sets

We'll be predicting species throughout this notebook, using all of our columns `sepal_length`, `sepal_width`, `petal_length`, and `petal_width`

In [0]:
X = df.drop(columns='species')
Y = df['species']

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Decision Tree

In [0]:
# model
tree = DecisionTreeClassifier()

# train
tree.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [0]:
# score

print(tree.score(X_train, Y_train))
print(tree.score(X_test, Y_test))

1.0
1.0


Awesome! Decision trees are really good at predicting iris species

In [0]:
# predict
Y_predict = tree.predict(X_test)

### Feature Importances

The ".feature\_importances\_" attribute of the DecisionTreeClassifier() object gives us the information gain of each attribute as a measure of importance.

In [0]:
tree.feature_importances_

array([0.03334028, 0.        , 0.04381865, 0.92284107])

Looks like the decision tree mainly used `petal_width` to predict the species.

In [0]:
# I put the attribute names and their respective information gains in a data frame for readability.
pd.DataFrame({'Gain': tree.feature_importances_}, index = X_train.columns).sort_values('Gain', ascending = False)

Unnamed: 0,Gain
petal_width,0.922841
petal_length,0.043819
sepal_length,0.03334
sepal_width,0.0


### Confusion Matrix

A confusion matrix is a good way to check the accuracy of your model and to see in what ways your model may be predicting incorrectly.

We do this using the Pandas crosstab() function…

In [0]:
pd.crosstab(Y_test, Y_predict, rownames=['Actual'], colnames = ['Predicted:'], margins=True)

Predicted:,setosa,versicolor,virginica,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,9,0,0,9
versicolor,0,11,0,11
virginica,0,0,10,10
All,9,11,10,30


# Random Forest

In [0]:
# model
forest = RandomForestClassifier(criterion = 'entropy')

# train
forest.fit(X_train, Y_train)

# predict
forest_predictions = forest.predict(X_test)

# feature importances
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = X_train.columns).sort_values('Importance', ascending = False))

# confusion matrix
pd.crosstab(Y_test, forest_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)

              Importance
petal_width     0.511902
petal_length    0.366303
sepal_length    0.087671
sepal_width     0.034124




Predicted:,setosa,versicolor,virginica,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,9,0,0,9
versicolor,0,11,0,11
virginica,0,1,9,10
All,9,12,9,30


In [0]:
forest.score(X_test, Y_test) # mispredicted one sample. The random forest was worse than the single decision tree!

0.9666666666666667

# On your own

In [0]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

In [0]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Recode the quality column (what we want to predict) into a classification variable (good, average, bad)

Create a decision tree, bagging classifier, and random forest to predict quality. Compare results.