In [20]:
from patsy import dmatrices, build_design_matrices
import patsy

In [21]:
import pandas as pd
df = pd.read_csv('./assets/datasets/car.csv')
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


This time we will encode the features using a One Hot encoding scheme, i.e. we will consider them as categorical variables. We also need to encode the label using the LabelEncoder.

In [42]:
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['acceptability'])
X = pd.get_dummies(df.drop('acceptability', axis=1))

We would like to compare the performance of the following 4 algorithms:
Decision Trees
Bagging + Decision Trees
Random Forest
Extra Trees

Note that in order for our results to be consistent we have to expose the models to exactly the same Cross Validation scheme. Let's start by initializing that.

In [43]:
# # patsy our design matrix
# from patsy import dmatrices, build_design_matrices
# import patsy
# y,X =patsy.dmatrices('acceptability ~C(safety) + C(lug_boot ) + C(maint) +C(buying) +doors +persons' ,data = df, return_type = 'dataframe')
# X
# # >>> from sklearn.model_selection import StratifiedKFold
# # >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
# # >>> y = np.array([0, 0, 1, 1])
# # >>> skf = StratifiedKFold(n_splits=2)
# cv = StratifiedKFold(n_folds=3, shuffle=True, random_state=4)
# cv.get_n_splits(X, y)
# print(cv)
# # StratifiedKFold(n_splits=2, random_state=None, shuffle=False)

In [44]:
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=41)

Now let's initialize a Decision Tree Classifier and evaluate its performance:

In [49]:
dt = DecisionTreeClassifier(class_weight='balanced')
bdt = BaggingClassifier(DecisionTreeClassifier())
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s.mean().round(3), s.std().round(3))
score(bdt, "Bagging DT")

Decision Tree Score:	0.966 ± 0.009
Bagging DT Score:	0.968 ± 0.004


Your turn now:

Initialize the following models and check their performance:

Bagging + Decision Trees
Random Forest
Extra Trees


In [50]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
bdt = BaggingClassifier(DecisionTreeClassifier())
rf = RandomForestClassifier(class_weight='balanced', n_jobs=-1)
et = ExtraTreesClassifier(class_weight='balanced', n_jobs=-1)

def score(model, name):
    s = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    print "{} Score:\t{:0.3} ± {:0.3}".format(name, s.mean().round(3), s.std().round(3))

score(dt, "Decision Tree")
score(bdt, "Bagging DT")
score(rf, "Random Forest")
score(et, "Extra Trees")


Decision Tree Score:	0.966 ± 0.009
Bagging DT Score:	0.968 ± 0.004
Random Forest Score:	0.941 ± 0.008
Extra Trees Score:	0.957 ± 0.002


In this case the Bagging Decision tree seems to still be performing better than the other models, although the scores are compatible within the error. With other datasets the Random Forest and the Extra Trees model could be performing better and thus are worth testing.



Test the performance of the AdaBoost and GradientBoostingClassifier models on the car dataset. Use the code you developed above as a starter code.

In [None]:
Your turn now:

Initialize the following models and check their performance:

Bagging + Decision Trees
Random Forest
Extra Trees

then
AdaBoost 
GradientBoostingClassifier

In [48]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
ab = AdaBoostClassifier()
gb = GradientBoostingClassifier()
bdt = BaggingClassifier(DecisionTreeClassifier())
rf = RandomForestClassifier(class_weight='balanced', n_jobs=-1)
et = ExtraTreesClassifier(class_weight='balanced', n_jobs=-1)

def score(model, name):
    s = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    print "{} Score:\t{:0.3} ± {:0.3}".format(name, s.mean().round(3), s.std().round(3))

score(dt, "Decision Tree")
score(bdt, "Bagging DT")
score(rf, "Random Forest")
score(et, "Extra Trees")
score(ab, "AdaBoost")
score(gb, "Gradient Boosting Classifier")

Decision Tree Score:	0.966 ± 0.009
Bagging DT Score:	0.968 ± 0.004
Random Forest Score:	0.941 ± 0.008
Extra Trees Score:	0.957 ± 0.002
AdaBoost Score:	0.811 ± 0.002
Gradient Boosting Classifier Score:	0.982 ± 0.006


In this class we learned about Random Forest, Extremely randomized trees and Boosting. They are different ways to improve the performance of a weak learner.

Some of these methods will perform better in some cases, some better in other cases. For example, Decision Trees are more nimble and easier to communicate, but have a tendency to overfit. On the other hand Ensemble methods perform better in more complex scenarios, but may become very complicated and harder to explain. Have a look here for a couple of examples from real world startup Wise.io.

Check: Can you think of what could be limitations of these methods?
    Answer:

They don't scale very well to large datasets, Boosting in particular
They are black boxes