In [17]:
import pandas as pd
df = pd.read_csv('./assets/datasets/car.csv')

In [18]:
df.doors.value_counts()

3        432
5more    432
4        432
2        432
Name: doors, dtype: int64

In [19]:
df.buying.value_counts()

med      432
high     432
low      432
vhigh    432
Name: buying, dtype: int64

In [20]:
df.acceptability.value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: acceptability, dtype: int64

In [12]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


Since most of the features are categorical text we will need to encode them as numbers using the LabelEncoder:

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
features = [c for c in df.columns if c != 'acceptability']
for c in df.columns:
    df[c] = le.fit_transform(df[c])

X = df[features]
y = df['acceptability']

Check: Is it correct to use the label encoder blindly like this?

Answer: no, it's not correct, because the categorical features have a scale. It would be more appropriate to do one of the following:

either use pd.get_dummies to encode them as binaries
use a map that correctly assigns a numerical scale to the values, e.g. where med > small

The next step is to calculate the cross_val_score on the two classifier:

In [15]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
bagging = BaggingClassifier(knn, max_samples=0.5, max_features=0.5)

print "KNN Score:\t", cross_val_score(knn, X, y, cv=5, n_jobs=-1).mean()
print "Bagging Score:\t", cross_val_score(bagging, X, y, cv=5, n_jobs=-1).mean()

KNN Score:	0.643070305149
Bagging Score:	0.699082537976


Check: Does bagging interfere with grid search? Are we leaking data and thus faking the cross val score?

Answer: No. We are not leaking data. Bagging acts on the training sample for each fold, so it is not aware of the data in the test fold. You can convince yourself of this by doing a simple train test split

max_samples is the number of samples to draw from X to train each base estimator, can be given as absolute number or fraction of the total
max_features is the number of features to draw from X to train each base estimator, can also be given as absolute or fraction.