# Random Forest 

This notebook tries different models on the preprocessed airbnb dataset using Random Forest Classifier.

## Part 1 : Preprocessed data with 'data_first_booking' 

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import make_scorer

data = pd.read_csv('preprocessed_airbnb_train.csv')
labels = data.loc[:,'country_destination']
data = data.drop(['country_destination'], axis=1)

def folds_to_split(data,targets,train,test):
    data_tr = pd.DataFrame(data).iloc[train]
    data_te = pd.DataFrame(data).iloc[test]
    labels_tr = pd.DataFrame(targets).iloc[train]
    labels_te = pd.DataFrame(targets).iloc[test]
    return [data_tr, data_te, labels_tr, labels_te]

def dcg_score(y_true, y_score, k=5):
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])

    gain = 2 ** y_true - 1

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)

def ndcg_score(te_labels, predict, k):
    lb = LabelBinarizer()
    lb.fit(range(len(predict) + 1))
    T = lb.transform(te_labels)
    scores = []

    for y_true, y_score in zip(T, predict):
        actual = dcg_score(y_true, y_score, k)
        best = dcg_score(y_true, y_true, k)  
        score = float(actual) / float(best)
        scores.append(score)

    return np.mean(scores)

ndcg_scorer = make_scorer(ndcg_score, needs_proba=True, k=5)


### Varying the max-depth of the Random Forest 

In [2]:
results = pd.DataFrame()
foldnum=0

for train, test in cross_validation.KFold(len(data), shuffle=True, n_folds=10,
                                           random_state=20160217):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(data, labels, train, test)

    for max_depth in range(1, 5):
        rf = RandomForestClassifier(n_estimators=600,criterion='gini', max_depth=max_depth)
        rf = rf.fit(tr_data, tr_labels.values.ravel())
        rf_predict = rf.predict_proba(te_data) 
        score = ndcg_score(te_labels.as_matrix(), rf_predict, k=5)
        print 'Fold : {}, Max-depth : {}, Score : {}'.format( foldnum, max_depth, score )
        results.loc[foldnum, 'max_depth=%d'%max_depth ] = score

Fold : 1, Max-depth : 1, Score : 0.914630585321
Fold : 1, Max-depth : 2, Score : 0.914630585321
Fold : 1, Max-depth : 3, Score : 0.920932777486
Fold : 1, Max-depth : 4, Score : 0.922777094909
Fold : 2, Max-depth : 1, Score : 0.916515064953
Fold : 2, Max-depth : 2, Score : 0.916515064953
Fold : 2, Max-depth : 3, Score : 0.922745835777
Fold : 2, Max-depth : 4, Score : 0.924148418009
Fold : 3, Max-depth : 1, Score : 0.915203504403
Fold : 3, Max-depth : 2, Score : 0.915203504403
Fold : 3, Max-depth : 3, Score : 0.921028620066
Fold : 3, Max-depth : 4, Score : 0.922495212253
Fold : 4, Max-depth : 1, Score : 0.914359339211
Fold : 4, Max-depth : 2, Score : 0.914359339211
Fold : 4, Max-depth : 3, Score : 0.920640176937
Fold : 4, Max-depth : 4, Score : 0.923650754396
Fold : 5, Max-depth : 1, Score : 0.914549312078
Fold : 5, Max-depth : 2, Score : 0.914549312078
Fold : 5, Max-depth : 3, Score : 0.921183034291
Fold : 5, Max-depth : 4, Score : 0.922889579245
Fold : 6, Max-depth : 1, Score : 0.91455

In [4]:
results.mean()

max_depth=1    0.914617
max_depth=2    0.914617
max_depth=3    0.920844
max_depth=4    0.922857
dtype: float64

It can be seen that the nDCG score increases with the max-depth.

### Without setting the max-depth 

In [5]:
foldnum=0
results = pd.DataFrame()
for train, test in cross_validation.KFold(len(data), shuffle=True, n_folds=10,
                                           random_state=20160217):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(data, labels, train, test)
    rf = RandomForestClassifier(n_estimators=600,criterion='gini')
    rf = rf.fit(tr_data, tr_labels.values.ravel())
    rf_predict = rf.predict_proba(te_data) 
    score = ndcg_score(te_labels.as_matrix(), rf_predict, k=5)
    results.loc[foldnum, 'no max-depth' ] = score

In [6]:
results.mean()

no max-depth    0.922605
dtype: float64

The nDCG score of this model is comparable to max-depth = 4. This great score is attributed to the bias by the 'date_first_booking' which will be NULL for all the new users whose prediction need to be made. A good model will be the one without the 'date_first_booking' attribute. 

## Part 2 : Preprocessed data without 'date_first_booking'

As seen during the exploratory analysis, 58% of the data has the country destination as 'NDF' (No Destination Found). Also, it was found that there is a 1-to-1 correlation betweeen NaN 'date_first_booking' and 'NDF'. That is, the date_first_booking is NaN for all the instances whose country destination is 'NDF'. The high score seen in the above section can be attributed to this factor. So, inorder to get a more realistic model, in this section I will try to build models on the preprocessed data without the date_first_booking information.

### Without setting the max-depth

In [7]:
data_nodfb = data.drop(['dfb_year', 'dfb_month', 'dfb_day'], axis=1)
data_nodfb = pd.DataFrame(preprocessing.StandardScaler().fit_transform(data_nodfb))
foldnum=0
results = pd.DataFrame()
for train, test in cross_validation.KFold(len(data_nodfb), shuffle=True, n_folds=10,
                                           random_state=20160217):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(data_nodfb, labels, train, test)
    rf = RandomForestClassifier(n_estimators=600,criterion='gini')
    rf = rf.fit(tr_data, tr_labels.values.ravel())
    rf_predict = rf.predict_proba(te_data) 
    score = ndcg_score(te_labels.as_matrix(), rf_predict, k=5)
    results.loc[foldnum, 'no max-depth + dfb' ] = score

In [8]:
results.mean()

no max-depth + dfb    0.798816
dtype: float64

### Varying the max-depth of the Random Forest

In [9]:
results = pd.DataFrame()
foldnum=0
for train, test in cross_validation.KFold(len(data_nodfb), shuffle=True, n_folds=10,
                                           random_state=20160217):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(data_nodfb, labels, train, test)

    for max_depth in range(3, 8):
        rf = RandomForestClassifier(n_estimators=600,criterion='gini', max_depth=max_depth)
        rf = rf.fit(tr_data, tr_labels.values.ravel())
        rf_predict = rf.predict_proba(te_data) 
        score = ndcg_score(te_labels.as_matrix(), rf_predict, k=5)
        print 'Fold : {}, Max-depth : {}, Score : {}'.format( foldnum, max_depth, score )
        results.loc[foldnum, 'max_depth=%d'%max_depth ] = score

Fold : 1, Max-depth : 3, Score : 0.806603278112
Fold : 1, Max-depth : 4, Score : 0.807644004479
Fold : 1, Max-depth : 5, Score : 0.813657558096
Fold : 1, Max-depth : 6, Score : 0.815575904147
Fold : 1, Max-depth : 7, Score : 0.817798973216
Fold : 2, Max-depth : 3, Score : 0.809882411297
Fold : 2, Max-depth : 4, Score : 0.810416757163
Fold : 2, Max-depth : 5, Score : 0.815790087715
Fold : 2, Max-depth : 6, Score : 0.818261826421
Fold : 2, Max-depth : 7, Score : 0.819612470421
Fold : 3, Max-depth : 3, Score : 0.80805296249
Fold : 3, Max-depth : 4, Score : 0.809195815682
Fold : 3, Max-depth : 5, Score : 0.815075575493
Fold : 3, Max-depth : 6, Score : 0.81708537894
Fold : 3, Max-depth : 7, Score : 0.818048298655
Fold : 4, Max-depth : 3, Score : 0.807295250859
Fold : 4, Max-depth : 4, Score : 0.808037918371
Fold : 4, Max-depth : 5, Score : 0.814331568223
Fold : 4, Max-depth : 6, Score : 0.815774110584
Fold : 4, Max-depth : 7, Score : 0.817337687929
Fold : 5, Max-depth : 3, Score : 0.8079010

In [10]:
results.mean()

max_depth=3    0.807206
max_depth=4    0.808191
max_depth=5    0.814303
max_depth=6    0.816443
max_depth=7    0.818067
dtype: float64

It can be seen that the nDCG score increases with the max-depth.

### Varying min_samples_leaf

In [11]:
results = pd.DataFrame()
foldnum=0
for train, test in cross_validation.KFold(len(data_nodfb), shuffle=True, n_folds=10,
                                           random_state=20160217):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(data_nodfb, labels, train, test)

    for min_sample_leafs in [5, 20, 50]:
        rf = RandomForestClassifier(n_estimators=600,criterion='gini', min_samples_leaf=min_sample_leafs)
        rf = rf.fit(tr_data, tr_labels.values.ravel())
        rf_predict = rf.predict_proba(te_data) 
        score = ndcg_score(te_labels.as_matrix(), rf_predict, k=5)
        print 'Fold : {}, Min_S_Leafs : {}, Score : {}'.format( foldnum, min_sample_leafs, score )
        results.loc[foldnum, 'min_s_leafs=%d'%min_sample_leafs ] = score

Fold : 1, Min_S_Leafs : 5, Score : 0.820407923331
Fold : 1, Min_S_Leafs : 20, Score : 0.82328392834
Fold : 1, Min_S_Leafs : 50, Score : 0.823381799044
Fold : 2, Min_S_Leafs : 5, Score : 0.820952535189
Fold : 2, Min_S_Leafs : 20, Score : 0.82424557707
Fold : 2, Min_S_Leafs : 50, Score : 0.824723487734
Fold : 3, Min_S_Leafs : 5, Score : 0.820148906935
Fold : 3, Min_S_Leafs : 20, Score : 0.82330198963
Fold : 3, Min_S_Leafs : 50, Score : 0.823316477575
Fold : 4, Min_S_Leafs : 5, Score : 0.820739348248
Fold : 4, Min_S_Leafs : 20, Score : 0.823767843772
Fold : 4, Min_S_Leafs : 50, Score : 0.823351386668
Fold : 5, Min_S_Leafs : 5, Score : 0.821490008094
Fold : 5, Min_S_Leafs : 20, Score : 0.82444853059
Fold : 5, Min_S_Leafs : 50, Score : 0.824486743072
Fold : 6, Min_S_Leafs : 5, Score : 0.819904565882
Fold : 6, Min_S_Leafs : 20, Score : 0.821756134553
Fold : 6, Min_S_Leafs : 50, Score : 0.82249002024
Fold : 7, Min_S_Leafs : 5, Score : 0.821301910698
Fold : 7, Min_S_Leafs : 20, Score : 0.82464

In [12]:
results.mean()

min_s_leafs=5     0.820377
min_s_leafs=20    0.823197
min_s_leafs=50    0.823331
dtype: float64

The min_sample_leafs 20 and 50 gives almost the same nDCG score.

### Varying max_leaf_nodes

In [13]:
results = pd.DataFrame()
foldnum=0
for train, test in cross_validation.KFold(len(data_nodfb), shuffle=True, n_folds=10,
                                           random_state=20160217):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(data_nodfb, labels, train, test)

    for max_leaf_nodes in range(5, 12):
        rf = RandomForestClassifier(n_estimators=600, criterion='gini', max_depth=7, 
                                    max_leaf_nodes=max_leaf_nodes )
        rf = rf.fit(tr_data, tr_labels.values.ravel())
        rf_predict = rf.predict_proba(te_data) 
        score = ndcg_score(te_labels.as_matrix(), rf_predict, k=5)
        print 'Fold : {}, max_leaf_nodes : {}, Score : {}'.format( foldnum, max_leaf_nodes, score )
        results.loc[foldnum, 'max_leaf_nodes=%d'%max_leaf_nodes ] = score

Fold : 1, max_leaf_nodes : 5, Score : 0.806015421447
Fold : 1, max_leaf_nodes : 6, Score : 0.806015421447
Fold : 1, max_leaf_nodes : 7, Score : 0.806015421447
Fold : 1, max_leaf_nodes : 8, Score : 0.806568698308
Fold : 1, max_leaf_nodes : 9, Score : 0.806707017524
Fold : 1, max_leaf_nodes : 10, Score : 0.80712197517
Fold : 1, max_leaf_nodes : 11, Score : 0.807191134777
Fold : 2, max_leaf_nodes : 5, Score : 0.809520139448
Fold : 2, max_leaf_nodes : 6, Score : 0.809520139448
Fold : 2, max_leaf_nodes : 7, Score : 0.809726794889
Fold : 2, max_leaf_nodes : 8, Score : 0.809916992721
Fold : 2, max_leaf_nodes : 9, Score : 0.810194477214
Fold : 2, max_leaf_nodes : 10, Score : 0.810418423367
Fold : 2, max_leaf_nodes : 11, Score : 0.810971726149
Fold : 3, max_leaf_nodes : 5, Score : 0.807707148251
Fold : 3, max_leaf_nodes : 6, Score : 0.807707148251
Fold : 3, max_leaf_nodes : 7, Score : 0.807707148251
Fold : 3, max_leaf_nodes : 8, Score : 0.808070253202
Fold : 3, max_leaf_nodes : 9, Score : 0.808

In [14]:
results.mean()

max_leaf_nodes=5     0.806765
max_leaf_nodes=6     0.806765
max_leaf_nodes=7     0.806886
max_leaf_nodes=8     0.807199
max_leaf_nodes=9     0.807457
max_leaf_nodes=10    0.807744
max_leaf_nodes=11    0.807945
dtype: float64

Varying the max_leaf_nodes made the nDCG score worse compared to min_sample_leafs and max_depth values. 

### AdaBoost Classifier on Random Forest

In [15]:
from sklearn.ensemble import AdaBoostClassifier

[tr_data, te_data, 
 tr_labels, te_labels] = cross_validation.train_test_split(data_nodfb, labels, random_state=20160202, test_size=0.33)

rf =  AdaBoostClassifier(RandomForestClassifier(n_estimators=600, criterion='gini', min_samples_leaf=50), 
                             random_state=20160202)
rf = rf.fit(tr_data, tr_labels.values.ravel())
rf_predict = rf.predict_proba(te_data) 
score = ndcg_score(te_labels.as_matrix(), rf_predict, k=5)
print score

0.810542490776


I was hoping that applying AdaBoost on top of the best Random forest would improve the nDCG score. But as can be seen above, the nDCG score decreased.

The best nDCG score among all thest models is 0.823337, which was obtained with min_samples_leaf = 50  