In this notebook, I apply random  forest (with parameter class_weight = 'balanced'). Let's see if we can get 'balanced' classification result, which means the percentage of recognized NDF users and non-NDF users are more or less the same.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
#from xgboost.sklearn import XGBClassifier

np.random.seed(0)

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import sklearn
import sklearn.grid_search
#from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold

import math
from sklearn.metrics import confusion_matrix
from NDCG_score_func import ndcg_score

Loading X,y and X_test

In [2]:
X= np.load('X.npy')
y= np.load('y.npy')
X_test= np.load('X_test.npy')

To get a confusion matrix, we need a train data and test data and the true label and predicted value. 

Let's first split the origitnal traiing data into training set and test set.

In [3]:
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X,y, test_size=0.25, random_state=42)
print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape

(160088L, 226L)
(53363L, 226L)
(160088L,)
(53363L,)


In [4]:
y_test.astype(int)

array([10,  7, 11, ...,  7, 10,  7])

In [5]:
print np.bincount(y_test.astype(int))

[  141   388   280   525  1245   613   687 31039   214    52 15635  2544]


### Random Forest

Fit random forest classifier with training set and test the performance on test set.

In [6]:
clf = RandomForestClassifier(max_features=40, n_estimators=120, n_jobs=1, min_samples_split=5,class_weight = 'balanced')#n_jobs=2 #, min_samples_split=5  max_features=12
clf.fit(X_train,y_train)
y_test_pred = clf.predict(X_test)

In [7]:
#get the labels
Labels = ['AU', 'CA', 'DE', 'ES', 'FR', 'GB', 'IT', 'NDF', 'NL', 'PT', 'US', 'other']
confusion= pd.DataFrame(data = confusion_matrix(y_test, y_test_pred),index = Labels, columns=Labels )

In [8]:
np.bincount(y_test.astype(int)).astype(np.float64)

array([   141.,    388.,    280.,    525.,   1245.,    613.,    687.,
        31039.,    214.,     52.,  15635.,   2544.])

In [9]:
len(y_test)

53363

In [10]:
recip_freq = len(y_test) / 12 *np.bincount(y_test.astype(int)).astype(np.float64)
print recip_freq/recip_freq[7]

[ 0.00454267  0.0125004   0.00902091  0.0169142   0.04011083  0.01974935
  0.02213345  1.          0.00689455  0.00167531  0.50372113  0.0819614 ]


__Confusion Matrix__

In the following confusion matrix, the row index represent the true book type, and the columns index represents the predicted booking class.

In [11]:
print confusion

       AU  CA  DE   ES   FR   GB   IT    NDF  NL  PT    US  other
AU      0   0   0    0    1    0    0     50   1   0    87      2
CA      0   0   1    4    5    0    2    137   0   0   230      9
DE      0   0   0    0    1    0    0    116   0   0   157      6
ES      0   0   1    1    6    0    2    210   1   0   297      7
FR      0   2   2    2    8   10   10    478   1   0   717     15
GB      0   2   2    3    3    5    1    228   1   1   359      8
IT      1   1   1    2   10    2    3    289   1   0   358     19
NDF    21  60  52  110  272  150  106  23051  39   3  6719    456
NL      0   0   1    2    6    1    0     75   0   0   120      9
PT      0   0   0    0    0    0    0     18   0   0    33      1
US     11  21  28   48  114   42   65   5900  19   5  9145    237
other   0   8   3   12   17    7    8   1056   1   1  1385     46


In [12]:
total = confusion.sum(axis =1) #the total number of users for each class
NDF = total[7]
non_NDF = total.sum()-NDF
print 'There are %d non-NDF users and %d NDF users in this synthetic test set.'%(non_NDF, NDF)

There are 22324 non-NDF users and 31039 NDF users in this synthetic test set.


In [13]:
recognized = pd.Series(np.diagonal(confusion), index = Labels)
rec_perc = recognized/total *100
rec_perc

AU        0.000000
CA        0.000000
DE        0.000000
ES        0.190476
FR        0.642570
GB        0.815661
IT        0.436681
NDF      74.264635
NL        0.000000
PT        0.000000
US       58.490566
other     1.808176
dtype: float64

__Binarized reading__: In order to make the results more clear, we make a table containing the result of recoginized NDF and recognized non-NDF users. Notice that if a user actually booked Italy is recognized as booking France, it is counted as NOT recognized.

In [14]:
reco_bina = pd.Series(data = [recognized.sum()-recognized['NDF'], recognized['NDF']],index = ['Non-NDF','NDF'])
tot_bina = pd.Series(data = [total.sum()-total['NDF'], total['NDF']],index = ['Non-NDF','NDF'])
rec_perc_bina = reco_bina/tot_bina *100
# rec_perc_bina

In [15]:
print 'The percentage of recognized non-NDF users is %d'%rec_perc_bina[0]+'%,'+ ' while the percentage of recoginized NDF users is %d'\
                                                          %rec_perc_bina[1]+'%,'

The percentage of recognized non-NDF users is 41%, while the percentage of recoginized NDF users is 74%,


The expected balanced result is not obtained. This is partially caused  by the fact that if a user actually booked Italy is recognized as booking France, it is counted as NOT recognized. In order to get rid of this effect, let's apply 'balanced' Random forestto the whole data set with binarized labels.

__Accuracy__ and __NDCG__ (Just of recording)

In [16]:
sklearn.metrics.accuracy_score(y_test, y_test_pred)

0.60451998575792221

In [17]:
ground_truth = y_test
predictions = clf.predict_proba(X_test)
ndcg_score(ground_truth, predictions, k=5)

0.80414514213477373