## Model Selection

<a id='up'></a>

 Classification trees are appropriate for this problem, as they successively determine decision criteria based on subsets of the initial variables. It corresponds to an intuitive representation of the consumers, each one being associated with a cluster linked to its credit prole. We chose to use three different models: 

In this chapter, we choose the best model and the best parameters to predict the number of visitors.

* 0. [Load libraries](#load-libraries)
* 1. [Data import](#data-import)
* 2. [Model evaluation - hyperparameters](#hyperparameters)
* 3. [Finding p-value](#pval)
* 4. [Imbalance correction](#imbalance-correction)
* 4.1. [RandomForestClassifier](#rfc)
* 5. [Models](#models)
* 5.1 [Fit Models](#fit-model)
* 5.2 [predict Models on train](#pmtr)
* 5.3 [RSMLE - Root Mean Squared Logarithmic Error](#rsmle)
* 6. [Predict Test](#test)
* 7. [Sub](#sub)

### <a id='load-libraries'>0. Load libraries</a>

In [23]:
# Load libraries
import numpy as np 
import pandas as pd 
from subprocess import check_output
import matplotlib.pyplot as plt
import seaborn as sns

In [24]:
import glob, re
from sklearn import *
from datetime import datetime
from xgboost import XGBRegressor

In [25]:
np.random.seed(10)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

In [26]:
#pip install -U imbalanced-learn
# Install a pip package in the current Jupyter kernel
#import sys
#!{sys.executable} -m pip install imbalanced-learn

In [27]:
# For section 4
#from sklearn.ensemble import RandomForestClassifier # Random foreset section 4 (Imbalance correction)
#from sklearn import utils # for section 4 in order to avoid ValueError: Unknown label type: 'continuous'
#from sklearn.metrics import accuracy_score #
#from sklearn.metrics import roc_auc_score

#from collections import Counter
#from sklearn.datasets import make_classification

#from imblearn.over_sampling import SMOTE # doctest: +NORMALIZE_WHITESPACE

In [28]:
from importlib import reload
import pyMechkar as mechkar
reload(mechkar)

<module 'pyMechkar' from 'C:\\Users\\gali\\git\\Recruit-Restaurant-Visitor-Forecasting1\\pyMechkar.py'>

In [29]:
#from collections import Counter
#from sklearn.datasets import make_classification
#from imblearn.over_sampling import SMOTE

In [30]:
# Install a pip package in the current Jupyter kernel
#import sys
#!{sys.executable} -m pip uninstall tableone

In [31]:
#import tableone as TableOne

[Up to the header](#up)

### <a id='data-import'>1.Data import</a> 

In [32]:
# Data Aggregation
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')

In [33]:
print(len(train['visit_date'].unique())) # 478
print(len(train['air_store_id'].unique())) # 812
print(train['visit_date'].min()) # 2016-01-01
print(train['visit_date'].max()) #2017-04-22

478
812
2016-01-01
2017-04-22


In [34]:
print(821*39) #32019
print(len(test['visit_date'].unique())) #
print(test['visit_date'].min()) #2017-04-23
print(test['visit_date'].max()) # 2017-05-31
print(len(test['air_store_id'].unique())) #821

32019
39
2017-04-23
2017-05-31
821


In [35]:
#nt = []
#test[['air_store_id']] - train[['air_store_id']]
#train[['air_store_id']]

In [36]:
train.head(1)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,visit_date,visitors,air_store_id,latitude,longitude,month,date,dw,...,Ōsaka-fu,Hyōgo-ken,Hokkaidō,Shizuoka-ken,Fukuoka-ken,Hiroshima-ken,Niigata-ken,Miyagi-ken,reserve_visitors_air_1,air_date_diff_1
0,0,0,2016-01-13,25,air_ba937bf13d40fb24,35.658068,139.751599,1,13,2,...,0,0,0,0,0,0,0,0,,


In [37]:
test.head(1)

Unnamed: 0.1,Unnamed: 0,id,visitors,visit_date,air_store_id,dw,dy,date,month,holiday_flg,...,Saturday,Tōkyō-to,Ōsaka-fu,Hyōgo-ken,Hiroshima-ken,Fukuoka-ken,Hokkaidō,Miyagi-ken,Niigata-ken,Shizuoka-ken
0,0,air_00a91d42b08b08d9_2017-04-23,0,2017-04-23,air_00a91d42b08b08d9,6,2017,23,4,0,...,0,1,0,0,0,0,0,0,0,0


In [38]:
train = train.drop(['Unnamed: 0' , 'Unnamed: 0.1'], axis=1)
test = test.drop(['Unnamed: 0'], axis=1)

[Up to the header](#up)

### <a id='hyperparameters'>2. Model evaluation - hyperparameters</a>

In [39]:
col = [c for c in train if c not in ['id', 'air_store_id', 'visit_date','visitors']] 

###### For missing values

In [40]:
train = train.fillna(-99)
test = test.fillna(-99)

### RSMLE - Root Mean Squared Logarithmic Error

In [41]:
def RMSLE(y, pred):
    return metrics.mean_squared_error(y, pred)**0.5

In [42]:
X=train[col]
X_train = train[train.visit_date<'2017-03-01'][col]
X_test = train[train.visit_date>'2017-03-01'][col]


y_train = np.log1p(train[train.visit_date<'2017-03-01']['visitors'].values)
y_test = np.log1p(train[train.visit_date>'2017-03-01']['visitors'].values)

X_dev = X_train.copy()
y_dev = y_train.copy()

In [21]:
X_test['visitors'] = y_test
X_dev['visitors'] = y_dev
X_train['visitors'] = y_train

In [43]:
df1 = X_train.copy()
df1['outcome'] = y_train
df1['dataset'] = 'train'

df2 = X_dev.copy()
df2['outcome'] = y_dev
df2['dataset'] = 'dev'

df3 = X_test.copy()
df3['outcome'] = y_test
df3['dataset'] = 'test'

df1 = df1.append(df2, ignore_index=True)
df1 = df1.append(df3,  ignore_index=True)

vn = df1.columns.tolist()
tab1 = mechkar.pyMechkar().Table1(x=vn,y="dataset",data=df1)
#tt = mechkar.pyMechkar().TABLE1(x=vn,y="dataset",data=df1).getTable1()

Factorizing... please wait
[********************************************************
[]
*********************************************************
*********************************************************
*********************************************************
['Unable to calcualte the Fisher exact test for variables sat/sun/hol and dataset... The p-value may be incorrect']
------ Finished in 83.97333002090454econds -----


AttributeError: 'DataFrame' object has no attribute 'getTable1'

In [20]:
X1_train['visitors'] = y1_train

[Up to the header](#up)

#### <a id='pval'>3. Finding p-value</a>

In [21]:
#cat = ['Dining bar', 'Izakaya', 'Other', 'Italian/French','Cafe/Sweets',\
#      'Japanese food', 'Bar/Cocktail', 'Creative cuisine','Western food',\
#     'Yakiniku/Korean food', 'Asian','International cuisine','Okonomiyaki/Monja/Teppanyaki',\
#    'Karaoke/Party','Wednesday', 'Thursday', 'Friday', 'Saturday','Monday',\
#     'Tuesday', 'Sunday','Tōkyō-to','Ōsaka-fu','Hyōgo-ken','Hokkaidō',\
#       'Shizuoka-ken', 'Fukuoka-ken', 'Hiroshima-ken','Niigata-ken', 'Miyagi-ken'\
#       'sunday','saturday','holiday_flg']

# non-normal variables
#nonnormal = ['visitors']


#ableOne(data, columns, categorical, groupby, nonnormal, label_suffix=True, pval = True)
# create tableone with the input arguments
#mytable = TableOne(data=X1_train,columns=col, categorical=cat,nonnormal=nonnormal,pval = True,label_suffix=True)
#mytable

#overall_table = TableOne(train, label_suffix=True)
#X1_train.dtypes[X1_train.dtypes=='float64']

In [13]:
#mtrain3= mechkar.pyMechkar().Table1(data=X1_train, y='visitors')
#mtrain3
#mtest = mechkar.pyMechkar().exploreData(data=X1_test, y=y1_test

#X_test.date

[Up to the header](#up)

### <a id='imbalance-correction'>4. Imbalance correction</a>

##### Use Tree-Based Algorithms
Decision trees often perform well on imbalanced datasets,<br> 
because their hierarchical structure allows them to learn signals from both classes.<br><br>

In modern applied machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.)<br>
almost always outperform singular decision trees

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score 
from sklearn.metrics import roc_auc_score
```

In [23]:
Xtra = X1_train.copy()
ytra = y1_train.copy()

LabelEncoder - convert string or float values to 0 .. n classes.

If we put as input <code>Xtra</code>, <code>ytra</code> to fit method it cause error. To avoid it we will convert and encode labels.

In [24]:
#lab_enc = preprocessing.LabelEncoder()
#training_scores_encoded = lab_enc.fit_transform(ytra)
#print(training_scores_encoded)
#print(utils.multiclass.type_of_target(ytra))
#print(utils.multiclass.type_of_target(ytra.astype('int')))
#print(utils.multiclass.type_of_target(training_scores_encoded))


### <a id='rfc'>4.1 RandomForestClassifier</a>

In [78]:
# Separate input features (Xtra) and target variable (ytra)
#Xtra = Xtra.drop('visitors', axis=1)


#clf = RandomForestClassifier()
#clf.fit(Xtra, training_scores_encoded)
#print("RandomForestClassifier")

# Predict on training set
#pred_y = clf.predict(Xtra)

# Is our model still predicting just one class?
#print(np.unique(pred_y))

# The accuracy 51%
#print(accuracy_score(training_scores_encoded, pred_y)) #0.5102562512817334

# The AUROC
#prob_y = clf.predict_proba(Xtra)
#prob_y = [p[1] for p in prob_y]
#print(roc_auc_score(training_scores_encoded,prob_y) )


#### <a id='smote'>4.1.1 Using Smote</a>

[Up to the header](#up)

## <a id='models'>5. Models </a>

##### Model GradientBoostingRegressor

In [66]:
model1 = ensemble.GradientBoostingRegressor(learning_rate=0.2, random_state=3, n_estimators=200, subsample=0.8, 
                      max_depth =10)

##### Model KNeighborsRegressor

In [67]:
model2 = neighbors.KNeighborsRegressor(n_jobs=-1, n_neighbors=4)

##### Model XGBRegressor

In [68]:
model3 = XGBRegressor(learning_rate=0.2, n_estimators=280, subsample=0.8, 
                      colsample_bytree=0.8, max_depth =12)
#random_state=3

[Up to the header](#up)

### <a id='fit-model'>5.1 Fit Models</a>

##### fit model GradientBoostingRegressor

In [69]:
model1.fit(train[col], np.log1p(train['visitors'].values))

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.2, loss='ls', max_depth=10, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=200, n_iter_no_change=None, presort='auto',
             random_state=3, subsample=0.8, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

##### fit model KNeighborsRegressor

In [70]:
model2.fit(train[col], np.log1p(train['visitors'].values))

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=4, p=2,
          weights='uniform')

##### fit model XGBRegressor

In [71]:
model3.fit(train[col], np.log1p(train['visitors'].values))

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0, learning_rate=0.2, max_delta_step=0,
       max_depth=12, min_child_weight=1, missing=None, n_estimators=280,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.8)

[Up to the header](#up)

### <a id='pmtr'>5.2 predict Models on train</a>

#####  predict model GradientBoostingRegressor on train

In [72]:
preds1 = model1.predict(train[col])

#####  predict model KNeighborsRegressor on train

In [73]:
preds2 = model2.predict(train[col])

#####  predict model XGBRegressor on train

In [74]:
preds3 = model3.predict(train[col])

[Up to the header](#up)

 ### <a id='rsmle'>RSMLE - Root Mean Squared Logarithmic Error</a>

In [75]:
print('RMSLE GradientBoostingRegressor: ', RMSLE(np.log1p(train['visitors'].values), preds1))

RMSLE GradientBoostingRegressor:  0.6047027150514401


In [76]:
print('RMSLE KNeighborsRegressor: ', RMSLE(np.log1p(train['visitors'].values), preds2))

RMSLE KNeighborsRegressor:  0.6654282036972176


In [77]:
print('RMSLE XGBRegressor: ', RMSLE(np.log1p(train['visitors'].values), preds3))

RMSLE XGBRegressor:  0.5751243395584047


[Up to the header](#up)

### <a id='test'>6. Predict Test</a>

#### predict GradientBoostingRegressor

In [79]:
preds1 = model1.predict(test[col])

#### predict KNeighborsRegressor

In [80]:
preds2 = model2.predict(test[col])

#### predict XGBRegressor

In [81]:
preds3 = model3.predict(test[col])

In [82]:
test['visitors'] = 0.3*preds1+0.3*preds2+0.4*preds3
test['visitors'] = np.expm1(test['visitors']).clip(lower=0.)

[Up to the header](#up)

## <a id='sub'>7. Sub</a>

In [83]:
sub = test[['id','visitors']].copy()

In [84]:
sub.head()

Unnamed: 0,id,visitors
0,air_00a91d42b08b08d9_2017-04-23,9.921163
1,air_00a91d42b08b08d9_2017-04-24,17.659213
2,air_00a91d42b08b08d9_2017-04-25,16.76825
3,air_00a91d42b08b08d9_2017-04-26,21.974205
4,air_00a91d42b08b08d9_2017-04-27,23.721075


In [36]:
#sub1.to_csv(r'C:\Users\sergey\Documents\Recruit Restaurant Visitor_2\submission.csv')

[Up to the header](#up)