# Finding right classifiers for direction types

In [1]:
import pandas as pd
import numpy as np

# custom scripts
from data_preparation import get_X_y_type
from model_fitting import *

from sklearn.externals import joblib

# handling warnings
import warnings
warnings.simplefilter("ignore")

## 1 &emsp; Pipeline description

Overall pipeline is:

- load datasets for the type,

- train a model on test set,

- validate it on a smaller sample,

- run it on test set,

- create a dataframe which would represent the status of models,

- find the best model and save it.

This applies to types _setting_, _business_, _delivery_, and _location_.

Types _entrance_ and _exit_ will be fitted in another notebook because they have to be compared with the semantic rule-based model.

__NB:__ `joblib.dump();` has a semicolon in the end of the statement which is quite un-pythonic. This was made so that the function would not display its result (path to the best fitted model).


### Index

[Business](#business)

[Delivery](#delivery)

[Location](#location)

[Setting](#setting)

## 2 &emsp; Running models

#### <div id="business">2.1 &emsp; Business</div>

Loading data:

In [2]:
X_train, y_train = get_X_y_type("business", "train")
X_valid, y_valid = get_X_y_type("business", "val")
X_test, y_test = get_X_y_type("business", "test")

Fitting model:

In [3]:
business_dict, fitted_models = models_for_type(X_train, y_train, 
                                              X_valid, y_valid, 
                                              X_test, y_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   31.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  8.0min finished


Model LogReg scored 0.891841 on cross-validation with params:
{'C': 0.1}
Model LogReg scored 0.906188 on validation set
Fitting 5 folds for each of 198 candidates, totalling 990 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 990 out of 990 | elapsed:  6.3min finished


Model Decision Tree scored 0.869313 on cross-validation with params:
{'criterion': 'gini', 'max_depth': 8}
Model Decision Tree scored 0.922636 on validation set
Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   57.3s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  6.1min finished


Model Random Forest scored 0.900220 on cross-validation with params:
{'n_estimators': 46}
Model Random Forest scored 0.960000 on validation set


Choosing the best model pased on performance:

In [39]:
business_df = pd.DataFrame.from_dict(business_dict)
business_df

Unnamed: 0,model,cross-val,validation,test
0,LogReg,0.891841,0.906188,0.884453
1,Decision Tree,0.869313,0.922636,0.868333
2,Random Forest,0.90022,0.96,0.905702
3,SVC,0.526805,1.0,0.575758


In [40]:
best_model_name = business_df.iloc[business_df["test"].argmax()]["model"]
print(best_model_name)

Random Forest


In [6]:
best_model = fitted_models[best_model_name]
joblib.dump(best_model, "./data/models/business_final.pkl");

#### <div id="delivery">2.2 &emsp; Delivery</div>

Loading data:

In [53]:
X_train, y_train = get_X_y_type("delivery", "train")
X_valid, y_valid = get_X_y_type("delivery", "val")
X_test, y_test = get_X_y_type("delivery", "test")

Fitting model:

In [8]:
delivery_dict, fitted_models = models_for_type(X_train, y_train, 
                                              X_valid, y_valid, 
                                              X_test, y_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  9.6min finished


Model LogReg scored 0.732372 on cross-validation with params:
{'C': 0.1}
Model LogReg scored 0.775000 on validation set
Fitting 5 folds for each of 198 candidates, totalling 990 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 990 out of 990 | elapsed:  6.9min finished


Model Decision Tree scored 0.693793 on cross-validation with params:
{'criterion': 'gini', 'max_depth': 13}
Model Decision Tree scored 0.830000 on validation set
Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  6.4min finished


Model Random Forest scored 0.707697 on cross-validation with params:
{'n_estimators': 75}
Model Random Forest scored 0.886256 on validation set


Choosing the best model pased on performance:

In [59]:
delivery_df = pd.DataFrame.from_dict(delivery_dict)
delivery_df

Unnamed: 0,model,cross-val,validation,test
0,LogReg,0.732372,0.775,0.721578
1,Decision Tree,0.693793,0.83,0.711785
2,Random Forest,0.707697,0.886256,0.732673
3,SVC,0.721186,0.874419,0.720742


In [60]:
best_model_name = delivery_df.iloc[delivery_df["test"].argmax()]["model"]
print(best_model_name)

Random Forest


In [61]:
best_model = fitted_models[best_model_name]
joblib.dump(best_model, "./data/models/delivery_final.pkl");

#### <div id="location">2.3 &emsp; Location</div>

Loading data:

In [62]:
X_train, y_train = get_X_y_type("location", "train")
X_valid, y_valid = get_X_y_type("location", "val")
X_test, y_test = get_X_y_type("location", "test")

Fitting model:

In [13]:
location_dict, fitted_models = models_for_type(X_train, y_train, 
                                              X_valid, y_valid, 
                                              X_test, y_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 13.2min finished


Model LogReg scored 0.342365 on cross-validation with params:
{'C': 1.1090909090909091}
Model LogReg scored 0.470588 on validation set
Fitting 5 folds for each of 198 candidates, totalling 990 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 990 out of 990 | elapsed:  6.2min finished


Model Decision Tree scored 0.316568 on cross-validation with params:
{'criterion': 'gini', 'max_depth': 12}
Model Decision Tree scored 0.861538 on validation set
Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done 225 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 475 tasks      | elapsed:  4.6min


Model Random Forest scored 0.352769 on cross-validation with params:
{'n_estimators': 5}
Model Random Forest scored 0.825397 on validation set


[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  5.0min finished


Choosing the best model pased on performance:

In [65]:
location_df = pd.DataFrame.from_dict(location_dict)
location_df

Unnamed: 0,model,cross-val,validation,test
0,LogReg,0.342365,0.470588,0.272727
1,Decision Tree,0.316568,0.861538,0.231884
2,Random Forest,0.352769,0.825397,0.232558
3,SVC,0.372803,0.911765,0.25


In [66]:
best_model_name = location_df.iloc[location_df["test"].argmax()]["model"]
print(best_model_name)

LogReg


In [16]:
best_model = fitted_models[best_model_name]
joblib.dump(best_model, "./data/models/location_final.pkl");

#### <div id="setting">2.4 &emsp; Setting</div>

Loading data:

In [67]:
X_train, y_train = get_X_y_type("setting", "train")
X_valid, y_valid = get_X_y_type("setting", "val")
X_test, y_test = get_X_y_type("setting", "test")

Fitting model:

In [18]:
setting_dict, fitted_models = models_for_type(X_train, y_train, 
                                              X_valid, y_valid, 
                                              X_test, y_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   12.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  3.3min finished


Model LogReg scored 0.536318 on cross-validation with params:
{'C': 1.1090909090909091}
Model LogReg scored 1.000000 on validation set
Fitting 5 folds for each of 198 candidates, totalling 990 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 990 out of 990 | elapsed:  5.5min finished


Model Decision Tree scored 0.472220 on cross-validation with params:
{'criterion': 'gini', 'max_depth': 10}
Model Decision Tree scored 0.750000 on validation set
Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  73 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done 223 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 473 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  4.9min finished


Model Random Forest scored 0.291410 on cross-validation with params:
{'n_estimators': 23}
Model Random Forest scored 1.000000 on validation set


Choosing the best model pased on performance:

In [73]:
setting_df = pd.DataFrame.from_dict(setting_dict)
setting_df

Unnamed: 0,model,cross-val,validation,test
0,LogReg,0.536318,1.0,0.6
1,Decision Tree,0.47222,0.75,0.466667
2,Random Forest,0.29141,1.0,0.45
3,SVC,0.613119,1.0,0.642857


In [77]:
best_model_name = setting_df.iloc[setting_df["test"].argmax()]["model"]
best_model_name

'SVC'

In [78]:
# best_model = fitted_models[best_model_name]
joblib.dump(svm_classifier, "./data/models/setting_final.pkl");