# Training files
We are going to use several columns to predict `KM_Travelled`

In [76]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from pathlib import Path

In [77]:
df = pd.read_csv("Cab_Data.csv")
df.head()

Unnamed: 0,Transaction ID,Date of Travel,Company,City,KM Travelled,Price Charged,Cost of Trip
0,10000011,42377,Pink Cab,ATLANTA GA,30.45,370.95,313.635
1,10000012,42375,Pink Cab,ATLANTA GA,28.62,358.52,334.854
2,10000013,42371,Pink Cab,ATLANTA GA,9.04,125.2,97.632
3,10000014,42376,Pink Cab,ATLANTA GA,33.17,377.4,351.602
4,10000015,42372,Pink Cab,ATLANTA GA,8.73,114.62,97.776


In [78]:
df = df[["Date of Travel", "Company", "City", "KM Travelled", "Price Charged", "Cost of Trip"]]
df.head()

Unnamed: 0,Date of Travel,Company,City,KM Travelled,Price Charged,Cost of Trip
0,42377,Pink Cab,ATLANTA GA,30.45,370.95,313.635
1,42375,Pink Cab,ATLANTA GA,28.62,358.52,334.854
2,42371,Pink Cab,ATLANTA GA,9.04,125.2,97.632
3,42376,Pink Cab,ATLANTA GA,33.17,377.4,351.602
4,42372,Pink Cab,ATLANTA GA,8.73,114.62,97.776


In [79]:
df.columns = ["Date_of_Travel", "Company", "City", "KM_Travelled", "Price_Charged", "Cost_of_Trip"]
df.head()

Unnamed: 0,Date_of_Travel,Company,City,KM_Travelled,Price_Charged,Cost_of_Trip
0,42377,Pink Cab,ATLANTA GA,30.45,370.95,313.635
1,42375,Pink Cab,ATLANTA GA,28.62,358.52,334.854
2,42371,Pink Cab,ATLANTA GA,9.04,125.2,97.632
3,42376,Pink Cab,ATLANTA GA,33.17,377.4,351.602
4,42372,Pink Cab,ATLANTA GA,8.73,114.62,97.776


There are some caveats to mention. `Date_of_Travel` wouldn't be the best way to handle in this format. However, we are going to keep it this format since the deployment is the important ones and training it isn't that important here. Alternatively we could just ignore it all. 

For categorical variables like `Company` and `City`, we are going to categorify them. WE are not going to use one-hot encoding though, but just assign them to numerical values (or even not since we are planning to use Random Forest or Decision Tree) might be sufficient. 

We shall leave `KM_Travelled` as it is. 

In [80]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from IPython.display import Image, display_svg, SVG
from fastai.tabular.all import *

In [81]:
path = Path(".")

What processes would we like our pipeline to go through? We want to `Categorify` our data, and `FillMissing` any NaN values. 

In [82]:
procs = [Categorify, FillMissing]

In [83]:
g = df["Date_of_Travel"]
g.min(), g.max()

(42371, 43465)

Split training and validation set based on `KFold` since we're not worried about Time series here, just prediction of price. 

In [84]:
from sklearn.model_selection import KFold
X = df.Date_of_Travel.to_numpy()  # dummy data. We're not using this for training. 

kf = KFold(n_splits=10)
for train_idx, valid_idx in kf.split(X): break  # get one single fold is sufficient. 

splits = (L(list(train_idx)), L(list(valid_idx)))

Note that `L` is a fastai object. We use it to show our splits. It has an attribute that limits its output (just like numpy array, unlike python list which prints everything). So this attribute makes me uses `L`. Without it, the below code will still work, just be sure not to print `splits` to flood the output if not using `L` (one isn't sure if numpy array works for the code below, though). 

In [85]:
splits

((#323452) [35940,35941,35942,35943,35944,35945,35946,35947,35948,35949...],
 (#35940) [0,1,2,3,4,5,6,7,8,9...])

In [86]:
df.nunique()

Date_of_Travel     1095
Company               2
City                 19
KM_Travelled        874
Price_Charged     99176
Cost_of_Trip      16291
dtype: int64

We then have a helper function `cont_cat_split` (stands for continuous variable - categorical variable - split). The output will be continuous variable and categorical variable. The number in 2nd args gives the threshold, where values below that threshold (including threshold value itself) are considered as categorical, while those larger are continuous. 

In [87]:
dep_var = "KM_Travelled"   # dependent variable
cont, cat = cont_cat_split(df, 19, dep_var=dep_var)
cont, cat

(['Date_of_Travel', 'Price_Charged', 'Cost_of_Trip'], ['Company', 'City'])

And we convert them to a `TabularPandas` object for training. 

In [88]:
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
len(to.train), len(to.valid)

(323452, 35940)

In [89]:
to.show(5)  # this is not what actually passed to training. 

Unnamed: 0,Company,City,Date_of_Travel,Price_Charged,Cost_of_Trip,KM_Travelled
35940,Yellow Cab,BOSTON MA,42540,598.450012,366.911987,29.120001
35941,Yellow Cab,BOSTON MA,42539,458.630005,333.720001,25.75
35942,Yellow Cab,BOSTON MA,42540,174.070007,131.651993,9.54
35943,Yellow Cab,BOSTON MA,42540,383.089996,253.627197,20.52
35944,Yellow Cab,BOSTON MA,42548,439.829987,344.9664,25.440001


In [90]:
to.train.items.head(5)  # this is what's passed to training. 

Unnamed: 0,Date_of_Travel,Company,City,KM_Travelled,Price_Charged,Cost_of_Trip
35940,42540,2,3,29.120001,598.450012,366.911987
35941,42539,2,3,25.75,458.630005,333.720001
35942,42540,2,3,9.54,174.070007,131.651993
35943,42540,2,3,20.52,383.089996,253.627197
35944,42548,2,3,25.440001,439.829987,344.9664


In [91]:
to.classes["Company"]

['#na#', 'Pink Cab', 'Yellow Cab']

In [121]:
to.classes["City"]

['#na#', 'ATLANTA GA', 'AUSTIN TX', 'BOSTON MA', 'CHICAGO IL', 'DALLAS TX', 'DENVER CO', 'LOS ANGELES CA', 'MIAMI FL', 'NASHVILLE TN', 'NEW YORK NY', 'ORANGE COUNTY', 'PHOENIX AZ', 'PITTSBURGH PA', 'SACRAMENTO CA', 'SAN DIEGO CA', 'SEATTLE WA', 'SILICON VALLEY', 'TUCSON AZ', 'WASHINGTON DC']

In [126]:
"pink cab".title()

'Pink Cab'

The reason we have `Company` not as 0 and 1 but 0, 1, and 2 is because we have an extra `#na#` class in fastai. Although that is never used, you aren't sure whether there will be other companies outside what we expect in the testing data, so that's reserved for that reason. 

In [92]:
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

Ok, let's just use `DecisionTreeRegressor` to not make things complicated, and we will save the model as pickle afterwards. 

In [93]:
len(to.train) / 1000

323.452

In [94]:
# root mean squared error
def r_mse(pred, y): return round(math.sqrt(((pred - y) ** 2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)

In [95]:
m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)

m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)

(1.135953, 1.300814)

Originally we're trying to predict `Price_Charged` but got very bad results, so one changes mind to predict `KM_Travelled` instead and get better results. We'll just use this and continue on although the data aren't the best still. 

In [97]:
m.get_n_leaves()

9929

Check prediction is working with random value. 

In [124]:
m.predict([[42540., 2., 3., 460.8, 380.4]])[0]



29.123513247515703

Ok, working. Let's export it. 

In [110]:
save_pickle(path/"model.pkl", m)
save_pickle(path/"to.pkl", to)

## Self experimentation to see what works

In [118]:
from datetime import datetime
g = datetime.strptime("01/01/2016", "%d/%m/%Y")

In [120]:
(datetime.strptime("02/01/2016", "%d/%m/%Y") - g).days

1

In [114]:
np.where(np.array(to.classes["City"]) == "BOSTON MA")[0][0]

3