# Modelling all the Data

After the analysis performed with `taxi_model_sample001.csv` in the notebook `model_sample.ipynb` to see which model is best suited and its hyperparameters, we will perform the modeling of `taxi_model.csv` and check which is its RMSE_test

## Chicago Coordinates 
-87.6244212, 41.8755616

## 1 Load the libraries

In [1]:
# We load the libraries we are going to use in our analysis

# Libraries to manage our dataset
import pandas as pd
import numpy as np
from datetime import datetime
from math import sqrt
pd.options.display.max_columns = None

# Libraries for modelling
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from xgboost import XGBRegressor

# A library that will allow us to measure how good our model is.
from sklearn.metrics import mean_squared_error

## 2 Bagging + Xgboost with depth 5 and 120 estimators full dataset

Let's measure below how good our model is by predicting the rates of the test data that we separate at the beginning of the notebook but this time with all the data not with the sample created.

In [7]:
%%time

print('Starting to read the data at %s' %(datetime.now()))
chicago_trips = pd.read_csv('../Data/taxi_model.csv')
chicago_trips = chicago_trips.drop(['trip_id','trip_seconds','taxi_id_ind','payment_type_ind'],axis=1)
print('Finished read the data at %s' %(datetime.now()))

print('Starting to split the data in train, validation and test at %s' %(datetime.now()))
chicago_trips = pd.get_dummies(chicago_trips)
chicago_train = chicago_trips[(chicago_trips['year']<2016) |
                             ((chicago_trips['year']==2016) & (chicago_trips['month']<=1))]

chicago_test_val = chicago_trips[(chicago_trips['year']==2017) |
                             ((chicago_trips['year']==2016) & (chicago_trips['month']>1))]

print('The train dataset shape is: %d,%d' %(chicago_train.shape[0],chicago_train.shape[1]))
print('The valitation/test dataset shape is: %d,%d' %(chicago_test_val.shape[0],chicago_test_val.shape[1]))


train_target = np.ravel(chicago_train[['fare']])
train_predictors = chicago_train.drop(['fare'],axis=1)

chicago_test_val_target = np.ravel(chicago_test_val[['fare']])
chicago_test_val_predictors = chicago_test_val.drop(['fare'],axis=1)

validation_predictors,test_predictors, validation_target, test_target = train_test_split(chicago_test_val_predictors, 
                                                                                         chicago_test_val_target, 
                                                                                         test_size=0.38,
                                                                                         random_state=42)

print('The train predictors shape is %d,%d and the train target shape is %d' %(train_predictors.shape[0],
                                                                                  train_predictors.shape[1],
                                                                                  train_target.shape[0]))

print('The validation predictors shape is %d,%d and the validation target shape is %d' 
      %(validation_predictors.shape[0],validation_predictors.shape[1],
        validation_target.shape[0]))

print('The test predictors shape is %d,%d and the test target shape is %d' %(test_predictors.shape[0],
                                                                          test_predictors.shape[1],
                                                                          test_target.shape[0]))

del chicago_test_val, chicago_test_val_predictors, chicago_test_val_target, chicago_train, chicago_trips

del validation_predictors, validation_target
print('Finished split the data in train, validation and test at %s' %(datetime.now()))

depth = 5
estimator = 120

print('Starting to train at %s' %(datetime.now()))
taxi_model_bg_xgb = BaggingRegressor(XGBRegressor(max_depth=depth,
                                                  learning_rate=0.1,
                                                  n_estimators=estimator,
                                                  n_jobs=-1,
                                                  seed=13),
                                     n_jobs= -1,
                                     random_state=13)
print('Finished train at %s' %(datetime.now()))

print('Starting to fit at %s' %(datetime.now()))
taxi_model_bg_xgb.fit(train_predictors, train_target)
print('Finished fit at %s' %(datetime.now()))

print('Starting to predict train at %s' %(datetime.now()))
train_predictions = taxi_model_bg_xgb.predict(train_predictors)
val_predictions = taxi_model_bg_xgb.predict(validation_predictors)
test_predictions = taxi_model_bg_xgb.predict(test_predictors)
print('Finished predict at %s' %(datetime.now()))

error_train = rmse(train_target, train_predictions)
error_val = rmse(validation_target, val_predictions)
error_test = rmse(test_target, test_predictions)
difference = error_val - error_train
    

print ('For a %d depth and %d estimators the RMSE_train is %.5f $, the RMSE_val is %.5f $ and the difference %.5f $'
       % (depth,estimator, error_train,error_val,difference))

print ('For a %d depth and %d estimators the RMSE_test is %.5f $'
        % (depth, estimator, error_test))


Starting to read the data at 2019-05-15 21:35:18.208675
Finished read the data at 2019-05-15 21:35:21.061187
Starting to split the data in train, validation and test at 2019-05-15 21:35:21.061297
The train dataset shape is: 0,22
The valitation/test dataset shape is: 0,22
The train predictors shape is 0,21 and the train target shape is 0
The validation predictors shape is 0,21 and the validation target shape is 0
The test predictors shape is 0,21 and the test target shape is 0
Finished split the data in train, validation and test at 2019-05-15 21:35:21.556523
Starting to train at 2019-05-15 21:35:21.556681
Finished train at 2019-05-15 21:35:21.556736
Starting to fit at 2019-05-15 21:35:21.556759


ValueError: Found array with 0 sample(s) (shape=(0, 21)) while a minimum of 1 is required.

## 3 Conclusion