#### Creating the submission file

The submission file must be formatted to be accepted by the automated submission scoring system. A sample of the final format is given with the datasets in the Challenge's page.

In [None]:
import os
BASE_PATH_KAGGLE_SUBMISISON = './out'

def create_txt_file_for_submission(df_data, file_name):
    
    final_path = os.path.abspath(os.path.join(BASE_PATH_KAGGLE_SUBMISISON,file_name))
    final_path += '.txt'
    df_data.index.name = 'id'
    print(final_path)
    df_data.to_csv(final_path,
                   sep=',',
                   header=True,
                   na_rep=df_data['trip_duration'].quantile(0.5)
                  )
   

#### Predicting the trip duration for the original test dataset

As we stored the best models we'll now use an empty copy of them with the same parameters and train that again now with the whole original train dataset to get a better model for the final submission.

After predicting the values for trip duration one submission file will be prepared for each model.

In [None]:
for model_name, model_specs in best_models.items():
    clf = sklearn.base.clone(model_specs['clf'])
    clf.fit(df_X_train, df_y_train)
    y_predict = clf.predict(df_test)
    y_predict = np.exp(y_predict) - 1
    df_result = pd.DataFrame(y_predict, columns=['trip_duration'], index=df_test.index.values)
    create_txt_file_for_submission(df_result, model_name)

## The final results

After submiting the prediction for each model, the final Scores are as follows:

1. XGBoost: 0.49443
1. Gradient Boost: 0.49671
1. Ridge: 0.51110

## Achievements and Improvements

Throughout the project, many topics related do data manipulation and visualization were covered, from opening the files, describing each of the columns, checking for null data, ploting the distributions, checking outliers, creating new features combining multiple columns, building simple models and evaluate them, fining tunning the best. 

The results achieved are coherent with the proposal, in which a score of 0.5 was the final goal.

In the matter of improvements, the first step would be to play more with the hyperparameters to fine tunne them evn more, another point of improvement is to understand better in what areas the traffic is more intense by using clusterization for example, also is a good idea to add more sources of data like weather that have a big impact in trip duration.

#### Ridge Hyperparameters Tunnning

The score for Ridge using the default parameters was 0.4283718167519953.

The fine tunning tested the combination of the following parameters for the model using 3-fold cross-validation to evaluate the best result:

* alpha: 0.5, 1.0, 3.0, 4.0, 5.0, 10.0
* solver: *auto*, *least-squares* (lsqr), *Stochastic Average Gradient descent* (SAG), *Singular Value Decomposition* (SVD)

The final parameter chosen was *alpha* = 5.0 and *solver* = *svd*, which had no practical improvement, with the score of 0.4283718279198734

### Gradient Boosting Regressor Hyperparameters Tunnning

The score for Gradient Boosting Regressor using the default parameters was 0.40940629859357985.

The fine tunning tested the combination of the following parameters for the model using 2-fold cross-validation to evaluate the best result:

* max_depth: 3, 5
* n_estimators: 100, 200
* min_samples_split: 2, 6
* learning_rate : 0.1, 1.0

The final parameters chosen was *learning_rate*: 0.1, *max_depth*: 5, *min_samples_split*: 6, *n_estimators*: 200, which improved the score to 0.39898042677040685

### XGBoost Hyperparameters Tunnning

The score for Gradient Boosting Regressor using the default parameters was 0.409344079071674.

The fine tunning tested the combination of the following parameters for the model using 2-fold cross-validation to evaluate the best result:

* max_depth: 5, 8, 10
* n_estimators: 200, 300
* learning_rate: 0.05, 0.1
* reg_lambda: 1.0, 5

The final parameters chosen was *learning_rate*: 0.1, *max_depth*: 5, *n_estimators*: 300, *reg_lambda*: 5, which improved the score to 0.39852413102654594

In [1]:
import pandas as pd
import numpy as np
import csv
import sklearn
from datetime import datetime, timedelta
import time
import xgboost as xgb
import scipy
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from sklearn import preprocessing



PATH_TRAIN_DATASET = './data/train.csv'
PATH_TEST_DATASET = './data/test.csv'
PATH_SAMPLE_SUMBISSION = './data/sample_submission.csv'

NYC_DEGREE_KM = 111.05938787411571
NYC_BOUNDING_BOX = [(40.4774,-74.2589), ( 40.9176, -73.7004)]

def calculate_city_block_distance(df_data):
    delta_lat = np.absolute(df_data.pickup_latitude - df_data.dropoff_latitude) * NYC_DEGREE_KM    
    delta_lon = np.absolute(df_data.pickup_longitude - df_data.dropoff_longitude) * NYC_DEGREE_KM    
    return delta_lat + delta_lon

def kaggle_score(y_true_exp, y_pred_exp):
    y_pred_exp = np.exp(y_pred_exp) - 1
    y_true_exp = np.exp(y_true_exp) - 1
    e_log_square = np.square( np.log(y_pred_exp + 1) - np.log(y_true_exp + 1))
    score = np.sqrt((1/len(y_true_exp)) * np.sum(e_log_square))
    return score

df_test = pd.read_csv(PATH_TEST_DATASET, infer_datetime_format=True, parse_dates=['pickup_datetime'],  index_col='id')
df_train = pd.read_csv(PATH_TRAIN_DATASET, infer_datetime_format=True,parse_dates=['pickup_datetime'], index_col='id')

df_train.drop('dropoff_datetime', axis=1, inplace=True)
df_train['pickup_datetime'] = df_train['pickup_datetime'].dt.to_pydatetime()
df_test['pickup_datetime'] = df_test['pickup_datetime'].dt.to_pydatetime()

Q1 = df_train['trip_duration'].quantile(0.25)
Q3 = df_train['trip_duration'].quantile(0.75)
IQR = Q3 - Q1
df_train = df_train[~((df_train['trip_duration'] < (Q1 - 1.5 * IQR)) |(df_train['trip_duration'] > (Q3 + 1.5 * IQR)))]

df_train = df_train[df_train['trip_duration'] > 1]
df_train = df_train[df_train['trip_duration'] < 7200]

filter_lat_long = df_train['pickup_latitude'] < NYC_BOUNDING_BOX[1][0]
filter_lat_long &= df_train['pickup_latitude'] > NYC_BOUNDING_BOX[0][0]
filter_lat_long &= df_train['pickup_longitude'] < NYC_BOUNDING_BOX[1][1]
filter_lat_long &= df_train['pickup_longitude'] > NYC_BOUNDING_BOX[0][1]

filter_lat_long &= df_train['dropoff_latitude'] < NYC_BOUNDING_BOX[1][0]
filter_lat_long &= df_train['dropoff_latitude'] > NYC_BOUNDING_BOX[0][0]
filter_lat_long &= df_train['dropoff_longitude'] < NYC_BOUNDING_BOX[1][1]
filter_lat_long &= df_train['dropoff_longitude'] > NYC_BOUNDING_BOX[0][1]


df_train['distance'] = calculate_city_block_distance(df_train)
df_train = df_train[df_train['distance'] > .1]
df_train['avg_speed'] = df_train['distance']/(df_train['trip_duration']/3600)
df_train = df_train[df_train['avg_speed'] < 100]
df_train = df_train[df_train['avg_speed'] > 1]

df_train.drop('avg_speed', axis=1, inplace=True)

df_train['pickup_date'] = df_train['pickup_datetime'].dt.date
df_train['pickup_hour'] = df_train['pickup_datetime'].dt.hour
df_train['pickup_weekday'] = df_train['pickup_datetime'].dt.day_name()

holidays = [day.date() for day in calendar().holidays(start=df_train['pickup_date'].min(),
                                                      end=df_train['pickup_date'].max())]
df_train['holiday'] = df_train['pickup_date'].isin(holidays)
df_train.drop('pickup_date', axis=1, inplace=True)

df_train = df_train[df_train['passenger_count']>0]

cols = ['vendor_id', 'passenger_count','store_and_fwd_flag',
        'pickup_weekday', 'pickup_hour', 'holiday']
df_train = pd.get_dummies(df_train, columns=cols)

cols = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
        'dropoff_latitude', 'pickup_datetime', 'pickup_date']
df_train.drop([cols], axis=1, inplace=True)
# df_train[cols] = df_train[cols].round(3)

df_train.drop('pickup_date', axis=1, inplace=True)


df_train['trip_duration'] = np.log(df_train['trip_duration'] + 1)
df_train['distance'] = np.log(df_train['distance'] + 1)

from sklearn.model_selection import train_test_split

df_y_train = df_train['trip_duration']
df_X_train = df_train.drop(columns=['trip_duration'])

X_train, X_test, y_train, y_test = train_test_split(df_X_train,
                                                    df_y_train,
                                                    test_size = 0.3,
                                                    random_state = 3)

