# New York City Taxi Trip Duration Prediction

In this project I will try to predict taxi trip durations in New York City. I will use Auto Machine Learning to train the model.

**Benefits of AutoML:**

AutoML (Automated Machine Learning) revolutionizes the traditional approach to machine learning by automating the process of model selection, hyperparameter tuning, and feature engineering. This technology democratizes access to advanced machine learning techniques, allowing users with varying levels of expertise to build high-performing models efficiently. By leveraging AutoML, organizations can significantly accelerate the development cycle of machine learning projects, reduce the need for extensive manual intervention, and mitigate the risk of human error. Additionally, AutoML provides the capability to experiment with a broader range of algorithms and configurations than would be feasible manually, leading to potentially more accurate and robust models. Overall, AutoML enhances productivity, optimizes resource allocation, and fosters innovation by simplifying and streamlining the complex tasks involved in machine learning.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nyc-taxi-trip-duration/train.zip
/kaggle/input/nyc-taxi-trip-duration/test.zip
/kaggle/input/nyc-taxi-trip-duration/sample_submission.zip


In [2]:
import zipfile
labels=["ss","train","test"]
paths=["/kaggle/input/nyc-taxi-trip-duration/sample_submission.zip","/kaggle/input/nyc-taxi-trip-duration/train.zip","/kaggle/input/nyc-taxi-trip-duration/test.zip"]
for l, p in zip(labels, paths):
    with zipfile.ZipFile(p, 'r') as zip_ref:
        zip_ref.extractall(l)

In [3]:
ss=pd.read_csv("/kaggle/working/ss/sample_submission.csv")
train=pd.read_csv("/kaggle/working/train/train.csv")
test=pd.read_csv("/kaggle/working/test/test.csv")

In [4]:
ss.head()

Unnamed: 0,id,trip_duration
0,id3004672,959
1,id3505355,959
2,id1217141,959
3,id2150126,959
4,id1598245,959


In [5]:
train.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


In [6]:
test.head()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag
0,id3004672,1,2016-06-30 23:59:58,1,-73.988129,40.732029,-73.990173,40.75668,N
1,id3505355,1,2016-06-30 23:59:53,1,-73.964203,40.679993,-73.959808,40.655403,N
2,id1217141,1,2016-06-30 23:59:47,1,-73.997437,40.737583,-73.98616,40.729523,N
3,id2150126,2,2016-06-30 23:59:41,1,-73.95607,40.7719,-73.986427,40.730469,N
4,id1598245,1,2016-06-30 23:59:33,1,-73.970215,40.761475,-73.96151,40.75589,N


In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
 10  trip_duration       1458644 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB


In [8]:
abs(train.corr(numeric_only=True)["trip_duration"]).sort_values(ascending=False)

trip_duration        1.000000
pickup_latitude      0.029204
pickup_longitude     0.026542
dropoff_latitude     0.020677
vendor_id            0.020304
dropoff_longitude    0.014678
passenger_count      0.008471
Name: trip_duration, dtype: float64

In [9]:
train.drop(["id","vendor_id","store_and_fwd_flag","pickup_datetime","dropoff_datetime"],axis=1,inplace=True)
test.drop(["id","vendor_id","store_and_fwd_flag","pickup_datetime"],axis=1,inplace=True)

In [10]:
import math
def haversine(pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude):
    R = 6371.0
    
    phi1 = math.radians(pickup_latitude)
    phi2 = math.radians(dropoff_latitude)
    delta_phi = math.radians(dropoff_latitude - pickup_latitude)
    delta_lambda = math.radians(dropoff_longitude - pickup_longitude)
    
    a = math.sin(delta_phi / 2.0)**2 + math.cos(phi1) * math.cos(phi2) * math.sin(delta_lambda / 2.0)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    
    distance = R * c
    return distance

In [11]:
train["distance"]=train.apply(lambda row: haversine(row['pickup_longitude'], row['pickup_latitude'], row['dropoff_longitude'], row['dropoff_latitude']), axis=1)
test["distance"]=test.apply(lambda row: haversine(row['pickup_longitude'], row['pickup_latitude'], row['dropoff_longitude'], row['dropoff_latitude']), axis=1)

In [12]:
test.head()

Unnamed: 0,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,distance
0,1,-73.988129,40.732029,-73.990173,40.75668,2.746426
1,1,-73.964203,40.679993,-73.959808,40.655403,2.759239
2,1,-73.997437,40.737583,-73.98616,40.729523,1.306155
3,1,-73.95607,40.7719,-73.986427,40.730469,5.269088
4,1,-73.970215,40.761475,-73.96151,40.75589,0.960842


In [13]:
train.head()

Unnamed: 0,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,distance
0,1,-73.982155,40.767937,-73.96463,40.765602,455,1.498521
1,1,-73.980415,40.738564,-73.999481,40.731152,663,1.805507
2,1,-73.979027,40.763939,-74.005333,40.710087,2124,6.385098
3,1,-74.01004,40.719971,-74.012268,40.706718,429,1.485498
4,1,-73.973053,40.793209,-73.972923,40.78252,435,1.188588


In [14]:
train.drop(["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"],axis=1,inplace=True)
test.drop(["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"],axis=1,inplace=True)

In [15]:
train.isnull().sum()

passenger_count    0
trip_duration      0
distance           0
dtype: int64

In [16]:
test.isnull().sum()

passenger_count    0
distance           0
dtype: int64

In [17]:
pip install pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.1/486.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pmdarima>=2.0.4
  Downloading pmdarima-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting deprecation>=2.1.0
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting ipywidgets>=7.6.5
  Downloading ipywidgets-8.1.3-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sktime==0.26.0
  Downloading sktime-0.26.0-py3-none-any.whl (21.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.8/21.8 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m00:

Collecting widgetsnbextension~=4.0.11
  Downloading widgetsnbextension-4.0.11-py3-none-any.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
Collecting jupyterlab-widgets~=3.0.11
  Downloading jupyterlab_widgets-3.0.11-py3-none-any.whl (214 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.4/214.4 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting tsdownsample>=0.1.3
  Downloading tsdownsample-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m67.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson<4.0.0,>=3.8.0
  Downloading orjson-3.10.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.7/142.7 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dash>=2.9.0
  Downloading das

    Uninstalling joblib-1.4.2:
      Successfully uninstalled joblib-1.4.2
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.5.0
    Uninstalling scikit-learn-1.5.0:
      Successfully uninstalled scikit-learn-1.5.0
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.9.0
    Uninstalling matplotlib-3.9.0:
      Successfully uninstalled matplotlib-3.9.0
Successfully installed Cython-3.0.10 Flask-3.0.3 blinker-1.8.2 category-encoders-2.6.3 dash-2.17.0 dash-core-components-2.0.0 dash-html-components-2.0.0 dash-table-5.0.0 deprecation-2.1.0 imbalanced-learn-0.12.3 ipywidgets-8.1.3 itsdangerous-2.2.0 joblib-1.3.2 jupyterlab-widgets-3.0.11 kaleido-0.2.1 lightgbm-4.3.0 matplotlib-3.7.5 orjson-3.10.4 pandas-2.1.4 patsy-0.5.6 plotly-5.22.0 plotly-resampler-0.10.0 pmdarima-2.0.4 pyca

In [18]:
from pycaret.regression import *
setup(data = train, target = 'trip_duration')

Unnamed: 0,Description,Value
0,Session id,2585
1,Target,trip_duration
2,Target type,Regression
3,Original data shape,"(1458644, 3)"
4,Transformed data shape,"(1458644, 3)"
5,Transformed train set shape,"(1021050, 3)"
6,Transformed test set shape,"(437594, 3)"
7,Numeric features,2
8,Preprocess,True
9,Imputation type,simple


<pycaret.regression.oop.RegressionExperiment at 0x7f7a4c77ecb0>

In [19]:
best_model=compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,435.5218,34369355.6558,5156.8666,0.0185,0.5993,0.7948,290.377
lr,Linear Regression,462.1447,34428171.8037,5162.6485,0.0159,0.6927,1.0144,0.431
lasso,Lasso Regression,462.1466,34428166.854,5162.6484,0.0159,0.6927,1.0147,0.233
en,Elastic Net,463.7848,34427920.5012,5162.6538,0.0159,0.697,1.0286,0.235
ridge,Ridge Regression,462.1447,34428171.8347,5162.6485,0.0159,0.6927,1.0144,0.241
llar,Lasso Least Angle Regression,462.1466,34428166.9358,5162.6484,0.0159,0.6927,1.0147,0.244
br,Bayesian Ridge,462.1613,34428168.0555,5162.6485,0.0159,0.6927,1.0146,0.245
lar,Least Angle Regression,462.1447,34428171.8446,5162.6485,0.0159,0.6927,1.0144,0.239
omp,Orthogonal Matching Pursuit,461.3927,34429623.2372,5162.8171,0.0158,0.6928,1.0193,0.248
huber,Huber Regressor,401.7473,34466825.1342,5167.1743,0.0136,0.5549,0.588,0.384


In [20]:
import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(best_model, f)
with open('model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)
print(loaded_model)

LGBMRegressor(n_jobs=-1, random_state=2585)


In [21]:
pred=best_model.predict(test)
ss["trip_duration"]=pred
ss.to_csv("submission.csv",index=False)

AttributeError: 'Index' object has no attribute '_format_native_types'