# New York City Taxi Trip Duration - [link](https://www.kaggle.com/competitions/nyc-taxi-trip-duration)
- type: Regression
- score: RMSE

[EDA notebook](https://www.kaggle.com/code/headsortails/nyc-taxi-eda-update-the-fast-the-curious#external-data)
___

File descriptions
- `train.csv` - the training set (contains 1458644 trip records)
- `test.csv` - the testing set (contains 625134 trip records)
- `sample_submission.csv` - a sample submission file in the correct format_



**Features**
- `id` - a unique identifier for each trip
- `vendor_id` - a code indicating the provider associated with the trip record
- `pickup_datetime` - date and time when the meter was engaged
- `dropoff_datetime` - date and time when the meter was disengaged
- `passenger_count` - the number of passengers in the vehicle (driver entered value)
- `pickup_longitude` - the longitude where the meter was engaged
- `pickup_latitude` - the latitude where the meter was engaged
- `dropoff_longitude` - the longitude where the meter was disengaged
- `dropoff_latitude` - the latitude where the meter was disengaged
- `store_and_fwd_flag` - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip

Target
- `trip_duration` - duration of the trip in seconds


In [1]:
import os
import shutil
import pandas as pd
import numpy as np
from IPython.display import display

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.3f}'.format)

import random
DEFAULT_RANDOM_SEED = 2021
def set_all_seeds(seed=DEFAULT_RANDOM_SEED):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
set_all_seeds(seed=DEFAULT_RANDOM_SEED)

os.chdir('C:/_Github repositories/New-York-City-Taxi-Trip-Duration')

## 1. Loading data + checking

In [2]:
train = pd.read_csv('data/train.csv')
display(train.head())
train.describe()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982,40.768,-73.965,40.766,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.98,40.739,-73.999,40.731,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979,40.764,-74.005,40.71,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01,40.72,-74.012,40.707,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973,40.793,-73.973,40.783,N,435


Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration
count,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0
mean,1.535,1.665,-73.973,40.751,-73.973,40.752,959.492
std,0.499,1.314,0.071,0.033,0.071,0.036,5237.432
min,1.0,0.0,-121.933,34.36,-121.933,32.181,1.0
25%,1.0,1.0,-73.992,40.737,-73.991,40.736,397.0
50%,2.0,1.0,-73.982,40.754,-73.98,40.755,662.0
75%,2.0,2.0,-73.967,40.768,-73.963,40.77,1075.0
max,2.0,9.0,-61.336,51.881,-61.336,43.921,3526282.0


In [3]:
test = pd.read_csv('data/test.csv')
display(test.head())
test.describe()

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag
0,id3004672,1,2016-06-30 23:59:58,1,-73.988,40.732,-73.99,40.757,N
1,id3505355,1,2016-06-30 23:59:53,1,-73.964,40.68,-73.96,40.655,N
2,id1217141,1,2016-06-30 23:59:47,1,-73.997,40.738,-73.986,40.73,N
3,id2150126,2,2016-06-30 23:59:41,1,-73.956,40.772,-73.986,40.73,N
4,id1598245,1,2016-06-30 23:59:33,1,-73.97,40.761,-73.962,40.756,N


Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
count,625134.0,625134.0,625134.0,625134.0,625134.0,625134.0
mean,1.535,1.662,-73.974,40.751,-73.973,40.752
std,0.499,1.311,0.073,0.03,0.073,0.036
min,1.0,0.0,-121.933,37.39,-121.933,36.601
25%,1.0,1.0,-73.992,40.737,-73.991,40.736
50%,2.0,1.0,-73.982,40.754,-73.98,40.755
75%,2.0,2.0,-73.967,40.768,-73.963,40.77
max,2.0,9.0,-69.249,42.815,-67.497,48.858


- **2 vendor IDs total (maybe analyze separatly)**
- **coordinates features**
- **passenger count (moth max 9 test/train)**
- **time feature (requires work)**
- **flag (info sent immediately or no) - possible to find "bad internet areas"**

test doesn't have 2 columns `trip_duration` and `dropoff_time`

### Missing values test
- **everything alright**

In [None]:
train.isna().sum()

id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64

In [9]:
test.isna().sum()

id                    0
vendor_id             0
pickup_datetime       0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
dtype: int64

### Combining `test` and `train` (for consistency checks)

In [14]:
df = pd.concat([train, test], ignore_index=True)
display(df.head())
df.isna().sum()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982,40.768,-73.965,40.766,N,455.0
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.98,40.739,-73.999,40.731,N,663.0
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979,40.764,-74.005,40.71,N,2124.0
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01,40.72,-74.012,40.707,N,429.0
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973,40.793,-73.973,40.783,N,435.0


id                         0
vendor_id                  0
pickup_datetime            0
dropoff_datetime      625134
passenger_count            0
pickup_longitude           0
pickup_latitude            0
dropoff_longitude          0
dropoff_latitude           0
store_and_fwd_flag         0
trip_duration         625134
dtype: int64

### Feature type change
- date do `datetime`
- `vendor_id` and `passenger_count` to categorical

In [22]:
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'])
df['vendor_id'] = df['vendor_id'].astype('category')
df['passenger_count'] = df['passenger_count'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2083778 entries, 0 to 2083777
Data columns (total 11 columns):
 #   Column              Dtype         
---  ------              -----         
 0   id                  object        
 1   vendor_id           category      
 2   pickup_datetime     datetime64[ns]
 3   dropoff_datetime    datetime64[ns]
 4   passenger_count     category      
 5   pickup_longitude    float64       
 6   pickup_latitude     float64       
 7   dropoff_longitude   float64       
 8   dropoff_latitude    float64       
 9   store_and_fwd_flag  object        
 10  trip_duration       float64       
dtypes: category(2), datetime64[ns](2), float64(5), object(2)
memory usage: 147.1+ MB


### Checking whether or not `trip duration` calculated poorly
- **everything is alright**

In [27]:
interval_seconds = (df['dropoff_datetime'] - df['pickup_datetime']).dt.total_seconds()
print((np.abs(interval_seconds - df['trip_duration']) > 0).sum())

0


## 2. Features visualization