<a href="https://colab.research.google.com/github/amolprabhu/NYC-Taxi-Duration-Prediction/blob/main/NYC_Taxi_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Problem Description </b>

### Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

In [1]:
# Importing libraries
import pandas as pd
import seaborn as sns

In [2]:
# Importing the necessary data
data = pd.read_csv('/content/drive/MyDrive/NYC Taxi/NYC Taxi Data.csv')

In [3]:
# Viewing the first 5 rows of the data
data.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


In [4]:
# Understanding the statistics of the columns
data.describe()

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration
count,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0
mean,1.53495,1.66453,-73.97349,40.75092,-73.97342,40.7518,959.4923
std,0.4987772,1.314242,0.07090186,0.03288119,0.07064327,0.03589056,5237.432
min,1.0,0.0,-121.9333,34.3597,-121.9333,32.18114,1.0
25%,1.0,1.0,-73.99187,40.73735,-73.99133,40.73588,397.0
50%,2.0,1.0,-73.98174,40.7541,-73.97975,40.75452,662.0
75%,2.0,2.0,-73.96733,40.76836,-73.96301,40.76981,1075.0
max,2.0,9.0,-61.33553,51.88108,-61.33553,43.92103,3526282.0


In [5]:
# Brief description of the columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
 10  trip_duration       1458644 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB


In [6]:
# Checking for null values
data.isna().sum()

id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64

In [7]:
# Dropping the 'id' column
data.drop('id', axis = 1, inplace = True)

In [8]:
# Understanding the 'vendor id' column
data['vendor_id'].value_counts()

2    780302
1    678342
Name: vendor_id, dtype: int64

In [9]:
# Unique items of 'passenger count'
data['passenger_count'].unique()

array([1, 6, 4, 2, 3, 5, 0, 7, 9, 8])

In [10]:
# Number of empty trips or trips with no passengers
len(data[data['passenger_count'] == 0])

60

In [11]:
# Dropping rows with empty passenger count
data.drop(data[data['passenger_count'] == 0].index, axis = 0, inplace= True)

In [12]:
# Importing datatime to help conversion of string to datetime
from datetime import datetime

In [13]:
# Creating a function to convert string to datetime
def str_to_datetime(rows):
  d = datetime.strptime(rows, '%Y-%m-%d %H:%M:%S')
  return d

In [14]:
# Applying the function to columns
data['dropoff_datetime'] = data['dropoff_datetime'].apply(str_to_datetime)
data['pickup_datetime'] = data['pickup_datetime'].apply(str_to_datetime)

In [15]:
# Creating new columns using pickup date and time
data['trip_month'] = data['pickup_datetime'].dt.month
data['trip_date'] = data['pickup_datetime'].dt.day
data['trip_day_of_week'] = data['pickup_datetime'].dt.dayofweek
data['trip_hour'] = data['pickup_datetime'].dt.hour

In [16]:
# Installing haversine package
!pip install haversine



In [17]:
# Importing haversine
import haversine as hs

In [18]:
# Creating new columns by combining columns
data['pickup_point'] = list(zip(data['pickup_latitude'], data['pickup_longitude']))
data['dropoff_point'] = list(zip(data['dropoff_latitude'], data['dropoff_longitude']))
data['points'] = list(zip(data['pickup_point'],data['dropoff_point']))

In [19]:
# Creating a function that returns distance between 2 points
def distance(rows):
  kms = hs.haversine(rows[0], rows[1])
  return kms  

In [20]:
# New column showing distance between 2 points
data['trip_distance'] = data['points'].apply(distance)

In [21]:
# Creating a dataframe with dummy variables for 'store and fwd flag' column
flag_df = pd.get_dummies(data= data['store_and_fwd_flag'], prefix = 'store_and_fwd_flag')

In [22]:
# Concatenating flag_df to original data
data = pd.concat([data,flag_df], axis =1)

In [23]:
# Creating a function to remove outliers
def outlier_removal(df, column):
  ul = df[column].quantile(0.975)
  ll = df[column].quantile(0.025)
  

  return df[(df[column] > ll) & (df[column] < ul)]