# Taxi Fare Prediction Model_Feature_Engineering&_EDA

## Introduction
This project aims to develop a predictive model for taxi fares in NYC. Initially, we will create a model for NYC and adjust parameters to align with domain knowledge from Tbilisi. We hypothesize that factors such as time of day, seasonality, and holidays impact taxi demand and fare prices.



### Notebook Aim and Feature Development

#### **Objective**
The primary objective of our project is to develop a predictive model that accurately forecasts taxi fares. Initially focusing on New York City (NYC), we aim to expand and adapt the model to incorporate Tbilisi, employing localized domain knowledge to tailor our approach.

#### **Influence of Demand on Pricing**
The fare prices in the taxi industry are predominantly influenced by demand dynamics, which can fluctuate based on various factors including:

- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

#### **Additional Influential Features**
Beyond temporal and periodic factors, several other elements could influence fare pricing:

- **Passenger Count**: Exploring whether vehicles accommodating more passengers have different fare structures, similar to practices in ride-sharing applications.
- **Trip Distance and Duration**: Both metrics are crucial for pricing. While trip distance is a direct influencer, the duration might also affect costs, especially in varying traffic conditions.
- **Velocity**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.
- **Taxi Zones**: With the NYC taxi zone dataset, we can analyze whether specific pickup and dropoff locations impact fare prices due to their geographical significance.


### Summary
This comprehensive approach not only allows us to understand the multifaceted dynamics of taxi fare pricing in NYC but also sets a foundation for adapting the model to Tbilisi, ensuring that both city-specific and universal factors are considered for effective fare prediction.

## Data Loading

In this section we will install necessary packages, imports necessary libraries and load the dataset.

In [1]:
!pip install pyarrow
!pip install fastparquet
!pip install geopandas



We need to have pyarrow 16 version to initiate the code if the code does not run please install and upadte the pyarrow from terminal.

In [2]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sns
import geopandas as gpd

pd.set_option('display.float_format', lambda x: '%.2f' % x)
from pandas.tseries.holiday import USFederalHolidayCalendar


In [3]:
# Replace 'path_to_file.parquet' with the path to your Parquet file
df_original = pd.read_parquet('/Users/md/Desktop/python_project/parquet_files/cleaned/cleaned_taxi_data.parquet', engine='pyarrow')  # or engine='fastparquet' if you prefer
df = df_original

Let's check for data accuracy and that the cleaned data is clean and has all the columns after data preparation in previous notebook.

## Data Initial View and Celaining 

As we already process cleaned data from previous notebook we do not need to clean the data for nulls, duplicates or outliers however to check that data is consistent and clean we will have an initial look at the laoded dataset below.

In [3]:
df.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,General_Airport_Fee,JFK_LGA_Pickup_Fee
count,32449058.0,32449058,32449058,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0
mean,1.75,2023-07-02 05:28:03.922574,2023-07-02 05:45:37.417722,1.39,3.57,1.58,165.37,164.2,1.2,19.73,1.61,0.5,3.59,0.6,1.0,28.91,2.5,0.14,0.01
min,1.0,2023-01-01 00:00:05,2023-01-01 00:04:16,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.5,0.0,0.0,1.0,0.01,2.5,0.0,0.0
25%,2.0,2023-04-02 00:56:01.250000,2023-04-02 01:10:36,1.0,1.09,1.0,132.0,114.0,1.0,9.3,0.0,0.5,1.0,0.0,1.0,15.96,2.5,0.0,0.0
50%,2.0,2023-06-26 15:42:35.500000,2023-06-26 16:02:16.500000,1.0,1.8,1.0,162.0,162.0,1.0,13.5,1.0,0.5,2.88,0.0,1.0,21.0,2.5,0.0,0.0
75%,2.0,2023-10-06 10:59:01.750000,2023-10-06 11:17:46,1.0,3.42,1.0,234.0,234.0,1.0,21.9,2.5,0.5,4.48,0.0,1.0,30.72,2.5,0.0,0.0
max,2.0,2023-12-31 23:57:45,2023-12-31 23:59:56,6.0,30.0,99.0,265.0,265.0,4.0,300.0,50.0,0.6,500.0,50.0,1.0,500.0,2.75,1.75,1.25
std,0.43,,,0.88,4.46,7.13,63.57,69.75,0.46,17.92,1.83,0.0,4.04,2.18,0.0,22.74,0.0,0.46,0.11


In [5]:
df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                   int64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag             category
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
General_Airport_Fee             float64
JFK_LGA_Pickup_Fee              float64
distance_bins                  category
dtype: object

In [6]:
df.isnull().sum()

VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
congestion_surcharge     0
General_Airport_Fee      0
JFK_LGA_Pickup_Fee       0
distance_bins            0
dtype: int64

### Additional Dataset For Taxi Zones Loading

In [4]:
zones = pd.read_csv("/Users/md/Desktop/python_project/parquet_files/cleaned/taxi_zones.csv", sep=';')
zones.describe()

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,LocationID
count,263.0,263.0,263.0,263.0
mean,132.0,0.09,0.0,131.98
std,76.07,0.05,0.0,76.07
min,1.0,0.01,0.0,1.0
25%,66.5,0.05,0.0,66.5
50%,132.0,0.08,0.0,132.0
75%,197.5,0.12,0.0,197.5
max,263.0,0.43,0.0,263.0


In [5]:
zones.sample(5)

Unnamed: 0,OBJECTID,Shape_Leng,the_geom,Shape_Area,zone,LocationID,borough
17,15,0.14,MULTIPOLYGON (((-73.7774039129087 40.796598241...,0.0,Bay Terrace/Fort Totten,15,Queens
174,174,0.07,MULTIPOLYGON (((-73.87772817699982 40.88345419...,0.0,Norwood,174,Bronx
259,259,0.13,MULTIPOLYGON (((-73.85107116191898 40.91037152...,0.0,Woodlawn/Wakefield,259,Bronx
257,255,0.06,MULTIPOLYGON (((-73.96176070375392 40.72522879...,0.0,Williamsburg (North Side),255,Brooklyn
117,117,0.17,MULTIPOLYGON (((-73.7763584369479 40.609655838...,0.0,Hammels/Arverne,117,Queens


we already see that we have 263 zones, in our dataset we have 265 zones for taxis, which means we already know that when joining we will have to adjust for missing values and try to find this zones or remove them.

# Feature Engineering

Below based on our domain knowledge and literature reviews we will create new features or adjust the existing ones to gain more insights on the data and cerate best possible predictive model.

## Seasonal and Time Features


- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Duration**: How long did the trip last.

In [6]:
# Convert the pickup and dropoff datetime to pandas datetime format if not already
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

# Time of day segmentation
df['pickup_time_of_day'] = df['tpep_pickup_datetime'].dt.hour.apply(lambda x: 'morning' if 5 <= x <= 11
                                                                           else 'afternoon' if 12 <= x <= 17
                                                                           else 'evening' if 18 <= x <= 23
                                                                           else 'night')

# Seasons segmentation
df['pickup_season'] = df['tpep_pickup_datetime'].dt.month.apply(lambda x: 'spring' if 3 <= x <= 5
                                                                       else 'summer' if 6 <= x <= 8
                                                                       else 'autumn' if 9 <= x <= 11
                                                                       else 'winter')

# Passenger count categories
df['passenger_count_category'] = pd.cut(df['passenger_count'], bins=[0, 1, 4, 6], include_lowest=True, 
                                        labels=['low', 'medium', 'high'])

# Weekday/Weekend segmentation
df['pickup_day_type'] = df['tpep_pickup_datetime'].dt.day_name().apply(lambda x: 'weekend' if x in ['Saturday', 'Sunday'] else 'weekday')


#taxi_data_prepared['transaction_date'] = pd.to_datetime(taxi_data_prepared['tpep_pickup_datetime'].dt.date)
# -> we make it datetime again because it's very little use when it's just a string (can't compare, sort, etc.)
df['transaction_year'] = df['tpep_pickup_datetime'].dt.year
df['transaction_month'] = df['tpep_pickup_datetime'].dt.month
df['transaction_day'] =  df['tpep_pickup_datetime'].dt.day
df['transaction_hour'] = df['tpep_pickup_datetime'].dt.hour

#trip duration is another interesting feature to analyze 


# Calculate the trip duration and convert it to minutes
df['trip_duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60


Lets take a look at adjusted dataset and what are the new created features we will sample the dataset to also test that the fatures were created correctly.

In [7]:
df.sample(10)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
17904517,2,2023-01-20 14:04:00,2023-01-20 14:17:20,1,1.8,1.0,N,246,50,2,...,1-2 miles,afternoon,winter,low,weekday,2023,1,20,14,13.33
4196376,2,2023-07-19 10:55:06,2023-07-19 10:58:52,1,1.0,1.0,N,107,170,1,...,1-2 miles,morning,summer,low,weekday,2023,7,19,10,3.77
20425903,2,2023-10-17 12:52:31,2023-10-17 13:15:49,1,4.49,1.0,N,231,68,1,...,2-5 miles,afternoon,autumn,low,weekday,2023,10,17,12,23.3
16016993,1,2023-11-28 13:11:22,2023-11-28 13:36:38,1,3.1,1.0,N,143,186,2,...,2-5 miles,afternoon,autumn,low,weekday,2023,11,28,13,25.27
22315601,1,2023-09-06 19:11:10,2023-09-06 19:42:49,1,1.3,1.0,N,161,186,1,...,1-2 miles,evening,autumn,low,weekday,2023,9,6,19,31.65
15904090,2,2023-11-27 08:00:14,2023-11-27 08:07:42,1,1.0,1.0,N,186,234,1,...,1-2 miles,morning,autumn,low,weekday,2023,11,27,8,7.47
8308596,1,2023-04-01 11:56:39,2023-04-01 11:59:46,1,1.0,1.0,N,100,170,1,...,1-2 miles,morning,spring,low,weekend,2023,4,1,11,3.12
19194519,1,2023-10-04 10:52:07,2023-10-04 11:06:25,1,2.5,1.0,N,142,166,1,...,2-5 miles,morning,autumn,low,weekday,2023,10,4,10,14.3
14780647,2,2023-11-14 09:39:05,2023-11-14 09:53:40,1,1.91,1.0,N,140,142,1,...,1-2 miles,morning,autumn,low,weekday,2023,11,14,9,14.58
6581768,2,2023-05-14 07:01:10,2023-05-14 07:25:40,1,17.55,2.0,N,132,162,2,...,10-20 miles,morning,spring,low,weekend,2023,5,14,7,24.5


Below we will check if the trip duration calculations are correct

In [8]:
# Display the first few rows to confirm the new 'trip_duration' column
print(df[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']].sample(10))

         tpep_pickup_datetime tpep_dropoff_datetime  trip_duration
18954181  2023-10-01 14:56:51   2023-10-01 15:14:58          18.12
9101451   2023-04-10 11:54:27   2023-04-10 12:05:41          11.23
13364641  2023-08-30 17:25:40   2023-08-30 17:34:37           8.95
30595479  2023-03-12 00:04:34   2023-03-12 00:17:37          13.05
27276929  2023-02-03 12:14:34   2023-02-03 12:24:25           9.85
29380016  2023-02-27 08:03:29   2023-02-27 08:18:25          14.93
23530505  2023-09-19 16:30:02   2023-09-19 16:42:01          11.98
10827775  2023-04-28 12:08:36   2023-04-28 12:16:04           7.47
14741836  2023-11-13 18:09:46   2023-11-13 18:16:10           6.40
12737121  2023-08-22 11:33:11   2023-08-22 11:51:19          18.13


To check if newly added features have correct values we will use descriptive statistics and adjust accordingly if needed.

In [9]:
df.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,...,improvement_surcharge,total_amount,congestion_surcharge,General_Airport_Fee,JFK_LGA_Pickup_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,32449058.0,32449058,32449058,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,...,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0
mean,1.75,2023-07-02 05:28:03.922574,2023-07-02 05:45:37.417722,1.39,3.57,1.58,165.37,164.2,1.2,19.73,...,1.0,28.91,2.5,0.14,0.01,2023.0,6.52,15.55,14.31,17.56
min,1.0,2023-01-01 00:00:05,2023-01-01 00:04:16,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.01,2.5,0.0,0.0,2023.0,1.0,1.0,0.0,-476.25
25%,2.0,2023-04-02 00:56:01.250000,2023-04-02 01:10:36,1.0,1.09,1.0,132.0,114.0,1.0,9.3,...,1.0,15.96,2.5,0.0,0.0,2023.0,4.0,8.0,11.0,7.72
50%,2.0,2023-06-26 15:42:35.500000,2023-06-26 16:02:16.500000,1.0,1.8,1.0,162.0,162.0,1.0,13.5,...,1.0,21.0,2.5,0.0,0.0,2023.0,6.0,15.0,15.0,12.65
75%,2.0,2023-10-06 10:59:01.750000,2023-10-06 11:17:46,1.0,3.42,1.0,234.0,234.0,1.0,21.9,...,1.0,30.72,2.5,0.0,0.0,2023.0,10.0,23.0,19.0,20.63
max,2.0,2023-12-31 23:57:45,2023-12-31 23:59:56,6.0,30.0,99.0,265.0,265.0,4.0,300.0,...,1.0,500.0,2.75,1.75,1.25,2023.0,12.0,31.0,23.0,7053.62
std,0.43,,,0.88,4.46,7.13,63.57,69.75,0.46,17.92,...,0.0,22.74,0.0,0.46,0.11,0.0,3.46,8.72,5.78,41.75


We see that for month, year, day, season features, the values make sense although for trip duration we can see that we have negative trip durations. 


Negative trip durations may have occured due to data entry issues , times might have been mixed up. we can investigate further and see what is the number of negative values and either drop the corrupted data or adjust it accordingly.

In [10]:
# Display cases with negative trip_duration
negative_durations = df[df['trip_duration'] < 0]
negative_durations[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']]

negative_durations.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,...,improvement_surcharge,total_amount,congestion_surcharge,General_Airport_Fee,JFK_LGA_Pickup_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,727.0,727,727,727.0,727.0,727.0,727.0,727.0,727.0,727.0,...,727.0,727.0,727.0,727.0,727.0,727.0,727.0,727.0,727.0,727.0
mean,1.69,2023-10-19 09:42:54.950481,2023-10-19 09:01:44.551581,1.36,3.82,11.27,150.67,151.1,1.11,21.06,...,1.0,29.11,2.5,0.02,0.0,2023.0,10.4,6.55,2.38,-41.17
min,1.0,2023-01-20 13:35:00,2023-01-20 13:05:48,1.0,1.0,1.0,3.0,4.0,1.0,4.4,...,1.0,9.4,2.5,0.0,0.0,2023.0,1.0,1.0,1.0,-476.25
25%,1.0,2023-11-05 01:43:57.500000,2023-11-05 01:01:55,1.0,1.7,1.0,90.0,87.0,1.0,12.8,...,1.0,19.68,2.5,0.0,0.0,2023.0,11.0,5.0,1.0,-49.08
50%,2.0,2023-11-05 01:51:51,2023-11-05 01:05:36,1.0,3.06,1.0,148.0,148.0,1.0,17.7,...,1.0,25.4,2.5,0.0,0.0,2023.0,11.0,5.0,1.0,-43.82
75%,2.0,2023-11-05 01:56:14.500000,2023-11-05 01:10:16,2.0,4.96,1.0,222.5,231.0,1.0,26.5,...,1.0,35.1,2.5,0.0,0.0,2023.0,11.0,5.0,1.0,-35.22
max,2.0,2023-12-31 09:20:00,2023-12-31 09:10:02,4.0,19.7,99.0,264.0,265.0,4.0,80.0,...,1.0,103.36,2.5,1.75,0.0,2023.0,12.0,31.0,21.0,-0.03
std,0.46,,,0.7,2.95,30.0,67.36,74.31,0.42,11.67,...,0.0,13.6,0.0,0.16,0.0,0.0,1.94,5.14,3.91,21.34


In [11]:
# Check for possible datetime swaps or errors
swapped_cases = df[df['tpep_pickup_datetime'] > df['tpep_dropoff_datetime']]
print(swapped_cases[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']])


         tpep_pickup_datetime tpep_dropoff_datetime  trip_duration
432181    2023-06-05 15:30:00   2023-06-05 15:15:19         -14.68
434497    2023-06-05 15:07:00   2023-06-05 14:02:04         -64.93
1177236   2023-06-13 11:35:00   2023-06-13 11:26:42          -8.30
1493285   2023-06-16 11:47:00   2023-06-16 11:18:02         -28.97
1500379   2023-06-16 12:01:00   2023-06-16 11:32:43         -28.28
...                       ...                   ...            ...
31363333  2023-03-20 11:30:00   2023-03-20 11:00:24         -29.60
31473492  2023-03-21 17:50:35   2023-03-21 17:00:25         -50.17
31624151  2023-03-23 11:00:00   2023-03-23 10:30:28         -29.53
32248964  2023-03-30 08:14:00   2023-03-30 08:00:35         -13.42
32412057  2023-03-31 18:01:00   2023-03-31 17:15:20         -45.67

[727 rows x 3 columns]


as we can see there are only 727 negative values which compared to full dataset is really low number thus instead of going over 30million records to witch the rows we will drop rows with. trip durations less than or equal to 0.

In [12]:
df = df[df['trip_duration']>0]
df.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,...,improvement_surcharge,total_amount,congestion_surcharge,General_Airport_Fee,JFK_LGA_Pickup_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,32447704.0,32447704,32447704,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,...,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0,32447704.0
mean,1.75,2023-07-02 05:24:40.256013,2023-07-02 05:42:13.850476,1.39,3.57,1.58,165.37,164.2,1.2,19.73,...,1.0,28.91,2.5,0.14,0.01,2023.0,6.52,15.55,14.31,17.56
min,1.0,2023-01-01 00:00:05,2023-01-01 00:04:16,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.01,2.5,0.0,0.0,2023.0,1.0,1.0,0.0,0.02
25%,2.0,2023-04-02 00:54:16,2023-04-02 01:08:53,1.0,1.09,1.0,132.0,114.0,1.0,9.3,...,1.0,15.96,2.5,0.0,0.0,2023.0,4.0,8.0,11.0,7.72
50%,2.0,2023-06-26 15:39:06,2023-06-26 15:58:38,1.0,1.8,1.0,162.0,162.0,1.0,13.5,...,1.0,21.0,2.5,0.0,0.0,2023.0,6.0,15.0,15.0,12.65
75%,2.0,2023-10-06 10:52:41.250000,2023-10-06 11:11:23,1.0,3.42,1.0,234.0,234.0,1.0,21.9,...,1.0,30.72,2.5,0.0,0.0,2023.0,10.0,23.0,19.0,20.63
max,2.0,2023-12-31 23:57:45,2023-12-31 23:59:56,6.0,30.0,99.0,265.0,265.0,4.0,300.0,...,1.0,500.0,2.75,1.75,1.25,2023.0,12.0,31.0,23.0,7053.62
std,0.43,,,0.88,4.46,7.13,63.57,69.75,0.46,17.92,...,0.0,22.74,0.0,0.46,0.11,0.0,3.46,8.72,5.78,41.75


## Taxi Zones_ Feature 
taxi zone ID s though informative they do not provide any insights as to where passanger was picked up and neighbourhoods are thought to effect pricing at least when hailing a cab thus we will merge Taxi zone dataset with the NYC trip data on zone IDs and idnetify pickup and drop off buroughs for each trip.

In [13]:
# Merge the zone data into the main taxi trip dataset for pickup locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='PULocationID', right_on='LocationID', how='left')
df.rename(columns={'zone': 'PUzone', 'borough': 'PUborough'}, inplace=True)

# Merge the zone data for dropoff locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='DOLocationID', right_on='LocationID', how='left', suffixes=('', '_drop'))
df.rename(columns={'zone': 'DOzone', 'borough': 'DOborough'}, inplace=True)

# Drop the extra LocationID columns if they are not needed
df.drop(['LocationID', 'LocationID_drop'], axis=1, inplace=True)


In [14]:
print(df['PUborough'].value_counts())
print(df['DOborough'].value_counts())

PUborough
Manhattan        28745483
Queens            3208548
Brooklyn           166333
Bronx               43754
Staten Island        1602
EWR                   939
Name: count, dtype: int64
DOborough
Manhattan        28838887
Queens            1663434
Brooklyn          1243582
Bronx              185603
EWR                 94314
Staten Island        8976
Name: count, dtype: int64


In [15]:
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())

PUzone       291562
PUborough    291562
DOzone       423425
DOborough    423425
dtype: int64


In [16]:
print(sorted(zones['LocationID'].unique()))
print(sorted(df['PULocationID'].unique()))
print(sorted(df['DOLocationID'].unique()))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 2

We can see that in our dataset we have 2 zones namely 264 and 265 which do not have specific buroughs and are not in our taxi zones dataset. 

In [17]:
missing_pu = df[~df['PULocationID'].isin(zones['LocationID'])]
missing_do = df[~df['DOLocationID'].isin(zones['LocationID'])]
print(f"Missing PULocationIDs: {missing_pu['PULocationID'].unique()}")
print(f"Missing DOLocationIDs: {missing_do['DOLocationID'].unique()}")

Missing PULocationIDs: [264  57 265 105]
Missing DOLocationIDs: [265 264  57 105]


In [18]:
# Filter data for PULocationID or DOLocationID being 264 or 265
trips = df[(df['PULocationID'].isin([264, 265])) | (df['DOLocationID'].isin([264, 265]))]

# Print the filtered data summary
trips.describe(include='all')


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,pickup_day_type,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,PUzone,PUborough,DOzone,DOborough
count,483705.0,483705,483705,483705.0,483705.0,483705.0,483705,483705.0,483705.0,483705.0,...,483705,483705.0,483705.0,483705.0,483705.0,483705.0,192217,192217,60918,60918
unique,,,,,,,2,,,,...,2,,,,,,242,6,250,6
top,,,,,,,N,,,,...,weekday,,,,,,JFK Airport,Queens,Times Sq/Theatre District,Manhattan
freq,,,,,,,480380,,,,...,346393,,,,,,68592,95556,2322,48504
mean,1.74,2023-06-22 18:07:16.492603,2023-06-22 18:32:35.845915,1.41,7.6,1.94,,217.58,250.55,1.24,...,,2023.0,6.21,15.55,14.27,25.32,,,,
min,1.0,2023-01-01 00:02:13,2023-01-01 00:10:54,1.0,1.0,1.0,,1.0,1.0,1.0,...,,2023.0,1.0,1.0,0.0,0.02,,,,
25%,1.0,2023-03-25 23:16:02,2023-03-25 23:37:36,1.0,1.32,1.0,,138.0,264.0,1.0,...,,2023.0,3.0,8.0,10.0,9.4,,,,
50%,2.0,2023-06-17 02:05:33,2023-06-17 02:22:55,1.0,2.96,1.0,,264.0,264.0,1.0,...,,2023.0,6.0,15.0,15.0,17.27,,,,
75%,2.0,2023-09-18 16:05:24,2023-09-18 16:46:41,2.0,11.4,2.0,,264.0,265.0,1.0,...,,2023.0,9.0,23.0,19.0,31.53,,,,
max,2.0,2023-12-31 23:53:07,2023-12-31 23:59:46,6.0,30.0,99.0,,265.0,265.0,4.0,...,,2023.0,12.0,31.0,23.0,5501.72,,,,


In [19]:
# Display sample records
trips.sample(10)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,pickup_day_type,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,PUzone,PUborough,DOzone,DOborough
31143160,2,2023-03-17 20:55:30,2023-03-17 21:13:57,1,2.8,1.0,N,264,264,1,...,weekday,2023,3,17,20,18.45,,,,
24568333,1,2023-12-04 08:25:28,2023-12-04 08:25:46,2,8.6,5.0,N,265,265,1,...,weekday,2023,12,4,8,0.3,,,,
31591411,2,2023-03-22 20:17:33,2023-03-22 20:38:32,1,6.76,1.0,N,132,265,2,...,weekday,2023,3,22,20,20.98,JFK Airport,Queens,,
9833818,2,2023-04-18 08:35:58,2023-04-18 08:36:03,2,1.0,5.0,N,75,264,1,...,weekday,2023,4,18,8,0.08,East Harlem South,Manhattan,,
28935806,2,2023-02-22 10:48:34,2023-02-22 11:18:43,1,1.63,1.0,N,264,264,1,...,weekday,2023,2,22,10,30.15,,,,
8584184,2,2023-04-04 14:38:42,2023-04-04 14:55:40,1,2.66,1.0,N,264,264,2,...,weekday,2023,4,4,14,16.97,,,,
15786919,2,2023-11-25 15:53:07,2023-11-25 16:27:33,1,12.88,4.0,N,132,265,1,...,weekend,2023,11,25,15,34.43,JFK Airport,Queens,,
27219153,2,2023-02-02 18:19:25,2023-02-02 18:25:57,1,1.0,1.0,N,264,264,2,...,weekday,2023,2,2,18,6.53,,,,
10163861,2,2023-04-21 13:24:11,2023-04-21 13:25:39,1,1.0,1.0,N,264,264,1,...,weekday,2023,4,21,13,1.47,,,,
23570425,2,2023-09-19 22:23:19,2023-09-19 22:44:18,1,6.18,1.0,N,264,264,1,...,weekday,2023,9,19,22,20.98,,,,


In [20]:
# Manually assign zones for IDs 264 and 265
df.loc[df['PULocationID'] == 264, ['PUzone', 'PUborough']] = ['Outside NYC', 'Unknown']
df.loc[df['DOLocationID'] == 264, ['DOzone', 'DOborough']] = ['Outside NYC', 'Unknown']
df.loc[df['PULocationID'] == 265, ['PUzone', 'PUborough']] = ['Airport Area', 'Unknown']
df.loc[df['DOLocationID'] == 265, ['DOzone', 'DOborough']] = ['Airport Area', 'Unknown']


In [21]:
# Check for null values in the updated columns
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())



PUzone        83
PUborough     83
DOzone       641
DOborough    641
dtype: int64


In [22]:
# Print rows where PUzone or PUborough is null
print("Rows with missing PUzone or PUborough:")
print(df[df['PUzone'].isnull() | df['PUborough'].isnull()][['PULocationID', 'PUzone', 'PUborough']].head())

# Print rows where DOzone or DOborough is null
print("Rows with missing DOzone or DOborough:")
print(df[df['DOzone'].isnull() | df['DOborough'].isnull()][['DOLocationID', 'DOzone', 'DOborough']].head())


Rows with missing PUzone or PUborough:
        PULocationID PUzone PUborough
8592              57    NaN       NaN
207339            57    NaN       NaN
593658            57    NaN       NaN
885396            57    NaN       NaN
906936            57    NaN       NaN
Rows with missing DOzone or DOborough:
        DOLocationID DOzone DOborough
108257            57    NaN       NaN
122526            57    NaN       NaN
242892            57    NaN       NaN
265231            57    NaN       NaN
293002            57    NaN       NaN


In [23]:
# List unique LocationIDs associated with null zones or boroughs
missing_pu_ids = df[df['PUzone'].isnull()]['PULocationID'].unique()
missing_do_ids = df[df['DOzone'].isnull()]['DOLocationID'].unique()
print(f"Missing PULocationIDs: {missing_pu_ids}")
print(f"Missing DOLocationIDs: {missing_do_ids}")


Missing PULocationIDs: [ 57 105]
Missing DOLocationIDs: [ 57 105]


In [24]:
# Manually assign zones and boroughs for LocationID 57 and 105
df.loc[df['PULocationID'] == 57, ['PUzone', 'PUborough']] = ['Corona', 'Queens']
df.loc[df['DOLocationID'] == 57, ['DOzone', 'DOborough']] = ['Corona', 'Queens']

df.loc[df['PULocationID'] == 105, ['PUzone', 'PUborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']
df.loc[df['DOLocationID'] == 105, ['DOzone', 'DOborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']


In [25]:
# Verify updates for LocationID 57
print("Updated zones and boroughs for LocationID 57:")
print(df[df['PULocationID'] == 57][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 57][['DOLocationID', 'DOzone', 'DOborough']].head(2))

# Verify updates for LocationID 105
print("Updated zones and boroughs for LocationID 105:")
print(df[df['PULocationID'] == 105][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 105][['DOLocationID', 'DOzone', 'DOborough']].head(2))


Updated zones and boroughs for LocationID 57:
        PULocationID  PUzone PUborough
8592              57  Corona    Queens
207339            57  Corona    Queens
        DOLocationID  DOzone DOborough
108257            57  Corona    Queens
122526            57  Corona    Queens
Updated zones and boroughs for LocationID 105:
         PULocationID                                         PUzone  \
5825802           105  Governor's Island/Ellis Island/Liberty Island   
6088635           105  Governor's Island/Ellis Island/Liberty Island   

         PUborough  
5825802  Manhattan  
6088635  Manhattan  
          DOLocationID                                         DOzone  \
5014745            105  Governor's Island/Ellis Island/Liberty Island   
18863194           105  Governor's Island/Ellis Island/Liberty Island   

          DOborough  
5014745   Manhattan  
18863194  Manhattan  


In [26]:
# Check again for null values in the zone and borough columns
print("Null values in PUzone and PUborough after update:")
print(df[['PUzone', 'PUborough']].isnull().sum())

print("Null values in DOzone and DOborough after update:")
print(df[['DOzone', 'DOborough']].isnull().sum())


Null values in PUzone and PUborough after update:
PUzone       0
PUborough    0
dtype: int64
Null values in DOzone and DOborough after update:
DOzone       0
DOborough    0
dtype: int64


In [27]:
df.isnull().sum()

VendorID                    0
tpep_pickup_datetime        0
tpep_dropoff_datetime       0
passenger_count             0
trip_distance               0
RatecodeID                  0
store_and_fwd_flag          0
PULocationID                0
DOLocationID                0
payment_type                0
fare_amount                 0
extra                       0
mta_tax                     0
tip_amount                  0
tolls_amount                0
improvement_surcharge       0
total_amount                0
congestion_surcharge        0
General_Airport_Fee         0
JFK_LGA_Pickup_Fee          0
distance_bins               0
pickup_time_of_day          0
pickup_season               0
passenger_count_category    0
pickup_day_type             0
transaction_year            0
transaction_month           0
transaction_day             0
transaction_hour            0
trip_duration               0
PUzone                      0
PUborough                   0
DOzone                      0
DOborough 

## Holiday
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

In [28]:
# Create a calendar object
calendar = USFederalHolidayCalendar()

# Define the range for your data
start_date = '2023-01-01'
end_date = '2023-12-31'

# Generate holidays
holidays = calendar.holidays(start=start_date, end=end_date)

# Add a column to your dataframe indicating whether the trip started on a holiday
df['is_holiday'] = df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)

In [29]:
df['is_holiday'].describe()

count   32458221.00
mean           0.02
std            0.15
min            0.00
25%            0.00
50%            0.00
75%            0.00
max            1.00
Name: is_holiday, dtype: float64

In [30]:
holidays = df[df['is_holiday']==1]
holidays.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,...,total_amount,congestion_surcharge,General_Airport_Fee,JFK_LGA_Pickup_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,is_holiday
count,708722.0,708722,708722,708722.0,708722.0,708722.0,708722.0,708722.0,708722.0,708722.0,...,708722.0,708722.0,708722.0,708722.0,708722.0,708722.0,708722.0,708722.0,708722.0,708722.0
mean,1.76,2023-07-10 03:28:34.235519,2023-07-10 03:45:09.898057,1.46,4.19,1.46,163.17,161.69,1.23,21.06,...,30.23,2.5,0.18,0.03,2023.0,6.81,14.31,14.43,16.59,1.0
min,1.0,2023-01-02 00:00:03,2023-01-02 00:03:28,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.01,2.5,0.0,0.0,2023.0,1.0,2.0,0.0,0.02,1.0
25%,2.0,2023-02-20 19:08:04,2023-02-20 19:23:28,1.0,1.15,1.0,132.0,112.0,1.0,9.3,...,15.12,2.5,0.0,0.0,2023.0,2.0,9.0,11.0,7.03,1.0
50%,2.0,2023-07-04 15:34:42.500000,2023-07-04 15:51:49,1.0,1.97,1.0,161.0,162.0,1.0,13.5,...,20.16,2.5,0.0,0.0,2023.0,7.0,16.0,15.0,11.63,1.0
75%,2.0,2023-11-10 09:40:17.750000,2023-11-10 09:56:26.750000,2.0,4.2,1.0,231.0,234.0,1.0,23.5,...,32.0,2.5,0.0,0.0,2023.0,11.0,20.0,19.0,19.57,1.0
max,2.0,2023-12-25 23:59:57,2023-12-26 22:28:11,6.0,30.0,99.0,265.0,265.0,4.0,300.0,...,500.0,2.5,1.75,1.25,2023.0,12.0,29.0,23.0,6179.4,1.0
std,0.43,,,0.93,5.1,6.02,62.75,70.98,0.48,19.98,...,25.35,0.0,0.52,0.2,0.0,3.91,8.44,5.59,43.73,0.0


In [31]:
holidays.sample(10)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,PUzone,PUborough,DOzone,DOborough,is_holiday
28766971,2,2023-02-20 06:26:02,2023-02-20 06:30:30,1,1.0,1.0,N,164,68,1,...,2023,2,20,6,4.47,Midtown South,Manhattan,East Chelsea,Manhattan,1
28804675,2,2023-02-20 17:13:50,2023-02-20 17:33:32,2,3.26,1.0,N,237,114,2,...,2023,2,20,17,19.7,Upper East Side South,Manhattan,Greenwich Village South,Manhattan,1
14486658,2,2023-11-10 23:17:09,2023-11-10 23:25:59,1,1.99,1.0,N,230,238,1,...,2023,11,10,23,8.83,Times Sq/Theatre District,Manhattan,Upper West Side North,Manhattan,1
15666016,2,2023-11-23 16:17:24,2023-11-23 16:27:21,1,1.16,1.0,N,226,145,1,...,2023,11,23,16,9.95,Sunnyside,Queens,Long Island City/Hunters Point,Queens,1
28780647,2,2023-02-20 11:54:25,2023-02-20 12:25:48,2,16.9,2.0,N,161,132,1,...,2023,2,20,11,31.38,Midtown Center,Manhattan,JFK Airport,Queens,1
28776019,2,2023-02-20 10:29:49,2023-02-20 10:40:16,1,2.72,1.0,N,238,162,2,...,2023,2,20,10,10.45,Upper West Side North,Manhattan,Midtown East,Manhattan,1
17542856,1,2023-01-16 11:04:18,2023-01-16 11:10:20,1,1.0,1.0,N,232,79,1,...,2023,1,16,11,6.03,Two Bridges/Seward Park,Manhattan,East Village,Manhattan,1
22128914,1,2023-09-04 15:24:13,2023-09-04 15:43:26,1,8.4,1.0,N,138,93,1,...,2023,9,4,15,19.22,LaGuardia Airport,Queens,Flushing Meadows-Corona Park,Queens,1
15654806,2,2023-11-23 13:18:55,2023-11-23 13:45:52,3,5.79,1.0,N,161,88,1,...,2023,11,23,13,26.95,Midtown Center,Manhattan,Financial District South,Manhattan,1
2998976,2,2023-07-04 08:50:22,2023-07-04 09:17:47,4,12.23,1.0,N,138,186,1,...,2023,7,4,8,27.42,LaGuardia Airport,Queens,Penn Station/Madison Sq West,Manhattan,1


In [46]:
cleaned_df.sample(10)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,transaction_day,transaction_hour,trip_duration,PUzone,PUborough,DOzone,DOborough,is_holiday,trip_duration_hours,speed_mph
27444830,2,2023-02-05 05:57:55,2023-02-05 06:26:30,1,6.96,1.0,N,48,82,2,...,5,5,28.58,Clinton East,Manhattan,Elmhurst,Queens,0,0.48,14.61
17079292,1,2023-01-11 09:32:31,2023-01-11 09:41:38,1,1.0,1.0,N,229,237,1,...,11,9,9.12,Sutton Place/Turtle Bay North,Manhattan,Upper East Side South,Manhattan,0,0.15,6.58
233350,2,2023-06-03 11:48:34,2023-06-03 12:33:13,2,21.12,2.0,N,132,87,1,...,3,11,44.65,JFK Airport,Queens,Financial District North,Manhattan,0,0.74,28.38
14972870,2,2023-11-15 23:00:48,2023-11-15 23:22:39,1,5.24,1.0,N,113,43,2,...,15,23,21.85,Greenwich Village North,Manhattan,Central Park,Manhattan,0,0.36,14.39
6945087,2,2023-05-17 19:38:26,2023-05-17 19:48:55,1,1.14,1.0,N,236,237,1,...,17,19,10.48,Upper East Side North,Manhattan,Upper East Side South,Manhattan,0,0.17,6.52
1689808,2,2023-06-18 14:29:56,2023-06-18 14:54:19,2,6.24,1.0,N,230,42,1,...,18,14,24.38,Times Sq/Theatre District,Manhattan,Central Harlem North,Manhattan,0,0.41,15.35
24977894,2,2023-12-08 00:42:35,2023-12-08 01:06:29,1,5.51,1.0,N,79,193,1,...,8,0,23.9,East Village,Manhattan,Queensbridge/Ravenswood,Queens,0,0.4,13.83
20739711,2,2023-10-20 12:25:17,2023-10-20 12:28:54,1,1.0,1.0,N,262,140,1,...,20,12,3.62,Yorkville East,Manhattan,Lenox Hill East,Manhattan,0,0.06,16.59
16269801,1,2023-11-30 19:08:01,2023-11-30 19:20:45,1,1.1,1.0,N,114,107,1,...,30,19,12.73,Greenwich Village South,Manhattan,Gramercy,Manhattan,0,0.21,5.18
9990423,2,2023-04-19 18:56:19,2023-04-19 19:03:23,2,1.0,1.0,N,246,100,1,...,19,18,7.07,West Chelsea/Hudson Yards,Manhattan,Garment District,Manhattan,0,0.12,8.49


In [32]:
# Filter the DataFrame to include only the rows where 'is_holiday' is 1
holiday_trips = df[df['is_holiday'] == 1].copy()  # Adding .copy() to avoid SettingWithCopyWarning on a slice

# Use loc to safely create a new column 'month_day'
holiday_trips.loc[:, 'month_day'] = list(zip(holiday_trips['transaction_month'], holiday_trips['transaction_day']))

# Find unique (month, day) pairs
unique_holiday_dates = pd.unique(holiday_trips['month_day'])


In [33]:
print(unique_holiday_dates)


[(6, 19) (7, 4) (5, 29) (11, 10) (11, 23) (1, 2) (1, 16) (10, 9) (9, 4)
 (12, 25) (2, 20)]




1. **(6, 19)** - June 19: Juneteenth National Independence Day, a federal holiday recognizing the emancipation of enslaved African Americans.
2. **(7, 4)** - July 4: Independence Day, a major national holiday in the United States celebrating the country's declaration of independence from the British Empire.
3. **(5, 29)** - May 29: This date in 2023 was Memorial Day, observed on the last Monday of May each year, honoring the military personnel who have died in the performance of their military duties.
4. **(11, 10)** - November 10: This is not a recognized public holiday. If it were November 11, it would be Veterans Day.
5. **(11, 23)** - November 23: This date in 2023 was Thanksgiving Day, a significant U.S. holiday celebrated on the fourth Thursday of November.
6. **(1, 2)** - January 2: In 2023, this was the observed day for New Year’s Day (January 1), as January 1st fell on a Sunday.
7. **(1, 16)** - January 16: Martin Luther King Jr. Day in 2023, celebrated on the third Monday of January to honor the civil rights leader.
8. **(10, 9)** - October 9: Columbus Day/Indigenous Peoples' Day in 2023, observed on the second Monday in October.
9. **(9, 4)** - September 4: Labor Day in 2023, which is celebrated on the first Monday of September and honors the American labor movement.
10. **(12, 25)** - December 25: Christmas Day, a major holiday across many cultures, marking the celebration of the birth of Jesus Christ.
11. **(2, 20)** - February 20: Presidents' Day in 2023, observed on the third Monday of February in honor of George Washington and other presidents.



## Velocity

**Velocity**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.

In [34]:
# First, ensure your trip_duration is in hours for speed calculation
df['trip_duration_hours'] = df['trip_duration'] / 60.0

# Calculate speed
df['speed_mph'] = df['trip_distance'] / df['trip_duration_hours']

# Handle any potential infinite or NaN values that may occur if duration is zero
df['speed_mph'].replace([np.inf, -np.inf], np.nan, inplace=True)
df['speed_mph'].fillna(0, inplace=True)  # Optionally set to zero or another placeholder value


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['speed_mph'].replace([np.inf, -np.inf], np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['speed_mph'].fillna(0, inplace=True)  # Optionally set to zero or another placeholder value


In [35]:
df['speed_mph'].describe()

count   32458221.00
mean          13.67
std           75.87
min            0.01
25%            7.79
50%           10.48
75%           14.75
max        68364.00
Name: speed_mph, dtype: float64

### Analyzing the Summary Statistics

1.  **Count**: Over 32 million data points are present, indicating a large dataset.
2.  **Mean**: The average speed is approximately 13.67 mph, which seems reasonable for urban traffic.
3.  **Standard Deviation**: The standard deviation is quite high at 75.87, suggesting significant variation in the speed data.
4.  **Min and Max**: The minimum speed is 0.01 mph, which is close to being stationary, but the maximum speed is 68,364 mph, which is unrealistic for any road vehicle and likely indicates data errors or extreme outliers.



In [36]:
df['speed_mph'].isna().sum()

0

In [37]:
max_speed = df[df['speed_mph']==68364.0]

In [38]:
max_speed

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,transaction_day,transaction_hour,trip_duration,PUzone,PUborough,DOzone,DOborough,is_holiday,trip_duration_hours,speed_mph
30600861,2,2023-03-12 00:03:08,2023-03-12 00:03:09,1,18.99,2.0,N,132,114,2,...,12,0,0.02,JFK Airport,Queens,Greenwich Village South,Manhattan,0,0.0,68364.0


Trip speed that is over 100 mph is unrealistic for taxi trips in the city, thus to remove any possible outliers we will remove the speeds above 100 mph. 

In [48]:
# Define a realistic maximum speed
max_realistic_speed = 100  # mph

# Filter the DataFrame to remove highly unrealistic speeds
cleaned_df = df[df['speed_mph'] <= max_realistic_speed]

# Optionally, inspect the data points with extreme speeds before removing them
extreme_speeds = df[df['speed_mph'] > max_realistic_speed]
print(extreme_speeds[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance', 'trip_duration', 'speed_mph']])



         tpep_pickup_datetime tpep_dropoff_datetime  trip_distance  \
846       2023-06-01 00:34:56   2023-06-01 00:35:03           1.00   
1234      2023-06-01 00:09:36   2023-06-01 00:09:40           1.00   
1361      2023-06-01 00:15:05   2023-06-01 00:15:30           1.00   
1413      2023-06-01 00:26:05   2023-06-01 00:26:21           1.00   
1414      2023-06-01 00:26:05   2023-06-01 00:26:21           1.00   
...                       ...                   ...            ...   
32456857  2023-03-31 23:23:50   2023-03-31 23:24:00           1.00   
32457751  2023-03-31 23:22:45   2023-03-31 23:22:55           1.00   
32457995  2023-03-31 23:32:03   2023-03-31 23:32:18           1.00   
32457999  2023-03-31 23:08:28   2023-03-31 23:08:55           1.00   
32458088  2023-03-31 23:05:43   2023-03-31 23:05:44           1.00   

          trip_duration  speed_mph  
846                0.12     514.29  
1234               0.07     900.00  
1361               0.42     144.00  
1413       

In [49]:
# Describe the speed statistics after removing extreme data points
new_speed_stats = cleaned_df['speed_mph'].describe()
print(new_speed_stats)


count   32387745.00
mean          12.40
std            7.26
min            0.01
25%            7.79
50%           10.47
75%           14.71
max          100.00
Name: speed_mph, dtype: float64


In [50]:
cleaned_df.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,...,General_Airport_Fee,JFK_LGA_Pickup_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,is_holiday,trip_duration_hours,speed_mph
count,32387745.0,32387745,32387745,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,...,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0,32387745.0
mean,1.75,2023-07-02 05:15:46.592766,2023-07-02 05:33:22.618237,1.39,3.57,1.59,165.37,164.18,1.2,19.7,...,0.14,0.01,2023.0,6.52,15.55,14.31,17.6,0.02,0.29,12.4
min,1.0,2023-01-01 00:00:05,2023-01-01 00:04:16,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,2023.0,1.0,1.0,0.0,0.6,0.0,0.01,0.01
25%,2.0,2023-04-02 00:42:50,2023-04-02 00:56:52,1.0,1.09,1.0,132.0,114.0,1.0,9.3,...,0.0,0.0,2023.0,4.0,8.0,11.0,7.75,0.0,0.13,7.79
50%,2.0,2023-06-26 15:32:16,2023-06-26 15:51:55,1.0,1.8,1.0,162.0,162.0,1.0,13.5,...,0.0,0.0,2023.0,6.0,15.0,15.0,12.68,0.0,0.21,10.47
75%,2.0,2023-10-06 10:45:50,2023-10-06 11:03:33,1.0,3.43,1.0,234.0,234.0,1.0,21.9,...,0.0,0.0,2023.0,10.0,23.0,19.0,20.67,0.0,0.34,14.71
max,2.0,2023-12-31 23:57:45,2023-12-31 23:59:56,6.0,30.0,99.0,265.0,265.0,4.0,300.0,...,1.75,1.25,2023.0,12.0,31.0,23.0,7053.62,1.0,117.56,100.0
std,0.43,,,0.88,4.46,7.17,63.56,69.77,0.46,17.84,...,0.46,0.11,0.0,3.46,8.72,5.78,41.79,0.15,0.7,7.26


# Further Cleaning Outliers Based on New Features

Based on the descriptive statistics above we see that there are trip durations that are over 100 hours, which seems to be an outlier or the meter was left on for some trips. 

Based on our domani knowledge it is illegal for NYC taxis to take on trips with duration of more than 12 hours based on NYC government website: "If the trip would result in the driver's having to operate the taxicab for more than 12 consecutive hours, which is prohibited" 

Thus we will filter out trips that have more than 12 hours take a look at their statistics and remove them to make sure that our data is inline with governmental regulations as well as has no outliers.

In [51]:
# Display extremely long trips
extreme_durations = cleaned_df[cleaned_df['trip_duration_hours'] > 12]  # More than 24 hours
extreme_durations.describe()


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,...,General_Airport_Fee,JFK_LGA_Pickup_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,is_holiday,trip_duration_hours,speed_mph
count,26165.0,26165,26165,26165.0,26165.0,26165.0,26165.0,26165.0,26165.0,26165.0,...,26165.0,26165.0,26165.0,26165.0,26165.0,26165.0,26165.0,26165.0,26165.0,26165.0
mean,2.0,2023-06-21 02:56:24.131205,2023-06-22 02:05:04.260615,1.74,4.25,1.12,160.67,160.39,1.27,23.39,...,0.13,0.01,2023.0,6.16,15.43,13.64,1388.67,0.02,23.14,0.2
min,1.0,2023-01-01 00:11:01,2023-01-01 23:37:57,1.0,1.0,1.0,1.0,1.0,1.0,3.0,...,0.0,0.0,2023.0,1.0,1.0,0.0,720.13,0.0,12.0,0.01
25%,2.0,2023-03-23 13:49:35,2023-03-24 13:44:43,1.0,1.3,1.0,114.0,107.0,1.0,10.7,...,0.0,0.0,2023.0,3.0,8.0,10.0,1400.52,0.0,23.34,0.06
50%,2.0,2023-06-08 14:27:58,2023-06-09 13:58:17,1.0,2.19,1.0,161.0,161.0,1.0,16.3,...,0.0,0.0,2023.0,6.0,15.0,15.0,1424.3,0.0,23.74,0.09
75%,2.0,2023-09-19 19:24:27,2023-09-20 18:28:41,2.0,4.44,1.0,231.0,233.0,2.0,26.8,...,0.0,0.0,2023.0,9.0,23.0,19.0,1434.0,0.0,23.9,0.19
max,2.0,2023-12-31 04:44:40,2023-12-31 23:28:36,6.0,30.0,99.0,265.0,265.0,4.0,300.0,...,1.75,1.25,2023.0,12.0,31.0,23.0,7053.62,1.0,117.56,2.49
std,0.04,,,1.38,5.04,1.76,65.12,71.39,0.45,22.1,...,0.45,0.11,0.0,3.44,8.76,6.37,185.05,0.15,3.08,0.26


In [52]:
cleaned_df = cleaned_df[cleaned_df['trip_duration_hours']<=12]

In [53]:
cleaned_df.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,...,General_Airport_Fee,JFK_LGA_Pickup_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,is_holiday,trip_duration_hours,speed_mph
count,32361580.0,32361580,32361580,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,...,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0,32361580.0
mean,1.75,2023-07-02 05:28:41.771842,2023-07-02 05:45:11.285101,1.39,3.57,1.59,165.38,164.18,1.2,19.69,...,0.14,0.01,2023.0,6.52,15.55,14.31,16.49,0.02,0.27,12.41
min,1.0,2023-01-01 00:00:05,2023-01-01 00:04:16,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,2023.0,1.0,1.0,0.0,0.6,0.0,0.01,0.08
25%,2.0,2023-04-02 00:52:48.750000,2023-04-02 01:06:11.500000,1.0,1.09,1.0,132.0,114.0,1.0,9.3,...,0.0,0.0,2023.0,4.0,8.0,11.0,7.73,0.0,0.13,7.79
50%,2.0,2023-06-26 15:47:24,2023-06-26 16:06:46.500000,1.0,1.8,1.0,162.0,162.0,1.0,13.5,...,0.0,0.0,2023.0,6.0,15.0,15.0,12.67,0.0,0.21,10.47
75%,2.0,2023-10-06 10:54:49,2023-10-06 11:12:34,1.0,3.43,1.0,234.0,234.0,1.0,21.9,...,0.0,0.0,2023.0,10.0,23.0,19.0,20.65,0.0,0.34,14.72
max,2.0,2023-12-31 23:57:45,2023-12-31 23:59:56,6.0,30.0,99.0,265.0,265.0,4.0,300.0,...,1.75,1.25,2023.0,12.0,31.0,23.0,719.57,1.0,11.99,100.0
std,0.43,,,0.88,4.46,7.17,63.56,69.77,0.46,17.83,...,0.46,0.11,0.0,3.46,8.72,5.78,14.11,0.15,0.24,7.26


## Testing NEW Feature Validity

We need to check if created features are within their bounds and our code worked properly.

In [54]:
def test_trip_duration_positive():
    assert cleaned_df['trip_duration'].min() > 0, "Error: Non-positive trip durations present in the dataset."


In [55]:
def test_time_of_day_categories():
    hours = cleaned_df['tpep_pickup_datetime'].dt.hour
    conditions = [
        ((hours >= 5) & (hours <= 11)),
        ((hours >= 12) & (hours <= 17)),
        ((hours >= 18) & (hours <= 23)),
        ((hours < 5) | (hours == 24))
    ]
    categories = ['morning', 'afternoon', 'evening', 'night']
    for condition, category in zip(conditions, categories):
        assert all(cleaned_df.loc[condition, 'pickup_time_of_day'] == category), f"Error in categorizing {category}."


In [56]:
def test_passenger_count_categories():
    conditions = [
        (cleaned_df['passenger_count'] == 1),
        (cleaned_df['passenger_count'].between(2, 4)),
        (cleaned_df['passenger_count'].between(5, 6))
    ]
    categories = ['low', 'medium', 'high']
    for condition, category in zip(conditions, categories):
        assert all(cleaned_df.loc[condition, 'passenger_count_category'] == category), f"Error in categorizing passenger count {category}."


In [57]:
def test_seasonal_categories():
    months = cleaned_df['tpep_pickup_datetime'].dt.month
    conditions = [
        (months.isin([3, 4, 5])),
        (months.isin([6, 7, 8])),
        (months.isin([9, 10, 11])),
        (months.isin([12, 1, 2]))
    ]
    seasons = ['spring', 'summer', 'autumn', 'winter']
    for condition, season in zip(conditions, seasons):
        assert all(cleaned_df.loc[condition, 'pickup_season'] == season), f"Error in season categorization for {season}."


In [58]:
from pandas.tseries.holiday import USFederalHolidayCalendar

def test_holiday_flag():
    calendar = USFederalHolidayCalendar()
    holidays = calendar.holidays(start=cleaned_df['tpep_pickup_datetime'].min(), end=cleaned_df['tpep_pickup_datetime'].max())
    cleaned_df['calculated_holiday'] = cleaned_df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)
    assert all(cleaned_df['calculated_holiday'] == cleaned_df['is_holiday']), "Holiday flag mismatches detected."


In [59]:
# Running all tests
try:
    test_trip_duration_positive()
    test_time_of_day_categories()
    test_passenger_count_categories()
    print("All tests passed!")
except AssertionError as e:
    print("Test failed:", e)


All tests passed!


# Saving Feature Engineered Dataset For EDA and Model Development

In [60]:
cleaned_df.to_parquet('/Users/md/Desktop/python_project/parquet_files/cleaned/cleaned_feature_engineered_dataset.parquet', engine='pyarrow')
