# Taxi Fare Prediction Model_Feature_Engineering

## Introduction
This project aims to develop a predictive model for taxi fares in NYC. Initially, we will create a model for NYC and adjust parameters to align with domain knowledge from Tbilisi. We hypothesize that factors such as time of day, seasonality, and holidays impact taxi demand and fare prices.



### Notebook Aim and Feature Development

#### **Objective**
The primary objective of our project is to develop a predictive model that accurately forecasts taxi fares. Initially focusing on New York City (NYC), we aim to expand and adapt the model to incorporate Tbilisi, employing localized domain knowledge to tailor our approach.

#### **Influence of Demand on Pricing**
The fare prices in the taxi industry are predominantly influenced by demand dynamics, which can fluctuate based on various factors including:

- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

#### **Additional Influential Features**
Beyond temporal and periodic factors, several other elements could influence fare pricing:

- **Passenger Count**: Exploring whether vehicles accommodating more passengers have different fare structures, similar to practices in ride-sharing applications.
- **Trip Distance and Duration**: Both metrics are crucial for pricing. While trip distance is a direct influencer, the duration might also affect costs, especially in varying traffic conditions.
- **Velocity**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.
- **Taxi Zones**: With the NYC taxi zone dataset, we can analyze whether specific pickup and dropoff locations impact fare prices due to their geographical significance.


### Summary
This comprehensive approach not only allows us to understand the multifaceted dynamics of taxi fare pricing in NYC but also sets a foundation for adapting the model to Tbilisi, ensuring that both city-specific and universal factors are considered for effective fare prediction.

## Data Loading

In this section we will install necessary packages, imports necessary libraries and load the dataset.

In [1]:
!pip install pyarrow
!pip install fastparquet
!pip install geopandas



We need to have pyarrow 16 version to initiate the code if the code does not run please install and upadte the pyarrow from terminal.

In [1]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sns
import geopandas as gpd

pd.set_option('display.float_format', lambda x: '%.2f' % x)
from pandas.tseries.holiday import USFederalHolidayCalendar


In [2]:
# Replace 'path_to_file.parquet' with the path to your Parquet file
df_original = pd.read_parquet('/Users/md/Desktop/python_project/parquet_files/cleaned_taxi_data_v.1.parquet', engine='pyarrow')  # or engine='fastparquet' if you prefer
df = df_original

Let's check for data accuracy and that the cleaned data is clean and has all the columns after data preparation in previous notebook.

In [4]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,distance_bins
0,2023-01-01 00:55:08,2023-01-01 01:01:27,1,1.1,43,237,1,7.9,1.0,4.0,16.9,0.0,0.0,1-2 miles
1,2023-01-01 00:25:04,2023-01-01 00:37:49,1,2.51,48,238,1,14.9,1.0,15.0,34.9,0.0,0.0,2-5 miles
2,2023-01-01 00:10:29,2023-01-01 00:21:19,1,1.43,107,79,1,11.4,1.0,3.28,19.68,0.0,0.0,1-2 miles
3,2023-01-01 00:50:34,2023-01-01 01:02:52,1,1.84,161,137,1,12.8,1.0,10.0,27.8,0.0,0.0,1-2 miles
4,2023-01-01 00:09:22,2023-01-01 00:19:49,1,1.66,239,143,1,12.1,1.0,3.42,20.52,0.0,0.0,1-2 miles


In [5]:
len(df)

28093781

## Dataset Loadeing And Preparation For Fetaure Engineering 


As we already process cleaned data from previous notebook we do not need to clean the data for nulls, duplicates or outliers however to check that data is consistent and clean we will have an initial look at the laoded dataset below.

We will load the Taxi Zone dataset downloaded from NYC taxi zones website : https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc

In [3]:
df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee
count,28093781,28093781,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0
mean,2023-07-01 05:46:20.028285,2023-07-01 06:06:41.749390,1.4,4.26,163.46,162.04,1.19,22.85,1.68,4.07,32.75,0.01,0.17
min,2023-01-01 00:00:05,2023-01-01 00:05:44,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,2.2,0.0,0.0
25%,2023-04-01 18:21:12,2023-04-01 18:40:28,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,18.6,0.0,0.0
50%,2023-06-25 02:32:23,2023-06-25 02:47:49,1.0,2.26,161.0,162.0,1.0,16.3,1.0,3.28,23.88,0.0,0.0
75%,2023-10-04 20:39:08,2023-10-04 20:56:30,1.0,4.35,231.0,234.0,1.0,25.4,2.5,5.0,35.0,0.0,0.0
max,2023-12-31 23:55:17,2023-12-31 23:59:56,6.0,50.0,265.0,265.0,4.0,300.0,96.38,984.3,1000.0,1.25,1.75
std,,,0.89,4.82,63.28,70.92,0.44,18.39,1.91,4.32,23.48,0.12,0.51


In [6]:
df.dtypes

tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                   int64
trip_distance                   float64
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
tip_amount                      float64
total_amount                    float64
JFK_LGA_Pickup_Fee              float64
General_Airport_Fee             float64
distance_bins                  category
dtype: object

In [7]:
df.isnull().sum()

tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
tip_amount               0
total_amount             0
JFK_LGA_Pickup_Fee       0
General_Airport_Fee      0
distance_bins            0
dtype: int64

### Additional Dataset For Taxi Zones Loading

In [8]:
zones = pd.read_csv("/Users/md/Desktop/python_project/parquet_files/cleaned/taxi_zones.csv", sep=';')
zones.describe()

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,LocationID
count,263.0,263.0,263.0,263.0
mean,132.0,0.09,0.0,131.98
std,76.07,0.05,0.0,76.07
min,1.0,0.01,0.0,1.0
25%,66.5,0.05,0.0,66.5
50%,132.0,0.08,0.0,132.0
75%,197.5,0.12,0.0,197.5
max,263.0,0.43,0.0,263.0


In [9]:
zones.sample(5)

Unnamed: 0,OBJECTID,Shape_Leng,the_geom,Shape_Area,zone,LocationID,borough
10,10,0.1,MULTIPOLYGON (((-73.78326624999988 40.68999429...,0.0,Baisley Park,10,Queens
190,191,0.13,MULTIPOLYGON (((-73.73016587199996 40.72395859...,0.0,Queens Village,191,Queens
11,11,0.08,MULTIPOLYGON (((-74.00109809499993 40.60303462...,0.0,Bath Beach,11,Brooklyn
95,95,0.11,MULTIPOLYGON (((-73.84732494199989 40.73877145...,0.0,Forest Hills,95,Queens
250,248,0.06,MULTIPOLYGON (((-73.8639374809998 40.840044565...,0.0,West Farms/Bronx River,248,Bronx


we already see that we have 263 zones, in our dataset we have 265 zones for taxis, which means we already know that when joining we will have to adjust for missing values and try to find this zones or remove them.

# Feature Engineering

Below based on our domain knowledge and literature reviews we will create new features or adjust the existing ones to gain more insights on the data and cerate best possible predictive model.

## Seasonal and Time Features


- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Duration**: How long did the trip last.

In [10]:
# Convert the pickup and dropoff datetime to pandas datetime format if not already
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

# Time of day segmentation
df['pickup_time_of_day'] = df['tpep_pickup_datetime'].dt.hour.apply(lambda x: 'morning' if 5 <= x <= 11
                                                                           else 'afternoon' if 12 <= x <= 17
                                                                           else 'evening' if 18 <= x <= 23
                                                                           else 'night')

# Seasons segmentation
df['pickup_season'] = df['tpep_pickup_datetime'].dt.month.apply(lambda x: 'spring' if 3 <= x <= 5
                                                                       else 'summer' if 6 <= x <= 8
                                                                       else 'autumn' if 9 <= x <= 11
                                                                       else 'winter')

# Passenger count categories
df['passenger_count_category'] = pd.cut(df['passenger_count'], bins=[0, 1, 4, 6], include_lowest=True, 
                                        labels=['low', 'medium', 'high'])

# Weekday/Weekend segmentation
df['pickup_day_type'] = df['tpep_pickup_datetime'].dt.day_name().apply(lambda x: 'weekend' if x in ['Saturday', 'Sunday'] else 'weekday')


#taxi_data_prepared['transaction_date'] = pd.to_datetime(taxi_data_prepared['tpep_pickup_datetime'].dt.date)
# -> we make it datetime again because it's very little use when it's just a string (can't compare, sort, etc.)
df['transaction_year'] = df['tpep_pickup_datetime'].dt.year
df['transaction_month'] = df['tpep_pickup_datetime'].dt.month
df['transaction_day'] =  df['tpep_pickup_datetime'].dt.day
df['transaction_hour'] = df['tpep_pickup_datetime'].dt.hour

#trip duration is another interesting feature to analyze 


# Calculate the trip duration and convert it to minutes
df['trip_duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60


Lets take a look at adjusted dataset and what are the new created features we will sample the dataset to also test that the fatures were created correctly.

In [14]:
df.sample(20)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
6481378,2023-03-25 22:04:19,2023-03-25 22:25:24,1,13.94,130,236,2,70.0,0.0,0.0,...,10-20 miles,evening,spring,low,weekend,2023,3,25,22,21.08
138659,2023-01-03 14:31:11,2023-01-03 14:38:28,1,1.5,143,68,1,9.3,2.5,1.0,...,1-2 miles,afternoon,winter,low,weekday,2023,1,3,14,7.28
13759295,2023-06-21 19:00:07,2023-06-21 19:10:22,1,1.26,211,249,1,10.7,2.5,3.44,...,1-2 miles,evening,summer,low,weekday,2023,6,21,19,10.25
3544838,2023-02-17 07:51:21,2023-02-17 07:58:33,2,1.71,142,233,1,10.0,0.0,1.5,...,1-2 miles,morning,winter,medium,weekday,2023,2,17,7,7.2
18772337,2023-09-01 17:38:51,2023-09-01 17:50:35,2,1.81,79,162,1,12.8,2.5,3.86,...,1-2 miles,afternoon,autumn,medium,weekday,2023,9,1,17,11.73
8648853,2023-04-21 18:59:29,2023-04-21 19:50:11,4,17.73,132,48,1,70.0,5.0,17.11,...,10-20 miles,evening,spring,medium,weekday,2023,4,21,18,50.7
18968833,2023-09-04 21:24:43,2023-09-04 21:28:58,3,1.4,211,261,1,7.9,3.5,2.55,...,1-2 miles,evening,autumn,medium,weekday,2023,9,4,21,4.25
4299161,2023-02-26 20:01:15,2023-02-26 20:07:58,1,1.94,237,75,2,10.0,1.0,0.0,...,1-2 miles,evening,winter,low,weekend,2023,2,26,20,6.72
5169258,2023-03-09 19:07:34,2023-03-09 19:12:22,1,1.3,239,151,1,7.9,2.5,2.88,...,1-2 miles,evening,spring,low,weekday,2023,3,9,19,4.8
17637415,2023-08-15 16:10:24,2023-08-15 16:18:00,1,1.2,237,142,1,8.6,5.0,3.75,...,1-2 miles,afternoon,summer,low,weekday,2023,8,15,16,7.6


## Analysis of Newly Added Features in the Taxi Dataset

In this analysis, we explore the enhancements made to a comprehensive taxi dataset through feature engineering, specifically focusing on new temporal and categorical attributes derived from the raw data. The following attributes were added: `pickup_time_of_day`, `pickup_season`, `passenger_count_category`, `pickup_day_type`, `transaction_year`, `transaction_month`, `transaction_day`, `transaction_hour`, and `trip_duration`. These features were engineered to facilitate deeper insights into the patterns of taxi usage.

### Overview of New Features

**Temporal Segmentation:**
- **Time of Day:** This attribute classifies each trip into one of four categories based on the pickup hour: `morning` (5-11 AM), `afternoon` (12-5 PM), `evening` (6-11 PM), and `night` (12 AM-4 AM). Most pickups occur in the `afternoon`, accounting for about 35.45% of the total trips. This could indicate higher taxi demand during these hours, possibly due to work-related commuting or lunchtime errands.
- **Season:** The trips are classified into seasons based on the pickup month, allowing for seasonal analysis of taxi usage. The most common season for taxi pickups was `spring`, with approximately 27% of the year's pickups, suggesting a peak in taxi usage during this season, which may correlate with tourist activity or better weather conditions.
- **Weekday/Weekend:** This feature categorizes each day as a `weekday` or `weekend`, based on the day of the week. A significant majority of the trips (72.42%) occurred on weekdays, which highlights the routine nature of taxi usage for commuting during the workweek.

**Categorical Binning of Numerical Data:**
- **Passenger Count Categories:** Passengers per trip were categorized into `low` (1 passenger), `medium` (2-4 passengers), and `high` (5-6 passengers). The categorization helps in analyzing travel behavior in relation to group size. 

**Transaction Date and Time:**
- Extracted `year`, `month`, `day`, and `hour` from the pickup datetime to facilitate time-based querying and aggregation at various granularities.

**Trip Duration Calculation:**
- Calculated as the difference in minutes between pickup and dropoff times, providing insights into trip lengths and potential traffic conditions. The mean trip duration was approximately 20.36 minutes, but with a high standard deviation, indicating significant variability which could be influenced by factors such as traffic, time of day, and day of the week.


From initial look and testing I checked that dates have correctly been categorized in weekend or weekday, that the passenger categories are alos correctly defined and that the trip duration calculation is correct.

To check if newly added features have correct values we will use descriptive statistics and adjust accordingly if needed.

Below we will check if the trip duration calculations are correct

In [16]:
df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,28093781,28093781,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0
mean,2023-07-01 05:46:20.028285,2023-07-01 06:06:41.749390,1.4,4.26,163.46,162.04,1.19,22.85,1.68,4.07,32.75,0.01,0.17,2023.0,6.49,15.5,14.33,20.36
min,2023-01-01 00:00:05,2023-01-01 00:05:44,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,2.2,0.0,0.0,2023.0,1.0,1.0,0.0,-165.05
25%,2023-04-01 18:21:12,2023-04-01 18:40:28,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,18.6,0.0,0.0,2023.0,4.0,8.0,11.0,10.23
50%,2023-06-25 02:32:23,2023-06-25 02:47:49,1.0,2.26,161.0,162.0,1.0,16.3,1.0,3.28,23.88,0.0,0.0,2023.0,6.0,15.0,15.0,15.2
75%,2023-10-04 20:39:08,2023-10-04 20:56:30,1.0,4.35,231.0,234.0,1.0,25.4,2.5,5.0,35.0,0.0,0.0,2023.0,10.0,23.0,19.0,23.43
max,2023-12-31 23:55:17,2023-12-31 23:59:56,6.0,50.0,265.0,265.0,4.0,300.0,96.38,984.3,1000.0,1.25,1.75,2023.0,12.0,31.0,23.0,7053.62
std,,,0.89,4.82,63.28,70.92,0.44,18.39,1.91,4.32,23.48,0.12,0.51,0.0,3.45,8.7,5.9,43.02


In [18]:
categorical_columns = ['PULocationID', 'DOLocationID', 'payment_type', 'distance_bins', 'pickup_time_of_day', 'pickup_season', 'passenger_count_category', 'pickup_day_type', 'transaction_year', 'transaction_month', 'transaction_day', 'transaction_hour']
descriptive_stats = df[categorical_columns].describe(include='object')
descriptive_stats


Unnamed: 0,pickup_time_of_day,pickup_season,pickup_day_type
count,28093781,28093781,28093781
unique,4,4,2
top,afternoon,spring,weekday
freq,9956916,7586611,20350513



### Insights from Descriptive Statistics

- **Frequency Distributions:** The top segments for the newly added categorical features indicate the most common contexts in which taxis are used (afternoon pickups during spring on weekdays).
- **Average and Variance:** The distribution of trip durations has a high variance, as indicated by its standard deviation (43.02 minutes), pointing to a wide range in trip lengths. This suggests variability in trip purposes, from short hops to longer journeys.
- **Outliers:** The maximum trip duration is exceptionally high at over 7053 minutes, suggesting potential data entry errors or exceptional cases (e.g., taxis being hired for extended periods).


## Outlier Handling and Data Validation For New Features

Based on the descriptive statistics provided, there are a few indicators of potential data issues that may require cleaning or further investigation:

### 1. **Trip Duration Outliers**
- **Negative Values:** The minimum value in the trip duration is -165.05 minutes. This is clearly an error since trip duration cannot be negative. Negative values could result from data entry errors or incorrect timestamp recording.
- **Excessive Maximum Value:** The maximum trip duration is over 7053 minutes (approximately 117.55 hours). Such a high duration is unusual and might indicate anomalies in data recording or entry errors.


We see that for month, year, day, season features, the values make sense although for trip duration we can see that we have negative trip durations. 


Negative trip durations may have occured due to data entry issues , times might have been mixed up. we can investigate further and see what is the number of negative values and either drop the corrupted data or adjust it accordingly.

In [19]:
# Display the first few rows to confirm the new 'trip_duration' column
print(df[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']].sample(10))

         tpep_pickup_datetime tpep_dropoff_datetime  trip_duration
9055959   2023-04-26 18:14:12   2023-04-26 18:41:46          27.57
14349685  2023-06-29 08:07:34   2023-06-29 08:31:40          24.10
4925425   2023-03-06 21:08:00   2023-03-06 21:15:02           7.03
2990179   2023-02-10 08:16:01   2023-02-10 09:01:57          45.93
3166761   2023-02-12 09:59:07   2023-02-12 10:03:38           4.52
18298363  2023-08-25 13:36:18   2023-08-25 13:47:12          10.90
15594783  2023-07-17 21:09:05   2023-07-17 21:18:32           9.45
289346    2023-01-05 16:57:10   2023-01-05 17:54:18          57.13
21311997  2023-10-07 17:06:01   2023-10-07 17:25:21          19.33
7289224   2023-04-05 07:45:52   2023-04-05 07:58:19          12.45


In [20]:
# Display cases with negative trip_duration
negative_durations = df[df['trip_duration'] < 0]
negative_durations[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']]

negative_durations.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,725,725,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0
mean,2023-10-21 16:34:57.008276,2023-10-21 15:54:57.144827,1.36,4.09,150.77,150.19,1.11,22.08,1.43,3.44,30.41,0.0,0.02,2023.0,10.48,6.4,2.17,-40.0
min,2023-01-23 10:43:58,2023-01-23 10:29:26,1.0,1.0,3.0,4.0,1.0,7.2,0.0,0.0,11.64,0.0,0.0,2023.0,1.0,1.0,1.0,-165.05
25%,2023-11-05 01:44:13,2023-11-05 01:02:16,1.0,2.0,95.0,87.0,1.0,13.5,1.0,1.0,21.12,0.0,0.0,2023.0,11.0,5.0,1.0,-47.88
50%,2023-11-05 01:51:24,2023-11-05 01:06:24,1.0,3.3,148.0,144.0,1.0,19.1,1.0,3.14,26.4,0.0,0.0,2023.0,11.0,5.0,1.0,-43.37
75%,2023-11-05 01:55:58,2023-11-05 01:10:52,2.0,5.2,229.0,231.0,1.0,27.5,1.0,5.0,36.48,0.0,0.0,2023.0,11.0,5.0,1.0,-35.1
max,2023-12-31 09:20:00,2023-12-31 09:10:02,4.0,19.7,264.0,265.0,4.0,80.0,5.0,20.67,103.36,0.0,1.75,2023.0,12.0,31.0,18.0,-0.03
std,,,0.69,2.94,67.35,74.64,0.42,11.56,1.12,3.05,13.6,0.0,0.17,0.0,1.82,4.94,3.6,13.58


In [21]:
# Check for possible datetime swaps or errors
swapped_cases = df[df['tpep_pickup_datetime'] > df['tpep_dropoff_datetime']]
print(swapped_cases[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']])


         tpep_pickup_datetime tpep_dropoff_datetime  trip_duration
1631038   2023-01-23 10:43:58   2023-01-23 10:29:26         -14.53
2397097   2023-02-02 13:02:23   2023-02-02 12:50:35         -11.80
2397098   2023-02-02 13:59:20   2023-02-02 13:15:43         -43.62
2482045   2023-02-03 13:45:00   2023-02-03 13:44:50          -0.17
2991721   2023-02-10 09:40:22   2023-02-10 09:20:58         -19.40
...                       ...                   ...            ...
24565445  2023-11-15 14:08:00   2023-11-15 14:04:33          -3.45
24970707  2023-11-20 07:55:00   2023-11-20 07:46:43          -8.28
25300720  2023-11-25 14:53:11   2023-11-25 14:53:09          -0.03
25305059  2023-11-25 15:50:50   2023-11-25 15:50:05          -0.75
28052000  2023-12-31 09:20:00   2023-12-31 09:10:02          -9.97

[725 rows x 3 columns]


as we can see there are only 727 negative values which compared to full dataset is really low number thus instead of going over 30million records to witch the rows we will drop rows with. trip durations less than or equal to 0.

In [22]:
df = df[df['trip_duration']>0]
df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,28092619,28092619,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0,28092619.0
mean,2023-07-01 05:42:14.805966,2023-07-01 06:02:36.639535,1.4,4.26,163.46,162.04,1.19,22.85,1.68,4.07,32.75,0.01,0.17,2023.0,6.49,15.5,14.33,20.36
min,2023-01-01 00:00:05,2023-01-01 00:05:44,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,2.2,0.0,0.0,2023.0,1.0,1.0,0.0,0.02
25%,2023-04-01 18:19:32,2023-04-01 18:38:51,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,18.6,0.0,0.0,2023.0,4.0,8.0,11.0,10.23
50%,2023-06-25 02:26:07,2023-06-25 02:42:17,1.0,2.26,161.0,162.0,1.0,16.3,1.0,3.28,23.88,0.0,0.0,2023.0,6.0,15.0,15.0,15.2
75%,2023-10-04 20:34:21,2023-10-04 20:51:54.500000,1.0,4.35,231.0,234.0,1.0,25.4,2.5,5.0,35.0,0.0,0.0,2023.0,10.0,23.0,19.0,23.43
max,2023-12-31 23:55:17,2023-12-31 23:59:56,6.0,50.0,265.0,265.0,4.0,300.0,96.38,984.3,1000.0,1.25,1.75,2023.0,12.0,31.0,23.0,7053.62
std,,,0.89,4.82,63.28,70.92,0.44,18.39,1.91,4.32,23.48,0.12,0.51,0.0,3.45,8.7,5.9,43.02


## Removing Long Trip Durations

In NYC by Law taxis are prohibeted to take on rides that may exceed more than 12 hrs based on NYC regulations: "Both taxi and FHV drivers are prohibited from transporting passengers for more than 10 hours in any 24-hour period and for more than 60 hours in a calendar week (Monday-Sunday). TLC will review driver hours using the trip records it receives from FHV bases and through the TPEP and LPEP systems."

https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.nyc.gov/site/tlc/about/fatigued-driving-prevention-frequently-asked-questions.page%23:~:text%3DBoth%2520taxi%2520and%2520FHV%2520drivers,the%2520TPEP%2520and%2520LPEP%2520systems.&ved=2ahUKEwj24p290IqGAxWDywIHHb0rA6UQFnoECBIQAw&usg=AOvVaw1ieKuvHzDndDauBufXQym5

However as we do not have driver ID's there is no way to track this we will visualise the optimal trip duration as less then or equal to 24 hours per trip.

In [25]:
# Convert trip duration from minutes to hours
df['trip_duration_hours'] = df['trip_duration'] / 60





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['trip_duration_hours'] = df['trip_duration'] / 60


In [26]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours
0,2023-01-01 00:55:08,2023-01-01 01:01:27,1,1.1,43,237,1,7.9,1.0,4.0,...,night,winter,low,weekend,2023,1,1,0,6.32,0.11
1,2023-01-01 00:25:04,2023-01-01 00:37:49,1,2.51,48,238,1,14.9,1.0,15.0,...,night,winter,low,weekend,2023,1,1,0,12.75,0.21
2,2023-01-01 00:10:29,2023-01-01 00:21:19,1,1.43,107,79,1,11.4,1.0,3.28,...,night,winter,low,weekend,2023,1,1,0,10.83,0.18
3,2023-01-01 00:50:34,2023-01-01 01:02:52,1,1.84,161,137,1,12.8,1.0,10.0,...,night,winter,low,weekend,2023,1,1,0,12.3,0.21
4,2023-01-01 00:09:22,2023-01-01 00:19:49,1,1.66,239,143,1,12.1,1.0,3.42,...,night,winter,low,weekend,2023,1,1,0,10.45,0.17


In [27]:
# Now, filter the DataFrame to remove unwanted trip durations
df = df[(df['trip_duration'] >= 1)]

df = df[(df['trip_duration_hours'] <= 24)]

In [28]:
df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours
count,28083844,28083844,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0,28083844.0
mean,2023-07-01 05:48:27.298996,2023-07-01 06:08:48.832929,1.4,4.26,163.47,162.04,1.19,22.85,1.68,4.07,32.74,0.01,0.17,2023.0,6.49,15.5,14.33,20.36,0.34
min,2023-01-01 00:00:05,2023-01-01 00:05:44,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,2.2,0.0,0.0,2023.0,1.0,1.0,0.0,1.0,0.02
25%,2023-04-01 18:23:26,2023-04-01 18:42:53,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,18.65,0.0,0.0,2023.0,4.0,8.0,11.0,10.25,0.17
50%,2023-06-25 02:36:00.500000,2023-06-25 02:52:05,1.0,2.26,161.0,162.0,1.0,16.3,1.0,3.28,23.88,0.0,0.0,2023.0,6.0,15.0,15.0,15.2,0.25
75%,2023-10-04 20:37:51,2023-10-04 20:55:18.250000,1.0,4.35,231.0,234.0,1.0,25.4,2.5,5.0,35.0,0.0,0.0,2023.0,10.0,23.0,19.0,23.43,0.39
max,2023-12-31 23:55:17,2023-12-31 23:59:56,6.0,50.0,265.0,265.0,4.0,300.0,96.38,984.3,1000.0,1.25,1.75,2023.0,12.0,31.0,23.0,1439.97,24.0
std,,,0.89,4.82,63.28,70.92,0.44,18.38,1.91,4.31,23.47,0.12,0.51,0.0,3.45,8.7,5.9,42.6,0.71


## Taxi Zones_ Feature 
taxi zone ID s though informative they do not provide any insights as to where passanger was picked up and neighbourhoods are thought to effect pricing at least when hailing a cab thus we will merge Taxi zone dataset with the NYC trip data on zone IDs and idnetify pickup and drop off buroughs for each trip.

In [29]:
# Merge the zone data into the main taxi trip dataset for pickup locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='PULocationID', right_on='LocationID', how='left')
df.rename(columns={'zone': 'PUzone', 'borough': 'PUborough'}, inplace=True)

# Merge the zone data for dropoff locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='DOLocationID', right_on='LocationID', how='left', suffixes=('', '_drop'))
df.rename(columns={'zone': 'DOzone', 'borough': 'DOborough'}, inplace=True)

# Drop the extra LocationID columns if they are not needed
df.drop(['LocationID', 'LocationID_drop'], axis=1, inplace=True)


In [30]:
print(df['PUborough'].value_counts())
print(df['DOborough'].value_counts())

PUborough
Manhattan        24179274
Queens            3456221
Brooklyn           160639
Bronx               43036
Staten Island        1302
EWR                   354
Name: count, dtype: int64
DOborough
Manhattan        24289383
Queens            1765124
Brooklyn          1343260
Bronx              198726
EWR                102830
Staten Island        9273
Name: count, dtype: int64


In [31]:
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())

PUzone       254255
PUborough    254255
DOzone       386485
DOborough    386485
dtype: int64


In [32]:
print(sorted(zones['LocationID'].unique()))
print(sorted(df['PULocationID'].unique()))
print(sorted(df['DOLocationID'].unique()))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 2

We can see that in our dataset we have 2 zones namely 264 and 265 which do not have specific buroughs and are not in our taxi zones dataset. 

In [33]:
missing_pu = df[~df['PULocationID'].isin(zones['LocationID'])]
missing_do = df[~df['DOLocationID'].isin(zones['LocationID'])]
print(f"Missing PULocationIDs: {missing_pu['PULocationID'].unique()}")
print(f"Missing DOLocationIDs: {missing_do['DOLocationID'].unique()}")

Missing PULocationIDs: [264 265  57 105]
Missing DOLocationIDs: [264 265  57 105]


In [34]:
# Filter data for PULocationID or DOLocationID being 264 or 265
trips = df[(df['PULocationID'].isin([264, 265])) | (df['DOLocationID'].isin([264, 265]))]

# Print the filtered data summary
trips.describe(include='all')


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours,PUzone,PUborough,DOzone,DOborough
count,442934,442934,442934.0,442934.0,442934.0,442934.0,442934.0,442934.0,442934.0,442934.0,...,442934.0,442934.0,442934.0,442934.0,442934.0,442934.0,188756,188756,57134,57134
unique,,,,,,,,,,,...,,,,,,,240,6,252,6
top,,,,,,,,,,,...,,,,,,,JFK Airport,Queens,Times Sq/Theatre District,Manhattan
freq,,,,,,,,,,,...,,,,,,,71420,99658,2105,44034
mean,2023-06-22 03:12:27.957560,2023-06-22 03:40:22.040664,1.42,8.87,213.55,250.04,1.23,46.47,1.69,6.36,...,2023.0,6.19,15.5,14.29,27.9,0.47,,,,
min,2023-01-01 00:02:13,2023-01-01 00:15:27,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,...,2023.0,1.0,1.0,0.0,1.0,0.02,,,,
25%,2023-03-25 00:42:37,2023-03-25 00:59:33.250000,1.0,1.84,138.0,264.0,1.0,13.5,0.0,0.0,...,2023.0,3.0,8.0,10.0,12.38,0.21,,,,
50%,2023-06-16 22:15:28.500000,2023-06-16 22:37:35,1.0,4.3,264.0,264.0,1.0,26.1,1.0,3.5,...,2023.0,6.0,15.0,15.0,20.32,0.34,,,,
75%,2023-09-17 17:14:13.750000,2023-09-17 17:46:59,2.0,13.56,264.0,265.0,1.0,68.8,2.5,8.0,...,2023.0,9.0,23.0,19.0,34.38,0.57,,,,
max,2023-12-31 23:53:07,2023-12-31 23:59:46,6.0,50.0,265.0,265.0,4.0,300.0,11.75,280.0,...,2023.0,12.0,31.0,23.0,1439.82,24.0,,,,


In [35]:
# Display sample records
trips.sample(10)


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours,PUzone,PUborough,DOzone,DOborough
2397321,2023-02-02 13:16:49,2023-02-02 13:33:50,1,2.38,264,264,1,16.3,0.0,4.45,...,2023,2,2,13,17.02,0.28,,,,
13576769,2023-06-19 14:58:13,2023-06-19 15:14:23,1,2.62,264,264,1,16.3,0.0,2.0,...,2023,6,19,14,16.17,0.27,,,,
6311196,2023-03-23 21:08:42,2023-03-23 21:30:49,1,8.27,163,265,2,55.2,1.0,0.0,...,2023,3,23,21,22.12,0.37,Midtown North,Manhattan,,
6872243,2023-03-30 22:26:28,2023-03-30 23:47:38,2,30.64,132,265,1,100.0,0.0,24.31,...,2023,3,30,22,81.17,1.35,JFK Airport,Queens,,
22718440,2023-10-24 19:19:47,2023-10-24 19:36:47,1,2.63,264,264,1,17.0,2.5,3.0,...,2023,10,24,19,17.0,0.28,,,,
6907614,2023-03-31 12:25:47,2023-03-31 12:39:19,4,1.31,264,264,2,12.8,0.0,0.0,...,2023,3,31,12,13.53,0.23,,,,
5980722,2023-03-19 15:10:55,2023-03-19 15:39:34,1,4.13,264,264,1,26.8,0.0,3.0,...,2023,3,19,15,28.65,0.48,,,,
16889802,2023-08-04 13:40:55,2023-08-04 14:12:28,3,5.94,100,265,1,34.5,0.0,21.0,...,2023,8,4,13,31.55,0.53,Garment District,Manhattan,,
16268607,2023-07-26 21:06:30,2023-07-26 21:15:09,1,1.56,264,264,1,10.7,1.0,2.5,...,2023,7,26,21,8.65,0.14,,,,
1320162,2023-01-19 11:23:10,2023-01-19 11:34:30,1,6.06,264,132,4,79.0,0.0,0.0,...,2023,1,19,11,11.33,0.19,,,JFK Airport,Queens


In [36]:
# Manually assign zones for IDs 264 and 265
df.loc[df['PULocationID'] == 264, ['PUzone', 'PUborough']] = ['Outside NYC', 'Unknown']
df.loc[df['DOLocationID'] == 264, ['DOzone', 'DOborough']] = ['Outside NYC', 'Unknown']
df.loc[df['PULocationID'] == 265, ['PUzone', 'PUborough']] = ['Airport Area', 'Unknown']
df.loc[df['DOLocationID'] == 265, ['DOzone', 'DOborough']] = ['Airport Area', 'Unknown']


In [37]:
# Check for null values in the updated columns
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())



PUzone        86
PUborough     86
DOzone       688
DOborough    688
dtype: int64


In [38]:
# Print rows where PUzone or PUborough is null
print("Rows with missing PUzone or PUborough:")
print(df[df['PUzone'].isnull() | df['PUborough'].isnull()][['PULocationID', 'PUzone', 'PUborough']].head())

# Print rows where DOzone or DOborough is null
print("Rows with missing DOzone or DOborough:")
print(df[df['DOzone'].isnull() | df['DOborough'].isnull()][['DOLocationID', 'DOzone', 'DOborough']].head())


Rows with missing PUzone or PUborough:
         PULocationID PUzone PUborough
196727             57    NaN       NaN
464270             57    NaN       NaN
1379932            57    NaN       NaN
2099653            57    NaN       NaN
2314105            57    NaN       NaN
Rows with missing DOzone or DOborough:
       DOLocationID DOzone DOborough
1122             57    NaN       NaN
16081            57    NaN       NaN
16283            57    NaN       NaN
19910            57    NaN       NaN
73022            57    NaN       NaN


In [39]:
# List unique LocationIDs associated with null zones or boroughs
missing_pu_ids = df[df['PUzone'].isnull()]['PULocationID'].unique()
missing_do_ids = df[df['DOzone'].isnull()]['DOLocationID'].unique()
print(f"Missing PULocationIDs: {missing_pu_ids}")
print(f"Missing DOLocationIDs: {missing_do_ids}")


Missing PULocationIDs: [ 57 105]
Missing DOLocationIDs: [ 57 105]


In [40]:
# Manually assign zones and boroughs for LocationID 57 and 105
df.loc[df['PULocationID'] == 57, ['PUzone', 'PUborough']] = ['Corona', 'Queens']
df.loc[df['DOLocationID'] == 57, ['DOzone', 'DOborough']] = ['Corona', 'Queens']

df.loc[df['PULocationID'] == 105, ['PUzone', 'PUborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']
df.loc[df['DOLocationID'] == 105, ['DOzone', 'DOborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']


In [41]:
# Verify updates for LocationID 57
print("Updated zones and boroughs for LocationID 57:")
print(df[df['PULocationID'] == 57][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 57][['DOLocationID', 'DOzone', 'DOborough']].head(2))

# Verify updates for LocationID 105
print("Updated zones and boroughs for LocationID 105:")
print(df[df['PULocationID'] == 105][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 105][['DOLocationID', 'DOzone', 'DOborough']].head(2))


Updated zones and boroughs for LocationID 57:
        PULocationID  PUzone PUborough
196727            57  Corona    Queens
464270            57  Corona    Queens
       DOLocationID  DOzone DOborough
1122             57  Corona    Queens
16081            57  Corona    Queens
Updated zones and boroughs for LocationID 105:
         PULocationID                                         PUzone  \
3487541           105  Governor's Island/Ellis Island/Liberty Island   
9894599           105  Governor's Island/Ellis Island/Liberty Island   

         PUborough  
3487541  Manhattan  
9894599  Manhattan  
         DOLocationID                                         DOzone  \
2243969           105  Governor's Island/Ellis Island/Liberty Island   
3284279           105  Governor's Island/Ellis Island/Liberty Island   

         DOborough  
2243969  Manhattan  
3284279  Manhattan  


In [42]:
# Check again for null values in the zone and borough columns
print("Null values in PUzone and PUborough after update:")
print(df[['PUzone', 'PUborough']].isnull().sum())

print("Null values in DOzone and DOborough after update:")
print(df[['DOzone', 'DOborough']].isnull().sum())


Null values in PUzone and PUborough after update:
PUzone       0
PUborough    0
dtype: int64
Null values in DOzone and DOborough after update:
DOzone       0
DOborough    0
dtype: int64


In [43]:
df.isnull().sum()

tpep_pickup_datetime        0
tpep_dropoff_datetime       0
passenger_count             0
trip_distance               0
PULocationID                0
DOLocationID                0
payment_type                0
fare_amount                 0
extra                       0
tip_amount                  0
total_amount                0
JFK_LGA_Pickup_Fee          0
General_Airport_Fee         0
distance_bins               0
pickup_time_of_day          0
pickup_season               0
passenger_count_category    0
pickup_day_type             0
transaction_year            0
transaction_month           0
transaction_day             0
transaction_hour            0
trip_duration               0
trip_duration_hours         0
PUzone                      0
PUborough                   0
DOzone                      0
DOborough                   0
dtype: int64

## Holiday
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

In [44]:
# Create a calendar object
calendar = USFederalHolidayCalendar()

# Define the range for your data
start_date = '2023-01-01'
end_date = '2023-12-31'

# Generate holidays
holidays = calendar.holidays(start=start_date, end=end_date)

# Add a column to your dataframe indicating whether the trip started on a holiday
df['is_holiday'] = df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)

In [45]:
df['is_holiday'].describe()

count   28095081.00
mean           0.02
std            0.15
min            0.00
25%            0.00
50%            0.00
75%            0.00
max            1.00
Name: is_holiday, dtype: float64

In [46]:
holidays = df[df['is_holiday']==1]
holidays.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours,is_holiday
count,631147,631147,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0,631147.0
mean,2023-07-08 13:05:44.911030,2023-07-08 13:24:43.357978,1.47,4.93,161.23,159.49,1.21,24.14,1.38,4.16,34.06,0.04,0.21,2023.0,6.76,14.3,14.45,18.97,0.32,1.0
min,2023-01-02 00:00:03,2023-01-02 00:05:01,1.0,1.0,1.0,1.0,1.0,3.0,0.0,0.0,5.2,0.0,0.0,2023.0,1.0,2.0,0.0,1.0,0.02,1.0
25%,2023-02-20 18:42:45,2023-02-20 18:59:13,1.0,1.57,132.0,107.0,1.0,11.4,0.0,1.0,17.64,0.0,0.0,2023.0,2.0,9.0,11.0,9.18,0.15,1.0
50%,2023-07-04 14:31:59,2023-07-04 14:48:15,1.0,2.47,161.0,161.0,1.0,15.6,1.0,3.08,23.0,0.0,0.0,2023.0,7.0,16.0,15.0,13.87,0.23,1.0
75%,2023-11-10 08:20:55,2023-11-10 08:41:02.500000,2.0,5.7,230.0,233.0,1.0,28.9,2.5,5.0,38.1,0.0,0.0,2023.0,11.0,20.0,19.0,22.02,0.37,1.0
max,2023-12-25 23:59:57,2023-12-26 22:46:23,6.0,50.0,265.0,265.0,4.0,299.0,12.75,222.21,379.81,1.25,1.75,2023.0,12.0,29.0,23.0,1439.88,24.0,1.0
std,,,0.93,5.43,62.07,71.94,0.46,20.36,1.91,4.65,25.95,0.22,0.56,0.0,3.91,8.48,5.67,44.11,0.74,0.0


In [47]:
holidays.sample(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours,PUzone,PUborough,DOzone,DOborough,is_holiday
13566789,2023-06-19 12:28:29,2023-06-19 12:36:58,2,1.29,143,163,1,10.0,0.0,2.8,...,6,19,12,8.48,0.14,Lincoln Square West,Manhattan,Midtown North,Manhattan,1
14671491,2023-07-04 16:18:45,2023-07-04 16:38:52,1,8.09,132,265,1,43.6,0.0,9.52,...,7,4,16,20.12,0.34,JFK Airport,Queens,Airport Area,Unknown,1
82836,2023-01-02 15:30:08,2023-01-02 15:42:56,1,3.1,234,141,1,12.0,2.5,3.2,...,1,2,15,12.8,0.21,Union Sq,Manhattan,Lenox Hill West,Manhattan,1
21474466,2023-10-09 21:57:05,2023-10-09 22:11:15,1,1.9,79,249,1,12.1,3.5,3.4,...,10,9,21,14.17,0.24,East Village,Manhattan,West Village,Manhattan,1
24176896,2023-11-10 19:11:31,2023-11-10 19:25:28,1,2.16,162,263,2,14.2,0.0,0.0,...,11,10,19,13.95,0.23,Midtown East,Manhattan,Yorkville West,Manhattan,1
25187506,2023-11-23 11:54:55,2023-11-23 12:36:31,2,5.35,143,233,1,38.0,0.0,5.0,...,11,23,11,41.6,0.69,Lincoln Square West,Manhattan,UN/Turtle Bay South,Manhattan,1
18952795,2023-09-04 16:54:16,2023-09-04 17:11:50,1,1.71,186,233,1,16.3,0.0,2.0,...,9,4,16,17.57,0.29,Penn Station/Madison Sq West,Manhattan,UN/Turtle Bay South,Manhattan,1
24141018,2023-11-10 13:13:52,2023-11-10 13:30:18,2,1.62,170,141,1,14.9,0.0,4.72,...,11,10,13,16.43,0.27,Murray Hill,Manhattan,Lenox Hill West,Manhattan,1
13601442,2023-06-19 20:01:49,2023-06-19 20:13:41,1,2.0,114,186,1,12.8,3.5,4.45,...,6,19,20,11.87,0.2,Greenwich Village South,Manhattan,Penn Station/Madison Sq West,Manhattan,1
11845403,2023-05-29 15:55:27,2023-05-29 16:50:26,2,20.26,132,211,2,70.0,0.0,0.0,...,5,29,15,54.98,0.92,JFK Airport,Queens,SoHo,Manhattan,1


In [48]:
df.sample(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours,PUzone,PUborough,DOzone,DOborough,is_holiday
15807939,2023-07-20 17:59:54,2023-07-20 18:21:31,1,2.3,237,224,1,19.8,2.5,3.0,...,7,20,17,21.62,0.36,Upper East Side South,Manhattan,Stuy Town/Peter Cooper Village,Manhattan,0
16603348,2023-07-31 14:28:09,2023-07-31 14:44:04,1,2.7,231,186,1,17.0,0.0,4.2,...,7,31,14,15.92,0.27,TriBeCa/Civic Center,Manhattan,Penn Station/Madison Sq West,Manhattan,0
10380172,2023-05-12 09:29:47,2023-05-12 09:53:04,5,3.06,230,211,2,21.2,0.0,0.0,...,5,12,9,23.28,0.39,Times Sq/Theatre District,Manhattan,SoHo,Manhattan,0
18208137,2023-08-24 06:43:22,2023-08-24 06:46:51,1,1.12,151,239,1,7.2,0.0,2.24,...,8,24,6,3.48,0.06,Manhattan Valley,Manhattan,Upper West Side South,Manhattan,0
21275632,2023-10-07 07:17:12,2023-10-07 07:37:06,1,5.67,209,188,1,27.5,0.0,4.0,...,10,7,7,19.9,0.33,Seaport,Manhattan,Prospect-Lefferts Gardens,Brooklyn,0
26078865,2023-12-05 10:22:50,2023-12-05 10:37:30,1,1.2,170,237,1,12.8,2.5,4.2,...,12,5,10,14.67,0.24,Murray Hill,Manhattan,Upper East Side South,Manhattan,0
5788320,2023-03-17 11:24:41,2023-03-17 11:42:59,1,3.42,163,68,1,19.8,0.0,4.76,...,3,17,11,18.3,0.3,Midtown North,Manhattan,East Chelsea,Manhattan,0
12643502,2023-06-08 08:07:37,2023-06-08 08:25:55,1,9.32,140,138,1,37.3,5.0,4.0,...,6,8,8,18.3,0.3,Lenox Hill East,Manhattan,LaGuardia Airport,Queens,0
19630498,2023-09-12 21:45:42,2023-09-12 22:04:18,2,2.0,142,186,1,17.7,3.5,5.0,...,9,12,21,18.6,0.31,Lincoln Square East,Manhattan,Penn Station/Madison Sq West,Manhattan,0
12872122,2023-06-10 19:07:31,2023-06-10 19:23:34,3,2.71,211,48,1,17.0,0.0,6.3,...,6,10,19,16.05,0.27,SoHo,Manhattan,Clinton East,Manhattan,0


In [49]:
# Filter the DataFrame to include only the rows where 'is_holiday' is 1
holiday_trips = df[df['is_holiday'] == 1].copy()  # Adding .copy() to avoid SettingWithCopyWarning on a slice

# Use loc to safely create a new column 'month_day'
holiday_trips.loc[:, 'month_day'] = list(zip(holiday_trips['transaction_month'], holiday_trips['transaction_day']))

# Find unique (month, day) pairs
unique_holiday_dates = pd.unique(holiday_trips['month_day'])


In [50]:
print(unique_holiday_dates)


[(1, 2) (1, 16) (2, 20) (5, 29) (6, 19) (7, 4) (9, 4) (10, 9) (11, 23)
 (11, 10) (12, 25)]




1. **(6, 19)** - June 19: Juneteenth National Independence Day, a federal holiday recognizing the emancipation of enslaved African Americans.
2. **(7, 4)** - July 4: Independence Day, a major national holiday in the United States celebrating the country's declaration of independence from the British Empire.
3. **(5, 29)** - May 29: This date in 2023 was Memorial Day, observed on the last Monday of May each year, honoring the military personnel who have died in the performance of their military duties.
4. **(11, 10)** - November 10: This is not a recognized public holiday. If it were November 11, it would be Veterans Day.
5. **(11, 23)** - November 23: This date in 2023 was Thanksgiving Day, a significant U.S. holiday celebrated on the fourth Thursday of November.
6. **(1, 2)** - January 2: In 2023, this was the observed day for New Year’s Day (January 1), as January 1st fell on a Sunday.
7. **(1, 16)** - January 16: Martin Luther King Jr. Day in 2023, celebrated on the third Monday of January to honor the civil rights leader.
8. **(10, 9)** - October 9: Columbus Day/Indigenous Peoples' Day in 2023, observed on the second Monday in October.
9. **(9, 4)** - September 4: Labor Day in 2023, which is celebrated on the first Monday of September and honors the American labor movement.
10. **(12, 25)** - December 25: Christmas Day, a major holiday across many cultures, marking the celebration of the birth of Jesus Christ.
11. **(2, 20)** - February 20: Presidents' Day in 2023, observed on the third Monday of February in honor of George Washington and other presidents.



## speed

**speed**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.

In [51]:
# First, ensure your trip_duration is in hours for speed calculation
df['trip_duration_hours'] = df['trip_duration'] / 60.0

# Calculate speed
df['speed_mph'] = df['trip_distance'] / df['trip_duration_hours']

# Handle any potential infinite or NaN values that may occur if duration is zero
df['speed_mph'].replace([np.inf, -np.inf], np.nan, inplace=True)
df['speed_mph'].fillna(0, inplace=True)  # Optionally set to zero or another placeholder value


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['speed_mph'].replace([np.inf, -np.inf], np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['speed_mph'].fillna(0, inplace=True)  # Optionally set to zero or another placeholder value


In [52]:
df['speed_mph'].describe()

count   28095081.00
mean          12.26
std            7.48
min            0.04
25%            7.76
50%           10.34
75%           14.50
max         2777.14
Name: speed_mph, dtype: float64

### Analyzing the Summary Statistics

1.  **Count**: Over 32 million data points are present, indicating a large dataset.
2.  **Mean**: The average speed is approximately 13.67 mph, which seems reasonable for urban traffic.
3.  **Standard Deviation**: The standard deviation is quite high at 75.87, suggesting significant variation in the speed data.
4.  **Min and Max**: The minimum speed is 0.01 mph, which is close to being stationary, but the maximum speed is 68,364 mph, which is unrealistic for any road vehicle and likely indicates data errors or extreme outliers.



In [53]:
df['speed_mph'].isna().sum()

0

In [59]:
max_speed = df[df['speed_mph']>500]

In [60]:
max_speed

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,transaction_day,transaction_hour,trip_duration,trip_duration_hours,PUzone,PUborough,DOzone,DOborough,is_holiday,speed_mph
30656,2023-01-01 13:28:59,2023-01-01 13:30:01,2,11.28,264,265,2,70.00,0.00,0.00,...,1,13,1.03,0.02,Outside NYC,Unknown,Airport Area,Unknown,0,654.97
40800,2023-01-01 16:54:35,2023-01-01 16:55:37,2,19.20,132,132,3,3.00,1.25,0.00,...,1,16,1.03,0.02,JFK Airport,Queens,JFK Airport,Queens,0,1114.84
81253,2023-01-02 14:31:24,2023-01-02 14:32:47,1,16.60,1,1,1,134.00,0.00,5.00,...,2,14,1.38,0.02,Newark Airport,EWR,Newark Airport,EWR,1,720.00
145055,2023-01-03 15:17:36,2023-01-03 15:18:38,2,18.90,161,161,3,70.00,2.50,0.00,...,3,15,1.03,0.02,Midtown Center,Manhattan,Midtown Center,Manhattan,0,1097.42
168383,2023-01-03 21:40:29,2023-01-03 21:41:51,1,19.50,151,151,1,3.70,1.00,10.00,...,3,21,1.37,0.02,Manhattan Valley,Manhattan,Manhattan Valley,Manhattan,0,856.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27491206,2023-12-21 14:43:48,2023-12-21 14:44:52,3,17.71,70,138,2,3.70,5.00,0.00,...,21,14,1.07,0.02,East Elmhurst,Queens,LaGuardia Airport,Queens,0,996.19
27645357,2023-12-23 16:54:40,2023-12-23 16:56:40,1,21.20,230,230,2,70.00,4.25,0.00,...,23,16,2.00,0.03,Times Sq/Theatre District,Manhattan,Times Sq/Theatre District,Manhattan,0,636.00
27695251,2023-12-24 16:33:59,2023-12-24 16:35:40,1,17.00,265,265,1,121.00,0.00,10.00,...,24,16,1.68,0.03,Airport Area,Unknown,Airport Area,Unknown,0,605.94
27750046,2023-12-26 07:28:10,2023-12-26 07:29:11,2,19.10,132,132,1,70.00,1.75,24.05,...,26,7,1.02,0.02,JFK Airport,Queens,JFK Airport,Queens,0,1127.21


Trip speed that is over 100 mph is unrealistic for taxi trips in the city, thus to remove any possible outliers we will remove the speeds above 100 mph. 

In [61]:
# Define a realistic maximum speed
max_realistic_speed = 100  # mph

# Filter the DataFrame to remove highly unrealistic speeds
cleaned_df = df[df['speed_mph'] <= max_realistic_speed]

# Optionally, inspect the data points with extreme speeds before removing them
extreme_speeds = df[df['speed_mph'] > max_realistic_speed]
print(extreme_speeds[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance', 'trip_duration', 'speed_mph']])



         tpep_pickup_datetime tpep_dropoff_datetime  trip_distance  \
14622     2023-01-01 03:04:43   2023-01-01 03:05:46           3.60   
30656     2023-01-01 13:28:59   2023-01-01 13:30:01          11.28   
36252     2023-01-01 15:23:49   2023-01-01 15:25:20           4.10   
40800     2023-01-01 16:54:35   2023-01-01 16:55:37          19.20   
41468     2023-01-01 16:48:59   2023-01-01 17:00:42          23.10   
...                       ...                   ...            ...   
27842689  2023-12-27 19:24:17   2023-12-27 19:33:19          30.13   
27942858  2023-12-29 14:15:48   2023-12-29 14:22:16          21.29   
27953528  2023-12-29 16:39:35   2023-12-29 16:42:13          11.80   
27980807  2023-12-29 23:06:44   2023-12-29 23:07:49          11.10   
27987447  2023-12-30 03:55:56   2023-12-30 03:57:29           3.60   

          trip_duration  speed_mph  
14622              1.05     205.71  
30656              1.03     654.97  
36252              1.52     162.20  
40800      

In [62]:
# Describe the speed statistics after removing extreme data points
new_speed_stats = cleaned_df['speed_mph'].describe()
print(new_speed_stats)


count   28093950.00
mean          12.25
std            6.88
min            0.04
25%            7.76
50%           10.34
75%           14.50
max          100.00
Name: speed_mph, dtype: float64


In [63]:
cleaned_df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours,is_holiday,speed_mph
count,28093950,28093950,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,...,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0,28093950.0
mean,2023-07-01 05:47:54.167641,2023-07-01 06:08:15.898099,1.4,4.26,163.45,162.0,1.19,22.85,1.68,4.07,...,0.01,0.17,2023.0,6.49,15.5,14.33,20.36,0.34,0.02,12.25
min,2023-01-01 00:00:05,2023-01-01 00:05:44,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,...,0.0,0.0,2023.0,1.0,1.0,0.0,1.0,0.02,0.0,0.04
25%,2023-04-01 18:22:08.250000,2023-04-01 18:41:26,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,...,0.0,0.0,2023.0,4.0,8.0,11.0,10.25,0.17,0.0,7.76
50%,2023-06-25 02:38:31.500000,2023-06-25 02:54:31,1.0,2.27,161.0,162.0,1.0,16.3,1.0,3.28,...,0.0,0.0,2023.0,6.0,15.0,15.0,15.2,0.25,0.0,10.34
75%,2023-10-04 20:36:35.750000,2023-10-04 20:54:02.750000,1.0,4.36,231.0,234.0,1.0,25.4,2.5,5.0,...,0.0,0.0,2023.0,10.0,23.0,19.0,23.43,0.39,0.0,14.5
max,2023-12-31 23:55:17,2023-12-31 23:59:56,6.0,50.0,265.0,265.0,4.0,300.0,96.38,984.3,...,1.25,1.75,2023.0,12.0,31.0,23.0,1439.97,24.0,1.0,100.0
std,,,0.89,4.82,63.28,70.94,0.44,18.37,1.91,4.31,...,0.12,0.51,0.0,3.45,8.7,5.9,42.6,0.71,0.15,6.88


## Testing NEW Feature Validity

We need to check if created features are within their bounds and our code worked properly.

In [64]:
def test_trip_duration_positive():
    assert cleaned_df['trip_duration'].min() > 0, "Error: Non-positive trip durations present in the dataset."


In [65]:
def test_time_of_day_categories():
    hours = cleaned_df['tpep_pickup_datetime'].dt.hour
    conditions = [
        ((hours >= 5) & (hours <= 11)),
        ((hours >= 12) & (hours <= 17)),
        ((hours >= 18) & (hours <= 23)),
        ((hours < 5) | (hours == 24))
    ]
    categories = ['morning', 'afternoon', 'evening', 'night']
    for condition, category in zip(conditions, categories):
        assert all(cleaned_df.loc[condition, 'pickup_time_of_day'] == category), f"Error in categorizing {category}."


In [66]:
def test_passenger_count_categories():
    conditions = [
        (cleaned_df['passenger_count'] == 1),
        (cleaned_df['passenger_count'].between(2, 4)),
        (cleaned_df['passenger_count'].between(5, 6))
    ]
    categories = ['low', 'medium', 'high']
    for condition, category in zip(conditions, categories):
        assert all(cleaned_df.loc[condition, 'passenger_count_category'] == category), f"Error in categorizing passenger count {category}."


In [67]:
def test_seasonal_categories():
    months = cleaned_df['tpep_pickup_datetime'].dt.month
    conditions = [
        (months.isin([3, 4, 5])),
        (months.isin([6, 7, 8])),
        (months.isin([9, 10, 11])),
        (months.isin([12, 1, 2]))
    ]
    seasons = ['spring', 'summer', 'autumn', 'winter']
    for condition, season in zip(conditions, seasons):
        assert all(cleaned_df.loc[condition, 'pickup_season'] == season), f"Error in season categorization for {season}."


In [68]:
from pandas.tseries.holiday import USFederalHolidayCalendar

def test_holiday_flag():
    calendar = USFederalHolidayCalendar()
    holidays = calendar.holidays(start=cleaned_df['tpep_pickup_datetime'].min(), end=cleaned_df['tpep_pickup_datetime'].max())
    cleaned_df['calculated_holiday'] = cleaned_df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)
    assert all(cleaned_df['calculated_holiday'] == cleaned_df['is_holiday']), "Holiday flag mismatches detected."


In [69]:
# Running all tests
try:
    test_trip_duration_positive()
    test_time_of_day_categories()
    test_passenger_count_categories()
    print("All tests passed!")
except AssertionError as e:
    print("Test failed:", e)


All tests passed!


# Sampling For Further Analysis

In [71]:
# Assume 'pickup_season' and 'time_of_day' are important categories
sampled_df = cleaned_df.groupby(['pickup_season', 'pickup_time_of_day'], group_keys=False).apply(lambda x: x.sample(frac=0.1, random_state=42))


  sampled_df = cleaned_df.groupby(['pickup_season', 'pickup_time_of_day'], group_keys=False).apply(lambda x: x.sample(frac=0.1, random_state=42))


In [73]:
sampled_df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration,trip_duration_hours,is_holiday,speed_mph
count,2809395,2809395,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,...,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0,2809395.0
mean,2023-07-01 05:36:50.292566,2023-07-01 05:57:11.048096,1.4,4.26,163.41,161.95,1.19,22.85,1.68,4.07,...,0.01,0.17,2023.0,6.49,15.5,14.33,20.35,0.34,0.02,12.25
min,2023-01-01 00:00:09,2023-01-01 00:09:24,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,...,0.0,0.0,2023.0,1.0,1.0,0.0,1.05,0.02,0.0,0.04
25%,2023-04-01 18:13:40.500000,2023-04-01 18:34:08.500000,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,...,0.0,0.0,2023.0,4.0,8.0,11.0,10.25,0.17,0.0,7.76
50%,2023-06-25 03:11:10,2023-06-25 03:27:25,1.0,2.26,161.0,162.0,1.0,16.3,1.0,3.28,...,0.0,0.0,2023.0,6.0,15.0,15.0,15.2,0.25,0.0,10.33
75%,2023-10-04 20:19:16.500000,2023-10-04 20:38:23,1.0,4.36,231.0,234.0,1.0,25.4,2.5,5.0,...,0.0,0.0,2023.0,10.0,23.0,19.0,23.45,0.39,0.0,14.5
max,2023-12-31 23:51:23,2023-12-31 23:59:34,6.0,49.99,265.0,265.0,4.0,300.0,13.75,250.0,...,1.25,1.75,2023.0,12.0,31.0,23.0,1439.92,24.0,1.0,99.69
std,,,0.89,4.82,63.27,70.93,0.44,18.38,1.91,4.28,...,0.12,0.51,0.0,3.45,8.7,5.9,42.33,0.71,0.15,6.89


In [74]:
categorical_columns = ['PULocationID', 'DOLocationID', 'payment_type', 'distance_bins', 'pickup_time_of_day', 'pickup_season', 'passenger_count_category', 'pickup_day_type', 'transaction_year', 'transaction_month', 'transaction_day', 'transaction_hour']
descriptive_stats = sampled_df[categorical_columns].describe(include='object')
descriptive_stats


Unnamed: 0,pickup_time_of_day,pickup_season,pickup_day_type
count,2809395,2809395,2809395
unique,4,4,2
top,afternoon,spring,weekday
freq,995638,758631,2035540


### Understanding and Implementing Stratified Sampling

#### What is Stratified Sampling?

Stratified sampling is a statistical method used to ensure that various subgroups within a dataset are adequately represented within the sample. It involves dividing a population into smaller groups, known as 'strata', that are distinct and non-overlapping. Each stratum is defined by shared characteristics or criteria, making them homogeneous within each group but heterogeneous between groups. Common stratifying criteria include demographic variables such as age, income, education level, or specific attributes relevant to the study, like seasons or time of day in the context of taxi fare analysis.


#### Why Use Stratified Sampling for Taxi Fare Prediction?

Stratified sampling is particularly beneficial for datasets where certain subgroups are expected to exhibit different behaviors or properties. For the taxi fare prediction project, several reasons underscore the choice of stratified sampling:

- **Enhanced Accuracy and Precision**: By ensuring that all critical subgroups (e.g., different times of the day, seasons) are proportionally represented, stratified sampling reduces sampling bias and improves the accuracy and reliability of the analysis results.

- **Improved Representativeness**: Taxi fare can vary significantly based on factors like the time of day (peak vs. off-peak hours), the day of the week (weekday vs. weekend), or seasonality (tourist seasons). Stratified sampling ensures that each of these factors is adequately represented in the sample, making the sample a miniaturized version of the complete dataset.

- **Efficiency in Data Use**: By focusing on ensuring representation across all key strata, this approach can often require a smaller total sample size to achieve the same level of precision as simple random sampling, especially when the variance within strata is lower than the variance across the entire population.

- **Ability to Perform Strata-specific Analysis**: Stratified sampling allows for the analysis of strata-specific data, which can provide insights into behaviors and patterns within specific subgroups that might be lost in a broader analysis.

#### Practical Example

Consider a dataset containing taxi trips over a year. Key influencing factorsinclude:
- **Time of Day**: Morning, Afternoon, Evening, Night.
- **Seasons**: Spring, Summer, Fall, Winter.

To ensure each time segment and season is adequately represented, the dataset can be divided into strata defined by these categories. Sampling from each of these strata then ensures that the variability and typical patterns of taxi usage and fare structures due to time-specific or seasonal factors are maintained in the sample used for analysis and model training.


# Saving Sampled Feature Engineered Dataset for EDA and Modeling

In [70]:
sampled_df.to_parquet('/Users/md/Desktop/python_project/parquet_files/cleaned/sampled_taxi_dataset_v.0.parquet', engine='pyarrow')
