# Taxi Fare Prediction Model_Feature_Engineering

## Introduction
This project aims to develop a predictive model for taxi fares in NYC. Initially, we will create a model for NYC and adjust parameters to align with domain knowledge from Tbilisi. We hypothesize that factors such as time of day, seasonality, and holidays impact taxi demand and fare prices.



### Notebook Aim and Feature Development

#### **Objective**
The primary objective of our project is to develop a predictive model that accurately forecasts taxi fares. Initially focusing on New York City (NYC), we aim to expand and adapt the model to incorporate Tbilisi, employing localized domain knowledge to tailor our approach.

#### **Influence of Demand on Pricing**
The fare prices in the taxi industry are predominantly influenced by demand dynamics, which can fluctuate based on various factors including:

- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

#### **Additional Influential Features**
Beyond temporal and periodic factors, several other elements could influence fare pricing:

- **Passenger Count**: Exploring whether vehicles accommodating more passengers have different fare structures, similar to practices in ride-sharing applications.
- **Trip Distance and Duration**: Both metrics are crucial for pricing. While trip distance is a direct influencer, the duration might also affect costs, especially in varying traffic conditions.
- **Velocity**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.
- **Taxi Zones**: With the NYC taxi zone dataset, we can analyze whether specific pickup and dropoff locations impact fare prices due to their geographical significance.


### Summary
This comprehensive approach not only allows us to understand the multifaceted dynamics of taxi fare pricing in NYC but also sets a foundation for adapting the model to Tbilisi, ensuring that both city-specific and universal factors are considered for effective fare prediction.

## Data Loading

In this section we will install necessary packages, imports necessary libraries and load the dataset.

In [53]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sns
import geopandas as gpd
from sklearn.model_selection import StratifiedShuffleSplit
pd.set_option('display.float_format', lambda x: '%.2f' % x)
from pandas.tseries.holiday import USFederalHolidayCalendar
from sklearn.model_selection import train_test_split


In [3]:
# Replace 'path_to_file.parquet' with the path to your Parquet file
df_original = pd.read_parquet('cleaned_taxi_data_v.1.parquet', engine='pyarrow')  # or engine='fastparquet' if you prefer
df = df_original

Let's check for data accuracy and that the cleaned data is clean and has all the columns after data preparation in previous notebook.

In [3]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,distance_bins
0,2023-01-01 00:55:08,2023-01-01 01:01:27,1,1.1,43,237,1,7.9,1.0,4.0,16.9,0.0,0.0,1-2 miles
1,2023-01-01 00:25:04,2023-01-01 00:37:49,1,2.51,48,238,1,14.9,1.0,15.0,34.9,0.0,0.0,2-5 miles
2,2023-01-01 00:10:29,2023-01-01 00:21:19,1,1.43,107,79,1,11.4,1.0,3.28,19.68,0.0,0.0,1-2 miles
3,2023-01-01 00:50:34,2023-01-01 01:02:52,1,1.84,161,137,1,12.8,1.0,10.0,27.8,0.0,0.0,1-2 miles
4,2023-01-01 00:09:22,2023-01-01 00:19:49,1,1.66,239,143,1,12.1,1.0,3.42,20.52,0.0,0.0,1-2 miles


In [4]:
len(df)

28093781

## Dataset Loadeing And Preparation For Fetaure Engineering 


As we already process cleaned data from previous notebook we do not need to clean the data for nulls, duplicates or outliers however to check that data is consistent and clean we will have an initial look at the laoded dataset below.

We will load the Taxi Zone dataset downloaded from NYC taxi zones website : https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc

In [5]:
df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee
count,28093781,28093781,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0
mean,2023-07-01 05:46:20.028285,2023-07-01 06:06:41.749390,1.4,4.26,163.46,162.04,1.19,22.85,1.68,4.07,32.75,0.01,0.17
min,2023-01-01 00:00:05,2023-01-01 00:05:44,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,2.2,0.0,0.0
25%,2023-04-01 18:21:12,2023-04-01 18:40:28,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,18.6,0.0,0.0
50%,2023-06-25 02:32:23,2023-06-25 02:47:49,1.0,2.26,161.0,162.0,1.0,16.3,1.0,3.28,23.88,0.0,0.0
75%,2023-10-04 20:39:08,2023-10-04 20:56:30,1.0,4.35,231.0,234.0,1.0,25.4,2.5,5.0,35.0,0.0,0.0
max,2023-12-31 23:55:17,2023-12-31 23:59:56,6.0,50.0,265.0,265.0,4.0,300.0,96.38,984.3,1000.0,1.25,1.75
std,,,0.89,4.82,63.28,70.92,0.44,18.39,1.91,4.32,23.48,0.12,0.51


In [6]:
df.dtypes

tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                   int64
trip_distance                   float64
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
tip_amount                      float64
total_amount                    float64
JFK_LGA_Pickup_Fee              float64
General_Airport_Fee             float64
distance_bins                  category
dtype: object

In [7]:
df.isnull().sum()

tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
tip_amount               0
total_amount             0
JFK_LGA_Pickup_Fee       0
General_Airport_Fee      0
distance_bins            0
dtype: int64

### Additional Dataset For Taxi Zones Loading

In [4]:
zones = pd.read_csv("/Users/md/Desktop/python_project/parquet_files/cleaned/taxi_zones.csv", sep=';')
zones.describe()

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,LocationID
count,263.0,263.0,263.0,263.0
mean,132.0,0.09,0.0,131.98
std,76.07,0.05,0.0,76.07
min,1.0,0.01,0.0,1.0
25%,66.5,0.05,0.0,66.5
50%,132.0,0.08,0.0,132.0
75%,197.5,0.12,0.0,197.5
max,263.0,0.43,0.0,263.0


In [9]:
zones.sample(5)

Unnamed: 0,OBJECTID,Shape_Leng,the_geom,Shape_Area,zone,LocationID,borough
10,10,0.1,MULTIPOLYGON (((-73.78326624999988 40.68999429...,0.0,Baisley Park,10,Queens
190,191,0.13,MULTIPOLYGON (((-73.73016587199996 40.72395859...,0.0,Queens Village,191,Queens
11,11,0.08,MULTIPOLYGON (((-74.00109809499993 40.60303462...,0.0,Bath Beach,11,Brooklyn
95,95,0.11,MULTIPOLYGON (((-73.84732494199989 40.73877145...,0.0,Forest Hills,95,Queens
250,248,0.06,MULTIPOLYGON (((-73.8639374809998 40.840044565...,0.0,West Farms/Bronx River,248,Bronx


we already see that we have 263 zones, in our dataset we have 265 zones for taxis, which means we already know that when joining we will have to adjust for missing values and try to find this zones or remove them.

# Feature Engineering

Below based on our domain knowledge and literature reviews we will create new features or adjust the existing ones to gain more insights on the data and cerate best possible predictive model.

## Seasonal and Time Features


- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Duration**: How long did the trip last.

In [5]:
# Convert the pickup and dropoff datetime to pandas datetime format if not already
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

# Time of day segmentation
df['pickup_time_of_day'] = df['tpep_pickup_datetime'].dt.hour.apply(lambda x: 'morning' if 5 <= x <= 11
                                                                           else 'afternoon' if 12 <= x <= 17
                                                                           else 'evening' if 18 <= x <= 23
                                                                           else 'night')

# Seasons segmentation
df['pickup_season'] = df['tpep_pickup_datetime'].dt.month.apply(lambda x: 'spring' if 3 <= x <= 5
                                                                       else 'summer' if 6 <= x <= 8
                                                                       else 'autumn' if 9 <= x <= 11
                                                                       else 'winter')

# Passenger count categories
df['passenger_count_category'] = pd.cut(df['passenger_count'], bins=[0, 1, 4, 6], include_lowest=True, 
                                        labels=['low', 'medium', 'high'])

# Weekday/Weekend segmentation
df['pickup_day_type'] = df['tpep_pickup_datetime'].dt.day_name().apply(lambda x: 'weekend' if x in ['Saturday', 'Sunday'] else 'weekday')


#taxi_data_prepared['transaction_date'] = pd.to_datetime(taxi_data_prepared['tpep_pickup_datetime'].dt.date)
# -> we make it datetime again because it's very little use when it's just a string (can't compare, sort, etc.)
df['transaction_year'] = df['tpep_pickup_datetime'].dt.year
df['transaction_month'] = df['tpep_pickup_datetime'].dt.month
df['transaction_day'] =  df['tpep_pickup_datetime'].dt.day
df['transaction_hour'] = df['tpep_pickup_datetime'].dt.hour

#trip duration is another interesting feature to analyze 


# Calculate the trip duration and convert it to minutes
df['trip_duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60


Lets take a look at adjusted dataset and what are the new created features we will sample the dataset to also test that the fatures were created correctly.

In [5]:
df.head(2)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
0,2023-01-01 00:55:08,2023-01-01 01:01:27,1,1.1,43,237,1,7.9,1.0,4.0,...,1-2 miles,night,winter,low,weekend,2023,1,1,0,6.32
1,2023-01-01 00:25:04,2023-01-01 00:37:49,1,2.51,48,238,1,14.9,1.0,15.0,...,2-5 miles,night,winter,low,weekend,2023,1,1,0,12.75


In [9]:
df.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,28093781,28093781,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0,28093781.0
mean,2023-07-01 05:46:20.028285,2023-07-01 06:06:41.749390,1.4,4.26,163.46,162.04,1.19,22.85,1.68,4.07,32.75,0.01,0.17,2023.0,6.49,15.5,14.33,20.36
min,2023-01-01 00:00:05,2023-01-01 00:05:44,1.0,1.0,1.0,1.0,1.0,2.0,0.0,0.0,2.2,0.0,0.0,2023.0,1.0,1.0,0.0,-165.05
25%,2023-04-01 18:21:12,2023-04-01 18:40:28,1.0,1.5,132.0,113.0,1.0,11.4,0.0,1.26,18.6,0.0,0.0,2023.0,4.0,8.0,11.0,10.23
50%,2023-06-25 02:32:23,2023-06-25 02:47:49,1.0,2.26,161.0,162.0,1.0,16.3,1.0,3.28,23.88,0.0,0.0,2023.0,6.0,15.0,15.0,15.2
75%,2023-10-04 20:39:08,2023-10-04 20:56:30,1.0,4.35,231.0,234.0,1.0,25.4,2.5,5.0,35.0,0.0,0.0,2023.0,10.0,23.0,19.0,23.43
max,2023-12-31 23:55:17,2023-12-31 23:59:56,6.0,50.0,265.0,265.0,4.0,300.0,96.38,984.3,1000.0,1.25,1.75,2023.0,12.0,31.0,23.0,7053.62
std,,,0.89,4.82,63.28,70.92,0.44,18.39,1.91,4.32,23.48,0.12,0.51,0.0,3.45,8.7,5.9,43.02


## Analysis of Newly Added Features in the Taxi Dataset

In this analysis, we explore the enhancements made to a comprehensive taxi dataset through feature engineering, specifically focusing on new temporal and categorical attributes derived from the raw data. The following attributes were added: `pickup_time_of_day`, `pickup_season`, `passenger_count_category`, `pickup_day_type`, `transaction_year`, `transaction_month`, `transaction_day`, `transaction_hour`, and `trip_duration`. These features were engineered to facilitate deeper insights into the patterns of taxi usage.

### Overview of New Features

**Temporal Segmentation:**
- **Time of Day:** This attribute classifies each trip into one of four categories based on the pickup hour: `morning` (5-11 AM), `afternoon` (12-5 PM), `evening` (6-11 PM), and `night` (12 AM-4 AM). Most pickups occur in the `afternoon`, accounting for about 35.45% of the total trips. This could indicate higher taxi demand during these hours, possibly due to work-related commuting or lunchtime errands.
- **Season:** The trips are classified into seasons based on the pickup month, allowing for seasonal analysis of taxi usage. The most common season for taxi pickups was `spring`, with approximately 27% of the year's pickups, suggesting a peak in taxi usage during this season, which may correlate with tourist activity or better weather conditions.
- **Weekday/Weekend:** This feature categorizes each day as a `weekday` or `weekend`, based on the day of the week. A significant majority of the trips (72.42%) occurred on weekdays, which highlights the routine nature of taxi usage for commuting during the workweek.

**Categorical Binning of Numerical Data:**
- **Passenger Count Categories:** Passengers per trip were categorized into `low` (1 passenger), `medium` (2-4 passengers), and `high` (5-6 passengers). The categorization helps in analyzing travel behavior in relation to group size. 

**Transaction Date and Time:**
- Extracted `year`, `month`, `day`, and `hour` from the pickup datetime to facilitate time-based querying and aggregation at various granularities.

**Trip Duration Calculation:**
- Calculated as the difference in minutes between pickup and dropoff times, providing insights into trip lengths and potential traffic conditions. The mean trip duration was approximately 20.36 minutes, but with a high standard deviation, indicating significant variability which could be influenced by factors such as traffic, time of day, and day of the week.


From initial look and testing I checked that dates have correctly been categorized in weekend or weekday, that the passenger categories are alos correctly defined and that the trip duration calculation is correct.

To check if newly added features have correct values we will use descriptive statistics and adjust accordingly if needed.

Below we will check if the trip duration calculations are correct

In [11]:
df.dtypes

tpep_pickup_datetime        datetime64[us]
tpep_dropoff_datetime       datetime64[us]
passenger_count                      int64
trip_distance                      float64
PULocationID                         int64
DOLocationID                         int64
payment_type                         int64
fare_amount                        float64
extra                              float64
tip_amount                         float64
total_amount                       float64
JFK_LGA_Pickup_Fee                 float64
General_Airport_Fee                float64
distance_bins                     category
pickup_time_of_day                  object
pickup_season                       object
passenger_count_category          category
pickup_day_type                     object
transaction_year                     int32
transaction_month                    int32
transaction_day                      int32
transaction_hour                     int32
trip_duration                      float64
dtype: obje

In [6]:
#we see that categorical features have different data types so we adjust
df['payment_type'] = df['payment_type'].astype('category')
df['PULocationID'] = df['PULocationID'].astype('category')
df['DOLocationID'] = df['DOLocationID'].astype('category')
df['pickup_time_of_day'] = df['pickup_time_of_day'].astype('category')
df['pickup_season'] = df['pickup_season'].astype('category')
df['pickup_day_type'] = df['pickup_day_type'].astype('category')
df['passenger_count_category'] = df['passenger_count_category'].astype('category')


In [17]:
categorical_columns = ['PULocationID', 'DOLocationID', 'payment_type', 'distance_bins', 'pickup_time_of_day', 'pickup_season', 'passenger_count_category', 'pickup_day_type']
descriptive_stats = df[categorical_columns].describe(include='category')
descriptive_stats

Unnamed: 0,PULocationID,DOLocationID,payment_type,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type
count,28093781,28093781,28093781,28093781,28093781,28093781,28093781,28093781
unique,262,262,4,6,4,4,3,2
top,132,236,1,1-2 miles,afternoon,spring,low,weekday
freq,1867827,1146494,23283837,12204149,9956916,7586611,21308816,20350513



### Insights from Descriptive Statistics

- **Frequency Distributions:** The top segments for the newly added categorical features indicate the most common contexts in which taxis are used (afternoon pickups during spring on weekdays).
- **Average and Variance:** The distribution of trip durations has a high variance, as indicated by its standard deviation (43.02 minutes), pointing to a wide range in trip lengths. This suggests variability in trip purposes, from short hops to longer journeys.
- **Outliers:** The maximum trip duration is exceptionally high at over 7053 minutes, suggesting potential data entry errors or exceptional cases (e.g., taxis being hired for extended periods).


Since our dataset is only for 2023 we do not need transaction year as a feature, however it should be noted that when handling different years this feature could be useful. as for transaction hour and day they are irrelevant for our analysis as provide no useful insights for us we are more interested in type of day and season or month.

In [7]:
df = df.drop(columns=['transaction_day', 'transaction_year','transaction_hour'])

## Outlier Handling and Data Validation For New Features

Based on the descriptive statistics provided, there are a few indicators of potential data issues that may require cleaning or further investigation:

### 1. **Trip Duration Outliers**
- **Negative Values:** The minimum value in the trip duration is -165.05 minutes. This is clearly an error since trip duration cannot be negative. Negative values could result from data entry errors or incorrect timestamp recording.
- **Excessive Maximum Value:** The maximum trip duration is over 7053 minutes (approximately 117.55 hours). Such a high duration is unusual and might indicate anomalies in data recording or entry errors.


We see that for month, year, day, season features, the values make sense although for trip duration we can see that we have negative trip durations. 


Negative trip durations may have occured due to data entry issues , times might have been mixed up. we can investigate further and see what is the number of negative values and either drop the corrupted data or adjust it accordingly.

In [29]:
# Display the first few rows to confirm the new 'trip_duration' column
print(df[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']].sample(10))

         tpep_pickup_datetime tpep_dropoff_datetime  trip_duration
2355383   2023-02-01 20:05:40   2023-02-01 20:11:01           5.35
22493662  2023-10-21 22:19:37   2023-10-21 22:59:38          40.02
22869425  2023-10-26 15:01:31   2023-10-26 16:10:00          68.48
17137999  2023-08-08 10:48:05   2023-08-08 11:22:43          34.63
23881843  2023-11-07 15:54:34   2023-11-07 16:04:12           9.63
18137440  2023-08-23 05:28:29   2023-08-23 05:37:07           8.63
20389839  2023-09-26 13:34:32   2023-09-26 14:12:39          38.12
17718808  2023-08-16 18:33:46   2023-08-16 18:56:20          22.57
17808919  2023-08-17 21:46:36   2023-08-17 22:26:48          40.20
6660025   2023-03-28 13:31:56   2023-03-28 13:51:30          19.57


In [8]:
# Display cases with negative trip_duration
negative_durations = df[df['trip_duration'] < 0]
negative_durations[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']]


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_duration
1631038,2023-01-23 10:43:58,2023-01-23 10:29:26,-14.53
2397097,2023-02-02 13:02:23,2023-02-02 12:50:35,-11.80
2397098,2023-02-02 13:59:20,2023-02-02 13:15:43,-43.62
2482045,2023-02-03 13:45:00,2023-02-03 13:44:50,-0.17
2991721,2023-02-10 09:40:22,2023-02-10 09:20:58,-19.40
...,...,...,...
24565445,2023-11-15 14:08:00,2023-11-15 14:04:33,-3.45
24970707,2023-11-20 07:55:00,2023-11-20 07:46:43,-8.28
25300720,2023-11-25 14:53:11,2023-11-25 14:53:09,-0.03
25305059,2023-11-25 15:50:50,2023-11-25 15:50:05,-0.75


In [19]:

negative_durations.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_year,transaction_month,transaction_day,transaction_hour,trip_duration
count,725,725,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0,725.0
mean,2023-10-21 16:34:57.008276,2023-10-21 15:54:57.144827,1.36,4.09,22.08,1.43,3.44,30.41,0.0,0.02,2023.0,10.48,6.4,2.17,-40.0
min,2023-01-23 10:43:58,2023-01-23 10:29:26,1.0,1.0,7.2,0.0,0.0,11.64,0.0,0.0,2023.0,1.0,1.0,1.0,-165.05
25%,2023-11-05 01:44:13,2023-11-05 01:02:16,1.0,2.0,13.5,1.0,1.0,21.12,0.0,0.0,2023.0,11.0,5.0,1.0,-47.88
50%,2023-11-05 01:51:24,2023-11-05 01:06:24,1.0,3.3,19.1,1.0,3.14,26.4,0.0,0.0,2023.0,11.0,5.0,1.0,-43.37
75%,2023-11-05 01:55:58,2023-11-05 01:10:52,2.0,5.2,27.5,1.0,5.0,36.48,0.0,0.0,2023.0,11.0,5.0,1.0,-35.1
max,2023-12-31 09:20:00,2023-12-31 09:10:02,4.0,19.7,80.0,5.0,20.67,103.36,0.0,1.75,2023.0,12.0,31.0,18.0,-0.03
std,,,0.69,2.94,11.56,1.12,3.05,13.6,0.0,0.17,0.0,1.82,4.94,3.6,13.58


In [9]:
# Check for possible datetime swaps or errors
swapped_cases = df[df['tpep_pickup_datetime'] > df['tpep_dropoff_datetime']]
print(swapped_cases[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']])


         tpep_pickup_datetime tpep_dropoff_datetime  trip_duration
1631038   2023-01-23 10:43:58   2023-01-23 10:29:26         -14.53
2397097   2023-02-02 13:02:23   2023-02-02 12:50:35         -11.80
2397098   2023-02-02 13:59:20   2023-02-02 13:15:43         -43.62
2482045   2023-02-03 13:45:00   2023-02-03 13:44:50          -0.17
2991721   2023-02-10 09:40:22   2023-02-10 09:20:58         -19.40
...                       ...                   ...            ...
24565445  2023-11-15 14:08:00   2023-11-15 14:04:33          -3.45
24970707  2023-11-20 07:55:00   2023-11-20 07:46:43          -8.28
25300720  2023-11-25 14:53:11   2023-11-25 14:53:09          -0.03
25305059  2023-11-25 15:50:50   2023-11-25 15:50:05          -0.75
28052000  2023-12-31 09:20:00   2023-12-31 09:10:02          -9.97

[725 rows x 3 columns]


as we can see there are only 727 negative values which compared to full dataset is really low number thus instead of going over 30million records to witch the rows we will drop rows with. trip durations less than or equal to 0.

In [8]:
df = df[df['trip_duration']>0]

In [9]:
df['trip_duration'].describe()

count   28092619.00
mean          20.36
std           43.02
min            0.02
25%           10.23
50%           15.20
75%           23.43
max         7053.62
Name: trip_duration, dtype: float64

## Removing Long Trip Durations

In NYC by Law taxis are prohibeted to take on rides that may exceed more than 12 hrs based on NYC regulations: "Both taxi and FHV drivers are prohibited from transporting passengers for more than 10 hours in any 24-hour period and for more than 60 hours in a calendar week (Monday-Sunday). TLC will review driver hours using the trip records it receives from FHV bases and through the TPEP and LPEP systems."

https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.nyc.gov/site/tlc/about/fatigued-driving-prevention-frequently-asked-questions.page%23:~:text%3DBoth%2520taxi%2520and%2520FHV%2520drivers,the%2520TPEP%2520and%2520LPEP%2520systems.&ved=2ahUKEwj24p290IqGAxWDywIHHb0rA6UQFnoECBIQAw&usg=AOvVaw1ieKuvHzDndDauBufXQym5

However as we do not have driver ID's there is no way to track this we will visualise the optimal trip duration as less then or equal to 24 hours per trip.

In [10]:

# Convert trip duration from minutes to hours
df.loc[:, 'trip_duration_hours'] = df['trip_duration'] / 60


In [12]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,JFK_LGA_Pickup_Fee,General_Airport_Fee,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours
0,2023-01-01 00:55:08,2023-01-01 01:01:27,1,1.1,43,237,1,7.9,1.0,4.0,...,0.0,0.0,1-2 miles,night,winter,low,weekend,1,6.32,0.11
1,2023-01-01 00:25:04,2023-01-01 00:37:49,1,2.51,48,238,1,14.9,1.0,15.0,...,0.0,0.0,2-5 miles,night,winter,low,weekend,1,12.75,0.21
2,2023-01-01 00:10:29,2023-01-01 00:21:19,1,1.43,107,79,1,11.4,1.0,3.28,...,0.0,0.0,1-2 miles,night,winter,low,weekend,1,10.83,0.18
3,2023-01-01 00:50:34,2023-01-01 01:02:52,1,1.84,161,137,1,12.8,1.0,10.0,...,0.0,0.0,1-2 miles,night,winter,low,weekend,1,12.3,0.21
4,2023-01-01 00:09:22,2023-01-01 00:19:49,1,1.66,239,143,1,12.1,1.0,3.42,...,0.0,0.0,1-2 miles,night,winter,low,weekend,1,10.45,0.17


In [11]:
# Now, filter the DataFrame to remove unwanted trip durations
df = df[(df['trip_duration'] >= 2)] #Trips less than 2 min are too short

In [12]:
df = df[(df['trip_duration_hours'] <= 24)]# trips longer than 24hrs 

In [14]:
df['trip_duration_hours'].describe()

count   28082179.00
mean           0.34
std            0.71
min            0.03
25%            0.17
50%            0.25
75%            0.39
max           24.00
Name: trip_duration_hours, dtype: float64

## Taxi Zones_ Feature 
taxi zone ID s though informative they do not provide any insights as to where passanger was picked up and neighbourhoods are thought to effect pricing at least when hailing a cab thus we will merge Taxi zone dataset with the NYC trip data on zone IDs and idnetify pickup and drop off buroughs for each trip.

In [15]:
# Merge the zone data into the main taxi trip dataset for pickup locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='PULocationID', right_on='LocationID', how='left')
df.rename(columns={'zone': 'PUzone', 'borough': 'PUborough'}, inplace=True)

In [16]:
# Merge the zone data for dropoff locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='DOLocationID', right_on='LocationID', how='left', suffixes=('', '_drop'))
df.rename(columns={'zone': 'DOzone', 'borough': 'DOborough'}, inplace=True)

In [17]:
# Drop the extra LocationID columns if they are not needed
df.drop(['LocationID', 'LocationID_drop'], axis=1, inplace=True)

In [32]:
print(df['PUborough'].value_counts())
print(df['DOborough'].value_counts())

PUborough
Manhattan        24178439
Queens            3455595
Brooklyn           160578
Bronx               43009
Staten Island        1301
EWR                   338
Name: count, dtype: int64
DOborough
Manhattan        24288525
Queens            1764518
Brooklyn          1343199
Bronx              198709
EWR                102813
Staten Island        9271
Name: count, dtype: int64


In [33]:
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())

PUzone       254156
PUborough    254156
DOzone       386381
DOborough    386381
dtype: int64


In [34]:
print(sorted(zones['LocationID'].unique()))
print(sorted(df['PULocationID'].unique()))
print(sorted(df['DOLocationID'].unique()))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 2

We can see that in our dataset we have 2 zones namely 264 and 265 which do not have specific buroughs and are not in our taxi zones dataset. 

In [18]:
missing_pu = df[~df['PULocationID'].isin(zones['LocationID'])]
missing_do = df[~df['DOLocationID'].isin(zones['LocationID'])]
print(f"Missing PULocationIDs: {missing_pu['PULocationID'].unique()}")
print(f"Missing DOLocationIDs: {missing_do['DOLocationID'].unique()}")

Missing PULocationIDs: [264 265  57 105]
Missing DOLocationIDs: [264 265  57 105]


In [19]:
# Filter data for PULocationID or DOLocationID being 264 or 265
trips = df[(df['PULocationID'].isin([264, 265])) | (df['DOLocationID'].isin([264, 265]))]

# Print the filtered data summary
trips.describe(include='all')


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours,PUzone,PUborough,DOzone,DOborough
count,442820,442820,442820.0,442820.0,442820.0,442820.0,442820.0,442820.0,442820.0,442820.0,...,442820,442820,442820,442820.0,442820.0,442820.0,188741,188741,57123,57123
unique,,,,,,,4.0,,,,...,4,3,2,,,,240,6,252,6
top,,,,,,,1.0,,,,...,spring,low,weekday,,,,JFK Airport,Queens,Times Sq/Theatre District,Manhattan
freq,,,,,,,349866.0,,,,...,116799,327794,315618,,,,71416,99650,2105,44031
mean,2023-06-22 03:14:17.204837,2023-06-22 03:42:11.697376,1.42,8.87,213.54,250.04,,46.46,1.69,6.36,...,,,,6.19,27.91,0.47,,,,
min,2023-01-01 00:02:13,2023-01-01 00:15:27,1.0,1.0,1.0,1.0,,2.0,0.0,0.0,...,,,,1.0,2.0,0.03,,,,
25%,2023-03-25 00:45:58.500000,2023-03-25 01:10:11.750000,1.0,1.84,138.0,264.0,,13.5,0.0,0.0,...,,,,3.0,12.4,0.21,,,,
50%,2023-06-16 22:15:46.500000,2023-06-16 22:38:47,1.0,4.31,264.0,264.0,,26.1,1.0,3.5,...,,,,6.0,20.32,0.34,,,,
75%,2023-09-17 17:11:36.250000,2023-09-17 17:44:22.250000,2.0,13.56,264.0,265.0,,68.8,2.5,8.0,...,,,,9.0,34.4,0.57,,,,
max,2023-12-31 23:53:07,2023-12-31 23:59:46,6.0,50.0,265.0,265.0,,300.0,11.75,280.0,...,,,,12.0,1439.82,24.0,,,,


In [17]:
# Display sample records
trips.sample(2)


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,tip_amount,...,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours,PUzone,PUborough,DOzone,DOborough
4865173,2023-03-06 01:54:58,2023-03-06 02:07:11,1,2.91,264,264,1,16.3,1.0,4.08,...,spring,low,weekday,3,12.22,0.2,,,,
12927916,2023-06-11 13:54:55,2023-06-11 14:52:37,1,17.93,264,264,1,70.0,0.0,17.88,...,summer,low,weekend,6,57.7,0.96,,,,


In [20]:
# Manually assign zones for IDs 264 and 265
df.loc[df['PULocationID'] == 264, ['PUzone', 'PUborough']] = ['Outside NYC', 'Unknown']
df.loc[df['DOLocationID'] == 264, ['DOzone', 'DOborough']] = ['Outside NYC', 'Unknown']
df.loc[df['PULocationID'] == 265, ['PUzone', 'PUborough']] = ['Airport Area', 'Unknown']
df.loc[df['DOLocationID'] == 265, ['DOzone', 'DOborough']] = ['Airport Area', 'Unknown']


In [19]:
# Check for null values in the updated columns
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())

PUzone        86
PUborough     86
DOzone       687
DOborough    687
dtype: int64


In [40]:
# Print rows where PUzone or PUborough is null
print("Rows with missing PUzone or PUborough:")
print(df[df['PUzone'].isnull() | df['PUborough'].isnull()][['PULocationID', 'PUzone', 'PUborough']].head())

# Print rows where DOzone or DOborough is null
print("Rows with missing DOzone or DOborough:")
print(df[df['DOzone'].isnull() | df['DOborough'].isnull()][['DOLocationID', 'DOzone', 'DOborough']].head())


Rows with missing PUzone or PUborough:
         PULocationID PUzone PUborough
196705             57    NaN       NaN
464231             57    NaN       NaN
1379824            57    NaN       NaN
2099500            57    NaN       NaN
2313943            57    NaN       NaN
Rows with missing DOzone or DOborough:
       DOLocationID DOzone DOborough
1122             57    NaN       NaN
16079            57    NaN       NaN
16281            57    NaN       NaN
19908            57    NaN       NaN
73013            57    NaN       NaN


In [20]:
# List unique LocationIDs associated with null zones or boroughs
missing_pu_ids = df[df['PUzone'].isnull()]['PULocationID'].unique()
missing_do_ids = df[df['DOzone'].isnull()]['DOLocationID'].unique()
print(f"Missing PULocationIDs: {missing_pu_ids}")
print(f"Missing DOLocationIDs: {missing_do_ids}")


Missing PULocationIDs: [ 57 105]
Missing DOLocationIDs: [ 57 105]


In [21]:
# Manually assign zones and boroughs for LocationID 57 and 105
df.loc[df['PULocationID'] == 57, ['PUzone', 'PUborough']] = ['Corona', 'Queens']
df.loc[df['DOLocationID'] == 57, ['DOzone', 'DOborough']] = ['Corona', 'Queens']

df.loc[df['PULocationID'] == 105, ['PUzone', 'PUborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']
df.loc[df['DOLocationID'] == 105, ['DOzone', 'DOborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']


In [22]:
# Verify updates for LocationID 57
print("Updated zones and boroughs for LocationID 57:")
print(df[df['PULocationID'] == 57][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 57][['DOLocationID', 'DOzone', 'DOborough']].head(2))

# Verify updates for LocationID 105
print("Updated zones and boroughs for LocationID 105:")
print(df[df['PULocationID'] == 105][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 105][['DOLocationID', 'DOzone', 'DOborough']].head(2))


Updated zones and boroughs for LocationID 57:
        PULocationID  PUzone PUborough
196705            57  Corona    Queens
464231            57  Corona    Queens
       DOLocationID  DOzone DOborough
1122             57  Corona    Queens
16079            57  Corona    Queens
Updated zones and boroughs for LocationID 105:
         PULocationID                                         PUzone  \
3487299           105  Governor's Island/Ellis Island/Liberty Island   
9893957           105  Governor's Island/Ellis Island/Liberty Island   

         PUborough  
3487299  Manhattan  
9893957  Manhattan  
         DOLocationID                                         DOzone  \
2243811           105  Governor's Island/Ellis Island/Liberty Island   
3284053           105  Governor's Island/Ellis Island/Liberty Island   

         DOborough  
2243811  Manhattan  
3284053  Manhattan  


In [23]:
# Check again for null values in the zone and borough columns
print("Null values in PUzone and PUborough after update:")
print(df[['PUzone', 'PUborough']].isnull().sum())

print("Null values in DOzone and DOborough after update:")
print(df[['DOzone', 'DOborough']].isnull().sum())


Null values in PUzone and PUborough after update:
PUzone       0
PUborough    0
dtype: int64
Null values in DOzone and DOborough after update:
DOzone       0
DOborough    0
dtype: int64


In [24]:
df['PUzone'] = df['PUzone'].astype('category')
df['PUborough'] = df['PUborough'].astype('category')
df['DOzone'] = df['DOzone'].astype('category')
df['DOborough'] = df['DOborough'].astype('category')

## General Locations

As our aim is not to predict fares specifically to New York but create geenral model that can predict fares for other cities as well new-york specific zoning is not interesting for our modeling purposes thus we will use general categories like: 

Centere, Airport, Suburbs for each 5 of new york cities buroughs.

In [25]:
# Define the conditions and choices for categorization
conditions = [
    df['PUzone'].isin(["JFK Airport", "LaGuardia Airport"]),
    df['PUborough'] == "Manhattan",
    df['PUborough'].isin(["Brooklyn", "Queens", "Bronx", "Staten Island"])
]

choices = ["Airport", "City Center", "Suburbs"]

# Apply the categorization for pickup locations
df['PUcategory'] = np.select(conditions, choices, default="Other")


In [26]:

# Apply the categorization for drop-off locations
conditions = [
    df['DOzone'].isin(["JFK Airport", "LaGuardia Airport"]),
    df['DOborough'] == "Manhattan",
    df['DOborough'].isin(["Brooklyn", "Queens", "Bronx", "Staten Island"])
]

df['DOcategory'] = np.select(conditions, choices, default="Other")



In [70]:
# Check the first few rows to ensure the categorization
print(df[['PULocationID', 'PUzone', 'PUborough', 'PUcategory', 'DOLocationID', 'DOzone', 'DOborough', 'DOcategory']].head())


   PULocationID                 PUzone  PUborough   PUcategory  DOLocationID  \
0            43           Central Park  Manhattan  City Center           237   
1            48           Clinton East  Manhattan  City Center           238   
2           107               Gramercy  Manhattan  City Center            79   
3           161         Midtown Center  Manhattan  City Center           137   
4           239  Upper West Side South  Manhattan  City Center           143   

                  DOzone  DOborough   DOcategory  
0  Upper East Side South  Manhattan  City Center  
1  Upper West Side North  Manhattan  City Center  
2           East Village  Manhattan  City Center  
3               Kips Bay  Manhattan  City Center  
4    Lincoln Square West  Manhattan  City Center  


# Creating Map For General Categories

In [2]:
import folium
from folium.plugins import PolyLineTextPath

# Create map for NYC taxi zones
nyc_map = folium.Map(location=[40.7128, -74.0060], zoom_start=11)

# Define NYC taxi zone areas with polygons and their categories
nyc_taxi_zones_polygons = {
    "Airport": [
        {"name": "JFK Airport", "coordinates": [[40.6413, -73.7781]]},
        {"name": "LaGuardia Airport", "coordinates": [[40.7769, -73.8740]]}
    ],
    "City Center": [
        {"name": "Manhattan", "coordinates": [[40.7831, -73.9712]]}
    ],
    "Suburbs": [
        {"name": "Brooklyn", "coordinates": [[40.6782, -73.9442]]},
        {"name": "Queens", "coordinates": [[40.7282, -73.7949]]},
        {"name": "Bronx", "coordinates": [[40.8448, -73.8648]]},
        {"name": "Staten Island", "coordinates": [[40.5795, -74.1502]]}
    ]
}

# Colors for categories
colors = {
    "Airport": "blue",
    "City Center": "green",
    "Suburbs": "orange"
}

# Add polygons to NYC map
for category, zones in nyc_taxi_zones_polygons.items():
    for zone in zones:
        folium.CircleMarker(
            location=zone["coordinates"][0],
            radius=50,
            color=colors[category],
            fill=True,
            fill_color=colors[category],
            fill_opacity=0.2,
            popup=f"{zone['name']} - {category}"
        ).add_to(nyc_map)

# Create map for Tbilisi zones
tbilisi_map = folium.Map(location=[41.7151, 44.8271], zoom_start=12)

# Define Tbilisi zone areas with polygons and their categories
tbilisi_zones_polygons = {
    "Airport": [
        {"name": "Tbilisi International Airport", "coordinates": [[41.6692, 44.9547]]}
    ],
    "City Center": [
        {"name": "Rustaveli", "coordinates": [[41.7161, 44.7922]]},
        {"name": "Saburtalo", "coordinates": [[41.7209, 44.7682]]}
    ],
    "Suburbs": [
        {"name": "Dighomi", "coordinates": [[41.7707, 44.7594]]},
        {"name": "Chughureti", "coordinates": [[41.7266, 44.7874]]}
    ]
}

# Add polygons to Tbilisi map
for category, zones in tbilisi_zones_polygons.items():
    for zone in zones:
        folium.CircleMarker(
            location=zone["coordinates"][0],
            radius=50,
            color=colors[category],
            fill=True,
            fill_color=colors[category],
            fill_opacity=0.2,
            popup=f"{zone['name']} - {category}"
        ).add_to(tbilisi_map)

# Save maps to HTML files
nyc_map.save('nyc_taxi_zones_overlay.html')
tbilisi_map.save('tbilisi_zones_overlay.html')


In [27]:
# Test the categorization
def test_general_categorization(df):
    # Check that 'JFK Airport' and 'LaGuardia Airport' zones are categorized as 'Airport'
    assert all(df[df['PUzone'].isin(["JFK Airport", "LaGuardia Airport"])]['PUcategory'] == 'Airport'), "Test Failed for Airport PUzone"
    assert all(df[df['DOzone'].isin(["JFK Airport", "LaGuardia Airport"])]['DOcategory'] == 'Airport'), "Test Failed for Airport DOzone"
    
    # Check that Manhattan borough is categorized as 'City Center'
    assert all(df[df['PUborough'] == "Manhattan"]['PUcategory'] == 'City Center'), "Test Failed for Manhattan PUborough"
    assert all(df[df['DOborough'] == "Manhattan"]['DOcategory'] == 'City Center'), "Test Failed for Manhattan DOborough"
    
    # Check that Brooklyn, Queens, Bronx, and Staten Island (excluding airport zones) are categorized as 'Suburbs'
    suburbs = ["Brooklyn", "Queens", "Bronx", "Staten Island"]
    suburban_pu = df[df['PUborough'].isin(suburbs) & ~df['PUzone'].isin(["JFK Airport", "LaGuardia Airport"])]
    suburban_do = df[df['DOborough'].isin(suburbs) & ~df['DOzone'].isin(["JFK Airport", "LaGuardia Airport"])]

    if not all(suburban_pu['PUcategory'] == 'Suburbs'):
        print("Debugging PUcategory for Suburbs:")
        print(suburban_pu[suburban_pu['PUcategory'] != 'Suburbs'][['PULocationID', 'PUzone', 'PUborough', 'PUcategory']])
    
    if not all(suburban_do['DOcategory'] == 'Suburbs'):
        print("Debugging DOcategory for Suburbs:")
        print(suburban_do[suburban_do['DOcategory'] != 'Suburbs'][['DOLocationID', 'DOzone', 'DOborough', 'DOcategory']])
    
    assert all(suburban_pu['PUcategory'] == 'Suburbs'), "Test Failed for Suburbs PUborough"
    assert all(suburban_do['DOcategory'] == 'Suburbs'), "Test Failed for Suburbs DOborough"
    
    # Check that any other zones are categorized as 'Other'
    other_zones_pu = df[~df['PUzone'].isin(["JFK Airport", "LaGuardia Airport"]) & 
                        ~df['PUborough'].isin(suburbs + ["Manhattan"])]
    other_zones_do = df[~df['DOzone'].isin(["JFK Airport", "LaGuardia Airport"]) & 
                        ~df['DOborough'].isin(suburbs + ["Manhattan"])]
    
    if not all(other_zones_pu['PUcategory'] == 'Other'):
        print("Debugging PUcategory for Other:")
        print(other_zones_pu[other_zones_pu['PUcategory'] != 'Other'][['PULocationID', 'PUzone', 'PUborough', 'PUcategory']])
    
    if not all(other_zones_do['DOcategory'] == 'Other'):
        print("Debugging DOcategory for Other:")
        print(other_zones_do[other_zones_do['DOcategory'] != 'Other'][['DOLocationID', 'DOzone', 'DOborough', 'DOcategory']])
    
    assert all(other_zones_pu['PUcategory'] == 'Other'), "Test Failed for Other PUzone"
    assert all(other_zones_do['DOcategory'] == 'Other'), "Test Failed for Other DOzone"
    
    print("All general tests passed!")




In [27]:
# Run the general test
test_general_categorization(df)

All general tests passed!


In [28]:
# Columns that are no longer needed for modeling
columns_to_drop = ['PULocationID', 'DOLocationID', 'PUzone', 'PUborough', 'DOzone', 'DOborough']

# Drop the columns
df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# Display the first few rows to verify the columns have been dropped
df.head()

# Optional: Display the remaining columns to verify
print("Remaining columns:", df.columns.tolist())

Remaining columns: ['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'payment_type', 'fare_amount', 'extra', 'tip_amount', 'total_amount', 'JFK_LGA_Pickup_Fee', 'General_Airport_Fee', 'distance_bins', 'pickup_time_of_day', 'pickup_season', 'passenger_count_category', 'pickup_day_type', 'transaction_month', 'trip_duration', 'trip_duration_hours', 'PUcategory', 'DOcategory']


## Holiday
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

In [29]:
# Create a calendar object
calendar = USFederalHolidayCalendar()

# Define the range for your data
start_date = '2023-01-01'
end_date = '2023-12-31'

# Generate holidays
holidays = calendar.holidays(start=start_date, end=end_date)

# Add a column to your dataframe indicating whether the trip started on a holiday
df['is_holiday'] = df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)

In [77]:
df['is_holiday'].describe()

count   28093416.00
mean           0.02
std            0.15
min            0.00
25%            0.00
50%            0.00
75%            0.00
max            1.00
Name: is_holiday, dtype: float64

In [78]:
holidays = df[df['is_holiday']==1]
holidays.describe()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,General_Airport_Fee,transaction_month,trip_duration,trip_duration_hours,is_holiday
count,631098,631098,631098.0,631098.0,631098.0,631098.0,631098.0,631098.0,631098.0,631098.0,631098.0,631098.0,631098.0,631098.0
mean,2023-07-08 13:08:13.026664,2023-07-08 13:27:11.554902,1.47,4.93,24.14,1.38,4.16,34.06,0.04,0.21,6.76,18.98,0.32,1.0
min,2023-01-02 00:00:03,2023-01-02 00:05:01,1.0,1.0,3.0,0.0,0.0,5.5,0.0,0.0,1.0,2.0,0.03,1.0
25%,2023-02-20 18:42:47.500000,2023-02-20 18:59:16.500000,1.0,1.57,11.4,0.0,1.0,17.64,0.0,0.0,2.0,9.18,0.15,1.0
50%,2023-07-04 14:32:03.500000,2023-07-04 14:48:17.500000,1.0,2.47,15.6,1.0,3.08,23.0,0.0,0.0,7.0,13.87,0.23,1.0
75%,2023-11-10 08:20:56.500000,2023-11-10 08:41:03.750000,2.0,5.7,28.9,2.5,5.0,38.1,0.0,0.0,11.0,22.02,0.37,1.0
max,2023-12-25 23:59:57,2023-12-26 22:46:23,6.0,50.0,299.0,12.75,222.21,379.81,1.25,1.75,12.0,1439.88,24.0,1.0
std,,,0.93,5.43,20.35,1.91,4.65,25.95,0.22,0.56,3.91,44.11,0.74,0.0


we see that there is high fare amount fro holidays.

In [79]:
holidays.sample(4)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,...,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours,PUcategory,DOcategory,is_holiday
3805097,2023-02-20 17:04:13,2023-02-20 17:12:38,4,1.9,1,11.4,0.0,0.1,15.5,0.0,...,afternoon,winter,medium,weekday,2,8.42,0.14,City Center,City Center,1
21449553,2023-10-09 16:02:37,2023-10-09 16:08:36,1,1.81,2,9.3,0.0,0.0,13.3,0.0,...,afternoon,autumn,low,weekday,10,5.98,0.1,City Center,City Center,1
11852337,2023-05-29 17:09:52,2023-05-29 17:20:13,4,1.69,1,11.4,0.0,1.54,16.94,0.0,...,afternoon,spring,medium,weekday,5,10.35,0.17,City Center,City Center,1
13562742,2023-06-19 11:18:29,2023-06-19 11:43:36,1,1.1,2,10.7,2.5,0.0,14.7,0.0,...,morning,summer,low,weekday,6,25.12,0.42,City Center,City Center,1


In [80]:
df.sample(4)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,payment_type,fare_amount,extra,tip_amount,total_amount,JFK_LGA_Pickup_Fee,...,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours,PUcategory,DOcategory,is_holiday
7854077,2023-04-12 14:51:11,2023-04-12 15:20:07,1,4.34,1,28.2,0.0,6.44,38.64,0.0,...,afternoon,spring,low,weekday,4,28.93,0.48,City Center,City Center,0
21467091,2023-10-09 20:14:22,2023-10-09 20:20:23,1,1.1,2,7.2,3.5,0.0,12.2,0.0,...,evening,autumn,low,weekday,10,6.02,0.1,City Center,City Center,1
24187726,2023-11-10 22:59:43,2023-11-10 23:21:20,1,4.59,1,24.0,1.0,5.8,34.8,0.0,...,evening,autumn,low,weekday,11,21.62,0.36,City Center,Suburbs,1
17114414,2023-08-07 21:58:29,2023-08-07 22:05:03,1,1.1,1,8.6,3.5,0.0,13.6,0.0,...,evening,summer,low,weekday,8,6.57,0.11,City Center,City Center,0


In [30]:
# Filter the DataFrame to include only the rows where 'is_holiday' is 1
holiday_trips = df[df['is_holiday'] == 1].copy()  # Adding .copy() to avoid SettingWithCopyWarning on a slice

# Extract (month, day) pairs without adding new columns
holiday_trips['month_day'] = holiday_trips['tpep_pickup_datetime'].apply(lambda x: (x.month, x.day))

# Find unique (month, day) pairs
unique_holiday_dates = holiday_trips['month_day'].unique()

# Display the unique holiday dates
print(unique_holiday_dates)


[(1, 2) (1, 16) (2, 20) (5, 29) (6, 19) (7, 4) (9, 4) (10, 9) (11, 23)
 (11, 10) (12, 25)]


In [31]:
print(unique_holiday_dates)


[(1, 2) (1, 16) (2, 20) (5, 29) (6, 19) (7, 4) (9, 4) (10, 9) (11, 23)
 (11, 10) (12, 25)]




1. **(6, 19)** - June 19: Juneteenth National Independence Day, a federal holiday recognizing the emancipation of enslaved African Americans.
2. **(7, 4)** - July 4: Independence Day, a major national holiday in the United States celebrating the country's declaration of independence from the British Empire.
3. **(5, 29)** - May 29: This date in 2023 was Memorial Day, observed on the last Monday of May each year, honoring the military personnel who have died in the performance of their military duties.
4. **(11, 10)** - November 10: This is not a recognized public holiday. If it were November 11, it would be Veterans Day.
5. **(11, 23)** - November 23: This date in 2023 was Thanksgiving Day, a significant U.S. holiday celebrated on the fourth Thursday of November.
6. **(1, 2)** - January 2: In 2023, this was the observed day for New Year’s Day (January 1), as January 1st fell on a Sunday.
7. **(1, 16)** - January 16: Martin Luther King Jr. Day in 2023, celebrated on the third Monday of January to honor the civil rights leader.
8. **(10, 9)** - October 9: Columbus Day/Indigenous Peoples' Day in 2023, observed on the second Monday in October.
9. **(9, 4)** - September 4: Labor Day in 2023, which is celebrated on the first Monday of September and honors the American labor movement.
10. **(12, 25)** - December 25: Christmas Day, a major holiday across many cultures, marking the celebration of the birth of Jesus Christ.
11. **(2, 20)** - February 20: Presidents' Day in 2023, observed on the third Monday of February in honor of George Washington and other presidents.



In [32]:
df['is_holiday'] = df['is_holiday'].astype('category')

## speed

**speed**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.

In [33]:


# Calculate speed
df['speed_mph'] = df['trip_distance'] / df['trip_duration_hours']

# Handle any potential infinite or NaN values that may occur if duration is zero
df['speed_mph'].replace([np.inf, -np.inf], np.nan, inplace=True)
df['speed_mph'].fillna(0, inplace=True)  # Optionally set to zero or another placeholder value


In [34]:
df['speed_mph'].describe()

count   28093416.00
mean          12.25
std            6.95
min            0.04
25%            7.76
50%           10.34
75%           14.50
max         1153.47
Name: speed_mph, dtype: float64

In [89]:
df['speed_mph'].isna().sum()

0

Trip speed that is over 100 mph is unrealistic for taxi trips in the city, thus to remove any possible outliers we will remove the speeds above 100 mph. 

In [35]:
# Define a realistic maximum speed
max_realistic_speed = 100  # mph

In [36]:
# Filter the DataFrame to remove highly unrealistic speeds
df = df[df['speed_mph'] <= max_realistic_speed]

In [37]:
# Describe the speed statistics after removing extreme data points
df['speed_mph'].describe()

count   28092948.00
mean          12.25
std            6.87
min            0.04
25%            7.76
50%           10.34
75%           14.50
max           99.89
Name: speed_mph, dtype: float64

### Adjusting Data Types

In [38]:
df.dtypes

tpep_pickup_datetime        datetime64[us]
tpep_dropoff_datetime       datetime64[us]
passenger_count                      int64
trip_distance                      float64
payment_type                      category
fare_amount                        float64
extra                              float64
tip_amount                         float64
total_amount                       float64
JFK_LGA_Pickup_Fee                 float64
General_Airport_Fee                float64
distance_bins                     category
pickup_time_of_day                category
pickup_season                     category
passenger_count_category          category
pickup_day_type                   category
transaction_month                    int32
trip_duration                      float64
trip_duration_hours                float64
PUcategory                          object
DOcategory                          object
is_holiday                        category
speed_mph                          float64
dtype: obje

In [39]:

df['transaction_month'] = df['transaction_month'].astype('category')
df['PUcategory'] = df['PUcategory'].astype('category')
df['DOcategory'] = df['DOcategory'].astype('category')

## Testing NEW Feature Validity

We need to check if created features are within their bounds and our code worked properly.

In [44]:
def test_trip_duration_positive():
    assert df['trip_duration'].min() > 0, "Error: Non-positive trip durations present in the dataset."


In [45]:
def test_time_of_day_categories():
    hours = df['tpep_pickup_datetime'].dt.hour
    conditions = [
        ((hours >= 5) & (hours <= 11)),
        ((hours >= 12) & (hours <= 17)),
        ((hours >= 18) & (hours <= 23)),
        ((hours < 5) | (hours == 24))
    ]
    categories = ['morning', 'afternoon', 'evening', 'night']
    for condition, category in zip(conditions, categories):
        assert all(df.loc[condition, 'pickup_time_of_day'] == category), f"Error in categorizing {category}."


In [46]:
def test_passenger_count_categories():
    conditions = [
        (df['passenger_count'] == 1),
        (df['passenger_count'].between(2, 4)),
        (df['passenger_count'].between(5, 6))
    ]
    categories = ['low', 'medium', 'high']
    for condition, category in zip(conditions, categories):
        assert all(df.loc[condition, 'passenger_count_category'] == category), f"Error in categorizing passenger count {category}."


In [47]:
def test_seasonal_categories():
    months = df['tpep_pickup_datetime'].dt.month
    conditions = [
        (months.isin([3, 4, 5])),
        (months.isin([6, 7, 8])),
        (months.isin([9, 10, 11])),
        (months.isin([12, 1, 2]))
    ]
    seasons = ['spring', 'summer', 'autumn', 'winter']
    for condition, season in zip(conditions, seasons):
        assert all(df.loc[condition, 'pickup_season'] == season), f"Error in season categorization for {season}."


In [48]:
from pandas.tseries.holiday import USFederalHolidayCalendar

def test_holiday_flag():
    calendar = USFederalHolidayCalendar()
    holidays = calendar.holidays(start=df['tpep_pickup_datetime'].min(), end=df['tpep_pickup_datetime'].max())
    df['calculated_holiday'] = df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)
    assert all(df['calculated_holiday'] == cleaned_df['is_holiday']), "Holiday flag mismatches detected."


In [49]:
# Running all tests
try:
    test_trip_duration_positive()
    test_time_of_day_categories()
    test_passenger_count_categories()
    print("All tests passed!")
except AssertionError as e:
    print("Test failed:", e)


All tests passed!


# Final Check Of The Dataset

In [50]:
# Verify the data types of the columns
print(df.dtypes)


tpep_pickup_datetime        datetime64[us]
tpep_dropoff_datetime       datetime64[us]
passenger_count                      int64
trip_distance                      float64
payment_type                      category
fare_amount                        float64
extra                              float64
tip_amount                         float64
total_amount                       float64
JFK_LGA_Pickup_Fee                 float64
General_Airport_Fee                float64
distance_bins                     category
pickup_time_of_day                category
pickup_season                     category
passenger_count_category          category
pickup_day_type                   category
transaction_month                 category
trip_duration                      float64
trip_duration_hours                float64
PUcategory                        category
DOcategory                        category
is_holiday                        category
speed_mph                          float64
dtype: obje

In [51]:
# Check for missing values in the DataFrame
missing_values = df.isnull().sum()
print(missing_values)


tpep_pickup_datetime        0
tpep_dropoff_datetime       0
passenger_count             0
trip_distance               0
payment_type                0
fare_amount                 0
extra                       0
tip_amount                  0
total_amount                0
JFK_LGA_Pickup_Fee          0
General_Airport_Fee         0
distance_bins               0
pickup_time_of_day          0
pickup_season               0
passenger_count_category    0
pickup_day_type             0
transaction_month           0
trip_duration               0
trip_duration_hours         0
PUcategory                  0
DOcategory                  0
is_holiday                  0
speed_mph                   0
dtype: int64


In [53]:
# Select numeric columns
numeric_df = df.select_dtypes(include=[np.number])

# Calculate correlations with fare_amount
correlation_with_fare = numeric_df.corr()['fare_amount'].sort_values(ascending=False)

# Display the correlations
print(correlation_with_fare)

fare_amount           1.00
total_amount          0.98
trip_distance         0.96
speed_mph             0.60
tip_amount            0.59
General_Airport_Fee   0.59
trip_duration_hours   0.27
trip_duration         0.27
extra                 0.17
JFK_LGA_Pickup_Fee    0.16
passenger_count       0.04
Name: fare_amount, dtype: float64


Drop Highly Correlated Features: total_amount to avoid multicollinearity with fare_amount.
Drop Non-Relevant Features: extra, JFK_LGA_Pickup_Fee, General_Airport_Fee as they are not applicable for Tbilisi.
Keep Only One Duration Feature: trip_duration_hours to allow future trip duration input.
Retain Important Features: trip_distance, speed_mph, and tip_amount which have significant correlations with fare_amount.

In [40]:
# Drop unnecessary columns based on correlation analysis and relevance
columns_to_drop = [
    'tpep_pickup_datetime', 'tpep_dropoff_datetime', 
    'total_amount', 'extra', 'JFK_LGA_Pickup_Fee', 'General_Airport_Fee'
]

In [45]:
df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

In [46]:
df.head()

Unnamed: 0,passenger_count,trip_distance,payment_type,fare_amount,tip_amount,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours,PUcategory,DOcategory,is_holiday,speed_mph
0,1,1.1,1,7.9,4.0,1-2 miles,night,winter,low,weekend,1,6.32,0.11,City Center,City Center,0,10.45
1,1,2.51,1,14.9,15.0,2-5 miles,night,winter,low,weekend,1,12.75,0.21,City Center,City Center,0,11.81
2,1,1.43,1,11.4,3.28,1-2 miles,night,winter,low,weekend,1,10.83,0.18,City Center,City Center,0,7.92
3,1,1.84,1,12.8,10.0,1-2 miles,night,winter,low,weekend,1,12.3,0.21,City Center,City Center,0,8.98
4,1,1.66,1,12.1,3.42,1-2 miles,night,winter,low,weekend,1,10.45,0.17,City Center,City Center,0,9.53


For Future use model tetsing or training we will retain, full feature engineered dataset

In [49]:
#df.to_parquet('feature_engineered_full_taxi_trip_data.parquet')

In [51]:
df.head(1)

Unnamed: 0,passenger_count,trip_distance,payment_type,fare_amount,tip_amount,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours,PUcategory,DOcategory,is_holiday,speed_mph
0,1,1.1,1,7.9,4.0,1-2 miles,night,winter,low,weekend,1,6.32,0.11,City Center,City Center,0,10.45


# Sampling For Further Analysis

In [25]:
df = pd.read_parquet('feature_engineered_full_taxi_trip_data.parquet')

In [26]:
df.head()

Unnamed: 0,passenger_count,trip_distance,payment_type,fare_amount,tip_amount,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type,transaction_month,trip_duration,trip_duration_hours,PUcategory,DOcategory,is_holiday,speed_mph
0,1,1.1,1,7.9,4.0,1-2 miles,night,winter,low,weekend,1,6.32,0.11,City Center,City Center,0,10.45
1,1,2.51,1,14.9,15.0,2-5 miles,night,winter,low,weekend,1,12.75,0.21,City Center,City Center,0,11.81
2,1,1.43,1,11.4,3.28,1-2 miles,night,winter,low,weekend,1,10.83,0.18,City Center,City Center,0,7.92
3,1,1.84,1,12.8,10.0,1-2 miles,night,winter,low,weekend,1,12.3,0.21,City Center,City Center,0,8.98
4,1,1.66,1,12.1,3.42,1-2 miles,night,winter,low,weekend,1,10.45,0.17,City Center,City Center,0,9.53


In [27]:
len(df)

28092948

### Understanding and Implementing Stratified Sampling

#### What is Stratified Sampling?

Stratified sampling is a statistical method used to ensure that various subgroups within a dataset are adequately represented within the sample. It involves dividing a population into smaller groups, known as 'strata', that are distinct and non-overlapping. Each stratum is defined by shared characteristics or criteria, making them homogeneous within each group but heterogeneous between groups. Common stratifying criteria include demographic variables such as age, income, education level, or specific attributes relevant to the study, like seasons or time of day in the context of taxi fare analysis.


#### Why Use Stratified Sampling for Taxi Fare Prediction?

Stratified sampling is particularly beneficial for datasets where certain subgroups are expected to exhibit different behaviors or properties. For the taxi fare prediction project, several reasons underscore the choice of stratified sampling

In [46]:
# Assume 'pickup_season' and 'time_of_day' are important categories

sampled_df = df.groupby(['pickup_season', 'pickup_time_of_day','is_holiday','PUcategory'], group_keys=False).apply(lambda x: x.sample(frac=0.01, random_state=42))

In [47]:
len(sampled_df)

280930

In [48]:
sampled_df.columns

Index(['passenger_count', 'trip_distance', 'payment_type', 'fare_amount',
       'tip_amount', 'distance_bins', 'pickup_time_of_day', 'pickup_season',
       'passenger_count_category', 'pickup_day_type', 'transaction_month',
       'trip_duration', 'trip_duration_hours', 'PUcategory', 'DOcategory',
       'is_holiday', 'speed_mph'],
      dtype='object')

In [49]:
sampled_df.dtypes

passenger_count                int64
trip_distance                float64
payment_type                   int64
fare_amount                  float64
tip_amount                   float64
distance_bins               category
pickup_time_of_day          category
pickup_season               category
passenger_count_category    category
pickup_day_type             category
transaction_month              int32
trip_duration                float64
trip_duration_hours          float64
PUcategory                  category
DOcategory                  category
is_holiday                     int64
speed_mph                    float64
dtype: object

In [50]:
sampled_df[categorical_columns].describe(include='category')

Unnamed: 0,PUcategory,DOcategory,distance_bins,pickup_time_of_day,pickup_season,passenger_count_category,pickup_day_type
count,280930,280930,280930,280930,280930,280930,280930
unique,4,4,6,4,4,3,2
top,City Center,City Center,1-2 miles,afternoon,spring,low,weekday
freq,241784,243062,122142,99560,75862,212852,203622


In [54]:
sampled_df.describe()

Unnamed: 0,passenger_count,trip_distance,payment_type,fare_amount,tip_amount,transaction_month,trip_duration,trip_duration_hours,is_holiday,speed_mph
count,280930.0,280930.0,280930.0,280930.0,280930.0,280930.0,280930.0,280930.0,280930.0,280930.0
mean,1.4,4.25,1.19,22.84,4.06,6.5,20.35,0.34,0.02,12.24
std,0.89,4.81,0.44,18.39,4.27,3.45,42.45,0.71,0.15,6.87
min,1.0,1.0,1.0,3.0,0.0,1.0,2.0,0.03,0.0,0.04
25%,1.0,1.5,1.0,11.4,1.26,4.0,10.25,0.17,0.0,7.75
50%,1.0,2.26,1.0,16.3,3.28,6.0,15.18,0.25,0.0,10.32
75%,1.0,4.34,1.0,25.4,5.0,10.0,23.43,0.39,0.0,14.49
max,6.0,50.0,4.0,300.0,120.0,12.0,1439.02,23.98,1.0,75.37


In [55]:
df.describe()

Unnamed: 0,passenger_count,trip_distance,payment_type,fare_amount,tip_amount,transaction_month,trip_duration,trip_duration_hours,is_holiday,speed_mph
count,28092948.0,28092948.0,28092948.0,28092948.0,28092948.0,28092948.0,28092948.0,28092948.0,28092948.0,28092948.0
mean,1.4,4.26,1.19,22.85,4.07,6.49,20.36,0.34,0.02,12.25
std,0.89,4.82,0.44,18.37,4.31,3.45,42.6,0.71,0.15,6.87
min,1.0,1.0,1.0,2.0,0.0,1.0,2.0,0.03,0.0,0.04
25%,1.0,1.5,1.0,11.4,1.26,4.0,10.25,0.17,0.0,7.76
50%,1.0,2.27,1.0,16.3,3.28,6.0,15.2,0.25,0.0,10.34
75%,1.0,4.36,1.0,25.4,5.0,10.0,23.43,0.39,0.0,14.5
max,6.0,50.0,4.0,300.0,984.3,12.0,1439.97,24.0,1.0,99.89


### Key Insights and Justification for Using a 1% Sample

#### Summary Statistics Comparison

To determine if the 1% sample is representative of the entire dataset, we compare the summary statistics for both the original dataset and the sampled dataset. Here are the key observations:

#### Count:
- **Original Dataset:** 28,092,948
- **Sampled Dataset:** 280,930
- The sampled dataset contains 1% of the original dataset, aligning with our sampling fraction.

#### Mean Values:
- The mean values of the numerical columns are nearly identical between the original dataset and the sampled dataset. This indicates that the sampling process has not biased the data significantly.
  - **Passenger Count:** 1.40 (both datasets)
  - **Trip Distance:** 4.25 (sampled) vs. 4.26 (original)
  - **Fare Amount:** 22.84 (sampled) vs. 22.85 (original)
  - **Tip Amount:** 4.06 (sampled) vs. 4.07 (original)
  - **Transaction Month:** 6.50 (sampled) vs. 6.49 (original)
  - **Trip Duration:** 20.35 minutes (sampled) vs. 20.36 minutes (original)
  - **Trip Duration Hours:** 0.34 (both datasets)
  - **Speed (mph):** 12.24 (sampled) vs. 12.25 (original)
  - **Is Holiday:** 0.02 (both datasets)

#### Standard Deviation:
- The standard deviation values are also very close between the two datasets, indicating similar variability.
  - **Passenger Count:** 0.89 (both datasets)
  - **Trip Distance:** 4.81 (sampled) vs. 4.82 (original)
  - **Fare Amount:** 18.39 (sampled) vs. 18.37 (original)
  - **Tip Amount:** 4.27 (sampled) vs. 4.31 (original)
  - **Transaction Month:** 3.45 (both datasets)
  - **Trip Duration:** 42.45 minutes (sampled) vs. 42.60 minutes (original)
  - **Trip Duration Hours:** 0.71 (both datasets)
  - **Speed (mph):** 6.87 (both datasets)
  - **Is Holiday:** 0.15 (both datasets)

#### Min, 25th Percentile, Median, 75th Percentile, and Max:
- The minimum, 25th percentile, median, 75th percentile, and maximum values are nearly identical between the original and sampled datasets. This consistency across the distribution confirms that the sample maintains the same data distribution as the original dataset.
  - **Passenger Count:** Values are identical.
  - **Trip Distance:** Slight differences in the median (2.26 vs. 2.27), but overall values are very close.
  - **Fare Amount:** Min values differ slightly (3.00 vs. 2.00), but the overall distribution is consistent.
  - **Tip Amount:** The sampled dataset has a max tip amount of 120.00 vs. 984.30 in the original, which might indicate an outlier or error.
  - **Transaction Month:** Identical across all percentiles.
  - **Trip Duration:** Slight difference in max value (1439.02 vs. 1439.97), but the rest are consistent.
  - **Speed (mph):** The max value in the sampled dataset is slightly lower (75.37 vs. 99.89), suggesting that some extreme values may not have been captured in the sample.

### Justification for Using a 1% Sample

1. **Computational Efficiency**:
    - **Processing Time and Resource Usage**: Handling the entire dataset of nearly 28 million records can be computationally intensive and time-consuming. A 1% sample reduces the processing time and computational resources required, making analysis more feasible and efficient.
    - **Memory Usage**: With large datasets, memory limitations can be a significant issue. Using a smaller sample ensures that memory constraints are not exceeded during data processing and analysis.

2. **Statistical Representativeness**:
    - **Distribution Consistency**: As observed from the descriptive statistics, the 1% sample maintains a very close distribution to the original dataset. The means, standard deviations, and percentiles of the sampled data are almost identical to those of the original data, indicating that the sample is statistically representative.
    - **Random Sampling**: Assuming the sample was taken randomly, it ensures that all patterns and variations in the original data are captured proportionally in the sample, which allows for accurate inferences and model training.

3. **Practicality in Analysis**:
    - **Feasibility**: Analyzing a smaller dataset is more practical for exploratory data analysis, model prototyping, and initial hypothesis testing. Once a robust model or analysis method is developed, it can be scaled up or validated using the full dataset.
    - **Scalability**: Techniques and models developed on a representative sample can be easily scaled to the full dataset when needed. This stepwise approach ensures efficiency and practicality in the analysis workflow.

### Conclusion

Using a 1% sample of the original dataset is justified based on the need for computational efficiency and the statistical representativeness of the sample. The consistency in key statistical measures between the sample and the full dataset demonstrates that the sample accurately reflects the original data, ensuring reliable analysis outcomes. This approach balances the practical constraints of data handling with the need for robust, representative analysis.

# Saving Sampled Feature Engineered Dataset for EDA and Modeling

In [52]:
sampled_df.to_parquet('/Users/md/Desktop/python_project/parquet_files/cleaned/sampled_taxi_dataset_v.1.parquet', engine='pyarrow')
