# Taxi Fare Prediction Model_Feature_Engineering&_EDA

## Introduction
This project aims to develop a predictive model for taxi fares in NYC. Initially, we will create a model for NYC and adjust parameters to align with domain knowledge from Tbilisi. We hypothesize that factors such as time of day, seasonality, and holidays impact taxi demand and fare prices.



### Notebook Aim and Feature Development

#### **Objective**
The primary objective of our project is to develop a predictive model that accurately forecasts taxi fares. Initially focusing on New York City (NYC), we aim to expand and adapt the model to incorporate Tbilisi, employing localized domain knowledge to tailor our approach.

#### **Influence of Demand on Pricing**
The fare prices in the taxi industry are predominantly influenced by demand dynamics, which can fluctuate based on various factors including:

- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

#### **Additional Influential Features**
Beyond temporal and periodic factors, several other elements could influence fare pricing:

- **Passenger Count**: Exploring whether vehicles accommodating more passengers have different fare structures, similar to practices in ride-sharing applications.
- **Trip Distance and Duration**: Both metrics are crucial for pricing. While trip distance is a direct influencer, the duration might also affect costs, especially in varying traffic conditions.
- **Velocity**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.
- **Taxi Zones**: With the NYC taxi zone dataset, we can analyze whether specific pickup and dropoff locations impact fare prices due to their geographical significance.


### Summary
This comprehensive approach not only allows us to understand the multifaceted dynamics of taxi fare pricing in NYC but also sets a foundation for adapting the model to Tbilisi, ensuring that both city-specific and universal factors are considered for effective fare prediction.

## Data Loading

In this section we will install necessary packages, imports necessary libraries and load the dataset.

In [None]:
!pip install pyarrow
!pip install fastparquet


In [None]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sns
pd.set_option('display.float_format', lambda x: '%.2f' % x)
from pandas.tseries.holiday import USFederalHolidayCalendar as calande

In [None]:
# Replace 'path_to_file.parquet' with the path to your Parquet file
df_original = pd.read_parquet('/Users/md/Desktop/python_project/parquet_files/cleaned/cleaned_taxi_data.parquet', engine='pyarrow')  # or engine='fastparquet' if you prefer
df = df_original

Let's check for data accuracy and that the cleaned data is clean and has all the columns after data preparation in previous notebook.

## Data Initial View and Celaining 

As we already process cleaned data from previous notebook we do not need to clean the data for nulls, duplicates or outliers however to check that data is consistent and clean we will have an initial look at the laoded dataset below.

In [None]:
df.describe()

In [None]:
df.isnull().sum()

### Additional Dataset For Taxi Zones Loading

In [None]:
zones = pd.read_csv("/Users/md/Desktop/python_project/parquet_files/cleaned/taxi_zones.csv", sep=';')
zones.describe()

In [None]:
zones.sample(5)

we already see that we have 263 zones, in our dataset we have 265 zones for taxis, which means we already know that when joining we will have to adjust for missing values and try to find this zones or remove them.

# Feature Engineering

Below based on our domain knowledge and literature reviews we will create new features or adjust the existing ones to gain more insights on the data and cerate best possible predictive model.

## Seasonal and Time Features


- **Seasons**: Investigating how seasonal changes—spring, summer, autumn, and winter—affect taxi demand and subsequently, fare prices.
- **Day of the Week**: Determining if there are variations in taxi usage and prices between weekdays and weekends.
- **Time of Day**: Analyzing how time segments (morning, afternoon, evening, and night) impact traffic conditions and fare rates, particularly during peak rush hours.
- **Duration**: How long did the trip last.

In [None]:
# Convert the pickup and dropoff datetime to pandas datetime format if not already
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

# Time of day segmentation
df['pickup_time_of_day'] = df['tpep_pickup_datetime'].dt.hour.apply(lambda x: 'morning' if 5 <= x <= 11
                                                                           else 'afternoon' if 12 <= x <= 17
                                                                           else 'evening' if 18 <= x <= 23
                                                                           else 'night')

# Seasons segmentation
df['pickup_season'] = df['tpep_pickup_datetime'].dt.month.apply(lambda x: 'spring' if 3 <= x <= 5
                                                                       else 'summer' if 6 <= x <= 8
                                                                       else 'autumn' if 9 <= x <= 11
                                                                       else 'winter')

# Passenger count categories
df['passenger_count_category'] = pd.cut(df['passenger_count'], bins=[0, 1, 4, 6], include_lowest=True, 
                                        labels=['low', 'medium', 'high'])

# Weekday/Weekend segmentation
df['pickup_day_type'] = df['tpep_pickup_datetime'].dt.day_name().apply(lambda x: 'weekend' if x in ['Saturday', 'Sunday'] else 'weekday')


#taxi_data_prepared['transaction_date'] = pd.to_datetime(taxi_data_prepared['tpep_pickup_datetime'].dt.date)
# -> we make it datetime again because it's very little use when it's just a string (can't compare, sort, etc.)
df['transaction_year'] = df['tpep_pickup_datetime'].dt.year
df['transaction_month'] = df['tpep_pickup_datetime'].dt.month
df['transaction_day'] =  df['tpep_pickup_datetime'].dt.day
df['transaction_hour'] = df['tpep_pickup_datetime'].dt.hour

#trip duration is another interesting feature to analyze 


# Calculate the trip duration and convert it to minutes
df['trip_duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60


Lets take a look at adjusted dataset and what are the new created features we will sample the dataset to also test that the fatures were created correctly.

In [None]:
df.sample(10)

Below we will check if the trip duration calculations are correct

In [None]:
# Display the first few rows to confirm the new 'trip_duration' column
print(df[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']].sample(10))

To check if newly added features have correct values we will use descriptive statistics and adjust accordingly if needed.

In [None]:
df.describe()

We see that for month, year, day, season features, the values make sense although for trip duration we can see that we have negative trip durations. 


Negative trip durations may have occured due to data entry issues , times might have been mixed up. we can investigate further and see what is the number of negative values and either drop the corrupted data or adjust it accordingly.

In [None]:
# Display cases with negative trip_duration
negative_durations = df[df['trip_duration'] < 0]
negative_durations[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']]

negative_durations.describe()

In [None]:
# Check for possible datetime swaps or errors
swapped_cases = df[df['tpep_pickup_datetime'] > df['tpep_dropoff_datetime']]
print(swapped_cases[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_duration']])


as we can see there are only 727 negative values which compared to full dataset is really low number thus instead of going over 30million records to witch the rows we will drop rows with. trip durations less than or equal to 0.

In [None]:
df = df[df['trip_duration']>0]
df.describe()

## Taxi Zones_ Feature 
taxi zone ID s though informative they do not provide any insights as to where passanger was picked up and neighbourhoods are thought to effect pricing at least when hailing a cab thus we will merge Taxi zone dataset with the NYC trip data on zone IDs and idnetify pickup and drop off buroughs for each trip.

In [None]:
# Merge the zone data into the main taxi trip dataset for pickup locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='PULocationID', right_on='LocationID', how='left')
df.rename(columns={'zone': 'PUzone', 'borough': 'PUborough'}, inplace=True)

# Merge the zone data for dropoff locations
df = pd.merge(df, zones[['LocationID', 'zone', 'borough']], left_on='DOLocationID', right_on='LocationID', how='left', suffixes=('', '_drop'))
df.rename(columns={'zone': 'DOzone', 'borough': 'DOborough'}, inplace=True)

# Drop the extra LocationID columns if they are not needed
df.drop(['LocationID', 'LocationID_drop'], axis=1, inplace=True)


In [None]:
print(df['PUborough'].value_counts())
print(df['DOborough'].value_counts())

In [None]:
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())

In [None]:
print(sorted(zones['LocationID'].unique()))
print(sorted(df['PULocationID'].unique()))
print(sorted(df['DOLocationID'].unique()))

We can see that in our dataset we have 2 zones namely 264 and 265 which do not have specific buroughs and are not in our taxi zones dataset. 

In [None]:
missing_pu = df[~df['PULocationID'].isin(zones['LocationID'])]
missing_do = df[~df['DOLocationID'].isin(zones['LocationID'])]
print(f"Missing PULocationIDs: {missing_pu['PULocationID'].unique()}")
print(f"Missing DOLocationIDs: {missing_do['DOLocationID'].unique()}")

In [None]:
# Filter data for PULocationID or DOLocationID being 264 or 265
trips = df[(df['PULocationID'].isin([264, 265])) | (df['DOLocationID'].isin([264, 265]))]

# Print the filtered data summary
trips.describe(include='all')


In [None]:
# Display sample records
trips.sample(10)

In [None]:
# Manually assign zones for IDs 264 and 265
df.loc[df['PULocationID'] == 264, ['PUzone', 'PUborough']] = ['Outside NYC', 'Unknown']
df.loc[df['DOLocationID'] == 264, ['DOzone', 'DOborough']] = ['Outside NYC', 'Unknown']
df.loc[df['PULocationID'] == 265, ['PUzone', 'PUborough']] = ['Airport Area', 'Unknown']
df.loc[df['DOLocationID'] == 265, ['DOzone', 'DOborough']] = ['Airport Area', 'Unknown']


In [None]:
# Check for null values in the updated columns
print(df[['PUzone', 'PUborough', 'DOzone', 'DOborough']].isnull().sum())



In [None]:
# Print rows where PUzone or PUborough is null
print("Rows with missing PUzone or PUborough:")
print(df[df['PUzone'].isnull() | df['PUborough'].isnull()][['PULocationID', 'PUzone', 'PUborough']].head())

# Print rows where DOzone or DOborough is null
print("Rows with missing DOzone or DOborough:")
print(df[df['DOzone'].isnull() | df['DOborough'].isnull()][['DOLocationID', 'DOzone', 'DOborough']].head())


In [None]:
# List unique LocationIDs associated with null zones or boroughs
missing_pu_ids = df[df['PUzone'].isnull()]['PULocationID'].unique()
missing_do_ids = df[df['DOzone'].isnull()]['DOLocationID'].unique()
print(f"Missing PULocationIDs: {missing_pu_ids}")
print(f"Missing DOLocationIDs: {missing_do_ids}")


In [None]:
# Manually assign zones and boroughs for LocationID 57 and 105
df.loc[df['PULocationID'] == 57, ['PUzone', 'PUborough']] = ['Corona', 'Queens']
df.loc[df['DOLocationID'] == 57, ['DOzone', 'DOborough']] = ['Corona', 'Queens']

df.loc[df['PULocationID'] == 105, ['PUzone', 'PUborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']
df.loc[df['DOLocationID'] == 105, ['DOzone', 'DOborough']] = ["Governor's Island/Ellis Island/Liberty Island", 'Manhattan']


In [None]:
# Verify updates for LocationID 57
print("Updated zones and boroughs for LocationID 57:")
print(df[df['PULocationID'] == 57][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 57][['DOLocationID', 'DOzone', 'DOborough']].head(2))

# Verify updates for LocationID 105
print("Updated zones and boroughs for LocationID 105:")
print(df[df['PULocationID'] == 105][['PULocationID', 'PUzone', 'PUborough']].head(2))
print(df[df['DOLocationID'] == 105][['DOLocationID', 'DOzone', 'DOborough']].head(2))


In [None]:
# Check again for null values in the zone and borough columns
print("Null values in PUzone and PUborough after update:")
print(df[['PUzone', 'PUborough']].isnull().sum())

print("Null values in DOzone and DOborough after update:")
print(df[['DOzone', 'DOborough']].isnull().sum())


In [None]:
df.isnull().sum()

## Holiday
- **Holidays**: Assessing the effect of major holidays (e.g., Christmas, Thanksgiving) on taxi demand, given the potential increase in tourism and local activity.

In [None]:
# Create a calendar object
calendar = USFederalHolidayCalendar()

# Define the range for your data
start_date = '2023-01-01'
end_date = '2023-12-31'

# Generate holidays
holidays = calendar.holidays(start=start_date, end=end_date)

# Add a column to your dataframe indicating whether the trip started on a holiday
df['is_holiday'] = df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)

## Velocity

**Velocity**: By calculating the average speed of a trip (velocity = distance/duration), we can examine if faster trips result in different pricing.

In [None]:
# First, ensure your trip_duration is in hours for speed calculation
df['trip_duration_hours'] = df['trip_duration'] / 60.0

# Calculate speed
df['speed_mph'] = df['trip_distance'] / df['trip_duration_hours']

# Handle any potential infinite or NaN values that may occur if duration is zero
df['speed_mph'].replace([np.inf, -np.inf], np.nan, inplace=True)
df['speed_mph'].fillna(0, inplace=True)  # Optionally set to zero or another placeholder value


## Testing NEW Feature Validity

We need to check if created features are within their bounds and our code worked properly.

In [None]:
def test_trip_duration_positive():
    assert df['trip_duration'].min() > 0, "Error: Non-positive trip durations present in the dataset."


In [None]:
def test_time_of_day_categories():
    hours = df['tpep_pickup_datetime'].dt.hour
    conditions = [
        ((hours >= 5) & (hours <= 11)),
        ((hours >= 12) & (hours <= 17)),
        ((hours >= 18) & (hours <= 23)),
        ((hours < 5) | (hours == 24))
    ]
    categories = ['morning', 'afternoon', 'evening', 'night']
    for condition, category in zip(conditions, categories):
        assert all(df.loc[condition, 'pickup_time_of_day'] == category), f"Error in categorizing {category}."


In [None]:
def test_passenger_count_categories():
    conditions = [
        (df['passenger_count'] == 1),
        (df['passenger_count'].between(2, 4)),
        (df['passenger_count'].between(5, 6))
    ]
    categories = ['low', 'medium', 'high']
    for condition, category in zip(conditions, categories):
        assert all(df.loc[condition, 'passenger_count_category'] == category), f"Error in categorizing passenger count {category}."


In [None]:
def test_seasonal_categories():
    months = df['tpep_pickup_datetime'].dt.month
    conditions = [
        (months.isin([3, 4, 5])),
        (months.isin([6, 7, 8])),
        (months.isin([9, 10, 11])),
        (months.isin([12, 1, 2]))
    ]
    seasons = ['spring', 'summer', 'autumn', 'winter']
    for condition, season in zip(conditions, seasons):
        assert all(df.loc[condition, 'pickup_season'] == season), f"Error in season categorization for {season}."


In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar

def test_holiday_flag():
    calendar = USFederalHolidayCalendar()
    holidays = calendar.holidays(start=df['tpep_pickup_datetime'].min(), end=df['tpep_pickup_datetime'].max())
    df['calculated_holiday'] = df['tpep_pickup_datetime'].dt.normalize().isin(holidays).astype(int)
    assert all(df['calculated_holiday'] == df['is_holiday']), "Holiday flag mismatches detected."


# Selecting Needed Features 


### Conducting Exploratory Data Analysis (EDA) for NYC Taxi Trips

Now that you've cleaned the dataset and engineered relevant features, the next logical step is to conduct an Exploratory Data Analysis (EDA). This will help you to:

1. **Understand the Distribution and Relationships**:
   - **Visualize distributions** of individual variables using histograms, box plots, and density plots to understand their range, central values, and variance.
   - **Explore relationships** between variables using scatter plots, pair plots, and correlation matrices. This helps in understanding how different features such as trip distance, duration, time of day, and passenger count relate to fare amounts.

2. **Identify Patterns and Anomalies**:
   - Look for patterns that may indicate how time of day or day of the week affects taxi usage and fare costs.
   - Identify potential outliers or anomalies that weren't previously detected during the cleaning process.

3. **Statistical Analysis**:
   - Apply statistical tests to confirm hypotheses about the data, for example, testing if average fares are significantly different during holidays or specific seasons.
   - Use aggregation to see mean, median, and standard deviation of fares across different times or zones.

4. **Geographical Insights**:
   - Utilize geographical plotting to visualize trips by pickup and drop-off locations. This can highlight busy areas or zones with higher fares.
   - Analyze how the geographic factors unique to NYC might translate to Tbilisi, considering urban layout differences.

5. **Temporal Dynamics**:
   - Investigate how different times of the year, week, or day impact fares and trip frequencies.
   - Determine if there's a "rush hour" effect in NYC and predict potential similar effects in Tbilisi.

### Drawing Conclusions for Model Development

After EDA, you should be able to draw preliminary conclusions about:

- **Key Drivers of Fare Prices**: Understanding which features are most predictive of fare changes can guide the feature selection for your predictive model.
- **Transferability of Insights**: Evaluate which aspects of the NYC taxi fare dynamics are likely to apply to Tbilisi. Consider socio-economic, geographical, and cultural differences that could influence model adjustments.
- **Data Quality and Further Needs**: Determine if additional data or further cleaning is required based on findings from EDA. For example, weather conditions or special events data could enhance the model.

### Moving Forward

With insights gained from EDA, you can proceed to:
- **Feature Selection**: Decide which features to include in your predictive model based on their relevance and impact on taxi fares.
- **Model Building**: Begin with simple models to establish a baseline before experimenting with more complex models like Random Forests or Gradient Boosting Machines.
- **Cross-Validation and Hyperparameter Tuning**: Implement these strategies to optimize model performance and ensure it generalizes well on unseen data.

### Conclusion

The comprehensive EDA will not only refine your understanding of NYC taxi trips but also equip you with the insights needed to tackle the fare prediction challenge in Tbilisi, adapting the approach based on localized conditions and data.

Great! Let's approach the Exploratory Data Analysis (EDA) systematically to refine your dataset and prepare for modeling. Here’s a structured plan to conduct EDA and feature refinement:

### 1. **Visual Analysis of Distribution and Relationships**
   - **Univariate Analysis**: Use histograms and box plots to analyze the distribution of continuous features like `fare_amount`, `trip_distance`, and `trip_duration`. Bar charts might be useful for categorical features like `pickup_day_type` and `passenger_count_category`.
   - **Bivariate Analysis**: Create scatter plots to examine relationships between fare and other numerical features. Use heatmaps to visualize correlations.

### 2. **Identify and Handle Outliers**
   - Review the plots to identify outliers. Decide if they are data errors or just extreme values. Remove or cap them as necessary, depending on their authenticity and impact on the model.
   - For example, trips with unusually long durations or distances that don't correspond to higher fares might be errors.

### 3. **Geographical Insights**
   - If geographical data is available, plot the pickup and drop-off locations using a scatter geo plot to identify high-traffic areas and fare patterns.
   - Consider whether geographical features relevant to NYC will be applicable in Tbilisi or if adjustments are needed.

### 4. **Temporal Analysis**
   - Examine how fares vary by time of day, day of the week, and season. This can uncover demand patterns which are crucial for dynamic pricing models.
   - Plot time-series of fares over the months to detect any trends or seasonal effects.

### 5. **Categorical and Dummy Variable Creation**
   - Convert categorical variables into dummy variables if needed, for modeling purposes. This includes variables like `pickup_day_type`, `pickup_time_of_day`, and `season`.
   - Make sure to drop one dummy variable to avoid multicollinearity.

### 6. **Redundancy and Irrelevance Check**
   - Remove or combine features that are redundant or have little impact on the fare. For instance, if `pickup_day_type` (weekend vs. weekday) captures the essence of `transaction_day`, consider dropping one.
   - Evaluate the necessity of features like `store_and_fwd_flag` if they do not significantly impact fare.

### 7. **Feature Engineering Reassessment**
   - Revisit your engineered features based on insights gained during EDA. Adjust or create new features as needed.
   - Consider engineering velocity or efficiency features like `speed` (distance over duration).

### 8. **Final Dataset Preparation**
   - Ensure all data types are correct and that missing values are handled.
   - Normalize or scale features if necessary, depending on the chosen modeling techniques.
   - Split the dataset into training and testing sets to prepare for modeling.

### 9. **Data Export**
   - Save the cleaned and transformed dataset to a new file, ready for use in model training and validation in your next notebook.


## Univariate Analysis 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Setting aesthetic style for seaborn plots
sns.set(style="whitegrid")

# Histogram for trip distances
plt.figure(figsize=(10, 6))
sns.histplot(df['trip_distance'], bins=30, kde=True)
plt.title('Distribution of Trip Distances')
plt.xlabel('Trip Distance (miles)')
plt.ylabel('Frequency')
plt.show()

# Boxplot for fare amounts
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['fare_amount'])
plt.title('Box Plot of Fare Amounts')
plt.xlabel('Fare Amount ($)')
plt.show()

# Count plot for day of the week
plt.figure(figsize=(10, 6))
sns.countplot(x='pickup_day_type', data=df)
plt.title('Trip Counts by Day Type')
plt.xlabel('Day Type')
plt.ylabel('Count')
plt.show()

# Count plot for time of day
plt.figure(figsize=(10, 6))
sns.countplot(x='pickup_time_of_day', data=df)
plt.title('Trip Counts by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Count')
plt.show()


## Demand Time Of Day

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='pickup_time_of_day', data=df, order=['morning', 'afternoon', 'evening', 'night'])
plt.title('Demand by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Number of Trips')
plt.show()


## Demand Day Of The Week

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='pickup_day_type', data=df, order=['weekday', 'weekend'])
plt.title('Demand by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Trips')
plt.show()


## Demand By seaoson

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='pickup_season', data=df, order=['spring', 'summer', 'autumn', 'winter'])
plt.title('Demand by Season')
plt.xlabel('Season')
plt.ylabel('Number of Trips')
plt.show()


## Hourly Demand 

In [None]:
# Creating a new column for the day of the week
df['pickup_day_of_week'] = df['tpep_pickup_datetime'].dt.day_name()

# Creating a pivot table for the heatmap
pivot = df.pivot_table(index='pickup_day_of_week', columns='pickup_time_of_day', values='fare_amount', aggfunc='count')

# Ordering the days correctly for the heatmap
ordered_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
pivot = pivot.reindex(ordered_days)

plt.figure(figsize=(12, 8))
sns.heatmap(pivot, cmap="YlGnBu", linewidths=.5, annot=True, fmt=".0f")
plt.title('Hourly Demand Throughout the Week')
plt.xlabel('Time of Day')
plt.ylabel('Day of the Week')
plt.show()


## Bivariate Analysis

In [None]:
# Scatter plot for trip distance vs fare amount
plt.figure(figsize=(10, 6))
sns.scatterplot(x='trip_distance', y='fare_amount', data=df)
plt.title('Trip Distance vs. Fare Amount')
plt.xlabel('Trip Distance (miles)')
plt.ylabel('Fare Amount ($)')
plt.show()

# Boxplot of fare amount by day of the week
plt.figure(figsize=(10, 6))
sns.boxplot(x='pickup_day_type', y='fare_amount', data=df)
plt.title('Fare Amount by Day Type')
plt.xlabel('Day Type')
plt.ylabel('Fare Amount ($)')
plt.show()

# Boxplot of fare amount by time of day
plt.figure(figsize=(10, 6))
sns.boxplot(x='pickup_time_of_day', y='fare_amount', data=df)
plt.title('Fare Amount by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Fare Amount ($)')
plt.show()

# Boxplot of fare amount by passenger count category
plt.figure(figsize=(10, 6))
sns.boxplot(x='passenger_count_category', y='fare_amount', data=df)
plt.title('Fare Amount by Passenger Count Category')
plt.xlabel('Passenger Count Category')
plt.ylabel('Fare Amount ($)')
plt.show()


In [None]:
# Check if 'pickup_season' is correctly categorized
season_counts = df['pickup_season'].value_counts()
print(season_counts)

# Check if 'pickup_time_of_day' is correctly categorized
time_of_day_counts = df['pickup_time_of_day'].value_counts()
print(time_of_day_counts)

# Check if 'pickup_day_type' is correctly categorized for weekdays and weekends
day_type_counts = df['pickup_day_type'].value_counts()
print(day_type_counts)

# To check all together
summary = df.groupby(['pickup_day_type', 'pickup_time_of_day', 'pickup_season'])['fare_amount'].count()
print(summary)


# GeoSpatial EDA

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt

# Load your zones data as a GeoDataFrame
gdf_zones = gpd.read_file('path_to_your_zones_shapefile.shp')  # Adjust the path to your shapefile

# If your zones data is in CSV and contains WKT or coordinates, convert it to GeoDataFrame
# gdf_zones = gpd.GeoDataFrame(zones, geometry=gpd.points_from_xy(zones.longitude, zones.latitude))

# Calculate pickup and dropoff counts per zone
pickup_counts = df['PULocationID'].value_counts().rename_axis('LocationID').reset_index(name='pickup_count')
dropoff_counts = df['DOLocationID'].value_counts().rename_axis('LocationID').reset_index(name='dropoff_count')

# Merge these counts with the zones GeoDataFrame
gdf_zones = gdf_zones.merge(pickup_counts, on='LocationID', how='left')
gdf_zones = gdf_zones.merge(dropoff_counts, on='LocationID', how='left')


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 10))

# Plot pickup counts
gdf_zones.plot(column='pickup_count', ax=ax[0], legend=True,
               legend_kwds={'label': "Pickup Count by Zone", 'orientation': "horizontal"})
ax[0].set_title('Taxi Pickup Heatmap')
ax[0].set_axis_off()

# Plot dropoff counts
gdf_zones.plot(column='dropoff_count', ax=ax[1], legend=True,
               legend_kwds={'label': "Dropoff Count by Zone", 'orientation': "horizontal"})
ax[1].set_title('Taxi Dropoff Heatmap')
ax[1].set_axis_off()

plt.show()


In [None]:
import folium

# Create a map centered around an average location in NYC
m = folium.Map(location=[40.730610, -73.935242], zoom_start=11)

# Create a Choropleth map for pickups
folium.Choropleth(
    geo_data=gdf_zones,
    name='choropleth',
    data=gdf_zones,
    columns=['LocationID', 'pickup_count'],
    key_on='feature.properties.LocationID',
    fill_color='YlOrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Taxi Pickup Volume'
).add_to(m)

m.save('NYC_taxi_pickup_map.html')


In [None]:
# Group by drop-off location and calculate the average fare
destination_pricing = df.groupby('DOLocationID')['fare_amount'].mean().sort_values(ascending=False)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(14, 7))
# If the destination IDs are too many, you can limit the output for better visualization
top_destinations = destination_pricing.head(20)
sns.barplot(x=top_destinations.index, y=top_destinations.values, palette="coolwarm")
plt.title('Average Taxi Fare by Destination')
plt.xlabel('Destination Location ID')
plt.ylabel('Average Fare ($)')
plt.xticks(rotation=45)  # Rotate labels for better readability if necessary

plt.show()


In [None]:
import geopandas as gpd
from shapely.geometry import Point

# Example: Assuming 'longitude' and 'latitude' are in the 'zones' dataframe
gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.longitude, df.latitude)
)

# Join the average fare data back onto the zones GeoDataFrame
gdf = gdf.merge(destination_pricing, on='DOLocationID')

# Plotting
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
gdf.plot(column='fare_amount', ax=ax, legend=True,
         legend_kwds={'label': "Average Fare ($)",
                      'orientation': "horizontal"})
plt.title('Heatmap of Average Taxi Fares by Destination')
plt.show()


# Correlation analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
corr = df.corr()

# Generate a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


Step 2: Redundancy and Multicollinearity Check
Identify features that are highly correlated with each other. High multicollinearity can be problematic in regression models because it can make the model's estimates very sensitive to changes in the model.

Drop redundant features: If two features are highly correlated, consider dropping one.
Principal Component Analysis (PCA): For dimensionality reduction, especially if you have a very high number of features.

# Feature Importance
Utilize machine learning algorithms like Random Forest or Gradient Boosting to identify feature importance. This will provide insights into which features have the most impact on your target variable.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Assuming all necessary preprocessing is done
model = RandomForestRegressor()
model.fit(X_train, y_train)
feature_importances = pd.DataFrame(model.feature_importances_,
                                   index = X_train.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)


Step 4: Feature Engineering Revisit
Based on the insights from the correlation and importance analysis:

Create new features: Sometimes interactions between features (e.g., multiplying or dividing two features) might have a better correlation with the target.
Grouping: Grouping sparse categorical features based on similarity or impact on the target variable can sometimes improve model performance.


# Feature Selection and Model Dataset

# Normalization & Standartization
from sklearn.preprocessing import StandardScaler, MinMaxScaler


scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['fare_amount', 'trip_distance', 'trip_duration']])  # include necessary features
scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df[['fare_amount', 'trip_distance', 'trip_duration']])


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['fare_amount', 'trip_distance', 'trip_duration']])  # include necessary features

# Normalization
scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df[['fare_amount', 'trip_distance', 'trip_duration']])
