# Rideshare Price Prediction

### Team:
Marko Masnikosa: mmasniko@syr.edu 
- GitHub: https://github.com/data11y - POC <br>

Dawryn Rosario: darosari@syr.edu
- GitHub: https://github.com/darosari

Rianne Parker: riparker@syr.edu
- GitHub: https://github.com/DatawithParker

## Overview
We are trying to predict hourly pricing for Lyft and Uber trips in New York City. Our approach involves looking to the Taxi and Limousine Commission of New York City data for trip information, weather data, and MTA subway trip data for alternative travel options. With multimodal transport options considered, we hope to be able to provide a model that can inform users to which mode of travel would be more efficient at a time.

## Data

### TLC Data  
Taxi and Limousine Commission of NYC data includes trip level data for the entire year. Data is available for Yellow Cabs, Green Cabs (more efficient), For Hire Vehicles, and High-Volume For Hire Vehicles. We focused on the High Volume data as this includes Lyft and Uber trips, as well as smaller rideshare platforms. Data is broken up by year, vehicle type, and month and is available as parquet files. The data is centered around taxi zones which will be explained later. [TLC data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) <p> Data Preprocessing:
* **Data Collection**: Data was collected for the year of 2020 and 2024. The files are large (200MB+ each) so filtering and aggregation was applied.
* **Data Filtering**: The High Volume data included several rideshare app platforms. This was filtered down to just Uber and Lyft, which made up the bulk of the data regardless. There were several features that were dropped for having low variance or high emptiness. Trips where the components of the trip cost were less than the driver pay were dropped, these rows indicated that the driver was paid more than what the rider was charged which is not normally the case.
* **Data Aggregation**: The raw data for each year for just the two apps is over 200 million rows per year. This amount of data was too large to easily handle and was thus aggregated to hourly data split across apps. The categorical features such as pickup location and drop off location were aggregated to the most frequent of that hour. Numerical features such as trip distance were aggregated to the mean and sum of that hour.
* **Taxi Zones**: The data is centered around taxi zones which are zones created by the TLC. Here is a map of the taxi zones in NYC: <img src='pictures/nyc_taxi_zones_satellite_overlay.png' width = "500"/>


Many data exploration questions were asked and examined. Some interesting findings include the following.
* Connections: In the data, pickups are happening across a wide area of taxi zones, but the drop offs are more concentrated to specific zones or are headed out of NYC. Not shown in this image but in more granular exploration showed some taxi zones were serviced much more by one app over another. <img src='pictures/nyc_rideshare_pickup_and_dropoffs_2024.png' width = "500" />

* App Dominance: Uber is significantly more used in NYC than Lyft. It would be interesting to have access to the driver payout strucure between the two apps to see why. <img src='pictures/nyc_rideshare_moving_avg_trip_volumes_2024.png' width = "500" />

* Zone Connections: Where are people who are picked up in one zone getting dropped off? It turns out they don't typically leave their taxi zones. This is excluding airport pickups and dropoffs. <img src='pictures/uber_lyft_connections_top_5_2024.png' width = "500" />

* Tips: Lyft riders are more generous than Uber riders when it comes to tipping. <img src='pictures/rider_generosity.png' width = "500" />

### MTA Delays Data EDA
#### 1. Data Loading & Initial Exploration

    In this section, I load the raw MTA Delays dataset and perform an initial inspection of the structure, column names, and datatypes. The goal is to understand what information is available, identify any immediate issues (e.g., null values or formatting problems), and prepare for further preprocessing.

Some questions that can be asnwered:
- What are the key columns in this dataset?
- What is the size of the dataset?
- Are there any obvious missing or corrupted values?

In [7]:
import pandas as pd

# link for MTA DELAYS: https://data.ny.gov/Transportation/MTA-Subway-Trains-Delayed-Beginning-2020/wx2t-qtaz/about_data
delays_df = pd.read_csv("/workspaces/SU-IST707-Group_Project/Project Checkpoints/Checkpoint 2/original_MTA_Subway_Trains_Delayed__Beginning_2020_20250303.csv") #when the MTA DELAYS file is local
delays_df.head()#showing first few rows


FileNotFoundError: [Errno 2] No such file or directory: '/workspaces/SU-IST707-Group_Project/Project Checkpoints/Checkpoint 2/original_MTA_Subway_Trains_Delayed__Beginning_2020_20250303.csv'

#### 2. Exploratory Data Analysis (EDA) – MTA Delays

    This section explores trends and distributions in the subway delay data to better understand temporal patterns, types of delays, and how delays vary across subway lines.

I aim to answer:
- What are the most common types of delays?
- Are certain subway lines more frequently delayed?
- Do delays occur more often at certain times or days?
- Are delays increasing or decreasing over time?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

#load dataset
delays_df = pd.read_csv("MTA_Subway_Trains_Delayed__Beginning_2020_20250331.csv")

# Basic structure
print("🧾 Dataset Overview:")
print("-" * 50)
print(delays_df.info())
print("\n🔍 Null Values:")
print(delays_df.isnull().sum())

# Preview the data
print("\n📄 First 5 Rows:")
display(delays_df.head())

# Top Delay Causes
print("\n📌 Top Reporting Categories:")
print(delays_df['reporting_category'].value_counts().head(10))

print("\n📌 Top Specific Subcategories:")
print(delays_df['subcategory'].value_counts().head(10))


# Most Affected Lines
print("\n🚇 Most Affected Subway Lines:")
print(delays_df['line'].value_counts().head(10))

# Convert date column
delays_df['month'] = pd.to_datetime(delays_df['month'], errors='coerce')

# Extract temporal features
delays_df['Year'] = delays_df['month'].dt.year
delays_df['Month'] = delays_df['month'].dt.month
delays_df['Weekday'] = delays_df['month'].dt.day_name()

# Plot delay count per year
plt.figure(figsize=(8, 4))
delays_df['Year'].value_counts().sort_index().plot(kind='bar')
plt.title("🗓️ Delay Reports Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Delay Reports")
plt.tight_layout()
plt.show()



FileNotFoundError: [Errno 2] No such file or directory: 'MTA_Subway_Trains_Delayed__Beginning_2020_20250331.csv'

 #### EDA Summary – MTA Subway Delays

- The dataset contains **40,503** entries and **7 columns**, covering subway delays across multiple lines and divisions.
- The most frequent **reporting categories** are:
  - Infrastructure & Equipment
  - Crew Availability
  - External Factors
- Common specific causes include door-related issues, braking, and debris on tracks.
- Only the `subcategory` column has missing values (~5.5% of records).
- Delay frequency is reported by month and has been converted to datetime format.
- Additional features (`Year`, `Month`, `Weekday`) were extracted to support temporal analysis.
- A time series plot shows variation in delays across years, providing insight into longer-term trends.

#### 3. Data Cleaning & Feature Engineering – MTA Delays

    This section handles missing values, standardizes categorical text data, and prepares the dataset for downstream modeling. I focus on ensuring consistency in categorical fields and creating useful features from raw columns.

Key steps:
- Fill or tag missing values in `subcategory`
- Normalize text fields to lowercase for consistency
- Ensure all datetime fields are usable
- Prepare for joins with other datasets (e.g., weather, ridership)

In [None]:
#fill missing values in subcategory
delays_df['subcategory'] = delays_df['subcategory'].fillna('Unknown')

#standardize string columns (lowercase, strip whitespace)
for col in ['division', 'line', 'reporting_category', 'subcategory']:
    delays_df[col] = delays_df[col].str.lower().str.strip()

#optional: create a 'day_type_label' if needed
day_type_map = {
    1: 'Weekday',
    2: 'Saturday',
    3: 'Sunday/Holiday'
}
delays_df['day_type_label'] = delays_df['day_type'].map(day_type_map)

#check result
print("🔍 Cleaned Columns Preview:")
display(delays_df[['month', 'division', 'line', 'reporting_category', 'subcategory', 'day_type_label', 'delays']].head())

🔍 Cleaned Columns Preview:


Unnamed: 0,month,division,line,reporting_category,subcategory,day_type_label,delays
0,2024-12-01,a division,1,crew availability,crew availability,Weekday,83
1,2024-12-01,a division,1,external factors,external debris on roadbed,Weekday,4
2,2024-12-01,a division,1,infrastructure & equipment,braking,Weekday,37
3,2024-12-01,a division,1,infrastructure & equipment,door-related,Weekday,34
4,2024-12-01,a division,1,infrastructure & equipment,"fire, smoke, debris",Weekday,37


In [None]:
# Show cleaned DataFrame structure
print("🧾 Final DataFrame Structure:")
print(delays_df.info())

# Check unique values in key categorical columns
print("\n📌 Unique Reporting Categories:")
print(delays_df['reporting_category'].value_counts())

print("\n📌 Unique Subcategories (Top 10):")
print(delays_df['subcategory'].value_counts().head(10))

print("\n🚇 Unique Subway Lines (Top 10):")
print(delays_df['line'].value_counts().head(10))

print("\n📆 Date Range:")
print(f"From {delays_df['month'].min().date()} to {delays_df['month'].max().date()}")

# Check if datetime features exist and look good
print("\n🧪 Sample of Temporal Features:")
display(delays_df[['month', 'Year', 'Month', 'Weekday']].sample(5))

# Check if day_type_label mapping worked
print("\n📅 Day Type Mapping Preview:")
print(delays_df['day_type_label'].value_counts())


🧾 Final DataFrame Structure:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   month               40503 non-null  datetime64[ns]
 1   division            40503 non-null  object        
 2   line                40503 non-null  object        
 3   day_type            40503 non-null  int64         
 4   reporting_category  40503 non-null  object        
 5   subcategory         40503 non-null  object        
 6   delays              40503 non-null  int64         
 7   Year                40503 non-null  int32         
 8   Month               40503 non-null  int32         
 9   Weekday             40503 non-null  object        
 10  day_type_label      40503 non-null  object        
dtypes: datetime64[ns](1), int32(2), int64(2), object(6)
memory usage: 3.1+ MB
None

📌 Unique Reporting Categories:
reporting_category

Unnamed: 0,month,Year,Month,Weekday
9326,2023-11-01,2023,11,Wednesday
4609,2024-06-01,2024,6,Saturday
34765,2020-10-01,2020,10,Thursday
39073,2020-03-01,2020,3,Sunday
10809,2023-09-01,2023,9,Friday



📅 Day Type Mapping Preview:
day_type_label
Weekday     24076
Saturday    16427
Name: count, dtype: int64


### Weather Data  

 Weather data in this project is used to identify how conditions affect rideshare pricing in NYC. It helps capture demand spikes and travel delays caused by adverse weather. This allows for more accurate fare predictions and better planning for both riders and service providers.

* **Source**: Visual Crossing 
* **Website**: https://www.visualcrossing.com/
* **Descriptions**: Visual Crossing is a leading provider of weather data and enterprise analysis tools to data scientists, business analysts, professionals, and academics. Visual Crossing aims to provide accurate weather data and forecasts by combining data from various sources, including ground-based weather stations, satellites, and radar, and using statistical climate modeling.

<img src="/workspaces/SU-IST707-Group_Project/Project Checkpoints/Checkpoint 2/PoweredByVC-WeatherLogo-RoundedRect.png" alt="Alt Text" width="250" height="75">

 
#### Data Processing

* **Data Collection**: NYC Weather Data was collected for the year of 1-1-2020 and 12-31-2024. The file is 2MB. It was pulled via query from Visual Crossings. 
* **Data Exploration**: The initial exploration focused on understanding the dataset structure, inspecting data types, and examining the distribution of weather variables such as temperature, precipitation, and windspeed. Special attention was given to the datetime column to ensure consistent hourly intervals throughout the time series.
* **Data Filtering and Cleaning**: Non-essential columns were removed to focus the analysis on key weather-related variables. The dataset was filtered to retain only hourly observations, and duplicate or invalid entries were excluded to maintain data quality.
    + **Columns removed**: name, dew, humidity, precipprob, snowdepth, windgust, winddir, sealevelpressure, solarradiation, solarenergy, severerisk, icon, stations, preciptype.
    + **Remaining Columns**: datetime, temp, feelslike, precip, snow, windspeed, cloudcover, visibility, uvindex, conditions
    + **DateTime Check**: Missing hourly records—primarily caused by daylight saving time transitions—were detected by comparing the dataset's timestamps against a complete hourly range. These missing records were then filled by averaging the values from the hour before and after, ensuring continuity in the time series.
        - Timestamps added to Dataframe: "2020-03-08 02:00:00", "2021-03-14 02:00:00", "2022-03-13 02:00:00", "2023-03-12 02:00:00","2024-03-10 02:00:00"
    + **'Conditions' Column Value Encoding**: The column is a categorical representation of combined weather conditions, encoded as numerical values to simplify analysis and modeling. Below is the mapping used:

        - **0** — Overcast  
        - **1** — Partially cloudy  
        - **2** — Clear  
        - **3** — Rain, Overcast  
        - **4** — Rain, Partially cloudy  
        - **5** — Snow, Rain, Partially cloudy  
        - **6** — Snow, Rain, Overcast  
        - **7** — Snow, Overcast  
        - **8** — Snow, Partially cloudy  
        - **9** — Rain  
        - **10** — Snow  
        - **11** — Snow, Rain  
    + **Dataframe shape**: 43,848 rows x 10 columns

#### NYC Weather Visual (2020-2024)

Temperatures steadily rise from January to July, peaking in the summer months before gradually declining through December. While the overall pattern is consistent year-to-year, slight variations appear — for example, 2023 had a warmer early spring compared to other years. 

<img src="/workspaces/SU-IST707-Group_Project/Project Checkpoints/Checkpoint 2/Weather plot.png" alt="Alt Text" width="800" height="480">



### MTA Ridership Data
MTA ridership data is used to analyze transit trends and understand how public transportation usage changed over time, especially during and after the COVID-19 pandemic. It provides insight into recovery patterns, demand for various transportation modes, and infrastructure usage across NYC. This information is crucial for planning service levels, evaluating operational efficiency, and informing transportation policy decisions.

* **Source**: NYC Open Data – MTA Ridership (Daily)

* **Website**: https://data.ny.gov/Transportation/MTA-Daily-Ridership-Data-2020-2025/vxuj-8kew/about_data

* **Descriptions**: This dataset contains daily estimated ridership counts across multiple modes of MTA transportation in NYC. It includes subways, buses, Long Island Railroad (LIRR), Metro-North, Access-A-Ride, bridges and tunnels, and the Staten Island Railway. The dataset was made available to support transparency and inform stakeholders about mobility trends in NYC during and following the pandemic.

### Data Processing

* **Data Collection**: MTA ridership data was collected between 2020-03-01 and 2025-01-09. The raw file was downloaded as a CSV from NYC Open Data. It includes 5 years of daily ridership estimates across several transportation systems.

* **Data Exploration**: The initial exploration involved reviewing column names, inspecting data types, and identifying the presence of missing or duplicate date records. Columns were checked for consistency and numerical values were verified for each ridership metric.

* **Data Filtering and Cleaning**: All columns except the Date column were converted to float64 to ensure numerical consistency for analysis. The Date column was converted to a proper datetime format for easy resampling and time-based indexing. Duplicate date entries were removed, and missing dates within the 2020–2024 range were identified by comparing against a complete date range. Any missing dates were added with null values for interpolation or handling in further analysis.

* **NA Handling**: Potential missing values were inspected and none were identified. 

* **Date Check**: Full coverage was confirmed for the date range 2020-03-01 to 2025-01-09. The complete range includes 1,776 days.

* **Dataframe shape after cleaning**: 1,776 rows x 15 columns

#### MTA Ridership - Monthly (2020-2024)

Subway ridership dropped sharply in early 2020 due to the pandemic but steadily recovered, peaking by 2024. Bus ridership also declined early but stabilized more quickly, while services like Access-A-Ride and Staten Island Railway maintained relatively low, flat usage. Bridges and Tunnels traffic steadily increased, suggesting more reliance on personal vehicles post-pandemic.

<img src="/workspaces/SU-IST707-Group_Project/Project Checkpoints/Checkpoint 2/MTA Daily Ridership.png" alt="Alt Text" width="800" height="480">

#### MTA Ridership - Weekend vs Weekday (2020-2024)

Ridership dropped sharply in early 2020 due to the COVID-19 pandemic but steadily recovered over time. Weekday ridership consistently remained higher than weekend levels, reflecting commuter travel patterns. Both lines show gradual growth with some seasonal dips, indicating partial normalization of public transit usage by 2024.

<img src="/workspaces/SU-IST707-Group_Project/Project Checkpoints/Checkpoint 2/MTA Ridership - Weekend v. Weekday.png" alt="Alt Text" width="800" height="480">


