# Rideshare Price Prediction (Checkpoint 2)

##### *Due Date: 4/3/2025*

### Team:
Marko Masnikosa: mmasniko@syr.edu 
- GitHub: https://github.com/data11y - POC <br>

Dawryn Rosario: darosari@syr.edu
- GitHub: https://github.com/darosari

Rianne Parker: riparker@syr.edu
- GitHub: https://github.com/DatawithParker

## Overview
We are trying to predict hourly pricing for Lyft and Uber trips in New York City. Our approach involves looking to the Taxi and Limousine Commission of New York City data for trip information, weather data, and MTA subway trip data for alternative travel options. With multimodal transport options considered, we hope to be able to provide a model that can inform users to which mode of travel would be more efficient at a time.

## Data

### TLC Data  
Taxi and Limousine Commission of NYC data includes trip level data for the entire year. Data is available for Yellow Cabs, Green Cabs (more efficient), For Hire Vehicles, and High-Volume For Hire Vehicles. We focused on the High Volume data as this includes Lyft and Uber trips. Data is broken up by year, vehicle type, and month and is available as parquet files. The data is centered around taxi zones which will be explained later. 
* **Data Source**: NYC Taxi and Limousine Commission - TLC Trip Record Data
* **Website**: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
#### Data Preprocessing:
* **Data Collection**: Data was collected for the year of 2020 and 2024. The files are large (200MB+ each) so filtering and aggregation was applied.
* **Data Filtering and Cleaning**: The High Volume data included several rideshare app platforms. This was filtered down to just Uber and Lyft, which made up the bulk of the data regardless. There were several features that were dropped for having low variance or high emptiness. Trips where the components of the fare were less than the driver pay were dropped, these rows indicated that the driver was paid more than what the rider was charged which is not normally the case.
* **Missingness Handling**: As the data was provided at a per trip level of granularity, it was aggregated per hour for the mean values of the numeric features or mode for the categorical features. This aggregation helped correct for any missing values.
* **Data Aggregation**: Data was aggregated to an hourly period from a trip level. Aggregations were calculated independently between apps and combined together at the end. The categorical features such as pickup location and drop off location were aggregated to the most frequent of that hour. Numerical features such as trip distance were aggregated to the mean and sum values of that hour.
* **Taxi Zones**: The data is centered around taxi zones which are zones created by the TLC. Here is a map of the taxi zones in NYC: <br> <img src='pictures/nyc_taxi_zones_satellite_overlay.png' width = "500"/>


Many data exploration questions were asked and examined. Some interesting findings include the following.
* Connections: In the data, pickups are happening across a wide area of taxi zones, but the drop offs are more concentrated to specific zones or are headed out of NYC. Not shown in this image but in more granular exploration showed some taxi zones were serviced much more by one app over another. <br><img src='pictures/nyc_rideshare_pickup_and_dropoffs_2024.png' width = "500" />

* App Dominance: Uber is used significantly more in NYC than Lyft. <br><img src='pictures/nyc_rideshare_moving_avg_trip_volumes_2024.png' width = "500" />

* Zone Connections: Where are people who are picked up in one zone getting dropped off? It turns out they don't typically leave their taxi zones. This is excluding airport pickups and dropoffs.<br> <img src='pictures/uber_lyft_connections_top_5_2024.png' width = "500" />

* Tips: Lyft riders are more generous than Uber riders when it comes to tipping.<br> <img src='pictures/rider_generosity.png' width = "500" />

### Weather Data  
 Weather data in this project is used to identify how conditions affect rideshare pricing in NYC. It helps capture demand spikes and travel delays caused by adverse weather. This allows for more accurate fare predictions and better planning for both riders and service providers.

* **Source**: Visual Crossing 
* **Website**: https://www.visualcrossing.com/
* **Descriptions**: Visual Crossing is a leading provider of weather data and enterprise analysis tools to data scientists, business analysts, professionals, and academics. Visual Crossing aims to provide accurate weather data and forecasts by combining data from various sources, including ground-based weather stations, satellites, and radar, and using statistical climate modeling.

<img src="pictures/MTA Daily Ridership.png" alt="Alt Text" width="800" height="400">

 
#### Data Processing

* **Data Collection**: NYC Weather Data was collected for the year of 1-1-2020 and 12-31-2024. The file is 2MB. It was pulled via query from Visual Crossings. 
* **Data Exploration**: The initial exploration focused on understanding the dataset structure, inspecting data types, and examining the distribution of weather variables such as temperature, precipitation, and windspeed. Special attention was given to the datetime column to ensure consistent hourly intervals throughout the time series.
* **Data Filtering and Cleaning**: Non-essential columns were removed to focus the analysis on key weather-related variables. The dataset was filtered to retain only hourly observations, and duplicate or invalid entries were excluded to maintain data quality.
    + **Columns removed**: name, dew, humidity, precipprob, snowdepth, windgust, winddir, sealevelpressure, solarradiation, solarenergy, severerisk, icon, stations, preciptype.
    + **Remaining Columns**: datetime, temp, feelslike, precip, snow, windspeed, cloudcover, visibility, uvindex, conditions
    + **DateTime Check**: Missing hourly records—primarily caused by daylight saving time transitions—were detected by comparing the dataset's timestamps against a complete hourly range. These missing records were then filled by averaging the values from the hour before and after, ensuring continuity in the time series.
        - Timestamps added to Dataframe: "2020-03-08 02:00:00", "2021-03-14 02:00:00", "2022-03-13 02:00:00", "2023-03-12 02:00:00","2024-03-10 02:00:00"
    + **'Conditions' Column Value Encoding**: The column is a categorical representation of combined weather conditions, encoded as numerical values to simplify analysis and modeling. Below is the mapping used:

        - **0** — Overcast  
        - **1** — Partially cloudy  
        - **2** — Clear  
        - **3** — Rain, Overcast  
        - **4** — Rain, Partially cloudy  
        - **5** — Snow, Rain, Partially cloudy  
        - **6** — Snow, Rain, Overcast  
        - **7** — Snow, Overcast  
        - **8** — Snow, Partially cloudy  
        - **9** — Rain  
        - **10** — Snow  
        - **11** — Snow, Rain  
    + **Dataframe shape**: 43,848 rows x 10 columns

#### NYC Weather Visual (2020-2024)

Temperatures steadily rise from January to July, peaking in the summer months before gradually declining through December. While the overall pattern is consistent year-to-year, slight variations appear — for example, 2023 had a warmer early spring compared to other years. 

<img src="pictures/Weather plot.png" alt="Alt Text" width="800" height="480">

### MTA Ridership Data
MTA ridership data is used to analyze transit trends and understand how public transportation usage changed over time, especially during and after the COVID-19 pandemic. It provides insight into recovery patterns, demand for various transportation modes, and infrastructure usage across NYC. This information is crucial for planning service levels, evaluating operational efficiency, and informing transportation policy decisions.

* **Source**: NYC Open Data – MTA Ridership (Daily)

* **Website**: https://data.ny.gov/Transportation/MTA-Daily-Ridership-Data-2020-2025/vxuj-8kew/about_data

* **Descriptions**: This dataset contains daily estimated ridership counts across multiple modes of MTA transportation in NYC. It includes subways, buses, Long Island Railroad (LIRR), Metro-North, Access-A-Ride, bridges and tunnels, and the Staten Island Railway. The dataset was made available to support transparency and inform stakeholders about mobility trends in NYC during and following the pandemic.

### Data Processing

* **Data Collection**: MTA ridership data was collected between 2020-03-01 and 2025-01-09. The raw file was downloaded as a CSV from NYC Open Data. It includes 5 years of daily ridership estimates across several transportation systems.

* **Data Exploration**: The initial exploration involved reviewing column names, inspecting data types, and identifying the presence of missing or duplicate date records. Columns were checked for consistency and numerical values were verified for each ridership metric.

* **Data Filtering and Cleaning**: All columns except the Date column were converted to float64 to ensure numerical consistency for analysis. The Date column was converted to a proper datetime format for easy resampling and time-based indexing. Duplicate date entries were removed, and missing dates within the 2020–2024 range were identified by comparing against a complete date range. Any missing dates were added with null values for interpolation or handling in further analysis.

* **NA Handling**: Potential missing values were inspected and none were identified. 

* **Date Check**: Full coverage was confirmed for the date range 2020-03-01 to 2025-01-09. The complete range includes 1,776 days.

* **Dataframe shape after cleaning**: 1,776 rows x 15 columns

#### MTA Ridership - Monthly (2020-2024)

Subway ridership dropped sharply in early 2020 due to the pandemic but steadily recovered, peaking by 2024. Bus ridership also declined early but stabilized more quickly, while services like Access-A-Ride and Staten Island Railway maintained relatively low, flat usage. Bridges and Tunnels traffic steadily increased, suggesting more reliance on personal vehicles post-pandemic.

<img src="pictures/MTA Daily Ridership.png" alt="Alt Text" width="800" height="480">

#### MTA Ridership - Weekend vs Weekday (2020-2024)

Ridership dropped sharply in early 2020 due to the COVID-19 pandemic but steadily recovered over time. Weekday ridership consistently remained higher than weekend levels, reflecting commuter travel patterns. Both lines show gradual growth with some seasonal dips, indicating partial normalization of public transit usage by 2024.

<img src="pictures/MTA Ridership - Weekend v. Weekday.png" alt="Alt Text" width="800" height="480">




### Service Alerts Data
After data exploration, it was determined that the MTA Delays dataset's granularity/structure prevented a cohesive connection with the other datasets explored. We decided to try and model the impact in service of MTA different modes of transit using the Service Alerts functionality provided by MTA.

* **Source**: NYC Open Data – MTA Service Alerts
* **Website**: https://data.ny.gov/Transportation/MTA-Service-Alerts-Beginning-April-2020/7kct-peq7/about_data.
* **Description**: The MTA service alerts system is designed to inform passengers about events that can disrupt their travel, including both scheduled and unscheduled occurrences. These alerts cover a range of situations, from planned maintenance and construction to unexpected incidents like accidents or track issues. The alerts are created through continuous monitoring of the transit system, where potential disruptions are identified and communicated to passengers to provide timely and accurate information about how their travel may be affected.  

#### Data Processing:

* **Data Collection**: Data was collected from 2020-04-28 to 2025-02-27. The raw file was downloaded as a CSV from NYC Open Data. It includes 5 years of service alerts for several transportation systems.
* **Data Exploration**: Though the system provides some estimates on the expected delay duration for certain events in free text fields, this was not always provided. We ultimatley decided to assign a grade based on the level of severity we believed was appropriate for each. <br> There can be multiple service alerts per day or none at all. Here we see the counts per service alert across the year by month. <br> <img src='pictures/service_alerts_count_monthly.png' width ="800" height="300"> <br>
* **Data Filtering and Cleaning**: As this dataset contained multiple years worth of data, we limited it down to just the alerts from 2024. Most of the fields in this dataset were dropped as well, as we needed to aggregate the data to an hourly level due to the inconsistancy of the service alert timings. The service alerts were attributed to the start of the hour, so if an alert was published at 10:37 AM it would be attributed to the 10 o' clock hour.
* **Missingness handeling**: The only feature in this dataset for this daterange was the Description field which was dropped in the cleaning process.
* **Feature Engineering**: The Status Label feature gave some indication as to what type of service interupption the alert was. Some alerts combined several status labels in a pipe delimited list. Due to the inconsistant timing of the alerts, we aggregated to an hourly level for each transportation Agency.  For each Agency, we took the max severity level per hour. The severity levels to status label mapping is as follows:
  
  **Low Severity (Informational/Minor) - Level 1**
    * `arrival-information-outage`: 1
    * `information-outage`: 1
    * `special-notice`: 1
    * `station-notice`: 1
    * `extra-service`: 1  *(Could be considered neutral or positive)*
    * `planned-work`: 1  *(Planned, so less disruptive impact assumed)*

    **Moderate Severity (Some Impact) - Level 2**
    * `boarding-change`: 2
    * `slow-speeds`: 2
    * `service-change`: 2
    * `some-delays`: 2
    * `expect-delays`: 2

    **Medium Severity (Noticeable Impact/Delays) - Level 3**
    * `delays`: 3
    * `stops-skipped`: 3
    * `stations-skipped`: 3
    * `express-to-local`: 3
    * `local-to-express`: 3
    * `buses-detoured`: 3
    * `shuttle-buses-detoured`: 3
    * `detour`: 3
    * `reroute`: 3
    * `some-reroutes`: 3
    * `trains-rerouted`: 3
    * `substitute-buses`: 3

    **High Severity (Significant Impact) - Level 4**
    * `severe-delays`: 4
    * `multiple-changes`: 4
    * `delays-and-cancellations`: 4

    **Very High Severity (Major Disruption/Suspension) - Level 5**
    * `part-suspended`: 5
    * `suspended`: 5
    * `cancellations`: 5

    **Default for unknown/missing status**
    * `unknown`: 0


* From this graph, we can see that the MTA subways and buses more consistantly have higher severity services. While we can't say definitivley that this causes increased delay durations, the significant increase from other forms of transit seems important to call out.

<img src='pictures/service_alerts_monthly_max_severity_by_agency.png' width ="800" height="300">


### MTA Delays Data
The MTA Subway Delays dataset provides monthly records of reported subway train delays across different lines and divisions in New York City. Each record includes the type of delay, the affected division and line, and the number of delay instances reported on that date. The data spans from January 2020 to December 2024 and serves as a proxy for system reliability within the city’s subway infrastructure.

* **Source**: NYC Open Data – MTA Delays

* **Website**: https://data.ny.gov/Transportation/MTA-Subway-Trains-Delayed-Beginning-2020/wx2t-qtaz/about_data

* **Description**: - The dataset contains **40,503** entries and **7 columns**, covering subway delays across multiple lines and divisions.
- The most frequent **reporting categories** are:
  - Infrastructure & Equipment
  - Crew Availability
  - External Factors
- Common specific causes include door-related issues, braking, and debris on tracks.
- Only the `subcategory` column has missing values (~5.5% of records).
- Delay frequency is reported by month and has been converted to datetime format.
- Additional features (`Year`, `Month`, `Weekday`) were extracted to support temporal analysis.
- A time series plot shows variation in delays across years, providing insight into longer-term trends.

##### *MTA Delays Per Year*

<img src="pictures/delays/MTA_delays_per_year.png" alt="MTA delays per year" width="900" height="480">


### Data Preprocessing:

**Data Collection:** The dataset was collected as a single CSV file containing over 40,000 rows and 7 columns. Each entry represents a reported delay event for a specific subway division and line on a given date.

**Data Cleaning:** One column, subcategory, had missing values for roughly 5.5% of the rows. These were filled with the label “Unknown” to retain the entries without introducing null-related bias. All categorical text columns — including division, line, reporting_category, and subcategory — were converted to lowercase and stripped of whitespace for standardization.

**Temporal Feature Extraction:** The original date column, month, was parsed into datetime format, and additional temporal features were created, including Year, Month, and Weekday. This allows for more granular time-based analysis and model-ready features.

**Day Type Mapping:** A numeric column day_type, which identifies whether the delay occurred on a weekday, Saturday, or holiday/Sunday, was mapped to human-readable labels using a day_type_label column.

Exploratory data analysis revealed important trends in the dataset:

**Delay Categories:** Infrastructure and equipment issues were the most common causes of delays, followed by crew availability and external factors (such as debris on tracks).

**Subway Lines:** The most frequently delayed lines were concentrated within the A and B divisions, with certain lines consistently showing higher disruption rates than others.

**Time Patterns:** Delays were more frequent during weekday service, particularly in the morning and late afternoon periods. Delay volumes appeared to remain relatively consistent over time, with some seasonal spikes in colder months.

These delay patterns will be integrated with weather, ridership, and TLC trip data to better understand how subway reliability might influence rideshare demand and pricing across New York City.

## Modeling Efforts

#### 1. **Random Forest Modeling** 

Running simple models such as a Random Forest Regressor, we were able to get an MSE of .77 and an $r^2$ goodness of fit value of .94. This was fit onto just the TLC data without much preprocessing or tuning. With a baseline established, we are confident we can achieve better results with further work.

#### 2. **Linear Regression Modeling** 

***File reference for code**: /workspaces/SU-IST707-Group_Project/Project Checkpoints/Checkpoint 2/Initial Models/Linear Regression Model-3.ipynb*

##### *Observations & Data Review*

The modeling effort was based on four datasets with varying time granularities:

- Three datasets (Uber, Lyft, and Weather) were structured with hourly temporal resolution, enabling detailed trend analysis across daily cycles.

- The Uber and Lyft datasets covered a consistent date range from 1/1/2024 to 12/31/2024 and included separate columns for date and hour, which required transformation to create a unified datetime field.

- Weather data spanned from 1/1/2020 to 12/31/2024 and included encoded numeric weather condition indicators, making it model-ready after filtering for 2024.

- The MTA Ridership data was available at a daily level, covering 3/1/2020 to 1/9/2025, and included system-level totals across Subways, Buses, LIRR, Metro-North, and more. This dataset needed downscaling and merging to match the hourly granularity of other inputs.

To enable accurate modeling, preprocessing steps included datetime alignment, feature engineering (e.g., hour, weekday, month), merging weather conditions and MTA ridership data, and calculating aggregated ride cost metrics like total_rideshare_cost_mean.


##### *Model Consideration*

A linear regression model was selected due to its interpretability and its ability to quantify the effect of each independent feature on the target — hourly rideshare fare cost. This choice is especially useful for understanding how different variables, such as time, weather, and ridership, influence pricing patterns.

Model Performance (on Train Set)
- **MAE: 1.55** — Predictions are on average within $1.55 of actual prices, showing strong day-to-day accuracy.

- **MSE: 3.97** — Small squared error values indicate that large prediction mistakes are rare.

- **RMSE: 1.99** — Most errors fall within $2 or less, even for outliers, confirming stability across a range of inputs.

- **R² Score: 0.8469** — The model explains approximately 85% of the variance in prices, demonstrating excellent predictive power.

These metrics were reinforced by actual vs. predicted plots for both training and test sets, which show tightly aligned performance curves, minimal overfitting, and strong temporal tracking.

##### *Train Plot*


<img src="pictures/LR_Train Chart.png" alt="Alt Text" width="800" height="480">

##### *Test Plot*

<img src="pictures/LR_test chart.png" alt="Alt Text" width="800" height="480">

##### *Feature Feedback & Interpretability*

1. **Best Top Predictive Features**: tips_mean, precip, trip_miles_mean, snow, congestion_surcharge_mean, service_uber, and is_weekend_mode were the most influential.These captured trip characteristics, weather-related demand, traffic congestion costs, and differences between platforms.

2. **Moderate/Useful Features**:Time-based variables like month, weekday, and hour added useful seasonal and hourly trend insights. Environmental variables like uvindex, conditions, and feelslike showed minor influence alone, but may be useful when interacting with other features (e.g., feelslike × hour).

3. **Low-Impact Features**: trip_count, windspeed, cloudcover, visibility, and daily MTA ridership metrics had negligible coefficients and could be dropped to simplify the model.

#### 3. **XGBoost Modeling with MTA Delays Integration**

##### *Observations & Data Integration*

This modeling phase explored how **NYC subway delays**, combined with rideshare trip characteristics and weather, impact hourly rideshare demand (`trip_count`).

The dataset was built by merging:
- Hourly rideshare data (Uber and Lyft)
- Hourly weather observations
- Daily MTA subway delay counts (aggregated and joined on date)

Preprocessing steps included:
- One-hot encoding for categorical variables (`service`, `conditions`)
- Type conversion for boolean fields
- Aggregation of `total_delays` and temporal alignment across datasets
- Final confirmation of a fully numeric, model-ready feature matrix

##### *Model Consideration*

We selected the **XGBoost Regressor** to model non-linear relationships and feature interactions across time, weather, trip, and delay dimensions. The model aimed to predict **hourly rideshare volume** using all available features.

##### *Model Performance*

- **RMSE**: **1588.20**  
  The model’s predictions are off by ~1,588 trips per hour on average, which is acceptable given the high range of trip counts observed.

##### *Prediction Performance Plot*

Shows the relationship between actual and predicted `trip_count`. The model tracks demand well, especially at mid-range volumes.

<img src="pictures/delays/predictionsvactual.png" alt="XGBoost Predictions vs. Actuals" width="700"/>

##### *Residual Distribution*

The residuals are mostly centered around zero with a slight left skew, indicating generally good prediction performance without major bias.

<img src="pictures/delays/distribution_of_residuals.png" alt="Distribution of Residuals" width="700"/>

##### *Time Series: Actual vs. Predicted*

This view shows how the model’s predictions follow the real-world fluctuations in demand across the test set. It shows alignment, with some volatility during peaks.

<img src="pictures/delays/aVp_overtime.png" alt="Actual vs Predicted Over Time" width="900"/>

##### *Feature Feedback & Interpretability*

1. **Top Predictive Features**:
   - `service_uber`: Platform usage had the highest impact, likely reflecting Uber’s larger market share.
   - `trip_miles_mean` and `trip_time_mean`: Longer trips strongly indicate higher hourly demand.

2. **Moderate Contributors**:
   - `is_weekend_mode`, `uvindex`, and `tips_mean` captured smaller but helpful signals.
   - `total_delays` from MTA data added some predictive power, showing potential for behavioral response to subway outages.

3. **Low Impact Features**:
   - Some weather categories (like `cloudcover`, `snow`, `conditions_#`) contributed minimally and may be useful only in interaction terms or under extreme conditions.


## Problems and Challenges

The assumptions made when aggregatung the TLC data may have resulted in some loss of information. Further exploration and different aggregation strategies will be explored going forward to draw out more information.

## Next Steps

As we continue modeling, we will be looking to join in the other datapoints we have available. Some further feature engineering may be needed to align our datasets and to draw out the connections between datapoints. Our initial modeling was mostly centered on the TLC data. We will also explore modeling the alternative transit options to develop a model that can recommend alternate transportation during times of peak rideshare usage and potential surge pricing. 