# **MTA Ridership Data Analysis**
This notebook analyzes the daily ridership data of the MTA, applies Winsorization to handle outliers, and explores trends over different time periods: Pre-Pandemic, Pandemic, Recovery, and Post-Pandemic.

## **Importing libraries & dataset**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from statsmodels.tsa.seasonal import seasonal_decompose


In [None]:
file_path = "MTA_Daily_Ridership.csv"
df = pd.read_csv(file_path)
df['Date'] = pd.to_datetime(df['Date'])


## **Exploratory Data Analysis (EDA)**
To better understand the dataset, we perform various EDA techniques:

- **Summary Statistics**: Overview of numerical features.
- **Missing Values Check**: Identify missing data.
- **Correlation Heatmap**: Understand relationships between numerical variables.
- **Boxplots**: Visualize outliers and distributions.
- **Time Series Trends**: Examine ridership patterns over time.

### Data summary

In [None]:
print(df.head())

In [None]:
print(df.info())

In [None]:
print(df.dtypes)

In [None]:
print(df.describe())

In [None]:
print(df.isnull().sum())

### Outliers visualisation

In [None]:

df_box = df.copy()
if 'Date' in df_box.columns:
    df_box = df_box.drop('Date', axis=1)

box_color = '#1f77b4'  # Blue
median_color = '#ff7f0e'  # Orange

for column in df_box.columns:
    plt.figure(figsize=(6, 4))
    box = plt.boxplot(df_box[column].dropna(), 
                      patch_artist=True, 
                      labels=[column])

    for patch in box['boxes']:
        patch.set(facecolor=box_color, alpha=0.5)

    for median in box['medians']:
        median.set(color=median_color, linewidth=2)

    plt.title(f'Box Plot – Outlier Visualization: {column}', fontsize=12, fontweight='bold')
    plt.ylabel('Values')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()


## **Data cleaning**

In [None]:
def winsorize_outliers(df, limits=(0.05, 0.05)): 
    for column in df.select_dtypes(include=['float64', 'int64']).columns:
        df[column] = winsorize(df[column], limits=limits)
    return df

In [None]:

pre_lockdown = (df['Date'] >= '2020-03-01') & (df['Date'] < '2020-03-22')
lockdown = (df['Date'] >= '2020-03-22') & (df['Date'] < '2021-06-08')
recovery = (df['Date'] >= '2021-06-08') & (df['Date'] < '2021-09-13')
post_lockdown = (df['Date'] >= '2021-09-13') & (df['Date'] < '2024-10-31')

df_pre_lockdown = df[pre_lockdown]
df_lockdown = df[lockdown]
df_recovery = df[recovery]
df_post_lockdown = df[post_lockdown]

In [None]:
winsorize_outliers (df_pre_lockdown)
winsorize_outliers (df_lockdown)
winsorize_outliers (df_recovery)
winsorize_outliers (df_post_lockdown)

In [None]:
df_merged = pd.concat([df_pre_lockdown, df_lockdown, df_recovery, df_post_lockdown], ignore_index=True)

In [None]:
# df_merged.to_csv('MTA-Ridership.csv', index=False, date_format='%Y-%m-%d')