# Phase 1 - Data & Descriptive Statistics

## Data Ingestion & Initial Inspection

This notebook begins the analysis of NYC yellow taxi ride demand (Jan-Mar 2019).  

**Objectives of this notebook:**
- Load and concatenate the raw CSV data
- Inspect schema, datatypes, and missing values
- Document initial observations and potential data quality issues


In [25]:
import sys
import os
import pandas as pd
import numpy as np
from   importlib import reload

# Add src directory to path
sys.path.append("../../src")

# Import the load_data module
from data import load_data


In [26]:
# Load all CSVs and concatenate into a single DataFrame
df = load_data.load_raw_taxi_data()

Loading ../../data/raw/yellow_tripdata_2019-01.csv...
Loading ../../data/raw/yellow_tripdata_2019-02.csv...
Loading ../../data/raw/yellow_tripdata_2019-03.csv...


In [30]:
# Inspect the concatenated dataframe
load_data.inspect_schema(df)


--- Schema Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22519712 entries, 0 to 22519711
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        int64         
 4   trip_distance          float64       
 5   RatecodeID             int64         
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18 

## Initial Observations

- Dataset contains ~22.5 million rows and 19 columns.
- Column names and types match TLC documentation, though casing differs from the project data dictionary.
- Some pickup datetimes fall outside the intended Jan-Mar 2019 window.
- Congestion surcharge is missing for ~22% of records; this is structural (introduced in early 2019 and only applied to certain zones/times).
- Some trips have unrealistic fare-duration combinations and will require cleaning.
- Further cleaning and feature engineering will be performed later on.

In [32]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,source_file
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,,yellow_tripdata_2019-01.csv
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,1,N,239,246,1,14.0,0.5,0.5,1.0,0.0,0.3,16.3,,yellow_tripdata_2019-01.csv
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,1,N,236,236,1,4.5,0.5,0.5,0.0,0.0,0.3,5.8,,yellow_tripdata_2019-01.csv
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,1,N,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,7.55,,yellow_tripdata_2019-01.csv
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,2,N,193,193,2,52.0,0.0,0.5,0.0,0.0,0.3,55.55,,yellow_tripdata_2019-01.csv


In [31]:
df.describe()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,22519710.0,22519712,22519712,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,22519710.0,17663730.0
mean,1.636566,2019-02-15 07:40:05.223035904,2019-02-15 07:57:10.702980608,1.571249,2.893291,1.06218,163.6597,161.9358,1.280282,12.78228,0.8911029,0.4961151,2.072678,0.3413585,0.2991298,17.79865,1.875677
min,1.0,2001-02-02 14:55:07,2001-02-02 15:07:27,0.0,0.0,1.0,1.0,1.0,1.0,-447.0,-60.0,-0.5,-89.89,-70.0,-0.3,-450.3,-2.5
25%,1.0,2019-01-24 11:05:38,2019-01-24 11:24:48,1.0,0.94,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,9.96,2.5
50%,2.0,2019-02-14 21:57:50,2019-02-14 22:14:17,1.0,1.6,1.0,162.0,162.0,1.0,9.0,0.5,0.5,1.7,0.0,0.3,13.5,2.5
75%,2.0,2019-03-09 13:28:09,2019-03-09 13:45:17.249999872,2.0,2.92,1.0,233.0,234.0,2.0,14.0,1.0,0.5,2.75,0.0,0.3,19.0,2.5
max,4.0,2088-01-24 00:25:39,2088-01-24 07:28:25,9.0,831.8,99.0,265.0,265.0,5.0,943274.8,535.38,75.0,141492.0,3288.0,1.0,1084772.0,4.5
std,0.5260722,,,1.22699,3.805281,0.6729183,66.08905,70.24575,0.4706301,292.6392,1.152017,0.05903604,29.93208,1.80156,0.02208043,313.7936,1.086081


In [36]:
# Save concatencated raw data
df.to_parquet("../../data/processed/raw_combined_2019_q1.parquet", index=False, engine="fastparquet")