# 📕 01 - Load and Validate Raw Data

## Introduction 

This notebook focuses on the data ingestion and pre-processing of the raw rides dataset. The data is sourced from TLC and contain detailed trip data from taxi rides for a specific month. Our goal is to download, explore, validate, and clean this data for further analysis and modeling.

## TLC Trip Record Data

[The New York City Taxi and Limousine Commission (TLC)](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) provides access to extensive trip records from both yellow and green taxis as well as For-Hire Vehicles (FHV). This data is invaluable for a variety of applications, including urban planning, traffic modeling, and socio-economic studies.

### Yellow and Green Taxi Trip Records

These records offer a detailed look into the operations of traditional taxis in the city:

- **Date/Time**: Each record contains timestamps for both the pick-up and drop-off events.
- **Locations**: Precise locations, both for pick-up and drop-off, are included. 
- **Trip Details**: Data encompasses the trip distance, a breakdown of the fare, the rate type, payment methods, and the number of passengers as reported by the driver.
- **Source**: The data stems from technology providers sanctioned under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). It's worth noting that while the 
TLC has provided access to this data, they did not generate it and thus do not guarantee its accuracy.

## Notebook Overview:

1. **Data Download**:  I've crafted a function that streamlines the download process based on the specified month and year. For the initial phase of this project, I've chosen an arbitrary date and will be working with data from a single month.
2. **Data Exploration**: A swift perusal of the basic data structure and principal columns we're interested in.
3. **Data Cleaning**: 
    - Filtering relevant columns.
    - Renaming columns for clarity.
    - Ensuring the data is within the desired date range.
4. **Data Saving**: The validated and cleaned data is then saved into a new Parquet file for further usage.

In [15]:
# import necessary libraries
from pathlib import Path
import requests
import pandas as pd

def download_one_file(year: int, month: int) -> Path:
    """
    Downloads the yellow taxi trip data for a given year and month from the TLC website and saves it as a parquet file.

    Args:
    - year (int): The year of the data to download.
    - month (int): The month of the data to download.

    Returns:
    - path (Path): The path to the downloaded parquet file.
    """

    URL = f'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year}-{month:02d}.parquet'
    response = requests.get(URL)

    if response.status_code == 200:
        path = f'../data/raw/rides_{year}_{month:02d}.parquet'
        open(path, 'wb').write(response.content)
        return path
    else:
        raise Exception(f'{URL} is not available.')

In [16]:
# explore and validate one single file
download_one_file(year=2022, month=1)

rides = pd.read_parquet('../data/raw/rides_2022_01.parquet')
rides.head(20)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0
5,1,2022-01-01 00:40:15,2022-01-01 01:09:48,1.0,10.3,1.0,N,138,161,1,33.0,3.0,0.5,13.0,6.55,0.3,56.35,2.5,0.0
6,2,2022-01-01 00:20:50,2022-01-01 00:34:58,1.0,5.07,1.0,N,233,87,1,17.0,0.5,0.5,5.2,0.0,0.3,26.0,2.5,0.0
7,2,2022-01-01 00:13:04,2022-01-01 00:22:45,1.0,2.02,1.0,N,238,152,2,9.0,0.5,0.5,0.0,0.0,0.3,12.8,2.5,0.0
8,2,2022-01-01 00:30:02,2022-01-01 00:44:49,1.0,2.71,1.0,N,166,236,1,12.0,0.5,0.5,2.25,0.0,0.3,18.05,2.5,0.0
9,2,2022-01-01 00:48:52,2022-01-01 00:53:28,1.0,0.78,1.0,N,236,141,2,5.0,0.5,0.5,0.0,0.0,0.3,8.8,2.5,0.0


In [17]:
rides = rides[['tpep_pickup_datetime', 'PULocationID']]
rides

Unnamed: 0,tpep_pickup_datetime,PULocationID
0,2022-01-01 00:35:40,142
1,2022-01-01 00:33:43,236
2,2022-01-01 00:53:21,166
3,2022-01-01 00:25:21,114
4,2022-01-01 00:36:48,68
...,...,...
2463926,2022-01-31 23:36:53,90
2463927,2022-01-31 23:44:22,107
2463928,2022-01-31 23:39:00,113
2463929,2022-01-31 23:36:42,148


In [18]:
# rename columns to a more convenient format
rides.rename(columns={
    'tpep_pickup_datetime': 'pickup_datetime',
    'PULocationID': 'pickup_location_id'
}, inplace=True)

# check data description
rides.describe()

Unnamed: 0,pickup_datetime,pickup_location_id
count,2463931,2463931.0
mean,2022-01-17 01:19:51.689724,166.0768
min,2008-12-31 22:23:09,1.0
25%,2022-01-09 15:37:41,132.0
50%,2022-01-17 12:11:45,162.0
75%,2022-01-24 13:49:37.500000,234.0
max,2022-05-18 20:41:57,265.0
std,,65.46806


In [19]:
# remove rides out of the time range
rides = rides[rides['pickup_datetime'] < '2022-02-01']
rides = rides[rides['pickup_datetime'] > '2022-01-01']

# check again for data description
rides.pickup_datetime.describe()

count                       2463879
mean     2022-01-17 01:58:40.393674
min             2022-01-01 00:00:08
25%             2022-01-09 15:37:56
50%             2022-01-17 12:11:54
75%             2022-01-24 13:49:37
max             2022-01-31 23:59:58
Name: pickup_datetime, dtype: object

In [20]:
# export validated data to parquet file
rides.to_parquet('../data/transformed/validated_rides_2022_01.parquet')