# NYC taxi data cleaning

## Load packages and NYC taxi data from January 2021.

Load packages.

In [1]:
import datetime
from pathlib import Path

import pandas as pd

Load NYC taxi data from January 2021.

In [2]:
DATA_PATH = Path("/home/fmerino/Documents/data-engineering-zoomcamp-2024/01-docker-terraform/02-docker-sql/data-nyc-taxi")

In [None]:
nyc_taxi = pd.read_parquet(DATA_PATH/"yellow_tripdata_2021-01.parquet")

nyc_taxi

In [None]:
nyc_taxi.dtypes

Discard `store_and_fwd_flag` details because of its lack of relevance.

In [5]:
del nyc_taxi["store_and_fwd_flag"]

Check number of unique values per column/attribute and identify potential categorical values.

In [None]:
for column in nyc_taxi.columns:
    if nyc_taxi[column].nunique() < 10:
        print(f"Column {column} includes {nyc_taxi[column].nunique()} unique values ({nyc_taxi[column].unique()}).")
    else:
        print(f"Column {column} includes {nyc_taxi[column].nunique()} unique values.")

In [None]:
nyc_taxi.describe()

Compute delta time (time elapsed between pickup and dropoff).

In [8]:
nyc_taxi["dt"] = (
    nyc_taxi["tpep_dropoff_datetime"]
    - nyc_taxi["tpep_pickup_datetime"]
)

In [9]:
nyc_taxi["avg_speed"] = (
    nyc_taxi["trip_distance"]
    / (nyc_taxi["dt"]/pd.Timedelta(hours=1))
)

In [10]:
relevant_cols = [
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime",
    "dt",
    "trip_distance",
    "avg_speed",
    "PULocationID",
    "DOLocationID",
    "passenger_count",
    "total_amount",
]

Reorder columns/attributes based on its relevance.

In [11]:
nyc_taxi = nyc_taxi[
    [
        "tpep_pickup_datetime",
        "tpep_dropoff_datetime",
        "dt",
        "trip_distance",
        "avg_speed",
        "PULocationID",
        "DOLocationID",
        "RatecodeID",
        "passenger_count",
        "total_amount",
        "fare_amount",
        "tip_amount",
        "tolls_amount",
        "extra",
        "mta_tax",
        "improvement_surcharge",
        "congestion_surcharge",
        "airport_fee",
        "payment_type",
        "VendorID",
    ]
].copy()

## Discard trips considered bad data.

Note that we cannot discuss with the business experts how to identify bad data and, therefore, our hability to do so is limited.
Next, we propose several scenarios that could identify bad data using our shallow understanding in this sector.

- Discard trips outside the analyzed time period (January 2021).

In [12]:
nyc_taxi.drop(
    nyc_taxi[
        (nyc_taxi["tpep_pickup_datetime"] < datetime.datetime(year=2021, month=1, day=1))
        | (nyc_taxi["tpep_pickup_datetime"] > datetime.datetime(year=2021, month=2, day=1))
        | (nyc_taxi["tpep_dropoff_datetime"] < datetime.datetime(year=2021, month=1, day=1))
        | (nyc_taxi["tpep_dropoff_datetime"] > datetime.datetime(year=2021, month=2, day=1))
    ].index,
    inplace=True,
)

- Discard trips with invalid `VendorID` values.

In [13]:
nyc_taxi.drop(nyc_taxi[nyc_taxi["VendorID"] == 6].index, inplace=True)

- Discard trips with invalid `RatecodeID` values and convert to `int64` this column/attribute.

In [14]:
nyc_taxi.drop(
    nyc_taxi[
        (nyc_taxi["RatecodeID"].isna())
        | (nyc_taxi["RatecodeID"] == 99.0)
    ].index,
    inplace=True,
)

In [15]:
nyc_taxi["RatecodeID"] = nyc_taxi["RatecodeID"].astype("int64")

- By law, a maximum of 4 passengers are allowed in standard NYC taxis. A child under 7 is allowed to sit on a passenger's lap in the rear seat in addition to the passenger limit. Therefore, discard trips with more than 5 passengers. Also, discard trips with no passengers.

In [16]:
nyc_taxi.drop(nyc_taxi[(nyc_taxi["passenger_count"] > 5) | (nyc_taxi["passenger_count"] == 0)].index, inplace=True)

In [17]:
nyc_taxi["passenger_count"] = nyc_taxi["passenger_count"].astype("int64")

- Discard trips with negative or nil distance.

In [18]:
nyc_taxi.drop(nyc_taxi[nyc_taxi["trip_distance"] <= 0].index, inplace=True)

- Discard trips with a negligible duration (lower than 1 minute).

In [19]:
nyc_taxi.drop(nyc_taxi[nyc_taxi["dt"]/pd.Timedelta(minutes=1) < 1].index, inplace=True)

- Discard trips with a negative average speed (i.e., the trip distance or duration is negative).

In [20]:
nyc_taxi.drop(nyc_taxi[nyc_taxi["avg_speed"] < 0].index, inplace=True)

- Discard trips from or to outside NYC with an average speed higher than 75 mph (max freeway speed limit in the surrounding states).

In [21]:
nyc_taxi.drop(
    nyc_taxi[
        (nyc_taxi["avg_speed"] > 75)
        & (
            (nyc_taxi["PULocationID"] > 263)
            | (nyc_taxi["DOLocationID"] > 263)
        )
    ].index,
    inplace=True,
)

- Discard trips within NYC with an average speed higher than 50 mph (max speed limit in NYC).

In [22]:
nyc_taxi.drop(
    nyc_taxi[
        (nyc_taxi["avg_speed"] > 50)
        & (
            (nyc_taxi["PULocationID"] < 264)
            & (nyc_taxi["DOLocationID"] < 264)
        )
    ].index,
    inplace=True,
)

- Discard trips taking more than 1 hour at an average speed lower than 3 mph, as it is assumed these slow trips cannot even be associated with traffic jams, even in NYC.

In [23]:
nyc_taxi.drop(
    nyc_taxi[
        (nyc_taxi["dt"]/pd.Timedelta(hours=1) > 1)
        & (nyc_taxi["avg_speed"] < 3)
    ].index,
    inplace=True,
)

Check value ranges for the most relevant columns/attributes.

In [None]:
nyc_taxi[relevant_cols].describe()

In [None]:
nyc_taxi.query("passenger_count == 0")[relevant_cols].describe()

Reset index after data processing.

In [26]:
nyc_taxi.reset_index(drop=True, inplace=True)

## Save processed data on disk.

Save processed NYC taxi data from January 2021 on disk (PARQUET format, as the original data).

In [27]:
nyc_taxi.to_parquet(DATA_PATH/"yellow_tripdata_2021-01_prepared.parquet")