<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/main/Exercises/day-6/practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ETL Lab Exercise: NYC Yellow Taxi Trip Dataset (Public CSV Download)

## Problem Statement

In this exercise, you'll build an ETL (Extract-Transform-Load) pipeline in Google Colab, using real-world trip data from New York City's iconic yellow taxis.

---

## Tasks

### 1. Extract
- **Download:**  
  Download a recent sample NYC Yellow Taxi Trip data CSV from this public direct URL:  
  https://data.cityofnewyork.us/resource/kxp8-n2sj.csv?$limit=5000
- **Load:**  
  Load the CSV file into a pandas DataFrame.

### 2. Transform
- **Clean:**  
  - Handle missing or inconsistent values in key columns (like fare amount, trip distance, passenger count).
  - Convert date columns (e.g., pickup/dropoff datetime) to proper datetime format.
  - Remove records that aren't plausible (e.g., zero or negative trip distance or fare).
- **Feature Engineering:**  
  - Compute trip duration in minutes.
  - Create a new feature called `Tip_Percent` as `tip_amount/total_amount * 100`.
  - Categorize each ride by time of day ("Morning", "Afternoon", "Evening", "Night") using the pickup timestamp.

### 3. Load
- **SQLite Storage:**  
  Store the transformed DataFrame in a local SQLite database in Colab.
- **SQL Queries:**  
  - What is the average fare by time-of-day category?
  - What is the distribution of tip percentages?
  - Which hour of the day sees the highest average trip duration?

---

## Constraints

- Use only pandas, Python standard library, and SQLite inside Google Colab.
- Do not use external databases or cloud services.

---

## Dataset

- **NYC Yellow Taxi Trip Data (Sample CSV, January to June 2020):**  
  https://data.cityofnewyork.us/resource/kxp8-n2sj.csv?$limit=5000

---

## Example Challenge Questions

- What is the median trip distance in the data?
- How does tip percentage relate to trip duration?
- What proportion of trips are cash vs. card payments?

---

**Expected Outcome:**  
By completing this lab, you’ll experience the full pipeline: extracting real urban transport data, cleaning and crafting new analytics features, and running practical SQL queries—100% locally within Colab.


Handle missing or inconsistent values in key columns (like fare amount, trip distance, passenger count).


In [21]:
import pandas as pd
import sqlite3

url = "https://data.cityofnewyork.us/resource/kxp8-n2sj.csv?$limit=5000"
df = pd.read_csv(url)
df.head()


Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2020-01-01T00:28:15.000,2020-01-01T00:33:03.000,1,1.2,1,N,238,239,1,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1,2020-01-01T00:35:39.000,2020-01-01T00:43:04.000,1,1.2,1,N,239,238,1,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1,2020-01-01T00:47:41.000,2020-01-01T00:53:52.000,1,0.6,1,N,238,238,1,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1,2020-01-01T00:55:23.000,2020-01-01T01:00:14.000,1,0.8,1,N,238,151,1,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2,2020-01-01T00:01:58.000,2020-01-01T00:04:16.000,1,0.0,1,N,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


In [22]:
print(type(df))
new_df = df.dropna(subset=['tolls_amount', 'trip_distance', 'fare_amount', 'passenger_count', 'total_amount'])



<class 'pandas.core.frame.DataFrame'>


Convert date columns (e.g., pickup/dropoff datetime) to proper datetime format.

In [25]:
 df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'], errors = 'coerce')

In [26]:
 df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'], errors = 'coerce')

Remove records that aren't plausible (e.g., zero or negative trip distance or fare)

In [30]:
df = df[(df['trip_distance'] > 0) & (df['fare_amount'] > 0) & (df['passenger_count'] > 0)]


Feature Engineering

In [33]:
# Compute trip duration in minutes.

df['dur_to_min'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60
df = df[(df['dur_to_min'] > 0)]


Create a new feature called Tip_Percent as tip_amount/total_amount * 100.

In [35]:
df['Tip_Percent'] = (df['tip_amount'] / df['total_amount']) * 100


Categorize each ride by time of day ("Morning", "Afternoon", "Evening", "Night") using the pickup timestamp.




In [36]:
tpep_pickup_datetime = pd.to_datetime(df['tpep_pickup_datetime'])


In [37]:
def get_time_of_day(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

In [40]:
df['Time_of_Day'] = df['tpep_pickup_datetime'].dt.hour.apply(get_time_of_day)


    SQLite Storage:
    Store the transformed DataFrame in a local SQLite database in Colab.
    SQL Queries:
        What is the average fare by time-of-day category?
        What is the distribution of tip percentages?
        Which hour of the day sees the highest average trip duration?


In [42]:
connection = sqlite3.connect("taxi.db")

In [44]:
df.to_sql('taxi_data', connection, if_exists='replace', index=False)

  df.to_sql('taxi_data', connection, if_exists='replace', index=False)


4851

Average fare by time-of-day category

In [45]:
avg_fare_by_time_of_day = pd.read_sql_query("""
    SELECT Time_of_Day, AVG(fare_amount) AS avg_fare
    FROM taxi_data
    GROUP BY Time_of_Day
""", connection)

In [46]:
print(avg_fare_by_time_of_day)

  Time_of_Day   avg_fare
0   Afternoon  10.083333
1     Morning   4.666667
2       Night  12.185770


  Distribution of tip percentages

In [47]:
distribution_of_tip_percentages = pd.read_sql_query("""
    SELECT Tip_Percent, COUNT(*) AS count
    FROM taxi_data
    GROUP BY Tip_Percent
""", connection)

In [48]:
print(distribution_of_tip_percentages)

     Tip_Percent  count
0       0.000000   1675
1       0.078064      1
2       0.232019      1
3       0.421941      1
4       0.556328      1
..           ...    ...
557    50.000000      2
558    50.505051      1
559    51.643192      1
560    53.811659      1
561    56.179775      1

[562 rows x 2 columns]


    Which hour of the day sees the highest average trip duration?

In [51]:
hour_with_highest_avg_trip_duration = pd.read_sql_query("""
    SELECT tpep_pickup_datetime, AVG(dur_to_min) AS avg_trip_duration
    FROM taxi_data
    GROUP BY tpep_pickup_datetime""", connection)

In [52]:
print(hour_with_highest_avg_trip_duration)

     tpep_pickup_datetime  avg_trip_duration
0     2019-12-31 13:58:16          15.350000
1     2019-12-31 14:08:25           6.066667
2     2019-12-31 14:20:03        1405.983333
3     2019-12-31 15:57:03           7.883333
4     2019-12-31 16:18:17        1399.166667
...                   ...                ...
2654  2020-01-01 02:37:26           0.266667
2655  2020-01-01 05:52:32           5.850000
2656  2020-01-01 08:42:19           2.500000
2657  2020-01-01 11:36:59           2.283333
2658  2020-01-01 12:41:48           1.733333

[2659 rows x 2 columns]




What is the median trip distance in the data?



In [53]:
median_trip_distance = df['trip_distance'].median()
print(median_trip_distance)

1.9


How does tip percentage relate to trip duration?
    

In [55]:
tip_percentage = df['Tip_Percent']
trip_duration = df['dur_to_min']
correlation = tip_percentage.corr(trip_duration)
print(correlation)

-0.018877769279166675


What proportion of trips are cash vs. card payments?

In [57]:
cash_vs_card = pd.read_sql_query("""
SELECT payment_type, COUNT(*) AS count
FROM taxi_data
GROUP BY payment_type
""", connection)

In [60]:
payment_map = {
    1: 'Card',
    2: 'Cash',
    3: 'No Charge',
    0: 'Unknown'
}

df['payment_type_label'] = df['payment_type'].map(payment_map)

cash_vs_card = df['payment_type_label'].value_counts(normalize=True)
print(cash_vs_card)

payment_type_label
Card         0.679257
Cash         0.315583
No Charge    0.005160
Name: proportion, dtype: float64


In [61]:
print(cash_vs_card)

payment_type_label
Card         0.679257
Cash         0.315583
No Charge    0.005160
Name: proportion, dtype: float64
