# Data Ingestion Validation

This notebook validates the data ingestion process by:
- Loading data from DuckDB
- Checking row counts per table
- Verifying date ranges
- Checking for missing values
- Displaying sample records
- Basic statistics (min/max dates, counts by month)

In [1]:
import duckdb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set visualization defaults
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Connect to DuckDB
db_path = Path('../data/nyc_mobility.duckdb')
conn = duckdb.connect(str(db_path))

print(f"Connected to DuckDB at: {db_path}")
print(f"Database exists: {db_path.exists()}")

Connected to DuckDB at: ../data/nyc_mobility.duckdb
Database exists: True


## 1. Check Available Tables

First, let's see what tables were created by DLT.

In [2]:
# Get list of all tables
tables_df = conn.execute("""
    SELECT table_schema, table_name 
    FROM information_schema.tables 
    WHERE table_schema = 'raw_data'
    ORDER BY table_name
""").df()

print("Available tables in raw_data schema:")
print(tables_df)

Available tables in raw_data schema:
  table_schema           table_name
0     raw_data           _dlt_loads
1     raw_data  _dlt_pipeline_state
2     raw_data         _dlt_version
3     raw_data             fhv_taxi
4     raw_data       hourly_weather
5     raw_data                trips
6     raw_data          yellow_taxi


## 2. Row Counts Per Table

Check how many records were loaded into each table.

In [3]:
# Get row counts for each table
row_counts = {}

# Yellow Taxi
try:
    count = conn.execute("SELECT COUNT(*) FROM raw_data.yellow_taxi").fetchone()[0]
    row_counts['Yellow Taxi'] = f"{count:,}"
except:
    row_counts['Yellow Taxi'] = "Table not found"

# FHV Taxi
try:
    count = conn.execute("SELECT COUNT(*) FROM raw_data.fhv_taxi").fetchone()[0]
    row_counts['FHV Taxi'] = f"{count:,}"
except:
    row_counts['FHV Taxi'] = "Table not found"

# CitiBike
try:
    count = conn.execute("SELECT COUNT(*) FROM raw_data.trips").fetchone()[0]
    row_counts['CitiBike Trips'] = f"{count:,}"
except:
    row_counts['CitiBike Trips'] = "Table not found"

# Weather
try:
    count = conn.execute("SELECT COUNT(*) FROM raw_data.hourly_weather").fetchone()[0]
    row_counts['Hourly Weather'] = f"{count:,}"
except:
    row_counts['Hourly Weather'] = "Table not found"

print("\nRow Counts:")
for table, count in row_counts.items():
    print(f"  {table}: {count}")


Row Counts:
  Yellow Taxi: 8,610,143
  FHV Taxi: 2,446,615
  CitiBike Trips: 1,417,052
  Hourly Weather: 1,464


## 3. Yellow Taxi Data Validation

In [4]:
# Check Yellow Taxi schema
yellow_schema = conn.execute("""
    SELECT column_name, data_type 
    FROM information_schema.columns 
    WHERE table_schema = 'raw_data' AND table_name = 'yellow_taxi'
    ORDER BY ordinal_position
""").df()

print("Yellow Taxi Schema:")
print(yellow_schema)

# Sample records
print("\nSample Yellow Taxi Records:")
yellow_sample = conn.execute("SELECT * FROM raw_data.yellow_taxi LIMIT 5").df()
display(yellow_sample)

Yellow Taxi Schema:
              column_name                 data_type
0               vendor_id                    BIGINT
1    tpep_pickup_datetime  TIMESTAMP WITH TIME ZONE
2   tpep_dropoff_datetime  TIMESTAMP WITH TIME ZONE
3         passenger_count                    BIGINT
4           trip_distance                    DOUBLE
5             ratecode_id                    BIGINT
6      store_and_fwd_flag                   VARCHAR
7          pu_location_id                    BIGINT
8          do_location_id                    BIGINT
9            payment_type                    BIGINT
10            fare_amount                    DOUBLE
11                  extra                    DOUBLE
12                mta_tax                    DOUBLE
13             tip_amount                    DOUBLE
14           tolls_amount                    DOUBLE
15  improvement_surcharge                    DOUBLE
16           total_amount                    DOUBLE
17   congestion_surcharge                   

Unnamed: 0,vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecode_id,store_and_fwd_flag,pu_location_id,do_location_id,payment_type,...,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,_dlt_load_id,_dlt_id,cbd_congestion_fee
0,1,2025-09-30 17:15:32-07:00,2025-09-30 18:04:03-07:00,1,17.2,2,N,132,107,1,...,0.5,0.0,6.94,1.0,83.44,2.5,1.75,1766193253.4136791,aXQCVCR52P2IjQ,0.75
1,7,2025-09-30 17:00:08-07:00,2025-09-30 17:00:08-07:00,1,5.0,1,N,107,225,1,...,0.5,8.49,0.0,1.0,42.44,2.5,0.0,1766193253.4136791,rzYq7xobYQLi3Q,0.75
2,2,2025-09-30 17:08:54-07:00,2025-09-30 17:14:44-07:00,1,2.75,1,N,263,229,1,...,0.5,3.71,0.0,1.0,22.26,2.5,0.0,1766193253.4136791,nPwH22/GWO/flg,0.75
3,1,2025-09-30 17:58:48-07:00,2025-09-30 18:04:40-07:00,1,1.3,1,N,211,231,2,...,0.5,0.0,0.0,1.0,13.65,2.5,0.0,1766193253.4136791,tnufog4iEJUNig,0.75
4,2,2025-09-30 17:39:51-07:00,2025-09-30 17:49:40-07:00,1,2.88,1,N,230,151,1,...,0.5,3.99,0.0,1.0,23.94,2.5,0.0,1766193253.4136791,21oAUuaYowX6TQ,0.75


In [5]:
# Date range and statistics
yellow_stats = conn.execute("""
    SELECT 
        MIN(tpep_pickup_datetime) as min_date,
        MAX(tpep_pickup_datetime) as max_date,
        COUNT(*) as total_trips,
        AVG(trip_distance) as avg_distance,
        AVG(total_amount) as avg_fare
    FROM raw_data.yellow_taxi
""").df()

print("Yellow Taxi Statistics:")
display(yellow_stats)

# Monthly breakdown
yellow_monthly = conn.execute("""
    SELECT 
        EXTRACT(YEAR FROM tpep_pickup_datetime) as year,
        EXTRACT(MONTH FROM tpep_pickup_datetime) as month,
        COUNT(*) as trip_count
    FROM raw_data.yellow_taxi
    GROUP BY year, month
    ORDER BY year, month
""").df()

print("\nYellow Taxi Monthly Breakdown:")
display(yellow_monthly)

Yellow Taxi Statistics:


Unnamed: 0,min_date,max_date,total_trips,avg_distance,avg_fare
0,2008-12-31 16:04:21-07:00,2025-11-30 16:59:59-07:00,8610143,6.61621,26.313773



Yellow Taxi Monthly Breakdown:


Unnamed: 0,year,month,trip_count
0,2008,12,2
1,2009,1,1
2,2025,9,8451
3,2025,10,4464450
4,2025,11,4137239


## 4. FHV Taxi Data Validation

In [6]:
# Check FHV schema
fhv_schema = conn.execute("""
    SELECT column_name, data_type 
    FROM information_schema.columns 
    WHERE table_schema = 'raw_data' AND table_name = 'fhv_taxi'
    ORDER BY ordinal_position
""").df()

print("FHV Taxi Schema:")
print(fhv_schema)

# Sample records
print("\nSample FHV Records:")
fhv_sample = conn.execute("SELECT * FROM raw_data.fhv_taxi LIMIT 5").df()
display(fhv_sample)

FHV Taxi Schema:
              column_name                 data_type
0    dispatching_base_num                   VARCHAR
1         pickup_datetime  TIMESTAMP WITH TIME ZONE
2       drop_off_datetime  TIMESTAMP WITH TIME ZONE
3  affiliated_base_number                   VARCHAR
4            _dlt_load_id                   VARCHAR
5                 _dlt_id                   VARCHAR
6          d_olocation_id                    BIGINT
7          p_ulocation_id                    BIGINT

Sample FHV Records:


Unnamed: 0,dispatching_base_num,pickup_datetime,drop_off_datetime,affiliated_base_number,_dlt_load_id,_dlt_id,d_olocation_id,p_ulocation_id
0,B00009,2025-09-30 17:04:00-07:00,2025-09-30 17:26:00-07:00,B00009,1766193253.4136791,bP+Wt5lar0f+Vg,,
1,B00009,2025-09-30 17:04:00-07:00,2025-09-30 17:28:00-07:00,B00009,1766193253.4136791,Q2yg+MaNm60IgQ,,
2,B00009,2025-09-30 17:22:00-07:00,2025-09-30 17:35:00-07:00,B00009,1766193253.4136791,5HYItjnu1GeUAA,,
3,B00009,2025-09-30 17:22:00-07:00,2025-09-30 17:49:00-07:00,B00009,1766193253.4136791,JRnO/UdNMFytXg,,
4,B00013,2025-09-30 17:05:28-07:00,2025-09-30 17:47:00-07:00,B00014,1766193253.4136791,V+xx1SYbN2SnYw,,


In [7]:
# Date range and statistics
fhv_stats = conn.execute("""
    SELECT 
        MIN(pickup_datetime) as min_date,
        MAX(pickup_datetime) as max_date,
        COUNT(*) as total_trips
    FROM raw_data.fhv_taxi
""").df()

print("FHV Taxi Statistics:")
display(fhv_stats)

# Monthly breakdown
fhv_monthly = conn.execute("""
    SELECT 
        EXTRACT(YEAR FROM pickup_datetime) as year,
        EXTRACT(MONTH FROM pickup_datetime) as month,
        COUNT(*) as trip_count
    FROM raw_data.fhv_taxi
    GROUP BY year, month
    ORDER BY year, month
""").df()

print("\nFHV Taxi Monthly Breakdown:")
display(fhv_monthly)

FHV Taxi Statistics:


Unnamed: 0,min_date,max_date,total_trips
0,2025-09-30 17:00:00-07:00,2025-10-31 16:59:58-07:00,2446615



FHV Taxi Monthly Breakdown:


Unnamed: 0,year,month,trip_count
0,2025,9,8199
1,2025,10,2438416


## 5. CitiBike Data Validation

In [8]:
# Check CitiBike schema
citibike_schema = conn.execute("""
    SELECT column_name, data_type 
    FROM information_schema.columns 
    WHERE table_schema = 'raw_data' AND table_name = 'trips'
    ORDER BY ordinal_position
""").df()

print("CitiBike Schema:")
print(citibike_schema)

# Sample records
print("\nSample CitiBike Records:")
citibike_sample = conn.execute("SELECT * FROM raw_data.trips LIMIT 5").df()
display(citibike_sample)

CitiBike Schema:
           column_name                 data_type
0              ride_id                   VARCHAR
1        rideable_type                   VARCHAR
2           started_at  TIMESTAMP WITH TIME ZONE
3             ended_at  TIMESTAMP WITH TIME ZONE
4   start_station_name                   VARCHAR
5     start_station_id                   VARCHAR
6     end_station_name                   VARCHAR
7       end_station_id                   VARCHAR
8            start_lat                    DOUBLE
9            start_lng                    DOUBLE
10             end_lat                    DOUBLE
11             end_lng                    DOUBLE
12       member_casual                   VARCHAR
13        _dlt_load_id                   VARCHAR
14             _dlt_id                   VARCHAR

Sample CitiBike Records:


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,_dlt_load_id,_dlt_id
0,557FE56828BC7B72,electric_bike,2025-10-02 07:01:28.598000-07:00,2025-10-02 07:06:46.526000-07:00,E 2 St & 2 Ave,5593.02,Pike St & E Broadway,5270.05,40.725029,-73.990697,40.714067,-73.992939,member,1766194152.829321,EYIMzymGt0qvzQ
1,0C147B351D46C123,electric_bike,2025-10-04 12:10:31.195000-07:00,2025-10-04 12:11:42.659000-07:00,4 Ave & 9 St,3955.05,4 Ave & 9 St,3955.05,40.670513,-73.988766,40.670513,-73.988766,member,1766194152.829321,Pk7wxq1laByRpQ
2,F95C5036724DB9E8,electric_bike,2025-10-01 09:33:41.956000-07:00,2025-10-01 09:47:40.976000-07:00,Pier 40 - Hudson River Park,5696.03,Pike St & E Broadway,5270.05,40.727714,-74.011296,40.714067,-73.992939,member,1766194152.829321,OISE7UfHwIVh6A
3,2E0404A1CDF25730,electric_bike,2025-10-05 18:00:01.922000-07:00,2025-10-05 18:25:45.023000-07:00,1 Ave & E 30 St,6079.03,1 Ave & E 30 St,6079.03,40.741444,-73.975361,40.741444,-73.975361,member,1766194152.829321,uFppN7Vx0FtH1Q
4,33F394645EF291E3,classic_bike,2025-10-06 11:47:27.905000-07:00,2025-10-06 11:54:54.480000-07:00,74 St & 37 Ave,6332.06,77 St & 31 Ave,6718.02,40.74908,-73.89172,40.75886,-73.89081,member,1766194152.829321,x8SnddV4Px9WMw


In [9]:
# Date range and statistics (note: CitiBike column names vary by year)
# Try common column names for start time
try:
    citibike_stats = conn.execute("""
        SELECT 
            MIN(started_at) as min_date,
            MAX(started_at) as max_date,
            COUNT(*) as total_trips
        FROM raw_data.trips
    """).df()
    date_col = 'started_at'
except:
    try:
        citibike_stats = conn.execute("""
            SELECT 
                MIN(starttime) as min_date,
                MAX(starttime) as max_date,
                COUNT(*) as total_trips
            FROM raw_data.trips
        """).df()
        date_col = 'starttime'
    except:
        print("Unable to determine date column")
        citibike_stats = None
        date_col = None

if citibike_stats is not None:
    print("CitiBike Statistics:")
    display(citibike_stats)
    
    # Monthly breakdown
    citibike_monthly = conn.execute(f"""
        SELECT 
            EXTRACT(YEAR FROM {date_col}) as year,
            EXTRACT(MONTH FROM {date_col}) as month,
            COUNT(*) as trip_count
        FROM raw_data.trips
        GROUP BY year, month
        ORDER BY year, month
    """).df()
    
    print("\nCitiBike Monthly Breakdown:")
    display(citibike_monthly)

CitiBike Statistics:


Unnamed: 0,min_date,max_date,total_trips
0,2025-09-30 07:53:40.456000-07:00,2025-11-30 16:54:43.971000-07:00,1417052



CitiBike Monthly Breakdown:


Unnamed: 0,year,month,trip_count
0,2025,9,624
1,2025,10,999376
2,2025,11,417052


## 6. Weather Data Validation

In [10]:
# Check Weather schema
weather_schema = conn.execute("""
    SELECT column_name, data_type 
    FROM information_schema.columns 
    WHERE table_schema = 'raw_data' AND table_name = 'hourly_weather'
    ORDER BY ordinal_position
""").df()

print("Weather Schema:")
print(weather_schema)

# Sample records
print("\nSample Weather Records:")
weather_sample = conn.execute("SELECT * FROM raw_data.hourly_weather LIMIT 5").df()
display(weather_sample)

Weather Schema:
       column_name                 data_type
0        timestamp  TIMESTAMP WITH TIME ZONE
1             temp                    DOUBLE
2       feels_like                    DOUBLE
3         humidity                    BIGINT
4        dew_point                    DOUBLE
5    precipitation                    DOUBLE
6             rain                    DOUBLE
7         snowfall                    DOUBLE
8      cloud_cover                    BIGINT
9         pressure                    DOUBLE
10      wind_speed                    DOUBLE
11  wind_direction                    BIGINT
12    _dlt_load_id                   VARCHAR
13         _dlt_id                   VARCHAR

Sample Weather Records:


Unnamed: 0,timestamp,temp,feels_like,humidity,dew_point,precipitation,rain,snowfall,cloud_cover,pressure,wind_speed,wind_direction,_dlt_load_id,_dlt_id
0,2025-09-30 17:00:00-07:00,17.1,14.8,60,9.1,0.0,0.0,0.0,43,1018.5,3.86,10,1766196025.227349,C1E4LlDsmkq8ww
1,2025-09-30 18:00:00-07:00,16.2,13.5,58,7.9,0.0,0.0,0.0,57,1018.9,4.0,16,1766196025.227349,zY6vWG1RSde2tA
2,2025-09-30 19:00:00-07:00,14.9,12.2,59,6.9,0.0,0.0,0.0,96,1019.1,3.6,14,1766196025.227349,NE87bOsVSNDs3g
3,2025-09-30 20:00:00-07:00,13.8,11.1,61,6.5,0.0,0.0,0.0,91,1019.7,3.59,13,1766196025.227349,iZlzxo00VpnCFw
4,2025-09-30 21:00:00-07:00,13.0,10.1,63,6.1,0.0,0.0,0.0,1,1020.1,3.8,16,1766196025.227349,dwHiel+PZ4DWuQ


In [11]:
# Date range and statistics
weather_stats = conn.execute("""
    SELECT 
        MIN(timestamp) as min_date,
        MAX(timestamp) as max_date,
        COUNT(*) as total_records,
        AVG(temp) as avg_temp_celsius,
        AVG(humidity) as avg_humidity,
        AVG(wind_speed) as avg_wind_speed
    FROM raw_data.hourly_weather
""").df()

print("Weather Statistics:")
display(weather_stats)

# Daily breakdown
weather_daily = conn.execute("""
    SELECT 
        CAST(timestamp AS DATE) as date,
        COUNT(*) as hourly_records,
        AVG(temp) as avg_temp,
        MIN(temp) as min_temp,
        MAX(temp) as max_temp
    FROM raw_data.hourly_weather
    GROUP BY date
    ORDER BY date
    LIMIT 10
""").df()

print("\nWeather Daily Breakdown (first 10 days):")
display(weather_daily)

Weather Statistics:


Unnamed: 0,min_date,max_date,total_records,avg_temp_celsius,avg_humidity,avg_wind_speed
0,2025-09-30 17:00:00-07:00,2025-11-30 16:00:00-07:00,1464,10.527254,67.526639,2.953402



Weather Daily Breakdown (first 10 days):


Unnamed: 0,date,hourly_records,avg_temp,min_temp,max_temp
0,2025-09-30,7,14.214286,12.0,17.1
1,2025-10-01,24,14.883333,9.5,20.2
2,2025-10-02,24,13.8625,8.9,18.9
3,2025-10-03,24,16.379167,10.5,22.0
4,2025-10-04,24,20.35,13.4,27.7
5,2025-10-05,24,21.529167,14.5,29.7
6,2025-10-06,24,20.45,15.4,26.7
7,2025-10-07,24,22.820833,17.7,27.3
8,2025-10-08,24,15.891667,9.6,20.7
9,2025-10-09,24,10.120833,4.7,14.9


## 7. DLT Metadata Tables

Check DLT's metadata tables to see load information.

In [12]:
# Check DLT loads
try:
    dlt_loads = conn.execute("""
        SELECT * FROM raw_data._dlt_loads 
        ORDER BY inserted_at DESC 
        LIMIT 10
    """).df()
    
    print("Recent DLT Loads:")
    display(dlt_loads)
except:
    print("DLT metadata table not found - data may not have been loaded yet")

Recent DLT Loads:


Unnamed: 0,load_id,schema_name,status,inserted_at,schema_version_hash
0,1766196026.757368,nyc_tlc,0,2025-12-19 19:00:27.952016-07:00,mUxCxexm8XfRy8Httmh0rKcG59Ma6N0V28mLa7ZmQaI=
1,1766196025.227349,weather,0,2025-12-19 19:00:27.693572-07:00,sdbCPTwQW164lESFVf4wmX6XSSqMma+v55g9E5Gw1B0=
2,1766194152.829321,citibike,0,2025-12-19 18:33:04.618899-07:00,Az5MdUFo6UBlomJkXG43VWQCCsbn7VnZYS0m94bQAjc=
3,1766193253.4136791,nyc_tlc,0,2025-12-19 18:29:12.804222-07:00,mUxCxexm8XfRy8Httmh0rKcG59Ma6N0V28mLa7ZmQaI=
4,1766186168.559581,nyc_tlc,0,2025-12-19 16:22:13.671716-07:00,zlwwUccgqoVrtIoZY4LkA+FGakcQ57RAzsXe/FC5tC0=
5,1766185950.495216,nyc_tlc,0,2025-12-19 16:12:31.024023-07:00,TbJrRSNtBcb0MExc9yDCsJSu2WQrWZUQbESsBbO3YCw=
6,1766185948.209138,nyc_tlc,0,2025-12-19 16:12:29.587430-07:00,TbJrRSNtBcb0MExc9yDCsJSu2WQrWZUQbESsBbO3YCw=


## 8. Data Completeness Summary

Expected data for Q4 2023 (Oct-Dec):
- **Yellow Taxi**: ~3M trips (Oct-Dec 2023)
- **FHV**: ~15M trips (Oct-Dec 2023)
- **CitiBike**: ~1.5M trips (Oct-Dec 2023)
- **Weather**: 2,208 hourly records (92 days × 24 hours)

In [13]:
print("\n" + "="*80)
print("DATA INGESTION VALIDATION SUMMARY")
print("="*80)

for table, count in row_counts.items():
    print(f"  ✓ {table}: {count} records")

print("\n" + "="*80)
print("Next Steps:")
print("  1. Review data quality in 02_data_quality_assessment.ipynb")
print("  2. Perform exploratory analysis in 03_exploratory_analysis.ipynb")
print("="*80)


DATA INGESTION VALIDATION SUMMARY
  ✓ Yellow Taxi: 8,610,143 records
  ✓ FHV Taxi: 2,446,615 records
  ✓ CitiBike Trips: 1,417,052 records
  ✓ Hourly Weather: 1,464 records

Next Steps:
  1. Review data quality in 02_data_quality_assessment.ipynb
  2. Perform exploratory analysis in 03_exploratory_analysis.ipynb


In [14]:
# Close connection
conn.close()
print("\nConnection closed.")


Connection closed.
