# Enhanced DLT Pipeline - NYC Taxi Streaming Simulation

This Delta Live Tables (DLT) pipeline implements:
- Landing Zone: Raw data ingestion from simulated stream
- Bronze Layer: Date-transformed taxi trip data
- Data quality validations and monitoring

Pipeline defined in resources/ghithub_trends_digger.pipeline.yml


In [None]:
# Import required libraries
import dlt
import sys
from pyspark.sql.functions import col, when, coalesce, current_timestamp, lit, count, countDistinct, avg, min, max
from pyspark.sql.types import TimestampType, DoubleType, IntegerType

# Add src path for custom modules
sys.path.append(spark.conf.get("bundle.sourcePath", "."))

# Import custom data simulation components
from data_simulation.streaming_simulator import simulate_taxi_stream
from data_simulation.date_mapper import get_date_mapper


In [None]:
# LANDING ZONE - Raw data ingestion
@dlt.table(
    name="taxi_landing_zone",
    comment="Landing zone for simulated NYC taxi trip data with metadata",
    table_properties={
        "quality": "bronze",
        "layer": "landing"
    }
)
def taxi_landing_zone():
    """
    Landing zone table for raw NYC taxi data simulation.
    Maps historical Jan/Feb 2016 data to current date/hour.
    """
    return simulate_taxi_stream()


In [None]:
# BRONZE LAYER - Cleansed and standardized data
@dlt.table(
    name="taxi_trips_bronze",
    comment="Bronze layer - cleansed NYC taxi trips with current dates",
    table_properties={
        "quality": "bronze", 
        "layer": "bronze"
    }
)
@dlt.expect_or_drop("valid_timestamps", 
    col("tpep_pickup_datetime").isNotNull() & 
    col("tpep_dropoff_datetime").isNotNull() &
    (col("tpep_dropoff_datetime") > col("tpep_pickup_datetime"))
)
@dlt.expect_or_drop("valid_trip_distance", 
    col("trip_distance") > 0
)
@dlt.expect_or_drop("valid_fare_amount", 
    col("fare_amount") > 0
)
def taxi_trips_bronze():
    """
    Bronze layer with basic data quality validations.
    Filters out invalid records and standardizes data types.
    """
    return (
        spark.readStream.table("live.taxi_landing_zone")
        .select(
            col("tpep_pickup_datetime").cast(TimestampType()),
            col("tpep_dropoff_datetime").cast(TimestampType()),
            col("trip_distance").cast(DoubleType()),
            col("fare_amount").cast(DoubleType()),
            col("pickup_zip").cast(IntegerType()),
            col("dropoff_zip").cast(IntegerType()),
            col("ingestion_timestamp"),
            col("source_system"),
        )
        .withColumn("bronze_processing_timestamp", current_timestamp())
        .withColumn("data_quality_status", lit("validated"))
    )


In [None]:
# DATA QUALITY MONITORING TABLE (materialized for catalog visibility)
@dlt.table(
    name="taxi_data_quality_metrics",
    comment="Data quality metrics and monitoring for taxi pipeline",
    table_properties={
        "quality": "gold",
        "layer": "analytics"
    }
)
def taxi_data_quality_metrics():
    """
    Materialized table to monitor data quality metrics across the pipeline.
    Provides insights into validation failures and data health.
    Now visible in the catalog as a table.
    """
    bronze_df = dlt.read("taxi_trips_bronze")
    
    return bronze_df.agg(
        count("*").alias("total_records"),
        countDistinct("pickup_zip").alias("distinct_pickup_zips"),
        countDistinct("dropoff_zip").alias("distinct_dropoff_zips"),
        avg("trip_distance").alias("avg_trip_distance"),
        avg("fare_amount").alias("avg_fare_amount"),
        min("tpep_pickup_datetime").alias("earliest_pickup"),
        max("tpep_pickup_datetime").alias("latest_pickup")
    ).withColumn("quality_check_timestamp", current_timestamp())


In [None]:
# ALTERNATIVE: MATERIALIZED VIEW (also visible in catalog)
@dlt.materialized_view(
    name="taxi_hourly_summary_view",
    comment="Hourly summary statistics for taxi trips - materialized view"
)
def taxi_hourly_summary_view():
    """
    Materialized view that aggregates taxi data by hour.
    This will be visible in the catalog as a materialized view.
    """
    from pyspark.sql.functions import date_trunc
    
    bronze_df = dlt.read("taxi_trips_bronze")
    
    return (bronze_df
        .withColumn("pickup_hour", date_trunc("hour", col("tpep_pickup_datetime")))
        .groupBy("pickup_hour")
        .agg(
            count("*").alias("trips_per_hour"),
            avg("trip_distance").alias("avg_distance"),
            avg("fare_amount").alias("avg_fare"),
            countDistinct("pickup_zip").alias("unique_pickup_zones")
        )
        .orderBy("pickup_hour")
    )


## View Types in DLT - Catalog Visibility Explained

**Why your original `@dlt.view()` wasn't visible:**

1. **`@dlt.view()`**: Creates temporary views for intermediate pipeline transformations
   - ❌ **Not persistent** in catalog
   - ✅ Used for data transformations within the pipeline
   - ✅ Good for memory efficiency

2. **`@dlt.table()`**: Creates persistent Delta tables 
   - ✅ **Always visible** in catalog
   - ✅ Data is stored and queryable
   - ✅ Best for final outputs and analytics

3. **`@dlt.materialized_view()`**: Creates materialized views
   - ✅ **Should be visible** in catalog (Unity Catalog)
   - ✅ Automatically refreshed
   - ✅ Good for summary/aggregation tables

**Recommendation**: Use `@dlt.table()` for data you want to query from the catalog!
