**12 Days of Demos**
# üéÖ Auto Loader Magic at the North Pole üéÑ

[Databricks Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) lets elf data teams scan a cloud storage folder (S3, ADLS, GCS) and only ingest the new data that arrived since the previous run. This notebook demonstrates Auto Loader ingesting raw operational data from Unity Catalog volumes (simulating S3 buckets) into Delta tables.

---

### ü¶å Step 1: Configuration

The configuration settings below are where the demo will load and create data. You may choose to optionally change the settings if you prefer the demo use a different catalog or schema.

üëá **Optionally update the cell below, then run it!**

In [0]:
# TODO: Optionally update these values for your environment
TARGET_CATALOG = "main"
TARGET_SCHEMA = "dbrx_12daysofdemos"
TARGET_BRONZE_SCHEMA = "bronze_data"
TARGET_VOLUME = "raw_data_volume"

print(f"‚úÖ Configuration loaded")

In [0]:
# Derived paths - do not modify these
volume_base_path = f"/Volumes/{TARGET_CATALOG}/{TARGET_SCHEMA}/stream"
volume_source_path = f"{volume_base_path}/reindeer_telemetry"
schema_location = f"{volume_base_path}/_autoloader_schemas/reindeer_telemetry"
checkpoint_location = f"{volume_base_path}/_autoloader_checkpoints/reindeer_telemetry"
target_table = f"{TARGET_CATALOG}.{TARGET_BRONZE_SCHEMA}.reindeer_telemetry"

In [0]:
%run "../00-init/load-data"

Before running the Auto Loader cells below, you need to start the streaming notebook that deposits files into the volume.

üìç **Instructions:**
1. **Open the notebook**: `Stream_Reindeer_Telemetry_To_Volume` (in the same directory)
2. **Run it in a separate tab**: This notebook will continuously deposit parquet files
3. **Come back here**: Continue running the cells below to ingest the data with Auto Loader

üì° **What it does:**
* Simulates real-time reindeer telemetry data arriving from sensors
* Deposits parquet files into `/Volumes/.../stream/reindeer_telemetry`
* Runs continuously in the background

---
‚ú® **Tip**: Open `Stream_Reindeer_Telemetry_To_Volume` in a new browser tab so both notebooks can run simultaneously!

### üîç Step 2: Explore Data Uploaded to Volume


In [0]:
# üéÑ Let's explore what data the regional elf teams have uploaded to our volume
# This simulates an S3 bucket where Parquet files are continuously arriving

# List files in the volume
files = dbutils.fs.ls(volume_source_path)
print(f"‚ùÑÔ∏è Found {len(files)} files in the North Pole data volume:")
for file in files[:10]:  # Show first 10 files
    print(f"  - {file.name} ({file.size} bytes)")

### ü•â Step 3: Create Bronze Data Schema


In [0]:
# This is where we'll deposit data after picking it up from raw_data volume

spark.sql(f"""
    CREATE SCHEMA IF NOT EXISTS {TARGET_CATALOG}.{TARGET_BRONZE_SCHEMA}
    COMMENT 'Bronze layer: Raw ingested data from Auto Loader'
""")

print(f"‚úÖ Schema {TARGET_CATALOG}.{TARGET_BRONZE_SCHEMA} is ready!")

In [0]:
# Optional: Drop the table if you want to start fresh
spark.sql(f"DROP TABLE IF EXISTS {target_table}")
print(f"üßπ Cleaned up {target_table}")

### ü§ñ Step 4: Setup Auto Loader to Read from Volume


In [0]:
# Single stream that reads from volume and writes to Bronze Delta table

print(f"‚è≥ Starting Auto Loader ingestion...\n")

# Auto Loader stream: read from volume and write to Bronze Delta table
query = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "parquet")
    .option("cloudFiles.schemaLocation", schema_location)
    .option("cloudFiles.inferColumnTypes", "true")
    .load(volume_source_path)
    .writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_location)
    .option("mergeSchema", "true")
    .trigger(availableNow=True)  # Trigger once when data is available
    .table(target_table))

print(f"‚úÖ Stream started! Continuously monitoring {volume_source_path}")


### ‚úÖ  Step 4: Verify Fixed Table

In [0]:
# Let's check what was loaded into our Bronze Delta table
from time import sleep
print('Wait for the stream to process some records..')
sleep(15)

# Count total rows
total_rows = spark.table(target_table).count()
print(f"üéÑ Total rows ingested: {total_rows:,}")

# Show sample data
print(f"\nüéÅ Sample data from {target_table}:")
display(spark.table(target_table))

### üìö Schema Evolution with Auto Loader

**How it works:**
1. **Initial ingestion**: Auto Loader samples files and infers the schema
2. **New columns appear**: When new CSV files have additional columns, Auto Loader detects them
3. **Automatic handling**: With `mergeSchema=true`, new columns are added to the Delta table
4. **Rescued data**: If data doesn't match the schema, it's saved in `_rescued_data` column

**üéØ Schema Evolution Modes:**
* **`addNewColumns`** (default): Add new columns, fail on type mismatches
* **`rescue`**: Save incompatible data in `_rescued_data` column
* **`failOnNewColumns`**: Fail the stream when new columns appear (requires manual restart)

**‚ú® Best Practices:**
* Use `cloudFiles.schemaLocation` to persist inferred schemas
* Enable `mergeSchema=true` when writing to Delta for automatic schema evolution
* Monitor `_rescued_data` column for data quality issues
* Use schema hints for critical columns that need specific types

---
*Mrs. Claus approves: "No more manual schema management!"* üéÖ‚ú®

### üõë Stop All Streams

In [0]:
from time import sleep

print("üö® If you want to re-run cells, you can cancel this cell! All this is doing is turning off your streams! Please make sure to turn off your streams before you leave using stream.stop()! üö®")

# Stop all active streams to prevent them from continuing to run
sleep(180)
for stream in spark.streams.active:
    print(f"‚èπÔ∏è Stopping stream: {stream.name if stream.name else stream.id}")
    stream.stop()

print("\n‚úÖ All streams stopped successfully!")
print("üéÑ Auto Loader ingestion complete. Happy holidays!")