# üéÖ What is Databricks Auto Loader?

[Databricks Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) lets you scan a cloud storage folder (S3, ADLS, GCS) and only ingest the new data that arrived since the previous run.

This is called **incremental ingestion**.

Auto Loader can be used in a near real-time stream or in a batch fashion, e.g., running every night to ingest daily data.

## üéÑ The North Pole's Challenge
The North Pole has data stuck everywhere:
* üì¨ **CSVs of children's letters** arriving from postal services worldwide
* üè¨ **Retail store exports** tracking gift trends from mall Santa operations
* ‚òÅÔ∏è **S3 buckets from regional elf teams** containing behavioral analytics, workshop IoT sensors, and reindeer telemetry

## ‚ú® How Auto Loader Simplifies Data Ingestion

Ingesting data at scale from cloud storage can be really hard. Auto Loader makes it easy, offering these benefits:

* **Incremental** & **cost-efficient** ingestion (removes unnecessary listing or state handling)
* **Simple** and **resilient** operation: no tuning or manual code required
* Scalable to **billions of files**
* **Schema inference** and **schema evolution** are handled out of the box for most formats (csv, json, avro, images...)

### üéØ Mission: December 24th Deadline!
This notebook demonstrates Auto Loader ingesting raw operational data from Unity Catalog volumes (simulating S3 buckets) into Delta tables.

---
*"From scattered files to organized tables‚ÄîAuto Loader makes it magical!"* ü¶å‚ú®

In [0]:
# üéÑ Let's explore what data the regional elf teams have uploaded to our volume
# This simulates an S3 bucket where CSV files are continuously arriving

volume_path = "/Volumes/12daysofdemos/raw_data/raw_data"

# List files in the volume
files = dbutils.fs.ls(volume_path)
print(f"‚ùÑÔ∏è Found {len(files)} files in the North Pole data volume:")
for file in files[:10]:  # Show first 10 files
    print(f"  - {file.name} ({file.size} bytes)")

# Preview a sample file
print("\nüîç Sample data from the first CSV file:")
display(spark.read.option("header", "true").csv(volume_path).limit(3))

In [0]:
# üéÖ AUTO LOADER: COMPLETE INGESTION PIPELINE
# Single stream that reads from volume and writes to Delta table

volume_source_path = "/Volumes/12daysofdemos/raw_data/raw_data"
schema_location = "/Volumes/12daysofdemos/raw_data/raw_data/_schemas/autoloader"
checkpoint_location = "/Volumes/12daysofdemos/raw_data/raw_data/_checkpoints/autoloader"
target_table = "12daysofdemos.raw_data.santas_workshop"

# Auto Loader stream: read from volume and write to Delta
query = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("header", "true")
    .option("cloudFiles.schemaLocation", schema_location)
    .option("cloudFiles.inferColumnTypes", "true")
    .load(volume_source_path)
    .writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_location)
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .table(target_table))

query.awaitTermination()
print(f"‚úÖ Data successfully ingested into {target_table}!")
print("‚ú® Auto Loader handled schema inference and evolution automatically.")

In [0]:
# üîç VERIFY THE INGESTED DATA
# Let's check what was loaded into our Delta table

target_table = "12daysofdemos.raw_data.santas_workshop"

# Count total rows
total_rows = spark.table(target_table).count()
print(f"üéÑ Total rows ingested: {total_rows:,}")

# Show sample data
print("\nüéÅ Sample data from santas_workshop table:")
display(spark.table(target_table).limit(10))

# Show schema
print("\nüìã Table schema (Auto Loader inferred this automatically):")
spark.table(target_table).printSchema()

## üìö Schema Evolution with Auto Loader

### How it works:
1. **Initial ingestion**: Auto Loader samples files and infers the schema
2. **New columns appear**: When new CSV files have additional columns, Auto Loader detects them
3. **Automatic handling**: With `mergeSchema=true`, new columns are added to the Delta table
4. **Rescued data**: If data doesn't match the schema, it's saved in `_rescued_data` column

### üéØ Schema Evolution Modes:
* **`addNewColumns`** (default): Add new columns, fail on type mismatches
* **`rescue`**: Save incompatible data in `_rescued_data` column
* **`failOnNewColumns`**: Fail the stream when new columns appear (requires manual restart)

### ‚ú® Best Practices:
* Use `cloudFiles.schemaLocation` to persist inferred schemas
* Enable `mergeSchema=true` when writing to Delta for automatic schema evolution
* Monitor `_rescued_data` column for data quality issues
* Use schema hints for critical columns that need specific types

---
*Mrs. Claus approves: "No more manual schema management!"* üéÖ‚ú®