# ðŸ“˜ Databricks Auto Loader (cloudFiles)

## 1. What is Auto Loader?
- Auto Loader is a **Databricks feature for incremental file ingestion**.
- It **automatically detects and loads new files** arriving in a directory from cloud storage (ADLS, S3, GCS).
- Handles schema inference, schema evolution, and scalability for **billions of files**.
- Saves cost & complexity vs manual file scans.

---

## 2. How it Works
1. **Watches a directory or container** for new files.
2. **Tracks metadata** of already processed files in a checkpoint.
3. **Ingests only new files** into Delta tables (no duplicates).
4. Can handle **CSV, JSON, Parquet, Avro, ORC, Binary, Text**.

---

## 3. File Discovery Modes
### ðŸ”¹ Directory Listing Mode (default)
- Auto Loader **scans directories** to find new files.
- Good for **small to medium datasets** (millions of files).
- Example:
```python
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", "/mnt/schema")
      .load("/mnt/adls/container/path"))


In [0]:
 Why Auto Loader?

âœ… Incremental file processing (no reprocessing old files)
âœ… Handles schema drift automatically
âœ… Scales to billions of files
âœ… Supports both streaming & batch use cases
âœ… Integrated with Delta Lake (bronze, silver, gold architecture)

In [0]:
File Notification Mode

Auto Loader uses cloud-native events to detect new files:
Azure â†’ Event Grid
AWS â†’ SQS
GCP â†’ Pub/Sub
Scales to billions of files, lower cost.
Example:
    
    df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", "/mnt/schema")
      .option("cloudFiles.useNotifications", "true")   # ðŸ‘ˆ
      .load("/mnt/adls/container/path"))


In [0]:
Schema Inference â†’ Auto Loader detects schema from files.
Schema Evolution â†’ New columns can be added automatically.

.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")


In [0]:
.option("badRecordsPath", "/mnt/badrecords")
.option("enforceSchema", "true")
#Skips/flags bad records instead of failing pipeline.

# âš¡ Structured Streaming â€“ Output Modes & Triggers

Auto Loader works on top of **Structured Streaming**, so we must understand:
1. **Output Modes** â€“ how results are written
2. **Triggers** â€“ when results are written

---

## 1. Output Modes
Defines **what data gets written** to the sink (Delta, Console, Kafka, etc.).

### ðŸ”¹ Append (most common)
- Writes **only new rows** since the last trigger.
- Used in **incremental ingestion** (Bronze layer).
```python
.outputMode("append")


In [0]:
ðŸ”¹ Complete

Writes the entire result table every trigger.
Useful for aggregations (e.g., counts, sums).
ðŸ”¹ Complete

.outputMode("complete")


In [0]:
ðŸ”¹ Update

Writes only updated rows since the last trigger.
Not all sinks support this.
.outputMode("update")


In [0]:
2. Triggers

Defines when/how often micro-batches run.
ðŸ”¹ Default (micro-batch, continuous polling)
Runs as soon as new data arrives.
.trigger(processingTime="1 minute")


In [0]:
ðŸ”¹ Fixed Interval

Run every X time (e.g., every 5 minutes).
.trigger(processingTime="5 minutes")


In [0]:
ðŸ”¹ Once

Process all available data once and then stop.
Good for batch style ingestion.

.trigger(once=True)


In [0]:
ðŸ”¹ AvailableNow

Process everything available at the start, then stop.
Ignores late-arriving data.
.trigger(availableNow=True)
