#Autoloader
**Auto Loader is Databricks‚Äô cloud-native file ingestion engine for ingesting new files incrementally from object storage.**

Supported Sources:
- AWS S3
- Azure ADLS Gen2
- Google Cloud Storage (GCS)


#Types of Autoloader:
**1. Directory Listing/Triggered / Incremental Batch Pipeline (Micro-batch)**
- Runs Auto Loader in streaming mode but triggers once (or on schedule).<br>
üîπ Use case:
- Daily/hourly ingestion
- Orchestrated by Databricks Jobs
- Cost-optimized ingestion

**What really happens:**
- Databricks job gets t**riggered based on the schedule**
- It **looks for the new files** in the given cloud storage location(validates with checkpoints and gets only new files)
- Copy the files.. **validates the schema**(with schema from the previous load). If there is any **change in the schema(updates the schema location**).
- **Loads the file** into the target path with mergeschema feature(incase of schema change)
- **Post loading the checkpoint location** is loaded with the list of processed files


**2. Continuous (Streaming) Ingestion Pipeline**
- Auto Loader** runs in Structured Streaming mode and continuously monitors cloud storage** for new files.<br>
üîπ Use case:
- Near real-time ingestion
- IoT data
- Application logs
- Event-based files landing continuously

What really happends:
1. Cloud storage emits file-create event (S3 Event, ADLS Event Grid, GCS Pub/Sub)
2. Event is delivered to Databricks queue
3. Auto Loader receives notification (push model)
4. New file is registered
5. Infers schema / evolves if needed
6. Copy the file(s) & store the schema info in a schema file, so further schema inference is not needed.
7. Updates checkpoint (file1 is processed...)
8. Stream stays idle until next event arrives

#Benefits of Autoloader:
1. Meant **only for file ingestion from cloud object storage**(gcs/s3/adls)
2. **Supports streaming(streaming pipeline) and scheduled jobs(Direcotry listing)**
3. **Incremental and efficient file tracking:** with checkpoint automates ingestion of only new data
4. **Schema evlotion support:** with "cloudFiles.schemaEvolutionMode": "addNewColumns" and "mergeSchema": "true" .. adds columns without breaking the pipeline
5. **Scalability and Resource Optimization**:Properties such as "cloudFiles.maxFilesPerTrigger" allow you to control how many files are processed per batch
6. **Checkpointing and Fault Tolerance:** with checkpointing.. incase of job failures it can resume from the point where it failed

#Readstream important options:

- cloudFiles.format
- cloudFiles.schemaLocation
- checkpointLocation
- cloudFiles.inferColumnTypes
- cloudFiles.schemaEvolutionMode
- rescuedDataColumn
- cloudFiles.includeExistingFiles
- cloudFiles.maxFilesPerTrigger
- cloudFiles.partitionColumns
- cloudFiles.validateOptions


| Category            | Option                            | Description                                                             |
| ------------------- | --------------------------------- | ----------------------------------------------------------------------- |
| **Required**        | `cloudFiles.format`               | Source file format (csv, json, parquet, avro, etc.)                     |
| **Required**        | `cloudFiles.schemaLocation`       | Location to store schema metadata and evolution history                 |
| **Schema Handling** | `cloudFiles.inferColumnTypes`     | Enables automatic data type inference                                   |
| **Schema Handling** | `cloudFiles.schemaEvolutionMode`  | Controls schema changes (`addNewColumns`, `rescue`, `failOnNewColumns`) |
| **Schema Handling** | `rescuedDataColumn`               | Column to capture unexpected or malformed fields                        |                                |
| **File Discovery**  | `cloudFiles.includeExistingFiles` | Whether to process historical files at first run                        |
| **File Discovery**  | `cloudFiles.maxFilesPerTrigger`   | Maximum files processed per micro-batch                                 |
| **Performance**     | `cloudFiles.partitionColumns`     | Extract partition columns from directory structure                      |
| **Performance**     | `cloudFiles.validateOptions`      | Validates provided Auto Loader options                                  |
| **Performance**     | `cloudFiles.backfillInterval`     | Frequency of directory backfill when using notifications(Event based)                |
| **File Discovery**  | `cloudFiles.useNotifications`     | Enables file notification mode (event-based ingestion)                  |
| **File Discovery**  | `cloudFiles.queueName`            | Cloud queue used for file notifications(event-based ingestion) 

#writestream options:

| Option                         | Description                       |
| ------------------------------ | --------------------------------- |
| `format()`                     | delta, parquet, console, memory   |
| `outputMode()`                 | append(Only new rows (most common)), complete(Full aggregation results), update(Updated rows only)          |
| `option("checkpointLocation")` | Required for fault tolerance      |
| `path`                         | Output path (if not using table)  |
| `partitionBy()`                | Partition output files            |
| `queryName()`                  | Name the streaming query          |
| `mergeSchema`                  | Enable schema evolution for Delta |
| `truncate`                     | For console sink                  |


##Trigger options:
Why triggers comes with writestream but not readstream.<br>
Because readstream is just definition:<br>
**readStream Defines:**
- Where to read from
- Schema
- Source options

** It does NOT:**
- Execute anything
- Schedule anything
- Create batches

**writeStream Defines:**
- Output sink
- Output mode
- Checkpoint
- Trigger
- Starts the stream
- This is where the engine runs.

| Trigger Type                    | Syntax                                  | Runs Continuously? | Stops Automatically?                        | Needs Job Scheduling?              | Typical Use Case                              |
| ------------------------------- | --------------------------------------- | ------------------ | ------------------------------------------- | ---------------------------------- | --------------------------------------------- |
| **Micro-Batch Continuous**      | `.trigger(processingTime="30 seconds")` | ‚úÖ Yes              | ‚ùå No                                        | ‚ùå No (runs as long-running job)    | Near real-time ingestion                      |
| **Trigger Once (Old)**          | `.trigger(once=True)`                   | ‚ùå No               | ‚úÖ Yes (after one batch)                     | ‚úÖ Yes (if periodic runs needed)    | Legacy batch-style streaming                  |
| **Available Now (Recommended)** | `.trigger(availableNow=True)`           | ‚ùå No               | ‚úÖ Yes (after processing all available data) | ‚úÖ Yes (for hourly/daily pipelines) | Scheduled ingestion / streaming-as-batch      |
| **Continuous Mode (Rare)**      | `.trigger(continuous="1 second")`       | ‚úÖ Yes              | ‚ùå No                                        | ‚ùå No (runs continuously)           | Ultra-low latency streaming (limited support) |


In [0]:
create catalog if not exists auto_loader;
create schema if not exists auto_loader.auto_loader_sch;
create volume if not exists auto_loader.auto_loader_sch.auto_load_vol;

In [0]:
%python
cloudscrpath='/Volumes/auto_loader/auto_loader_sch/auto_load_vol/cloudsrc/'
chkpointlocation='/Volumes/auto_loader/auto_loader_sch/auto_load_vol/checkpointlocation/'
schemalocation='/Volumes/auto_loader/auto_loader_sch/auto_load_vol/schemalocation/'
bronzetgt="/Volumes/auto_loader/auto_loader_sch/auto_load_vol/bronzetgt/"
df1=spark.readStream.format("cloudFiles")\
.option("cloudFiles.format","csv")\
.option("cloudFiles.inferColumnTypes",True)\
.option("header",True)\
.option("cloudFiles.schemaEvolutionMode","addNewColumns")\
.option("cloudFiles.maxFilesPerTrigger",5)\
.option("cloudFiles.schemaLocation", schemalocation)\
.option("checkpointLocation", chkpointlocation)\
.load(cloudscrpath)

In [0]:
%python
df1.writeStream.trigger(availableNow=True)\
.format("delta")\
.option("checkpointLocation", chkpointlocation)\
.option("cloudFiles.schemaLocation", schemalocation)\
.option("mergeSchema", "true") \
.start(bronzetgt)

In [0]:
%python
spark.read.format("delta").load('/Volumes/auto_loader/auto_loader_sch/auto_load_vol/bronzetgt/').show()