# COPY INTO & Auto Loader - Incremental Data Ingestion Demo

Welcome! This demo will teach you how to efficiently ingest data from cloud storage into Delta Lake.

---

## üìä The Data Ingestion Challenge

**Common scenario:**
* New files arrive continuously in cloud storage (S3, ADLS, GCS)
* Need to load only NEW files (not reprocess old ones)
* Must handle schema changes gracefully
* Want reliable, scalable ingestion

**Naive approach problems:**
```sql
-- DON'T DO THIS!
SELECT * FROM read_files('/path/to/data/*.csv')  -- Reads ALL files every time!
```

‚ùå Reprocesses all files every run  
‚ùå Wastes time and money  
‚ùå No tracking of processed files  
‚ùå Doesn't scale  

---

## ‚úÖ Databricks Solutions

Databricks provides two powerful methods for incremental ingestion:

### **1. COPY INTO**
* SQL-based command
* Idempotent (safe to re-run)
* Tracks processed files automatically
* Simple syntax
* Great for batch ingestion

### **2. Auto Loader (cloudFiles)**
* Streaming-based approach
* Automatic schema inference and evolution
* Scalable to millions of files
* Built-in error handling
* Great for continuous ingestion

---

## üéØ What You'll Learn

1. **COPY INTO** - SQL-based incremental loading
2. **Auto Loader** - Streaming-based ingestion
3. **Comparison** - When to use each approach
4. **Best Practices** - Error handling, monitoring, optimization

**Let's get started!** üöÄ

## 1. Setup Demo Environment üõ†Ô∏è

We'll create sample data files to demonstrate both COPY INTO and Auto Loader.

**Setup steps:**
1. Create a Unity Catalog volume for storing files
2. Generate sample CSV files (simulating new data arriving)
3. Create target Delta tables

**Why use Volumes?**
* Modern Unity Catalog best practice
* Better governance and access control
* Works with COPY INTO and Auto Loader
* Replaces legacy DBFS paths

In [0]:
%sql
-- Create a volume to store our sample data files
-- Volumes are the modern way to store files in Unity Catalog

CREATE VOLUME IF NOT EXISTS main.default.ingestion_demo_data
COMMENT 'Sample data files for COPY INTO and Auto Loader demo'

In [0]:
# Generate sample customer data as CSV files
# We'll create multiple files to simulate incremental data arrival

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
from pyspark.sql.functions import current_timestamp, lit
import random
from datetime import datetime, timedelta

# Define schema
schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("country", StringType(), True),
    StructField("signup_date", StringType(), True),
    StructField("total_purchases", DoubleType(), True)
])

# Generate first batch of data (100 customers)
data_batch1 = [
    (i, f"Customer_{i}", f"customer{i}@example.com", 
     random.choice(["USA", "UK", "Canada", "Germany", "France"]),
     (datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
     round(random.uniform(100, 5000), 2))
    for i in range(1, 101)
]

df_batch1 = spark.createDataFrame(data_batch1, schema)

# Write to volume as CSV
output_path = "/Volumes/main/default/ingestion_demo_data/customers"
df_batch1.coalesce(1).write.mode("overwrite").option("header", "true").csv(f"{output_path}/batch1")

print("‚úÖ Created batch 1: 100 customers")
print(f"   Location: {output_path}/batch1/")
display(df_batch1.limit(5))

In [0]:
# Generate second batch (simulating new data arriving later)
data_batch2 = [
    (i, f"Customer_{i}", f"customer{i}@example.com", 
     random.choice(["USA", "UK", "Canada", "Germany", "France"]),
     (datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
     round(random.uniform(100, 5000), 2))
    for i in range(101, 151)
]

df_batch2 = spark.createDataFrame(data_batch2, schema)
df_batch2.coalesce(1).write.mode("overwrite").option("header", "true").csv(f"{output_path}/batch2")

print("‚úÖ Created batch 2: 50 customers")
print(f"   Location: {output_path}/batch2/")
display(df_batch2.limit(5))

In [0]:
# Generate third batch with a NEW COLUMN (schema evolution scenario)
from pyspark.sql.types import BooleanType

schema_v2 = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("country", StringType(), True),
    StructField("signup_date", StringType(), True),
    StructField("total_purchases", DoubleType(), True),
    StructField("is_premium", BooleanType(), True)  # NEW COLUMN!
])

data_batch3 = [
    (i, f"Customer_{i}", f"customer{i}@example.com", 
     random.choice(["USA", "UK", "Canada", "Germany", "France"]),
     (datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))).strftime("%Y-%m-%d"),
     round(random.uniform(100, 5000), 2),
     random.choice([True, False]))
    for i in range(151, 201)
]

df_batch3 = spark.createDataFrame(data_batch3, schema_v2)
df_batch3.coalesce(1).write.mode("overwrite").option("header", "true").csv(f"{output_path}/batch3")

print("‚úÖ Created batch 3: 50 customers with NEW COLUMN (is_premium)")
print(f"   Location: {output_path}/batch3/")
print("\n‚ö†Ô∏è  This batch has a different schema - we'll see how each method handles this!")
display(df_batch3.limit(5))

In [0]:
# List all the files we created
print("üìÅ Generated files in volume:\n")
files = dbutils.fs.ls(output_path)
for file_info in files:
    print(f"  {file_info.path}")

print("\n‚úÖ Setup complete! We have 3 batches of data ready for ingestion.")

## 2. COPY INTO - SQL-Based Ingestion üìä

**What is COPY INTO?**

COPY INTO is a SQL command that incrementally loads data from files into a Delta table.

**Key features:**
* ‚úÖ **Idempotent** - Safe to re-run, won't duplicate data
* ‚úÖ **Automatic tracking** - Remembers which files were processed
* ‚úÖ **SQL-based** - Familiar syntax, works in SQL warehouses
* ‚úÖ **File format support** - CSV, JSON, Parquet, Avro, ORC
* ‚úÖ **Pattern matching** - Use wildcards to select files

**Basic syntax:**
```sql
COPY INTO target_table
FROM 'source_path'
FILEFORMAT = format
FORMAT_OPTIONS ('option' = 'value')
COPY_OPTIONS ('option' = 'value')
```

In [0]:
%sql
-- Create the target Delta table for COPY INTO

CREATE TABLE IF NOT EXISTS main.default.customers_copy_into (
  customer_id INT,
  name STRING,
  email STRING,
  country STRING,
  signup_date STRING,
  total_purchases DOUBLE
)
USING DELTA
COMMENT 'Customer data loaded with COPY INTO'

In [0]:
%sql
-- Load the first batch of files
-- COPY INTO will track which files it processes

COPY INTO main.default.customers_copy_into
FROM '/Volumes/main/default/ingestion_demo_data/customers/batch1'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'false')

In [0]:
%sql
-- Check what was loaded
SELECT COUNT(*) AS row_count FROM main.default.customers_copy_into

In [0]:
%sql
-- View the loaded data
SELECT * FROM main.default.customers_copy_into
LIMIT 10

In [0]:
%sql
-- Load the second batch
-- This demonstrates incremental loading

COPY INTO main.default.customers_copy_into
FROM '/Volumes/main/default/ingestion_demo_data/customers/batch2'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')

In [0]:
%sql
-- Check the new count - should be 150 (100 + 50)
SELECT COUNT(*) AS row_count FROM main.default.customers_copy_into

In [0]:
%sql
-- Run COPY INTO again on the same files
-- It will NOT reload the same files (idempotent)

COPY INTO main.default.customers_copy_into
FROM '/Volumes/main/default/ingestion_demo_data/customers/batch1'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')

In [0]:
%sql
-- Count should still be 150 (no duplicates added)
SELECT COUNT(*) AS row_count FROM main.default.customers_copy_into

In [0]:
%sql
-- Use wildcards to load multiple directories at once
-- This loads all batches that haven't been processed yet

COPY INTO main.default.customers_copy_into
FROM '/Volumes/main/default/ingestion_demo_data/customers/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true')  -- Allow schema evolution

In [0]:
%sql
-- Should now have all 200 customers (batch1 + batch2 + batch3)
-- Note: batch3 has an extra column (is_premium)

SELECT COUNT(*) AS row_count FROM main.default.customers_copy_into

In [0]:
%sql
-- Check if the new column was added
DESCRIBE main.default.customers_copy_into

### üìö COPY INTO Options Reference

**FORMAT_OPTIONS (file reading):**
```sql
-- CSV options
'header' = 'true'              -- First row is header
'inferSchema' = 'true'         -- Infer data types
'delimiter' = ','              -- Field delimiter
'quote' = '"'                  -- Quote character
'escape' = '\\'                -- Escape character
'nullValue' = 'NULL'           -- NULL representation

-- JSON options
'multiLine' = 'true'           -- Multi-line JSON objects
'dateFormat' = 'yyyy-MM-dd'    -- Date format
```

**COPY_OPTIONS (behavior):**
```sql
'mergeSchema' = 'true'         -- Allow schema evolution
'force' = 'true'               -- Reprocess all files (ignore tracking)
```

**FILE_FORMAT:**
* CSV
* JSON
* PARQUET
* AVRO
* ORC
* BINARYFILE
* TEXT

## 3. Auto Loader (cloudFiles) ‚ö°

**What is Auto Loader?**

Auto Loader is Databricks' streaming-based solution for incrementally loading data from cloud storage.

**Key features:**
* ‚úÖ **Automatic schema inference** - Detects schema from files
* ‚úÖ **Schema evolution** - Handles new columns automatically
* ‚úÖ **Scalable** - Efficiently processes millions of files
* ‚úÖ **Streaming** - Continuous ingestion with low latency
* ‚úÖ **File notification** - Uses cloud events (S3 SQS, ADLS Event Grid)
* ‚úÖ **Checkpointing** - Tracks progress automatically

**Basic syntax:**
```python
df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cloudFiles.schemaLocation", checkpoint_path) \
  .load(source_path)

df.writeStream \
  .option("checkpointLocation", checkpoint_path) \
  .toTable("target_table")
```

In [0]:
# Auto Loader with schema inference
# This will automatically detect the schema and load data

from pyspark.sql.functions import current_timestamp

# Define paths
source_path = "/Volumes/main/default/ingestion_demo_data/customers/"
checkpoint_path = "/Volumes/main/default/ingestion_demo_data/checkpoints/autoloader_customers"
target_table = "main.default.customers_autoloader"

print("‚ö° Starting Auto Loader...\n")
print(f"Source: {source_path}")
print(f"Checkpoint: {checkpoint_path}")
print(f"Target: {target_table}")
print("\n‚è≥ This will run as a streaming query...")

In [0]:
# Read data using Auto Loader (cloudFiles)
df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cloudFiles.schemaLocation", checkpoint_path) \
  .option("header", "true") \
  .option("cloudFiles.inferColumnTypes", "true") \
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
  .load(source_path)

print("‚úÖ Auto Loader stream configured")
print("\nInferred schema:")
df.printSchema()

In [0]:
# Write the stream to a Delta table
# This starts the streaming query

query = df.writeStream \
  .option("checkpointLocation", checkpoint_path) \
  .option("mergeSchema", "true") \
  .trigger(availableNow=True) \
  .toTable(target_table)

print("‚è≥ Streaming query started...")
print("\nüëâ This will process all files and then stop (trigger=availableNow)")

# Wait for the stream to finish
query.awaitTermination()

print("\n‚úÖ Auto Loader completed!")

In [0]:
%sql
-- Check what Auto Loader loaded
SELECT COUNT(*) AS row_count 
FROM main.default.customers_autoloader

In [0]:
%sql
-- View the data loaded by Auto Loader
-- Notice it includes the is_premium column from batch3

SELECT * 
FROM main.default.customers_autoloader
ORDER BY customer_id
LIMIT 20

In [0]:
%sql
-- Check the schema - should include is_premium column
DESCRIBE main.default.customers_autoloader

### üí° Auto Loader Schema Hints

**Schema inference modes:**

**1. Automatic inference (default):**
```python
.option("cloudFiles.schemaLocation", checkpoint_path)
```
Auto Loader infers schema from first file.

**2. Schema hints (recommended):**
```python
.option("cloudFiles.schemaHints", "customer_id INT, name STRING")
```
Provide hints for specific columns.

**3. Explicit schema:**
```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
  StructField("customer_id", IntegerType()),
  StructField("name", StringType())
])

df = spark.readStream.format("cloudFiles") \
  .schema(schema) \
  .load(path)
```

**Schema evolution modes:**
* `addNewColumns` - Add new columns (default)
* `rescue` - Put unexpected data in _rescued_data column
* `failOnNewColumns` - Fail if schema changes
* `none` - No evolution

### üõ†Ô∏è Auto Loader Advanced Options

**File notification modes:**
```python
# Directory listing (default, works everywhere)
.option("cloudFiles.useNotifications", "false")

# File notification (more efficient for large-scale)
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.queueUrl", "s3://bucket/queue")  # AWS SQS
```

**Schema evolution:**
```python
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("cloudFiles.inferColumnTypes", "true")
```

**File filtering:**
```python
.option("cloudFiles.pathGlobFilter", "*.csv")  # Only CSV files
.option("cloudFiles.modifiedAfter", "2024-01-01")  # Files after date
```

**Performance:**
```python
.option("cloudFiles.maxFilesPerTrigger", 1000)  # Limit files per batch
.option("cloudFiles.maxBytesPerTrigger", "10g")  # Limit data per batch
```

**Metadata columns:**
```python
.option("cloudFiles.includeExistingFiles", "true")  # Process existing files
```

Auto Loader automatically adds:
* `_metadata.file_path` - Source file path
* `_metadata.file_name` - Source file name
* `_metadata.file_modification_time` - File timestamp

## 4. COPY INTO vs Auto Loader ü§î

Both methods solve the same problem, but have different strengths. Let's compare!

### üìä Feature Comparison

| Feature | COPY INTO | Auto Loader |
|---------|-----------|-------------|
| **Execution Model** | Batch (SQL) | Streaming (Spark Structured Streaming) |
| **Language** | SQL only | Python, Scala, SQL |
| **Schema Inference** | Manual or inferSchema | Automatic with evolution |
| **Schema Evolution** | mergeSchema option | Built-in, automatic |
| **File Tracking** | Automatic (metadata) | Checkpoint files |
| **Scalability** | Good (1000s of files) | Excellent (millions of files) |
| **Latency** | Minutes (batch) | Seconds (streaming) |
| **File Notification** | No | Yes (S3 SQS, ADLS Events) |
| **Error Handling** | Fails on error | Rescue columns, dead letter queue |
| **Complexity** | Simple | Moderate |
| **Cost** | Lower (batch) | Higher (continuous) |
| **Use in SQL Warehouse** | ‚úÖ Yes | ‚ùå No (needs cluster) |
| **Idempotency** | ‚úÖ Yes | ‚úÖ Yes |
| **Metadata Columns** | No | Yes (_metadata.*) |

### ‚úÖ When to Use COPY INTO

**Best for:**

‚úÖ **Scheduled batch loads** - Hourly, daily, weekly ingestion  
‚úÖ **SQL-only environments** - SQL warehouses, SQL-based pipelines  
‚úÖ **Simple use cases** - Straightforward CSV/JSON ingestion  
‚úÖ **Small to medium scale** - Up to thousands of files  
‚úÖ **Known schema** - Schema doesn't change frequently  
‚úÖ **Cost-sensitive** - Lower cost for batch processing  

**Example scenarios:**
* Daily sales reports from partner systems
* Hourly log file ingestion
* Weekly data dumps from external sources
* One-time historical data loads
* SQL-based ETL pipelines

**Advantages:**
* Simple SQL syntax
* Works in SQL warehouses
* Easy to understand and debug
* Lower cost for batch workloads
* No streaming infrastructure needed

### ‚ö° When to Use Auto Loader

**Best for:**

‚úÖ **Continuous ingestion** - Real-time or near-real-time data  
‚úÖ **Large scale** - Millions of files  
‚úÖ **Schema evolution** - Frequent schema changes  
‚úÖ **Complex scenarios** - Need advanced error handling  
‚úÖ **Low latency** - Need data available quickly  
‚úÖ **Production pipelines** - Enterprise-grade reliability  

**Example scenarios:**
* IoT sensor data (continuous stream)
* Application logs (high volume)
* CDC (Change Data Capture) files
* Multi-tenant data with varying schemas
* Mission-critical data pipelines
* Data lakes with millions of files

**Advantages:**
* Automatic schema inference and evolution
* Scales to millions of files
* Low latency (seconds)
* Built-in error handling (rescue columns)
* File notification for efficiency
* Metadata columns for lineage

### üå≥ Decision Tree: Which Should I Use?

```
Start: Need to ingest files from cloud storage
‚îÇ
‚îú‚îÄ Do you need real-time/continuous ingestion?
‚îÇ  ‚îú‚îÄ YES ‚Üí Use Auto Loader ‚ö°
‚îÇ  ‚îî‚îÄ NO ‚Üí Continue...
‚îÇ
‚îú‚îÄ Do you have millions of files?
‚îÇ  ‚îú‚îÄ YES ‚Üí Use Auto Loader ‚ö°
‚îÇ  ‚îî‚îÄ NO ‚Üí Continue...
‚îÇ
‚îú‚îÄ Does your schema change frequently?
‚îÇ  ‚îú‚îÄ YES ‚Üí Use Auto Loader ‚ö°
‚îÇ  ‚îî‚îÄ NO ‚Üí Continue...
‚îÇ
‚îú‚îÄ Are you using SQL Warehouse only?
‚îÇ  ‚îú‚îÄ YES ‚Üí Use COPY INTO üìä
‚îÇ  ‚îî‚îÄ NO ‚Üí Continue...
‚îÇ
‚îú‚îÄ Is simplicity more important than features?
‚îÇ  ‚îú‚îÄ YES ‚Üí Use COPY INTO üìä
‚îÇ  ‚îî‚îÄ NO ‚Üí Use Auto Loader ‚ö°
‚îÇ
‚îî‚îÄ Default recommendation: Auto Loader ‚ö°
```

**Quick guide:**
* **Simple batch loads** ‚Üí COPY INTO
* **Everything else** ‚Üí Auto Loader

### üîÑ Side-by-Side Code Comparison

**COPY INTO (SQL):**
```sql
-- Simple and straightforward
COPY INTO main.default.target_table
FROM '/path/to/files/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true')
```

**Auto Loader (Python):**
```python
# More configuration, more features
df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cloudFiles.schemaLocation", checkpoint_path) \
  .option("header", "true") \
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
  .load(source_path)

df.writeStream \
  .option("checkpointLocation", checkpoint_path) \
  .option("mergeSchema", "true") \
  .trigger(availableNow=True) \
  .toTable("main.default.target_table") \
  .awaitTermination()
```

**Key differences:**
* COPY INTO: 5 lines of SQL
* Auto Loader: 10+ lines of Python with more options

## 5. Best Practices ‚úÖ

Production-ready patterns for both COPY INTO and Auto Loader.

### üìä COPY INTO Best Practices

**1. Use pattern matching for flexibility:**
```sql
COPY INTO table
FROM '/path/to/data/year=2024/month=*/'
FILEFORMAT = CSV
```

**2. Enable schema evolution when needed:**
```sql
COPY_OPTIONS ('mergeSchema' = 'true')
```

**3. Schedule with jobs:**
```sql
-- Run COPY INTO on a schedule (hourly, daily)
-- Use Databricks Jobs or Workflows
```

**4. Monitor with COPY_HISTORY:**
```sql
-- Check what files were loaded
SELECT * FROM main.default.customers_copy_into.copy_history
```

**5. Handle errors gracefully:**
```sql
-- Use COPY_OPTIONS to control error handling
COPY_OPTIONS (
  'force' = 'false',           -- Don't reprocess files
  'mergeSchema' = 'true'       -- Allow schema changes
)
```

**6. Test with small batches first:**
```sql
-- Test on a subset before full load
COPY INTO table
FROM '/path/to/data/batch1/'
```

In [0]:
%sql
-- View the history of COPY INTO operations
-- This shows which files were loaded and when

DESCRIBE HISTORY main.default.customers_copy_into
LIMIT 10

### ‚ö° Auto Loader Best Practices

**1. Always use checkpoints:**
```python
.option("checkpointLocation", "/path/to/checkpoint")
```
‚ö†Ô∏è Never change checkpoint location - it tracks progress!

**2. Use schema location:**
```python
.option("cloudFiles.schemaLocation", "/path/to/schema")
```
Stores inferred schema for consistency.

**3. Enable schema evolution:**
```python
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
```

**4. Use schema hints for critical columns:**
```python
.option("cloudFiles.schemaHints", "id INT, amount DECIMAL(10,2)")
```

**5. Use rescue columns for data quality:**
```python
.option("cloudFiles.schemaEvolutionMode", "rescue")
```
Unexpected data goes to `_rescued_data` column.

**6. Use file notifications for scale:**
```python
.option("cloudFiles.useNotifications", "true")
```
Much more efficient for large-scale ingestion.

**7. Control batch size:**
```python
.option("cloudFiles.maxFilesPerTrigger", 1000)
```
Prevents overwhelming the cluster.

**8. Use trigger modes appropriately:**
```python
# Batch mode (process once and stop)
.trigger(availableNow=True)

# Continuous mode (keep running)
.trigger(processingTime='1 minute')

# Micro-batch (process as data arrives)
.trigger(once=True)  # Deprecated, use availableNow
```

In [0]:
# Auto Loader with rescue columns for error handling
# This captures malformed data instead of failing

print("üõ°Ô∏è Auto Loader with rescue columns for error handling\n")

rescue_checkpoint = "/Volumes/main/default/ingestion_demo_data/checkpoints/rescue_demo"

df_with_rescue = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cloudFiles.schemaLocation", rescue_checkpoint) \
  .option("header", "true") \
  .option("cloudFiles.schemaEvolutionMode", "rescue") \
  .option("cloudFiles.inferColumnTypes", "true") \
  .load(source_path)

print("Schema with rescue column:")
df_with_rescue.printSchema()

print("\nüí° Notice the '_rescued_data' column - this captures any data that doesn't fit the schema!")

In [0]:
# Monitor Auto Loader streaming queries
# Check active streams and their progress

print("üìä Monitoring Auto Loader streams\n")

# List active streaming queries
active_streams = spark.streams.active

if len(active_streams) > 0:
    print(f"Active streams: {len(active_streams)}\n")
    for stream in active_streams:
        print(f"Stream ID: {stream.id}")
        print(f"Name: {stream.name}")
        print(f"Status: {stream.status}")
        print(f"Recent progress: {stream.recentProgress}")
        print("-" * 60)
else:
    print("‚úÖ No active streams (all completed)")
    print("\nThis is expected since we used trigger(availableNow=True)")
    print("which processes all available data and stops.")

## üöÄ Performance Tips

### **COPY INTO Performance**

‚úÖ **Partition your source data** - Use directory structure  
‚úÖ **Use appropriate file sizes** - 128MB-1GB per file ideal  
‚úÖ **Limit file patterns** - Be specific with paths  
‚úÖ **Schedule during off-peak** - Reduce cluster contention  
‚úÖ **Monitor with DESCRIBE HISTORY** - Track load times  

### **Auto Loader Performance**

‚úÖ **Use file notifications** - Much faster than directory listing  
‚úÖ **Set maxFilesPerTrigger** - Control batch size  
‚úÖ **Use schema hints** - Avoid inference overhead  
‚úÖ **Optimize checkpoint location** - Use fast storage  
‚úÖ **Monitor streaming metrics** - Check Spark UI  
‚úÖ **Use appropriate trigger intervals** - Balance latency vs cost  

### **General Tips**

‚úÖ **Use Delta Lake** - Optimized for both methods  
‚úÖ **Compress source files** - Reduce I/O  
‚úÖ **Use Unity Catalog Volumes** - Modern best practice  
‚úÖ **Test with small datasets** - Validate before production  
‚úÖ **Monitor costs** - Streaming can be more expensive  

## ‚ö†Ô∏è Common Pitfalls to Avoid

### **COPY INTO Pitfalls**

‚ùå **Don't use force=true in production** - Reprocesses all files  
‚ùå **Don't ignore schema evolution** - Plan for schema changes  
‚ùå **Don't forget to schedule** - COPY INTO doesn't run automatically  
‚ùå **Don't use for real-time** - It's batch-oriented  

### **Auto Loader Pitfalls**

‚ùå **Don't change checkpoint location** - Loses progress tracking  
‚ùå **Don't skip schema location** - Can cause inconsistencies  
‚ùå **Don't use continuous mode unnecessarily** - Costs more  
‚ùå **Don't ignore rescue columns** - Monitor for data quality issues  
‚ùå **Don't forget to stop streams** - Can run indefinitely  

### **General Pitfalls**

‚ùå **Don't use read_files() for incremental** - No tracking  
‚ùå **Don't mix COPY INTO and Auto Loader** - Use one method per table  
‚ùå **Don't ignore file sizes** - Too small = overhead, too large = memory issues  
‚ùå **Don't skip testing** - Test schema evolution scenarios  

## üìö Quick Reference

### **COPY INTO Template**
```sql
COPY INTO catalog.schema.table
FROM 'source_path'
FILEFORMAT = CSV|JSON|PARQUET
FORMAT_OPTIONS (
  'header' = 'true',
  'inferSchema' = 'true'
)
COPY_OPTIONS (
  'mergeSchema' = 'true',
  'force' = 'false'
)
```

### **Auto Loader Template**
```python
# Read stream
df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cloudFiles.schemaLocation", checkpoint_path) \
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
  .option("cloudFiles.inferColumnTypes", "true") \
  .option("header", "true") \
  .load(source_path)

# Write stream
query = df.writeStream \
  .option("checkpointLocation", checkpoint_path) \
  .option("mergeSchema", "true") \
  .trigger(availableNow=True) \
  .toTable(target_table)

query.awaitTermination()
```

### **Monitoring**
```sql
-- COPY INTO history
DESCRIBE HISTORY table_name

-- Check table details
DESCRIBE DETAIL table_name
```

```python
# Auto Loader monitoring
spark.streams.active  # List active streams
query.status  # Check stream status
query.recentProgress  # View progress
```

## üéâ Congratulations!

You've completed the COPY INTO & Auto Loader demo!

### **What You Learned:**

‚úÖ **COPY INTO** - SQL-based incremental ingestion  
‚úÖ **Auto Loader** - Streaming-based ingestion with cloudFiles  
‚úÖ **Idempotency** - Both methods track processed files  
‚úÖ **Schema Evolution** - Handle schema changes gracefully  
‚úÖ **Comparison** - When to use each approach  
‚úÖ **Best Practices** - Production-ready patterns  

---

### **Key Takeaways:**

1. **Never reprocess all files** - Use COPY INTO or Auto Loader
2. **COPY INTO for simplicity** - Great for batch loads
3. **Auto Loader for scale** - Best for production pipelines
4. **Schema evolution matters** - Plan for schema changes
5. **Monitor your ingestion** - Use history and streaming metrics

---

### **Decision Summary:**

| Scenario | Recommendation |
|----------|----------------|
| SQL Warehouse only | COPY INTO |
| Simple batch loads | COPY INTO |
| Real-time ingestion | Auto Loader |
| Millions of files | Auto Loader |
| Frequent schema changes | Auto Loader |
| Production pipelines | Auto Loader |
| Cost-sensitive batch | COPY INTO |

---

### **Next Steps:**

* Implement incremental ingestion in your pipelines
* Set up file notifications for Auto Loader (S3 SQS)
* Create monitoring dashboards
* Explore Delta Live Tables (DLT) for declarative pipelines
* Learn about Change Data Capture (CDC)

---

### **Resources:**

* [COPY INTO Documentation](https://docs.databricks.com/sql/language-manual/delta-copy-into.html)
* [Auto Loader Documentation](https://docs.databricks.com/ingestion/auto-loader/index.html)
* [Unity Catalog Volumes](https://docs.databricks.com/data-governance/unity-catalog/volumes.html)
* [Delta Lake Best Practices](https://docs.databricks.com/delta/best-practices.html)

---

**You're now ready to build production-grade data ingestion pipelines!** üöÄ

*Happy ingesting!*