# Introduction to Databricks

## From Apache Spark to Databricks

**Apache Spark** is an open-source distributed computing framework designed for big data processing. However, after Spark became popular, several challenges emerged in production environments.

### Problems with Running Spark in Production

Engineers encountered significant **operational overhead**:

- **Cluster Setup**: Manual configuration of distributed computing clusters was **complex and time-consuming**
- **Resource Tuning**: Constant need to optimize memory, CPU, and storage allocation across nodes
- **Job Failures**: Debugging failed jobs in distributed environments required deep expertise
- **Infrastructure Management**: Engineers spent 60-70% of time on infrastructure rather than data logic

**Result**: Data scientists and engineers had less time for actual data analytics and business logic.

### Databricks Solution

**Founded by**: Original creators of Apache Spark (from UC Berkeley AMPLab)

**Mission**: Make Spark easy to use in production environments

**Key Offerings**:

1. **Managed Clusters**
   - Automatic cluster provisioning and scaling
   - No manual infrastructure setup required
   - Auto-termination to save costs

2. **Collaborative Notebooks**
   - Interactive development environment
   - Support for Python, Scala, SQL, and R
   - Real-time collaboration features

3. **Monitoring and Optimization**
   - Built-in performance monitoring
   - Automatic optimization suggestions
   - Job scheduling and orchestration

---



# Getting Started with Databricks

Step-by-Step Access Instructions:   

## Step 1: Create Account
- Navigate to: https://www.databricks.com/try-databricks
- Sign up for a free community edition or enterprise trial
- Choose your cloud provider (AWS, Azure, or GCP)

## Step 2: Start a Spark Cluster

1. Navigate to "Compute" in the sidebar
2. Click "Create Cluster"
3. Configure cluster settings:
   - Cluster name
   - Runtime version (e.g., 15.0 ML with Scala 2.12, Spark 3.5.0)
   - Node type (e.g., Standard_DS3_v2: 14 GB Memory, 4 Cores)
   - Autoscaling options
4. Click "Create Cluster"

> Community Edition has 1 free serverless limited cluster. 
> just go to step 3


**Runtime Options**:
- **Standard**: Basic Spark functionality
- **ML (Machine Learning)**: Includes pre-installed ML libraries (scikit-learn, TensorFlow, etc.)
- **LTS (Long Term Support)**: Stable versions for production

**Node Types**:
- **General Purpose**: Balanced compute and memory (Standard_DS3_v2, Standard_DS4_v2)
- **GPU Accelerated**: For deep learning workloads

## Step 3: Create a Notebook

1. Navigate to "Workspace"
2. Click "Create" → "Notebook"
3. Name your notebook
4. Select default language (Python recommended)
5. Attach notebook to your running cluster


## Step 4: Run Your First Spark Code
```python
# Create a simple DataFrame with 5 rows
spark.range(5).show()
```

**Expected Output**:
```
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+
```


---



# PySpark Fundamentals

## Understanding the Spark Architecture

**Key Components**:

1. **Driver Program**: Coordinates the execution, runs the main() function
2. **Cluster Manager**: Allocates resources (YARN, Mesos, Kubernetes, or Standalone)
3. **Executor Nodes**: Perform the actual data processing tasks
4. **SparkSession**: Entry point for all Spark functionality

<img src="./pic/3_spark_architecture.png" width=500>



## SparkSession

The unified entry point for working with Spark.

```python
from pyspark.sql import SparkSession

# Create SparkSession (in Databricks, 'spark' is pre-created)
spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

# In Databricks notebooks, simply use:
spark  # Already available
```


## RDD vs DataFrame vs Dataset

the complete Apache Spark ecosystem and how different components interact:  

<img src='./pic/3_spark_api_architecture.jpg' width=500>

**Vertical Flow** (Bottom to Top):

```text
Data Sources API (Storage, Where data comes from)

    ↓reads/writes data

RDDs (Low-level abstraction)

    ↓ provides distributed data

Spark Core: Execution engine  (What actually runs your code)

    ↓ executes operations

Catalyst Optimizer (Where you write query optimization)

    ↓ optimizes queries

APIs: Spark SQL, Dataset, DataFrame  (What you write code with)

    ↓ provides high-level interface

User Interfaces: JDBC, Console, Programs (How you interact with Spark)
```

**Horizontal Relationships:**   
- All three APIs (Spark SQL, Dataset, DataFrame) use the same Catalyst Optimizer
- All three APIs execute through the same Spark Core Engine
- Traditional RDDs can bypass higher layers for low-level control

**Key Takeaways**:   

- **You work at the top** (Programs, DataFrame API)
- **Middle layers optimize** automatically (Catalyst, Spark Core)
- **Bottom layers** handle distribution and storage (RDDs, Data Sources)
- **Everything is unified**: SQL, DataFrames, and Datasets all use the same engine
- **You benefit from optimization** without doing anything special

### RDD (Resilient Distributed Dataset) 
is the **fundamental data structure** of Apache Spark, it's the original low-level abstraction that Spark was built on. It is:  

- An **immutable distributed** collection of objects
- Split into multiple **partitions** across cluster nodes
- Can be operated on in **parallel**
- **Fault-tolerant**: automatically rebuilt if partitions are lost

Think of it as: A distributed Python list that can be processed in parallel across multiple machines



### DataFrame 
is a **distributed collection** of data **organized into named columns**, similar to a **table** in a relational database or a spreadsheet in Excel, but designed to work with massive datasets across multiple computers. It is:  

- ✅ Distributed table with named columns
- ✅ Like Excel/SQL table but for big data
- ✅ **Immutable** (can't change, only create new versions)
- ✅ Lazy evaluation (builds execution plan)
- ✅ Automatically optimized by Catalyst


Think of it as Excel/Google Sheets: 
- Has rows and columns (like a spreadsheet)
- Each column has a name and data type
- But can handle HUGE data (terabytes)
- Processes data across multiple computers in parallel


compare to Python List, Pandas DataFrame, Spark DataFrame can scale to **billions of rows** across 100s of computers!

Why Use DataFrames?
- ✅ Easy to use - SQL-like operations
- ✅ Fast - Automatic optimization
- ✅ Scalable - Handles terabytes of data
- ✅ Flexible - Reads many file formats
- ✅ Industry standard - What companies actually use



### Dataset
is a **strongly-typed, object-oriented API** available in **Scala and Java** (not Python/PySpark). It combines the benefits of RDDs (type safety) with the benefits of DataFrames (optimization).

If you're using PySpark (Python):

- ❌ **Datasets are NOT available in Python**
- ✅ DataFrames are your Dataset equivalent
- ✅ Focus on DataFrames, they're what you need

**Why no Datasets in Python?**

- **Python is dynamically typed** (no compile-time type checking)
- Datasets require **static typing** (Scala/Java feature)
- PySpark DataFrames already provide what you need

### Comparison

| Feature | RDD | DataFrame | Dataset |
|---------|-----|-----------|---------|
| **API Level** | Low-level | High-level | High-level |
| **Type Safety** | No | No | Yes (Scala/Java) |
| **Optimization** | Manual | Catalyst Optimizer | Catalyst Optimizer |
| **Ease of Use** | Complex | Easy (SQL-like) | Easy |
| **Performance** | Slower | Faster | Fastest |
| **Language** | All | All | Scala/Java only |

**Recommendation**: Use DataFrames for most PySpark work.

---



# ETL Process with PySpark

**ETL** stands for Extract, Transform, Load, which is the core process for data pipeline development.

```text
┌─────────┐     ┌───────────┐     ┌──────┐
│ Extract │ --> │ Transform │ --> │ Load │
└─────────┘     └───────────┘     └──────┘
```


## 1. Extract Phase

Reading data from various sources into Spark DataFrames.

### Reading CSV Files
```python
# Basic CSV read
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Advanced CSV read with options
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("delimiter", ",") \
    .option("quote", "\"") \
    .option("escape", "\\") \
    .option("nullValue", "NA") \
    .csv("path/to/file.csv")
```

### Reading JSON Files
```python
# Single-line JSON
df = spark.read.json("path/to/file.json")

# Multi-line JSON
df = spark.read \
    .option("multiline", "true") \
    .json("path/to/file.json")
```

### Reading Parquet Files
```python
# Parquet is the preferred format for Spark
df = spark.read.parquet("path/to/file.parquet")
```

### Reading from Databases (JDBC)
```python
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:port/database") \
    .option("dbtable", "schema.table_name") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()
```

### Reading Delta Lake Tables
```python
# Delta Lake provides ACID transactions
df = spark.read.format("delta").load("/path/to/delta-table")

# Or using table name
df = spark.table("database.table_name")
```



## 2. Transform Phase

Applying business logic and data transformations.



### Basic Transformations

```python
    # Select columns
    df_selected = df.select("col1", "col2", "col3")

    # Filter rows
    df_filtered = df.filter(df["age"] > 18)
    df_filtered = df.where(df["age"] > 18)  # Same as filter

    # Add new column
    from pyspark.sql.functions import col, lit

    df_with_new = df.withColumn("new_col", col("existing_col") * 2)
    df_with_constant = df.withColumn("status", lit("active"))

    # Rename column
    df_renamed = df.withColumnRenamed("old_name", "new_name")

    # Drop columns
    df_dropped = df.drop("col1", "col2")

    # Drop duplicates
    df_unique = df.dropDuplicates()
    df_unique_subset = df.dropDuplicates(["col1", "col2"])
```



### Working with Null Values

```python
    # Drop rows with null values
    df_no_nulls = df.dropna()  # Drop rows with any null
    df_no_nulls = df.dropna(how="all")  # Drop only if all columns are null
    df_no_nulls = df.dropna(subset=["col1", "col2"])  # Check specific columns

    # Fill null values
    df_filled = df.fillna(0)  # Fill all numeric nulls with 0
    df_filled = df.fillna({"col1": 0, "col2": "Unknown"})  # Column-specific
```



### Aggregations

```python
from pyspark.sql.functions import count, sum, avg, max, min, stddev

# Group by and aggregate
df_agg = df.groupBy("category") \
    .agg(
        count("*").alias("count"),
        sum("amount").alias("total_amount"),
        avg("amount").alias("avg_amount"),
        max("amount").alias("max_amount"),
        min("amount").alias("min_amount")
    )

# Multiple group by columns
df_multi_group = df.groupBy("category", "region") \
    .agg(sum("sales").alias("total_sales"))
```



### Joins

```python
# Inner join (default)
df_joined = df1.join(df2, df1["id"] == df2["id"], "inner")

# Left outer join
df_left = df1.join(df2, df1["id"] == df2["id"], "left")

# Right outer join
df_right = df1.join(df2, df1["id"] == df2["id"], "right")

# Full outer join
df_full = df1.join(df2, df1["id"] == df2["id"], "outer")

# Join on multiple columns
df_joined = df1.join(df2, 
    (df1["id"] == df2["id"]) & (df1["date"] == df2["date"]),
    "inner"
)

# Join on column name (if same name in both DataFrames)
df_joined = df1.join(df2, "id", "inner")
```



### Window Functions

Window functions are special functions that **perform calculations across a set of rows that are related to the current row**, without collapsing the rows like groupBy() does.  

Think of a **window** as a "frame" or "view" of related rows:
```text
Original DataFrame:
┌──────────┬───────┬──────┐
│department│ name  │salary│
├──────────┼───────┼──────┤
│Sales     │Alice  │ 5000 │  ┐
│Sales     │Bob    │ 6000 │  ├─ Sales Window
│Sales     │Charlie│ 5500 │  ┘
│Marketing │Diana  │ 7000 │  ┐
│Marketing │Eve    │ 6500 │  ┘─ Marketing Window
└──────────┴───────┴──────┘

with window: 
df_with_avg = df.withColumn('dept_avg', avg('salary').over(Window.partitionBy('department')))

+----------+-------+------+--------+
|department|   name|salary|dept_avg|
+----------+-------+------+--------+
| Marketing|  Diana|  7000|  6750.0|
| Marketing|    Eve|  6500|  6750.0|
|     Sales|  Alice|  5000|  5500.0|
|     Sales|    Bob|  6000|  5500.0|
|     Sales|Charlie|  5500|  5500.0|
+----------+-------+------+--------+

without window:
dept_avg = df.groupBy('department').agg(avg('salary').alias('dept_avg'))

+----------+--------+
|department|dept_avg|
+----------+--------+
|     Sales|  5500.0|
| Marketing|  6750.0|
+----------+--------+
```

Window function calculates across each window separately

```python
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lead, lag

# Define window specification
window_spec = Window.partitionBy("category").orderBy("amount")

# Add row number
df_with_row = df.withColumn("row_num", row_number().over(window_spec))

# Ranking functions
df_ranked = df.withColumn("rank", rank().over(window_spec))
df_dense = df.withColumn("dense_rank", dense_rank().over(window_spec))

# Lead and lag (access next/previous row)
df_with_lead = df.withColumn("next_amount", lead("amount", 1).over(window_spec))
df_with_lag = df.withColumn("prev_amount", lag("amount", 1).over(window_spec))

# Cumulative sum
from pyspark.sql.functions import sum as _sum

window_cumulative = Window.partitionBy("category").orderBy("date") \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_cumsum = df.withColumn("cumulative_sum", _sum("amount").over(window_cumulative))
```



### User-Defined Functions (UDFs)
allow you to create custom functions in Python and use them on Spark DataFrames when built-in Spark functions aren't enough.

```python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, IntegerType

# Define Python function
def categorize_age(age):
    if age < 18:
        return "Minor"
    elif age < 65:
        return "Adult"
    else:
        return "Senior"

# Register as UDF
categorize_udf = udf(categorize_age, StringType())

# Apply UDF
df_categorized = df.withColumn("age_category", categorize_udf(col("age")))

# Alternative: Use pandas UDF for better performance
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf(StringType())
def categorize_age_pandas(age: pd.Series) -> pd.Series:
    return age.apply(lambda x: "Minor" if x < 18 else ("Adult" if x < 65 else "Senior"))

df_categorized = df.withColumn("age_category", categorize_age_pandas(col("age")))
```



## 3. Load Phase

Writing transformed data to target destinations.

### Writing to CSV
```python
df.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("path/to/output")
```

### Writing to Parquet
```python
# Single file (coalesce to 1 partition)
df.coalesce(1).write \
    .mode("overwrite") \
    .parquet("path/to/output")

# Partitioned by column
df.write \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .parquet("path/to/output")
```

### Writing to Delta Lake
```python
# Overwrite
df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("/path/to/delta-table")

# Append
df.write \
    .format("delta") \
    .mode("append") \
    .save("/path/to/delta-table")

# Save as table
df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("database.table_name")
```

### Writing to Databases
```python
df.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:port/database") \
    .option("dbtable", "schema.table_name") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .mode("overwrite") \
    .save()
```

**Write Modes**:
- `overwrite`: Replace existing data
- `append`: Add to existing data
- `ignore`: Do nothing if data exists
- `error` or `errorifexists` (default): Throw error if data exists

---



# PySpark DataFrame Operations


## Creating DataFrames

### From Python Lists
```python
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["name", "age"]

df = spark.createDataFrame(data, columns)
```

### From Pandas DataFrame
```python
import pandas as pd

pandas_df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

spark_df = spark.createDataFrame(pandas_df)
```

### With Schema Definition
```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), nullable=False),
    StructField("age", IntegerType(), nullable=True),
    StructField("city", StringType(), nullable=True)
])

data = [("Alice", 25, "NYC"), ("Bob", 30, "LA")]
df = spark.createDataFrame(data, schema)
```



## Inspecting DataFrames

```python
# Show first n rows
df.show(10)
df.show(10, truncate=False)  # Don't truncate long strings

# Display schema
df.printSchema()

# Get column names
df.columns

# Get number of rows and columns
df.count()  # Number of rows
len(df.columns)  # Number of columns

# Summary statistics
df.describe().show()
df.summary().show()  # More detailed than describe()

# Display first n rows as list of Row objects
df.head(5)
df.take(5)

# First row
df.first()
```



## Column Operations

```python
from pyspark.sql.functions import col, expr, when, lit

# Select columns
df.select("name", "age")
df.select(col("name"), col("age"))

# Select with expressions
df.select(
    col("name"),
    (col("age") + 5).alias("age_plus_5"),
    expr("age * 2 as age_doubled")
)

# Conditional column
df.withColumn("age_group",
    when(col("age") < 18, "Minor")
    .when((col("age") >= 18) & (col("age") < 65), "Adult")
    .otherwise("Senior")
)

# Cast column type
df.withColumn("age", col("age").cast("double"))
df.withColumn("age", col("age").cast(DoubleType()))
```



## String Operations

```python
from pyspark.sql.functions import (
    lower, upper, initcap, trim, ltrim, rtrim,
    length, substring, concat, concat_ws,
    split, regexp_replace, regexp_extract
)

# Case conversion
df.withColumn("name_lower", lower(col("name")))
df.withColumn("name_upper", upper(col("name")))
df.withColumn("name_title", initcap(col("name")))  # Title Case

# Trim whitespace
df.withColumn("name_trimmed", trim(col("name")))

# String length
df.withColumn("name_length", length(col("name")))

# Substring (1-indexed)
df.withColumn("name_first_3", substring(col("name"), 1, 3))

# Concatenation
df.withColumn("full_name", concat(col("first_name"), lit(" "), col("last_name")))
df.withColumn("full_name", concat_ws(" ", col("first_name"), col("last_name")))

# Split string into array
df.withColumn("name_parts", split(col("name"), " "))

# Regex replace
df.withColumn("phone_clean", regexp_replace(col("phone"), "[^0-9]", ""))

# Regex extract
df.withColumn("area_code", regexp_extract(col("phone"), r"(\d{3})-\d{3}-\d{4}", 1))
```



## Date and Time Operations

```python
from pyspark.sql.functions import (
    current_date, current_timestamp, date_format,
    year, month, dayofmonth, dayofweek, dayofyear,
    hour, minute, second,
    datediff, months_between, add_months, date_add, date_sub,
    to_date, to_timestamp, unix_timestamp, from_unixtime
)

# Current date and timestamp
df.withColumn("current_date", current_date())
df.withColumn("current_time", current_timestamp())

# Extract date parts
df.withColumn("year", year(col("date")))
df.withColumn("month", month(col("date")))
df.withColumn("day", dayofmonth(col("date")))
df.withColumn("day_of_week", dayofweek(col("date")))

# Date arithmetic
df.withColumn("date_plus_7", date_add(col("date"), 7))
df.withColumn("date_minus_30", date_sub(col("date"), 30))
df.withColumn("date_plus_2_months", add_months(col("date"), 2))

# Date difference
df.withColumn("days_diff", datediff(col("end_date"), col("start_date")))
df.withColumn("months_diff", months_between(col("end_date"), col("start_date")))

# String to date/timestamp
df.withColumn("date", to_date(col("date_string"), "yyyy-MM-dd"))
df.withColumn("timestamp", to_timestamp(col("timestamp_string"), "yyyy-MM-dd HH:mm:ss"))

# Format date
df.withColumn("formatted_date", date_format(col("date"), "MMM dd, yyyy"))
```



## Array and Map Operations

```python
from pyspark.sql.functions import (
    array, array_contains, explode, size,
    map_keys, map_values, explode as explode_map
)

# Create array column
df.withColumn("numbers", array(lit(1), lit(2), lit(3)))

# Check if array contains value
df.withColumn("has_value", array_contains(col("array_col"), "value"))

# Array size
df.withColumn("array_size", size(col("array_col")))

# Explode array (creates new row for each element)
df.withColumn("exploded", explode(col("array_col")))

# Work with maps
df.withColumn("keys", map_keys(col("map_col")))
df.withColumn("values", map_values(col("map_col")))
```

---



# Advanced PySpark Concepts



## Partitioning

Partitioning controls how data is distributed across the cluster.

```python
# Check number of partitions
df.rdd.getNumPartitions()

# Repartition (full shuffle)
df_repartitioned = df.repartition(10)
df_repartitioned = df.repartition(10, "column_name")  # Partition by column

# Coalesce (reduces partitions without full shuffle, more efficient)
df_coalesced = df.coalesce(5)

# Write with partitioning
df.write \
    .partitionBy("year", "month") \
    .parquet("path/to/output")
```

**When to use**:
- Use `repartition()` when increasing partitions or need even distribution
- Use `coalesce()` when decreasing partitions to save on shuffle



## Caching and Persistence

```python
# Cache in memory (default storage level)
df_cached = df.cache()

# Persist with specific storage level
from pyspark import StorageLevel

df_persisted = df.persist(StorageLevel.MEMORY_AND_DISK)

# Available storage levels:
# - MEMORY_ONLY: Cache in memory only
# - MEMORY_AND_DISK: Cache in memory, spill to disk if needed
# - DISK_ONLY: Cache on disk only
# - MEMORY_ONLY_SER: Serialize objects in memory
# - MEMORY_AND_DISK_SER: Serialize in memory and disk

# Unpersist (free up memory)
df_cached.unpersist()

# Check if cached
df_cached.is_cached
```

**Best practices**:
- Cache DataFrames that are reused multiple times
- Cache after expensive transformations
- Unpersist when no longer needed



## Broadcast Variables

Use for small datasets that need to be available on all nodes.

```python
from pyspark.sql.functions import broadcast

# Broadcast small DataFrame in join
large_df = spark.read.parquet("large_data.parquet")
small_df = spark.read.parquet("small_data.parquet")

# Broadcast join (more efficient)
result = large_df.join(broadcast(small_df), "key")

# Broadcast Python variables
broadcast_var = spark.sparkContext.broadcast({"key": "value"})
# Access: broadcast_var.value
```



## Accumulators

Shared variables for aggregating information across executors.

```python
# Create accumulator
counter = spark.sparkContext.accumulator(0)

def process_row(row):
    global counter
    if row.age > 30:
        counter.add(1)
    return row

# Use in transformations
rdd = df.rdd.map(process_row)
rdd.collect()

print(f"Count of rows with age > 30: {counter.value}")
```



## SQL Queries

```python
# Register DataFrame as temporary view
df.createOrReplaceTempView("people")

# Run SQL query
result = spark.sql("""
    SELECT 
        category,
        COUNT(*) as count,
        AVG(amount) as avg_amount
    FROM people
    WHERE age > 18
    GROUP BY category
    ORDER BY count DESC
""")

result.show()

# Global temporary view (accessible across sessions)
df.createGlobalTempView("global_people")
spark.sql("SELECT * FROM global_temp.global_people")
```



## Working with JSON Data

```python
from pyspark.sql.functions import from_json, to_json, get_json_object, json_tuple

# Parse JSON string column
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

json_schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])

df.withColumn("parsed", from_json(col("json_string"), json_schema))

# Convert to JSON string
df.withColumn("json_string", to_json(struct(col("name"), col("age"))))

# Extract specific field from JSON string
df.withColumn("name", get_json_object(col("json_string"), "$.name"))

# Extract multiple fields
df.select(json_tuple(col("json_string"), "name", "age"))
```



## Handling Nested Data

```python
# Select nested fields
df.select("parent.child.grandchild")
df.select(col("parent.child.grandchild"))

# Flatten nested structure
from pyspark.sql.functions import col

df.select(
    col("id"),
    col("nested.field1").alias("field1"),
    col("nested.field2").alias("field2")
)

# Explode nested arrays
df.select("id", explode("array_field").alias("item"))
```

---



# Performance Optimization



## Catalyst Optimizer

Spark's Catalyst optimizer automatically optimizes logical and physical execution plans.

**Optimization techniques**:
1. Predicate pushdown: Push filters down to data source
2. Projection pruning: Only read required columns
3. Constant folding: Evaluate constant expressions at compile time
4. Common subexpression elimination: Reuse computed values



## Tungsten Execution Engine

Binary processing engine for improved performance:
- Memory management and binary processing
- Cache-aware computation
- Code generation for improved CPU efficiency



## Best Practices

### 1. Use Built-in Functions
```python
# Good: Use built-in functions
from pyspark.sql.functions import upper
df.withColumn("name_upper", upper(col("name")))

# Avoid: UDFs (slower due to serialization overhead)
def upper_udf(s):
    return s.upper()
udf_upper = udf(upper_udf, StringType())
df.withColumn("name_upper", udf_upper(col("name")))
```

### 2. Filter Early
```python
# Good: Filter before expensive operations
df.filter(col("age") > 18) \
  .join(other_df, "id") \
  .groupBy("category").count()

# Avoid: Filter after expensive operations
df.join(other_df, "id") \
  .groupBy("category").count() \
  .filter(col("age") > 18)
```

### 3. Partition Pruning
```python
# Write data partitioned by commonly filtered columns
df.write \
    .partitionBy("year", "month", "day") \
    .parquet("path/to/data")

# Read with partition filter (only scans relevant partitions)
df = spark.read.parquet("path/to/data") \
    .filter((col("year") == 2024) & (col("month") == 12))
```

### 4. Avoid Shuffles When Possible
```python
# Shuffle operations (expensive):
# - repartition()
# - groupBy()
# - join() (except broadcast join)
# - distinct()
# - sortBy()

# Minimize shuffles by:
# 1. Using broadcast joins for small tables
result = large_df.join(broadcast(small_df), "key")

# 2. Pre-partitioning data
df_partitioned = df.repartition("key")
# Subsequent groupBy on "key" won't shuffle
```

### 5. Choose Right File Format
```python
# Parquet (recommended):
# - Columnar format
# - Excellent compression
# - Predicate pushdown support
# - Schema evolution

# Delta Lake (best for production):
# - ACID transactions
# - Time travel
# - Schema enforcement
# - Optimized writes

# CSV (avoid for large data):
# - No schema
# - Poor compression
# - Slow to parse
```

### 6. Optimize Joins
```python
# 1. Broadcast join for small tables (< 10MB)
df1.join(broadcast(df2), "key")

# 2. Bucket joins for large tables frequently joined
df1.write \
    .bucketBy(10, "key") \
    .sortBy("key") \
    .saveAsTable("bucketed_table1")

# 3. Sort-merge join (default for large tables)
# Ensure data is pre-sorted and partitioned by join key
```

### 7. Monitor Query Plans
```python
# View physical plan
df.explain()

# View detailed plan with statistics
df.explain(mode="extended")

# View cost-based optimization details
df.explain(mode="cost")
```



## Debugging Performance Issues

```python
# 1. Check data skew
df.groupBy("partition_col").count().show()

# 2. Monitor Spark UI
# - Access at port 4040 when job is running
# - Check stages, tasks, and executors
# - Look for stragglers (slow tasks)

# 3. Enable adaptive query execution (Spark 3.0+)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

# 4. Tune memory settings
spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.driver.memory", "2g")
spark.conf.set("spark.memory.fraction", "0.8")
```



## Handling Data Skew

```python
# Problem: Some partitions have much more data than others

# Solution 1: Add salt to skewed keys
from pyspark.sql.functions import rand, floor

df_salted = df.withColumn("salt", (floor(rand() * 10)).cast("int"))
df_salted = df_salted.withColumn("salted_key", concat(col("key"), lit("_"), col("salt")))

# Solution 2: Broadcast join if one side is small
result = large_skewed_df.join(broadcast(small_df), "key")

# Solution 3: Repartition by multiple columns
df.repartition("col1", "col2")
```

---



# Common PySpark Patterns



## Reading Multiple Files
```python
# Read all CSV files in directory
df = spark.read.csv("path/to/directory/*.csv", header=True)

# Read with wildcard pattern
df = spark.read.parquet("path/*/year=2024/month=*/data.parquet")

# Union multiple DataFrames
from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1, df2, df3]
combined = reduce(DataFrame.union, dfs)
```



## Incremental Processing
```python
# Read only new data
from datetime import datetime, timedelta

yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")

df_new = spark.read.parquet("path/to/data") \
    .filter(col("date") >= yesterday)

# Process and append
df_processed = df_new.transform(my_transformation)
df_processed.write.mode("append").parquet("path/to/output")
```



## Error Handling
```python
try:
    df = spark.read.csv("path/to/file.csv", header=True)
    result = df.groupBy("category").count()
    result.write.mode("overwrite").parquet("output/path")
except Exception as e:
    print(f"Error occurred: {str(e)}")
    # Log error, send alert, etc.
finally:
    # Cleanup code
    spark.catalog.clearCache()
```

---



# Summary

## Key Takeaways

1. **Databricks** simplifies Spark operations with managed clusters, collaborative notebooks, and built-in optimization

2. **PySpark DataFrames** are the primary abstraction for distributed data processing, offering high-level APIs with automatic optimization

3. **ETL Pipeline** follows three phases:
   - **Extract**: Read data from various sources
   - **Transform**: Apply business logic and data transformations
   - **Load**: Write results to target destinations

4. **Performance** depends on:
   - Using built-in functions over UDFs
   - Filtering early and often
   - Choosing appropriate file formats (Parquet/Delta)
   - Managing partitions effectively
   - Minimizing shuffles

5. **Best Practices**:
   - Cache frequently used DataFrames
   - Use broadcast joins for small tables
   - Monitor Spark UI for bottlenecks
   - Handle data skew proactively
   - Leverage Delta Lake for production workloads

## Next Steps

1. Practice with real datasets on Databricks Community Edition
2. Explore Delta Lake for ACID transactions
3. Learn Spark Streaming for real-time processing
4. Study MLlib for machine learning at scale
5. Master performance tuning and optimization techniques

---



# Additional Resources

- **Databricks Documentation**: https://docs.databricks.com/
- **PySpark API Reference**: https://spark.apache.org/docs/latest/api/python/
- **Apache Spark Guide**: https://spark.apache.org/docs/latest/
- **Databricks Academy**: Free courses and certifications
- **Community Forums**: https://community.databricks.com/

