# Exercise 1: Setting Up a Spark Session and Working with Delta Lake

**Objective**: Learn how to set up a Spark session with Delta Lake support, and how to save and read data from Delta Lake.

## Prerequisites

- Apache Spark with Delta Lake installed.
- Jupyter Notebook or any Python IDE.


## Step 1: Install Required Libraries

Ensure you have the `delta-spark` package installed, which provides Delta Lake integration with PySpark.


In [None]:
# Install delta-spark package if not already installed
!pip install delta-spark



## Step 2: Import Libraries and Initialize Spark Session

Import the necessary libraries and initialize a Spark session with Delta Lake support.


In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session with Delta Lake configurations
spark = SparkSession.builder \
    .appName("SimpleApp") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()


## Step 3: Create a Sample DataFrame

Let's create a simple DataFrame to work with.


In [None]:
# Sample data
data = [("Alice", 34), ("Bob", 36), ("Cathy", 30)]
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()


## Step 4: Write Data to Delta Lake

Write the DataFrame to a Delta table.


In [None]:
# Define Delta table path
delta_table_path = "/tmp/delta-table"

# Write DataFrame to Delta format
df.write.format("delta").mode("overwrite").save(delta_table_path)


## Step 5: Read Data from Delta Lake

Read the data back from the Delta table.


In [None]:
# Read Delta table
df_delta = spark.read.format("delta").load(delta_table_path)

# Show Delta table data
df_delta.show()


## Step 6: Update Data in Delta Lake

Update records in the Delta table using Delta Lake's `update` functionality.


In [None]:
from delta.tables import DeltaTable
from pyspark.sql.functions import col

# Create DeltaTable object
delta_table = DeltaTable.forPath(spark, delta_table_path)

# Update age where Name is 'Alice'
delta_table.update(
    condition=col("Name") == "Alice",
    set={"Age": col("Age") + 1}
)

# Show updated data
delta_table.toDF().show()


## Step 7: Time Travel in Delta Lake

Delta Lake allows you to query older snapshots of data using time travel.

### View Table History


In [None]:
# View Delta table history
delta_table.history().show()


### Read Previous Versions

Read data from version 0.


In [None]:
# Read data from version 0
df_version0 = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)

# Show data from version 0
df_version0.show()


## Step 8: Append New Data to Delta Lake

Append new records to the Delta table.


In [None]:
# New data to append
new_data = [("David", 29), ("Eva", 32)]
new_df = spark.createDataFrame(new_data, columns)

# Append to Delta table
new_df.write.format("delta").mode("append").save(delta_table_path)

# Show updated table
delta_table.toDF().show()


## Step 9: Delete Data from Delta Lake

Delete records from the Delta table.


In [None]:
# Delete where Age is less than 32
delta_table.delete(condition=col("Age") < 32)

# Show data after deletion
delta_table.toDF().show()


## Step 10: Vacuum Old Data

Clean up old snapshots and remove unused files with `VACUUM`.


In [None]:
# Vacuum Delta table (Note: Only run if you are sure)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)
delta_table.vacuum(0)

# Try to time travel to version 0 (should fail if data is vacuumed)
try:
    df_version0 = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
    df_version0.show()
except Exception as e:
    print("Time travel failed:", e)


## Closing the Spark Session

After completing your tasks, don't forget to stop the Spark session.


In [None]:
spark.stop()
