# Introduction to Apache Spark
This beginner-friendly course covers the fundamentals of Apache Spark for large-scale data processing. You will explore Spark’s
distributed architecture, master the DataFrame API, and learn to read, write, and process data using Python. Through hands-on
exercises, you will build the skills needed to execute Spark transformations and actions efficiently.

---

### Prerequisites: 
You should meet the following prerequisites before starting this course:

- Basic programming knowledge
- Familiarity with Python
- Basic understanding of SQL queries (`SELECT`, `JOIN`, `GROUP BY`)
- Familiarity with data processing concepts
- No prior Spark or Databricks experience required

---

# Before getting started - Select your notebook compute

Before executing cells in this notebook, please select your serverless compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Navigate to the top-right of this notebook and click the *Connect* drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

> **NOTE:** Once you have completed this lab, terminate your cluster:
>    Navigate to the top-right of this notebook and click the *Connected* drop-down menu to select your connected cluster. 
>    When hovering the cluster, options will appear, then select *Terminate*.

# 1 - Exploring Spark Architecture in Databricks

In this demonstration, we'll explore how Spark's architecture components manifest in a Databricks environment and how to monitor them using the various UIs available to us.

### Objectives
- Identify key Spark architecture components in a Databricks cluster
- Navigate the Spark UI to monitor application execution
- Understand how Databricks implements Spark's cluster management

## Introduction to Notebooks

We will run all of the demos and exercises for this course in notebooks. Notebooks are interactive documents that combine live code, visualizations, narrative text, and outputs in a single document. In Databricks, notebooks provide a powerful environment for data exploration, analysis, and collaboration.

Key features of Databricks notebooks:
- Code execution: Run code in multiple languages (Python, SQL, R, Scala)
- Rich text formatting: Support for Markdown to create well-documented analyses
- Cell-based structure: Code and content is organized into executable cells
- Interactive visualizations: Direct plotting and charting of results
- Collaboration: Share and work together on notebooks in real-time

The default language for a notebook is set when you create it, as indicated by the __Python__ selection in the toolbar above☝️. You can change the default language through the notebook settings. Additionally, you can use magic commands to execute code in different languages within the same notebook:

- Use `%python` to execute Python code
- Use `%sql` to execute SQL queries
- Use `%r` to execute R code
- Use `%scala` to execute Scala code

The `%run` magic command is particularly useful - it allows you to execute another notebook within your current notebook, enabling modular code organization and reuse. For example `%run /path/to/another/notebook`.  

This will execute all the cells in the referenced notebook as if they were part of your current notebook.

## The SparkSession and SparkContext

The SparkSession is automatically instantiated as `spark` in Databricks notebooks connected to a cluster.  The SparkContext is available via the SparkSession using `spark.sparkContext` or simply `sc`.

In [0]:
spark

In [0]:
sc

In the serverless mode, directly accessing the underlying Spark driver JVM using the attribute 'sparkContext' is not supported on serverless compute. 

If you require direct access to these fields, consider using a single-user cluster. For more details on compatibility and limitations, check: https://docs.databricks.com/release-notes/serverless.html#limitations


## Creating and Monitoring a Spark Job

Let's create a simple job that will help us visualize the execution flow. Run the below cell to create a spark Job

**NOTE:** Don't focus too much on the code here. We want to focus on exploring the access logs and metrics.

In [0]:
# Create a large DataFrame to see parallelization in action
from pyspark.sql.functions import *
import time

# Generate some data
df = spark.range(0, 1000000)
df = df.withColumn("square", col("id") * col("id"))

# Force multiple stages with a shuffle operation
result = df.groupBy(col("id") % 100).agg(sum("square").alias("sum_squares"))

# result = df.repartition(2)

result.collect()

In [0]:
# Force the computation with the 'count' action
print(f"Number of groups: {result.count()}")

### Exploring the Query Profile

> NOTE: The Spark UI is not available. Instead, use the query profile to view information about your Spark queries. 

Click on the __See Performance__ link at the bottom of the last cell, then click on the executed statement. A query details panel appears on the right side of the screen.

Explore the [Query profile](https://docs.databricks.com/aws/en/sql/user/queries/query-profile).

# 2 - Reading and Writing Data with DataFrames

This demonstration will introduce you to Spark DataFrames by reading data from an external source into a DataFrame and then writing the DataFrame out to another location.

### Objectives
- Read external data into a DataFrame
- Create and inspect DataFrame schemas
- Write the contents of a DataFrame to a specified location


## Reading Data 

Let's start by cchecking your current catalog and schema

In [0]:
%sql
select current_catalog(),current_schema()

Now, let's create a DataFrame by reading a directory of CSV files from the previously created volume using the Market place data.

In [0]:
# Read the customers.csv file with schema inference
customers_df = spark.read.format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .load("/Volumes/databricks_simulated_retail_customer_data/v01/source_files")

In [0]:
# Show the inferred schema for customers_df
customers_df.printSchema()

> __NOTE:__ If the dataset has a header row, column names are inferred. Without a header, columns are named **_c0**, **_c1**, etc., and must be renamed manually.

In [0]:
# Alternatively we can use below command to print schema in linear format
print(customers_df.schema)

## `display()` function

The `display()` function is available in Databricks runtimes to be used within notebooks.  

`display()` represents a Spark action, which returns results for a DataFrame in a formatted tabular result window.  As this is intended for visual inspection only, results are limited to 10,000 rows or 2 MB, whichever is less. Additional features of the `display()` function include:

- Ability to sort or filter data in the result set
- Ability to download result data to a CSV file
- Ability to profile data using the __+__ -> __Data Profile__ option
- Ability to visualize data using the __+__ -> __Visualization__ option


In [0]:
# Using Display function to preview the data in the DataFrame
display(customers_df)

### Explicitly Defining a Schema

Let's load the same dataset into a DataFrame, this time we will explicitly define the schema.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
# or ...
from pyspark.sql.types import *

# Define explicit schema using StructType
customer_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("tax_id", StringType(), True),
    StructField("tax_code", StringType(), True),
    StructField("customer_name", StringType(), True),
    StructField("state", StringType(), True),
    StructField("city", StringType(), True),
    StructField("postcode", StringType(), True),
    StructField("street", StringType(), True),
    StructField("number", StringType(), True),
    StructField("unit", StringType(), True),
    StructField("region", StringType(), True),
    StructField("district", StringType(), True),
    StructField("lon", DoubleType(), True),
    StructField("lat", DoubleType(), True),
    StructField("ship_to_address", StringType(), True),
    StructField("valid_from", IntegerType(), True),
    StructField("valid_to", IntegerType(), True),
    StructField("units_purchased", IntegerType(), True),
    StructField("loyalty_segment", IntegerType(), True)])

# Read the customers.csv file with explicit StructType schema
customers_structtype_df = spark.read.format("csv") \
  .option("header", "true") \
  .schema(customer_schema) \
  .load("/Volumes/databricks_simulated_retail_customer_data/v01/source_files/customers.csv")

In [0]:
# Examine the explicit schema
customers_structtype_df.printSchema()

### Using a DDL Schema
Let's define the schema now using a DDL schema for readability.

In [0]:
ddl_schema = """
  customer_id INT NOT NULL,
  tax_id STRING,
  tax_code STRING,
  customer_name STRING,
  state STRING,
  city STRING,
  postcode STRING,
  street STRING,
  number STRING,
  unit STRING,
  region STRING,
  district STRING,
  lon DOUBLE,
  lat DOUBLE,
  ship_to_address STRING,
  valid_from INT,
  valid_to INT,
  units_purchased INT,
  loyalty_segment INT
"""

In [0]:
customers_ddl_df = spark.read.format("csv") \
  .option("header", "true") \
  .schema(ddl_schema) \
  .load("/Volumes/databricks_simulated_retail_customer_data/v01/source_files/customers.csv")

In [0]:
customers_ddl_df.printSchema()

## Writing Data From DataFrames

Now let's write out the contents of a DataFrame to different output locations.

### Writing the contents of a DataFrame to a File System
We can write the contents of our DataFrame to a filesystem in various formats using the `write` and `save` methods as shown here.

In [0]:
# Define an output path
parquet_output_volume_path = f"/Volumes/workspace/default/v01/customers_parquet"

In [0]:
%sql
CREATE VOLUME IF NOT EXISTS v01

In [0]:
# Write the DataFrame out as Parquet files to a directory
customers_ddl_df.write.format("parquet") \
  .mode("overwrite") \
  .save(parquet_output_volume_path)

In [0]:
# Show the files in the directory
display(dbutils.fs.ls(parquet_output_volume_path))

### Writing the contents of a DataFrame to a Table
We can also save our DataFrame to a table (defined in a catalog and schema in Unity Catalog).

In [0]:
# Writing our DataFrame to a new table (this is an action)
customers_ddl_df.write.saveAsTable("customers_ddl_df_table")

In [0]:
# Alternatively you can use the writeTo method which invokes the DataFrameWriterV2 
customers_ddl_df.writeTo(                              
    f"customers_ddl_df_table"
).createOrReplace()
# you have options to partition the table, as well as append, overwrite or createOrReplace

In [0]:
%sql
-- Read back the data in the table
SELECT * FROM customers_ddl_df_table

## Summary

1. **Reading Data into DataFrames**:
   - Schema inference can be used for CSV files
   - The schema can also be explicitly defined (preferred) using StructType definitions or using a DDL schema
   
2. **Writing Data from DataFrames**:
   - DataFrames can be written out to a distributed filesystem (like a Unity Catalog volume)
   - DataFrames can also be written out to tables in Unity Catalog

# 3 - Flight Data ETL with the DataFrame API

This demonstration will walk through common ETL operations using the Flights dataset. We'll cover data loading, cleaning, transformation, and analysis using the DataFrame API.

### Objectives
- Implement common ETL operations using Spark DataFrames
- Handle data cleaning and type conversion
- Create derived features through transformations
- Use different column reference methods
- Work with User Defined Functions (UDFs)

### Flight Data Processing Requirements

#### Source Data
Dataset Location: `databricks_airline_performance_data.v01.flights_small`(flight information dataset)

#### Target
Table name: flight_data

Schema:

| Column Name | Data Type | Description |
|-------------|-----------|-------------|
| FlightDateTime | datetime | Datetime of the flight (derived from the Year, Month, DayofMonth, DepTime fields in the source data) |
| FlightNum | integer | Flight number |
| ElapsedTimeDiff | integer | Difference between scheduled elapsed time and actual elapsed time for the flight, derived from the ActualElapsedTime and CRSElapsedTime fields in the source data |
| ArrDelayCategory | string | Categories include "On Time", "Slight Delay", "Moderate Delay" and "Severe Delay" based upon the value of the ArrDelay in the source data |

## Data Loading and Inspection

First, let's load and inspect the flight data.

In [0]:
# Read the flights data
flights_df = spark.read.table("databricks_airline_performance_data.v01.flights_small")

In [0]:
# Print the schema
flights_df.printSchema()

In [0]:
# Visually inspect a subset of the data
display(flights_df.limit(10))

In [0]:
# Let's remove columns we dont need, remember "filter early, filter often"
flights_required_cols_df = flights_df.select(
    "Year",
    "Month",
    "DayofMonth",
    "DepTime",
    "FlightNum",
    "ActualElapsedTime",
    "CRSElapsedTime",
    "ArrDelay")

# Alternatively we could have used the drop() method to remove the columns we didnt want...

In [0]:
# Get a count of the source data records
initial_count = flights_required_cols_df.count()

print(f"Source data has {initial_count} records")

In [0]:
# Let's examine the data for invalid values, these can include nulls or invalid values for string columns "ArrDelay", "ActualElapsedTime", "DepTime" which we intend on performing mathematical operations on, we can use the Spark SQL COUNT_IF function to perform the analysis

# Register the DataFrame as a temporary SQL table with cast columns
flights_required_cols_df \
    .selectExpr(
        "Year",
        "Month",
        "DayofMonth",
        "TRY_CAST(DepTime AS INT) AS DepTime",
        "FlightNum",
        "TRY_CAST(ActualElapsedTime AS INT) AS ActualElapsedTime",
        "CRSElapsedTime",
        "TRY_CAST(ArrDelay AS INT) AS ArrDelay"
    ) \
    .createOrReplaceTempView("flights_temp")

In [0]:
# Use Spark SQL to count null values
invalid_counts_sql = spark.sql("""
SELECT 
    COUNT_IF(Year IS NULL) AS Null_Year_Count,
    COUNT_IF(Month IS NULL) AS Null_Month_Count,
    COUNT_IF(DayofMonth IS NULL) AS Null_DayOfMonth_Count,
    COUNT_IF(DepTime IS NULL) AS Null_DepTime_Count,
    COUNT_IF(FlightNum IS NULL) AS Null_FlightNum_Count,
    COUNT_IF(ActualElapsedTime IS NULL) AS Null_ActualElapsedTime_Count,
    COUNT_IF(CRSElapsedTime IS NULL) AS Null_CRSElapsedTime_Count,
    COUNT_IF(ArrDelay IS NULL) AS Null_ArrDelay_Count
FROM flights_temp
""")

display(invalid_counts_sql)

### Comparing Spark SQL to DataFrame API Operations
Spark SQL DataFrame queries and their equivalent operations in the DataFrame API are evaluated to the same physical plans, let's prove this.

In [0]:
# this is the equivalent of the preceding Spark SQL query using the DataFrame API
from pyspark.sql.functions import col, sum, when

# Make sure to work with the same temporary view that the SQL is using
flights_temp_df = spark.table("flights_temp")

# Use DataFrame API to count null values
invalid_counts_df = flights_temp_df.select(
    sum(when(col("Year").isNull(), 1).otherwise(0)).alias("Null_Year_Count"),
    sum(when(col("Month").isNull(), 1).otherwise(0)).alias("Null_Month_Count"),
    sum(when(col("DayofMonth").isNull(), 1).otherwise(0)).alias("Null_DayOfMonth_Count"),
    sum(when(col("DepTime").isNull(), 1).otherwise(0)).alias("Null_DepTime_Count"),
    sum(when(col("FlightNum").isNull(), 1).otherwise(0)).alias("Null_FlightNum_Count"),
    sum(when(col("ActualElapsedTime").isNull(), 1).otherwise(0)).alias("Null_ActualElapsedTime_Count"),
    sum(when(col("CRSElapsedTime").isNull(), 1).otherwise(0)).alias("Null_CRSElapsedTime_Count"),
    sum(when(col("ArrDelay").isNull(), 1).otherwise(0)).alias("Null_ArrDelay_Count")
)

display(invalid_counts_df)

In [0]:
# Get the explain plans for the SQL and DF versions of our query
sql_plan = invalid_counts_sql.explain() #Getting SQL Plan Details

In [0]:
df_plan = invalid_counts_df.explain() # Getting DF Plan Details

In [0]:
# Show that the two approaches evaluate to the same physical plan
sql_plan == df_plan

### Using the Databricks AI Assistant
The Databricks AI Assistant feature can be used to generate code or to visualize metrics from DataFrames, from the code cell below click on the __generate__ link and enter:

```generate a bar chart showing nulls for each column in the flights_temp_df dataframe```

**NOTE:** Click on AI assistance toggle button and Enter the given prompt.

In [0]:
import matplotlib.pyplot as plt
import pyspark.sql.functions as F
from pyspark.sql.functions import *

# Add your AI generated code from above cell


## Data Cleaning

The flights data contains some invalid and missing values, lets find them and clean them (in this case we will drop them)

In [0]:
# To drop rows where any specified columns are null, we can use the na.drop DataFrame method
non_null_flights_df = flights_required_cols_df.na.drop(
    how='any',
    subset=['CRSElapsedTime']
)

In [0]:
from pyspark.sql.functions import col

# Let's remove rows with invalid values for "ArrDelay", "ActualElapsedTime" and "DepTime" columns
flights_with_valid_data_df = non_null_flights_df.filter(
    col("ArrDelay").try_cast("integer").isNotNull() & 
    col("ActualElapsedTime").try_cast("integer").isNotNull() &
    col("DepTime").try_cast("integer").isNotNull()
)

In [0]:
# Now that we know "ArrDelay" and "ActualElapsedTime" contain integer values only, lets cast them from strings to integers (replacing the existing columns)
clean_flights_df = flights_with_valid_data_df \
    .withColumn("ArrDelay", col("ArrDelay").cast("integer")) \
    .withColumn("ActualElapsedTime", col("ActualElapsedTime").cast("integer"))

clean_flights_df.printSchema()

## Data Enrichment

Now let's create a useful derived column to categorize delays.

In [0]:
# Let's start by deriving the "FlightDateTime" column from the "Year", "Month", "DayofMonth", "DepTime" columns, then drop the constituent columns
from pyspark.sql.functions import col, make_timestamp_ntz, lpad, substr, lit, pmod

flights_with_datetime_df = clean_flights_df.withColumn(
    "FlightDateTime",
    make_timestamp_ntz(
        col("Year"),
        col("Month"),
        col("DayofMonth"),
        pmod(substr(lpad(col("DepTime"), 4, "0"), lit(1), lit(2)).try_cast("integer"), lit(24)),
        substr(lpad(col("DepTime"), 4, "0"), lit(3), lit(2)).try_cast("integer"),
        lit(0)
    )
).drop("Year", "Month", "DayofMonth", "DepTime")

# Show the result
display(flights_with_datetime_df.limit(10))

In [0]:
# Lets derive the "ElapsedTimeDiff" column from the "ActualElapsedTime" and "CRSElapsedTime" columns

from pyspark.sql.functions import col

flights_with_elapsed_time_diff_df = flights_with_datetime_df.withColumn(
    "ElapsedTimeDiff", col("ActualElapsedTime") - col("CRSElapsedTime")
    ).drop("ActualElapsedTime", "CRSElapsedTime")

display(flights_with_elapsed_time_diff_df.limit(10))

In [0]:
# Now lets categorize the "ArrDelay" column into categories: "On Time", "Slight Delay", "Moderate Delay", "Severe Delay"

from pyspark.sql.functions import when

enriched_flights_df = flights_with_elapsed_time_diff_df \
    .withColumn("delay_category", when(col("ArrDelay") <= 0, "On Time")
        .when(col("ArrDelay") <= 15, "Slight Delay")
        .when(col("ArrDelay") <= 60, "Moderate Delay")
        .otherwise("Severe Delay")) \
       .drop("ArrDelay")
    
display(enriched_flights_df.limit(10))

## Analyze Delays

Let's analyze our delay categories using various column referencing approaches.

In [0]:
# Direct reference to list 100 random records
display(enriched_flights_df.select("FlightNum", "delay_category").limit(100))

In [0]:
# Column object
display(enriched_flights_df.select(col("FlightNum").alias("carrier_code"), col("delay_category")).limit(100))

In [0]:
# String expressions
display(enriched_flights_df.selectExpr("FlightNum", "ElapsedTimeDiff", "ElapsedTimeDiff > 0 as LongerThanScheduled"))

## Working with UDFs

Let's use a vectorized UDF to calculate the z-score (standard deviations from the mean) for delays for each flight

In [0]:
from pyspark.sql.functions import pandas_udf

# Pandas UDF (vectorized)
@pandas_udf("double")
def normalized_diff(diff_series):
    return (diff_series - diff_series.mean()) / diff_series.std()

# Apply both UDFs
udf_example = enriched_flights_df \
    .withColumn("diff_normalized", normalized_diff("ElapsedTimeDiff"))

display(udf_example)

# Note: In practice, prefer built-in functions over UDFs when possible

## Putting it altogether

Let's put this together in a chained operation to manipulate data from a source system and save it to a new target (overwriting any existing data)

In [0]:
%sql
-- Drop the target table in case it exists already
DROP TABLE IF EXISTS cleaned_and_enriched_flights;

In [0]:
from pyspark.sql.functions import col, make_timestamp_ntz, lpad, substr, lit, when, pandas_udf, pmod
# or more simply...
from pyspark.sql.functions import *

@pandas_udf("double")
def normalized_diff(diff_series):
    return (diff_series - diff_series.mean()) / diff_series.std()

a = (spark.read.table("databricks_airline_performance_data.v01.flights_small")
    .selectExpr(
        "Year",
        "Month",
        "DayofMonth",
        "TRY_CAST(DepTime AS INT) AS DepTime",
        "FlightNum",
        "TRY_CAST(ActualElapsedTime AS INT) AS ActualElapsedTime",
        "CRSElapsedTime",
        "TRY_CAST(ArrDelay AS INT) AS ArrDelay"
    ).filter(
        col("ArrDelay").try_cast("integer").isNotNull() & 
        col("ActualElapsedTime").try_cast("integer").isNotNull() &
        col("DepTime").try_cast("integer").isNotNull()
    )
    .na.drop()
    .withColumn(
        "FlightDateTime",
        make_timestamp_ntz(
            col("Year"),
            col("Month"),
            col("DayofMonth"),
            pmod(substr(lpad(col("DepTime"), 4, "0"), lit(1), lit(2)).try_cast("integer"), lit(24)),
            substr(lpad(col("DepTime"), 4, "0"), lit(3), lit(2)).try_cast("integer"),
            lit(0)
        )
    )
    .drop("Year", "Month", "DayofMonth", "DepTime")
    .withColumn(
        "ElapsedTimeDiff", col("ActualElapsedTime") - col("CRSElapsedTime")
        )
    .drop("ActualElapsedTime", "CRSElapsedTime")
    .withColumn("delay_category", when(col("ArrDelay") <= 0, "On Time")
        .when(col("ArrDelay") <= 15, "Slight Delay")
        .when(col("ArrDelay") <= 60, "Moderate Delay")
        .otherwise("Severe Delay")) \
    .drop("ArrDelay")
    .withColumn("diff_normalized", normalized_diff("ElapsedTimeDiff"))
    # Write optimized
    .write
    .mode("overwrite")
    .saveAsTable("workspace.default.cleaned_and_enriched_flights")
)

In [0]:
%sql
SELECT * FROM cleaned_and_enriched_flights;

## Key Takeaways

1. **Data Cleaning Best Practices**:
   - Validate and clean data types early
   - Handle missing values appropriately
   - Document cleaning assumptions

2. **Data Enrichment**:
   - Create meaningful derived columns
   - Consider business requirements
   - Use functions (built-in or user defined to enrich datasets)


# 4 - Lab - Analyzing Transaction Data with DataFrames

In this lab, you'll analyze transactions from the Bakehouse dataset using Spark DataFrames. You'll apply the concepts from the lecture to solve real business problems and gain insights from the data.

### Objectives
- Reading data into a DataFrame and exploring its contents and structure
- Filtering records and projecting columns from a DataFrame
- Saving a DataFrame to a table

## Initial Setup and Data Loading

First, let's load our data and examine its structure.

In [0]:
## Read the Bakehouse transaction data
transactions_df = spark.read.table("samples.bakehouse.sales_transactions")

## Examine the schema and display first few rows
<FILL_IN>

## Data Exploration

Let's explore the basic characteristics of the dataset.

### Total Transactions Count
Get a count of all transactions helps us understand the dataset size.

In [0]:
FILL_IN

### Transactions over $100
Find the transactions over $100, save these to a new DataFrame named `large_transactions_df`.  Display the contents of this new DataFrame.

In [0]:
from pyspark.sql.functions import col

FILL_IN

### Save the DataFrame to a table
Save the `large_transactions_df` DataFrame to a table called `large_transactions`

In [0]:
FILL_IN

### Use Spark SQL to count the number of large transactions
Count the total number of large transactions in our `large_transactions` table

In [0]:
FILL_IN

# Cleanup


> **NOTE:** Once you have completed this lab, terminate your cluster:
>    Navigate to the top-right of this notebook and click the *Connected* drop-down menu to select your connected cluster. 
>    When hovering the cluster, options will appear, then select *Terminate*.