# Data Engineering with Spark

By Tom URBAN & Ethan SMADJA 

## Lab 3: Structured Streaming

### Prerequisites

- Connect to the [Databricks Community Edition](https://community.cloud.databricks.com/login.html)
- Upload the provided notebook

### Goals

- Stream the `events` datasets from files
- Use Spark Structured Streaming to define the streaming dataframes and process the stream
- Visualize how the aggregation results change while new data is coming in
- Compare the code for dataframe analysis in batch and streaming mode

### Lab resources

- Notebook
- The data is part of the Databricks workspace: `/databricks-datasets/structured-streaming/events`

### Useful links

- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

### TO DO

1. Explore the dataset in the batch mode
2. Do the streaming demo:
  - define the streaming dataframe
  - define the transformations
  - start the stream
  - observe the changes in the results
3. With the help of the code from the demo, implement streaming example on another dataset  


1. Explore the dataset in the batch mode

In [0]:
events_path = "/databricks-datasets/structured-streaming/events/"

# display the files
display(dbutils.fs.ls(events_path))



Reading in batch and Schema display

In [0]:
# reading in batch
events_batch = spark.read.json(events_path)

# Schema
events_batch.printSchema()

# small display
events_batch.show(5, truncate=False)


# Aggregation and batch !

In [0]:
# Lecture en batch : on lit tous les fichiers d’un coup
events_batch = spark.read.json(events_path)

# Transformation : nombre d’événements par action
events_by_action_batch = (
    events_batch
    .groupBy("action")
    .count()
    .orderBy("count", ascending=False)
)

# Affichage du résultat final (figé)
display(events_by_action_batch)


#2. Démo streaming on events dataset
a. define schema help to batch

In [0]:
schema = events_batch.schema
schema

Creating volumes to stock events
![](path)

In [0]:
spark.sql("""
CREATE VOLUME IF NOT EXISTS workspace.raw_events.events_tmp_25_11_14
COMMENT 'Temporary raw events volume for the streaming demo'
""")

spark.sql("""
CREATE VOLUME IF NOT EXISTS workspace.checkpoints.events_by_action_demo
COMMENT 'Checkpoint storage for the streaming demo'
""")


In [0]:
from datetime import datetime

# --- 1. Catalog and schemas ---
catalog = "workspace"                     # main catalog you are using
uc_schema_raw_events = "raw_events"       # schema to store raw input data
db_schema_checkpoints = "checkpoints"     # schema to store streaming checkpoints
stream_name = "events_by_action_demo"     # logical name of the streaming job

# --- 2. Create schemas if they do not exist ---
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{uc_schema_raw_events}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{db_schema_checkpoints}")

# --- 3. Create a temporary volume for today's raw data ---
raw_events_volume_time = datetime.now()
raw_events_volume = f"events_tmp_{raw_events_volume_time.strftime('%y_%m_%d')}"

spark.sql(f"""
CREATE VOLUME IF NOT EXISTS {catalog}.{uc_schema_raw_events}.{raw_events_volume}
COMMENT 'Temporary raw events volume for this streaming demo'
""")

# --- 4. Define full UC paths for data and checkpoint ---
raw_data_path = f"/Volumes/{catalog}/{uc_schema_raw_events}/{raw_events_volume}"
checkpoint_path = f"/Volumes/{catalog}/{db_schema_checkpoints}/{stream_name}"

print("Raw data path :", raw_data_path)
print("Checkpoint path :", checkpoint_path)

# --- 5. Read the data stream ---
events_stream = (
    spark.readStream
        .schema(schema)                   # use the predefined schema
        .option("maxFilesPerTrigger", 1)  # process one file at a time
        .json(events_path)                # read from the events dataset
)

# --- 6. Apply transformation ---
events_by_action_stream = (
    events_stream
    .groupBy("action")                    # group by action type
    .count()                              # count number of records per action
)

# --- 7. Start the streaming query and display live results ---
display(
    events_by_action_stream,
    checkpointLocation=checkpoint_path    # store progress in this checkpoint
)


Batch vs Streaming – Interpretation:
Both approaches produce the same final result because Spark Structured Streaming is built on the same API and execution logic as batch mode.
In batch mode, data is processed all at once.
In streaming mode, data is processed in small micro-batches, with the state maintained through checkpoints.
When the data source is static (as in this lab), the results are identical, but streaming can also handle sources that continuously grow in real time.