# Type 2 Slowly Changing Data

In this notebook, we'll create a silver table that contains the information we'll need to link workouts back to our heart rate recordings.

We'll use a Type 2 table to record this data, encoding the start and end times for each session. 

<img src="https://files.training.databricks.com/images/ade/ADE_arch_completed_workouts.png" width="60%" />

## Learning Objectives
By the end of this lesson, students will be able to:
- Describe how Slowly Changing Dimension tables can be implemented in the Lakehouse
- Use custom logic to implement a SCD Type 2 table with batch overwrite logic

## Setup
Set up path and checkpoint variables (these will be used later).

In [0]:
%run ../Includes/Classroom-Setup-4.4

## Review workouts_silver Table
Several helper functions was defined to land and propagate a batch of data to the **`workouts_silver`** table.

This table is created by 
* Starting a stream against the **`bronze`** table
* Filtering all records by **`topic = 'workout'`**
* Deduping the data 
* Merging non-matching records into **`owrkouts_silver`**

...roughly the same strategy we used earlier to create the **`heart_rate_silver`** table

In [0]:
DA.daily_stream.load()       # Load another day's data
DA.process_bronze()          # Update the bronze table
DA.process_workouts_silver() # Update the workouts_silver table

Review the **`workouts_silver`** data.

In [0]:
workout_df = spark.read.table("workouts_silver")
display(workout_df)

For this data, the **`user_id`** and **`session_id`** form a composite key. 

Each pair should eventually have 2 records present, marking the "start" and "stop" action for each workout.

In [0]:
aggregate_df = workout_df.groupby("user_id", "session_id").count()
display(aggregate_df)

Because we'll be triggering a shuffle in this notebook, we'll be explicit about how many partitions we want at the end of our shuffle.

As before, we can use the current level of parallelism (max number of cores) as our upper bound for shuffle partitions.

In [0]:
print(f"Executor cores: {sc.defaultParallelism}")
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

## Create Completed Workouts Table

The query below matches our start and stop actions, capturing the time for each action. The **`in_progress`** field indicates whether or not a given workout session is ongoing.

In [0]:
def process_completed_workouts():
    spark.sql(f"""
        CREATE OR REPLACE TABLE completed_workouts 
        AS (
          SELECT a.user_id, a.workout_id, a.session_id, a.start_time start_time, b.end_time end_time, a.in_progress AND (b.in_progress IS NULL) in_progress
          FROM (
            SELECT user_id, workout_id, session_id, time start_time, null end_time, true in_progress
            FROM workouts_silver
            WHERE action = "start") a
          LEFT JOIN (
            SELECT user_id, workout_id, session_id, null start_time, time end_time, false in_progress
            FROM workouts_silver
            WHERE action = "stop") b
          ON a.user_id = b.user_id AND a.session_id = b.session_id
        )
    """)
    
process_completed_workouts()

You can now perform a query directly on your **`completed_workouts`** table to check your results.

In [0]:
total = spark.table("completed_workouts").count() # .sql("SELECT COUNT(*) FROM completed_workouts") 
print(f"{total:3} total")

total = spark.table("completed_workouts").filter("in_progress=true").count()
print(f"{total:3} where record is still awaiting end time")

total = spark.table("completed_workouts").filter("end_time IS NOT NULL").count()
print(f"{total:3} where end time has been recorded")

total = spark.table("completed_workouts").filter("start_time IS NOT NULL").count()
print(f"{total:3} where end time arrived after start time")

total = spark.table("completed_workouts").filter("in_progress=true AND end_time IS NULL").count()
print(f"{total:3} where they are in_progress AND have an end_time")

Use the functions below to propagate another batch of records through the pipeline to this point.

In [0]:
DA.daily_stream.load()       # Load another day's data
DA.process_bronze()          # Update the bronze table
DA.process_workouts_silver() # Update the workouts_silver table

process_completed_workouts() # Update the completed_workouts table

In [0]:
%sql
SELECT COUNT(*) 
AS total 
FROM completed_workouts

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()