ETL Pipeline 

CELL 1 — SQL
Set Databricks Context (REQUIRED)

In [0]:
%sql
USE CATALOG workspace;
USE SCHEMA bronze;

CELL 2 — Python

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.functions import col, when, lit, regexp_extract

Cell 3 - Load Raw Bronze Tables (VERIFIED NAMES)

In [0]:
df_device = spark.table("device_messages_raw")
df_steps  = spark.table("rapid_step_tests_raw")

Cell 4 - Inspect Schemas (Verification)

In [0]:
df_device.printSchema()
df_steps.printSchema()

Cell 5 - Prepare Device Messages

In [0]:
# Prepare Device Messages
# 1. Clean / convert distance safely
df_device = df_device.withColumn(
    "distance_cm",
    when(
        regexp_extract(col("distance"), r"(\\d+)", 1) != "",
        regexp_extract(col("distance"), r"(\\d+)", 1).cast("int")
    ).otherwise(None)
)

# 2. Add source label
df_device = df_device.withColumn("source", lit("device"))

Cell 6 - Prepare Step Test Data
Add Source Label

In [0]:
df_steps = df_steps.withColumn("source", lit("step"))
# Extract Step Time Windows
df_steps_window = df_steps.select(
    col("device_id").alias("step_device_id"),
    col("start_time"),
    col("stop_time")
)


Cell 7 Timestamp Alignment + Step Labeling

In [0]:
df_labeled = (
    df_device.alias("d")
    .join(
        df_steps_window.alias("s"),
        (col("d.device_id") == col("s.step_device_id")) &
        (col("d.timestamp").between(col("s.start_time"), col("s.stop_time"))),
        "left"
    )
    .withColumn(
        "step_label",
        when(col("s.start_time").isNotNull(), "step")
        .otherwise("no_step")
    )
)

Cell 8 - Final Silver Dataset Columns

In [0]:
df_final = df_labeled.select(
    col("timestamp"),
    col("sensor_type"),
    col("distance_cm"),
    col("device_id"),
    col("step_label"),
    col("source")
)

display(df_final.limit(10))

Cell 9 - Switch to Silver Schema

In [0]:
%sql
USE CATALOG workspace;
USE SCHEMA silver;

Cell 10 - Save Curated Dataset (Silver Layer)

In [0]:
df_final.write \
    .mode("overwrite") \
    .format("delta") \
    .saveAsTable("stedi_step_curated")

Cell 11 - Verification Queries 

Step vs No-Step Counts

In [0]:
%sql
SELECT step_label, COUNT(*) AS row_count
FROM stedi_step_curated
GROUP BY step_label;

Invalid Step Labels 

In [0]:
%sql
SELECT *
FROM stedi_step_curated
WHERE step_label NOT IN ('step', 'no_step')
   OR step_label IS NULL;

Source Label Counts

In [0]:
%sql
SELECT source, COUNT(*) AS row_count
FROM stedi_step_curated
GROUP BY source;

Invalid Source Labels 

In [0]:
%sql
SELECT *
FROM stedi_step_curated
WHERE source NOT IN ('device', 'step')
   OR source IS NULL;

Ethics Reflection
### Ethics Check

**Are we labeling data fairly?**  
Yes. Sensor readings are labeled as steps only when their timestamps fall within
user-recorded step test start and stop times. No inferred or speculative labeling is used.

**Are we protecting identity?**  
Yes. The dataset contains only anonymized device identifiers and no personal or demographic data.

**Are we avoiding medical claims?**  
Yes. The dataset describes movement timing only and makes no diagnostic or medical assertions.
