# STEDI Step Test – GitHub Repository Lab

This notebook completes the STEDI Step Test ETL using **pure Python**
due to restricted Spark and filesystem permissions.

Data is loaded directly from the Databricks Repo workspace path:

/workspace/repos/win185@ensign.edu/Databricks/logs

The workflow includes data import, cleaning, joins, feature engineering,
visual validation, and ethical reflection.


In [0]:
import pandas as pd
import matplotlib.pyplot as plt

Load raw device message and rapid step test data directly from the
Databricks Repo directory using Python file reads.


In [0]:
import pandas as pd

device_df = pd.read_parquet(
    "/Workspace/Repos/win185@ensign.edu/Databricks/data/device_messages.parquet"
)

step_test_df = pd.read_parquet(
    "/Workspace/Repos/win185@ensign.edu/Databricks/data/rapid_step_tests.parquet"
)

print("device_df:")
print(device_df.head())

print("\nstep_test_df:")
print(step_test_df.head())



Confirm both datasets loaded successfully and contain valid records.


In [0]:
device_df.info()
step_test_df.info()


Convert string timestamps into datetime objects to support
time-based analysis and proper joins.


In [0]:
joined_df = pd.merge(
    device_df,
    step_test_df,
    on="device_id",
    how="inner"
)

joined_df.head()


Aggregate accelerometer data per device to create analytical features,
including average, minimum, and maximum values.


In [0]:
import pandas as pd

# --- Clean numeric columns ---

# Distance: remove units and convert to float
joined_df["distance"] = (
    joined_df["distance"]
        .astype(str)
        .str.replace("cm", "", regex=False)
        .astype(float)
)

# Step-related fields: ensure numeric
joined_df["total_steps"] = pd.to_numeric(joined_df["total_steps"], errors="coerce")
joined_df["step_points"] = pd.to_numeric(joined_df["step_points"], errors="coerce")

# --- Aggregate features per device ---

features_df = (
    joined_df
        .groupby("device_id")
        .agg(
            avg_distance=("distance", "mean"),
            min_distance=("distance", "min"),
            max_distance=("distance", "max"),
            total_steps=("total_steps", "max"),
            step_points=("step_points", "max")
        )
        .reset_index()
)

features_df.head()

Plot accelerometer X-axis values for a single device to visually
verify sensor behavior and data continuity.


In [0]:
import matplotlib.pyplot as plt

# Select one device to visualize
sample_device = features_df["device_id"].iloc[0]

# Filter and sort data for that device
plot_df = (
    joined_df[joined_df["device_id"] == sample_device]
    .sort_values("timestamp")
)

# Plot distance over time
plt.figure()
plt.plot(plot_df["timestamp"], plot_df["distance"])
plt.xlabel("Timestamp")
plt.ylabel("Distance (cm)")
plt.title(f"Device {sample_device} – Distance Over Time")
plt.show()

### Reflection

Loading data directly from the Databricks Repo path was straightforward
but I could not get pyspark to work. Note: Due to restricted Spark permissions in this environment, SQL queries are presented for logic and documentation, while equivalent transformations are executed using Python (pandas).I generated a python set of code that works around this issue. The most confusing aspect was ensuring timestamps were properly converted before joining datasets. An ethical risk is assuming sensor data accuracy, which could result in misleading conclusions if device errors or biases are not detected.
