1 – Import Libraries & Load Silver Table

In [0]:
from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

# Load your silver table
df_spark = spark.table("workspace.silver.labeled_step_test")

# Add a binary classification label: 1 if test_time > 10s, else 0
from pyspark.sql.functions import col
df_spark = df_spark.withColumn("step_label", (col("test_time") > 10).cast("int"))

# Convert to Pandas for scikit-learn
df = df_spark.toPandas()
df.head()

2 – Define Features and Label Columns

In [0]:
# Define columns
feature_cols_numeric = ["distance_cm"]
feature_cols_categorical = ["sensor_type", "device_id"]
label_col = "step_label"

3 – Train/Test Split

In [0]:
from sklearn.model_selection import train_test_split

X = df[feature_cols_numeric + feature_cols_categorical]
y = df[label_col]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

4 – Preprocessing Pipeline

In [0]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

# Combine into ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, feature_cols_numeric),
        ("cat", categorical_transformer, feature_cols_categorical)
    ]
)

5 – Scikit-Learn Pipeline

In [0]:
from sklearn.pipeline import Pipeline

# Create the pipeline
pipeline = Pipeline(steps=[
    ("preprocess", preprocessor)
])

6 – Fit & Transform Data

In [0]:
# Fit the pipeline and transform the datasets
pipeline.fit(X_train)

X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)

7 – Save Files to GitHub Repo (/etl)

In [0]:
import joblib
import os

# Git-tracked repo path (Workspace view)
repo_path = "/Workspace/Repos/win185@ensign.edu/Databricks/etl"

# Save pipeline and transformed datasets (use compression to avoid file size issues)
joblib.dump(pipeline, f"{repo_path}/stedi_feature_pipeline.pkl", compress=3)
joblib.dump(X_train_transformed, f"{repo_path}/X_train_transformed.pkl", compress=3)
joblib.dump(X_test_transformed, f"{repo_path}/X_test_transformed.pkl", compress=3)
joblib.dump(y_train, f"{repo_path}/y_train.pkl", compress=3)
joblib.dump(y_test, f"{repo_path}/y_test.pkl", compress=3)

print("Saved all files to GitHub repo /etl directory.")

Ethics Reflection -
Using a consistent, reproducible feature pipeline helps prevent unfairness or hidden bias by ensuring that every data point is processed the same way, regardless of who or what it represents. It removes the risk of arbitrary transformations or human error that might favor one group over another. By standardizing the steps of data cleaning, encoding, and scaling, we make the model training process transparent and auditable. This helps build trust in the outcomes and allows bias to be detected and addressed more easily.

A spiritual principle that supports this idea is justice — the belief that all individuals deserve to be treated fairly and equally. Consistency in our work reflects integrity, and fairness honors the dignity of every person impacted by the systems we build.