1. Import Libraries and Load Your Curated Dataset

In [0]:
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.getOrCreate()
# Load the rebuilt silver table
df_spark = spark.table("workspace.silver.labeled_step_test")
df = df_spark.toPandas()
df.head()

Sanity Check

In [0]:
# Check schema and preview rows
df_spark.printSchema()
df_spark.show(5)

In [0]:
feature_cols_numeric = ["distance_cm"]
feature_cols_categorical = ["sensor_type", "device_id"]
label_col = "step_label"
# Count total rows
print(f"Total rows: {df_spark.count()}")


Summarize the Data

In [0]:
df_spark.select("distance_cm").summary().show()

2. Create a Train/Test Split -- we import ML tools

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

3. Define Feature Column

In [0]:
feature_cols_numeric = ["distance_cm"]

4. Build Preprocessing Steps - Create Vector Assembler

In [0]:
assembler = VectorAssembler(inputCols=feature_cols_numeric, outputCol="features")

5. Build a Scikit-Learn Pipeline - Transform Data for ML

In [0]:
data = assembler.transform(df_spark).select("features", "test_time")

6. Test/Train/Split

In [0]:
train, test = data.randomSplit([0.7, 0.3], seed=42)

7. Train Linear Regression Model

In [0]:
lr = LinearRegression(featuresCol="features", labelCol="test_time")
lr_model = lr.fit(train)

8. Evaluate Model

In [0]:
predictions = lr_model.transform(test)

evaluator = RegressionEvaluator(
    labelCol="test_time",
    predictionCol="prediction",
    metricName="rmse"
)

rmse = evaluator.evaluate(predictions)
print(f"RMSE: {rmse}")

11. Show Prediction

In [0]:
predictions.select("features", "test_time", "prediction").show(10)

Ethics Reflection - Using a consistent and reproducible feature pipeline helps prevent unfairness or hidden bias by ensuring that all data is processed in the same way, regardless of when or from whom it is collected. When feature generation is inconsistent, subtle differences in preprocessing can disproportionately affect certain groups and lead to biased model outcomes that are hard to detect. Reproducibility also makes it easier to audit models, trace errors, and identify where bias may have been introduced. A spiritual principle that helps illuminate the importance of consistency and fairness is the idea of treating others with impartiality and integrityâ€”acting with the same care and standards for everyone. This perspective reinforces that fairness in machine learning is not just a technical goal, but a moral responsibility to apply rules evenly and transparently.