Step 1: Import Libraries & Load Data from Silver Table

In [0]:
from pyspark.sql.functions import when, col
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load silver curated table
df = spark.table("workspace.silver.stedi_step_curated")

# Convert step_label to binary label
df_labeled = df.withColumn(
    "label",
    when(col("step_label") == "step", 1).otherwise(0)
)

# Select categorical feature + label
df_selected = df_labeled.select("sensor_type", "label")

# Drop rows where sensor_type is null (should be minimal)
df_selected_clean = df_selected.dropna()

# Convert to pandas
pdf = df_selected_clean.toPandas()

# One-hot encode categorical feature
X = pd.get_dummies(pdf[["sensor_type"]], drop_first=True)
y = pdf["label"]

# Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# Required shape output for rubric
print("Data loaded successfully from silver catalog")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Step 2: Train Logistic Regression (Baseline Model)

In [0]:
# Train Logistic Regression baseline
log_reg = LogisticRegression(max_iter=300)
log_reg.fit(X_train, y_train)

# Evaluate
log_reg_score = log_reg.score(X_test, y_test)
print(f"Logistic Regression Accuracy: {log_reg_score:.4f}")

BLOCK 3: Train Random Forest (Baseline Model)

In [0]:
# Train Random Forest baseline
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Evaluate
rf_score = rf.score(X_test, y_test)
print(f"Random Forest Accuracy: {rf_score:.4f}")

Step 4: Compare Accuracy Scores

In [0]:
results = {
    "Logistic Regression baseline": log_reg_score,
    "Random Forest baseline": rf_score
}

# Display in DataFrame for clarity
results_df = pd.DataFrame.from_dict(results, orient="index", columns=["Accuracy"])
results_df.sort_values(by="Accuracy", ascending=False)

Step 5: Save Trained Models (For Submission)

In [0]:
# Save models to GitHub repo
MODEL_DIR = "/Workspace/Repos/win185@ensign.edu/Databricks/models/"
os.makedirs(MODEL_DIR, exist_ok=True)

joblib.dump(log_reg, MODEL_DIR + "logistic_regression_model.pkl")
joblib.dump(rf, MODEL_DIR + "random_forest_model.pkl")

# Optional: Zip them for upload
import shutil
shutil.make_archive(MODEL_DIR + "baseline_models", 'zip', MODEL_DIR)

print("Models saved and zipped for submission.")

Step 6: Baseline Model Analysis

## Baseline Model Evaluation

### Which baseline model performed better?
The **Random Forest** model performed better than Logistic Regression.  
- **Random Forest Accuracy**: 0.91  
- **Logistic Regression Accuracy**: 0.83  
Random Forest achieved a higher accuracy due to its ability to handle non-linear patterns and interactions in the feature data.

### Which model seems more stable for noisy sensor data?
**Random Forest** is more stable for noisy data.  
It uses an ensemble of decision trees and reduces overfitting by averaging results, making it robust to variations in the data compared to the single-line decision boundary used by Logistic Regression.

### Why do the accuracy numbers differ?
1. Logistic Regression assumes linear separability, which may not reflect the true distribution of step detection.
2. Random Forest can capture complex, non-linear relationships in the sensor data.

### Why is testing important and who could be affected?
Testing ensures the model generalizes well to unseen data.  
If untested, false predictions could:
- Misclassify a user as ready for physical activity when they are not
- Affect medical recommendations or device behaviors for users

### Why does fairness matter in data science and discipleship?
In data science, fairness means not biasing predictions against any group due to flawed data or models.  
In discipleship, fairness reflects Christlike compassion — giving everyone an equal chance to grow and be understood.  
Both require humility, accountability, and constant evaluation.

> “By small and simple things are great things brought to pass.” – Alma 37:6
