### DAY 13 (21/01/26) – Model Comparison & Feature Engineering
### Learn:

- Training multiple models
- Hyperparameter tuning
- Feature importance
- Spark ML Pipelines

### 🛠️ Tasks:

1. Train 3 different models
2. Compare metrics in MLflow
3. Build Spark ML pipeline
4. Select best model

#TASK 1: Train 3 Different Models (sklearn)...

In [0]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import mlflow
import mlflow.sklearn

df = spark.table("ecommerce_catalog.default.events_gold").toPandas()

df = df.fillna({
    "views": 0,
    "carts": 0,
    "revenue": 0,
    "purchases": 0
})

X = df[["views", "carts", "revenue"]]
y = df["purchases"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [0]:
#Training multiple models & logging to MLflow..

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

models = {
    "linear_regression": LinearRegression(),
    "decision_tree": DecisionTreeRegressor(max_depth=5, random_state=42),
    "random_forest": RandomForestRegressor(n_estimators=100, random_state=42)
}

for name, model in models.items():
    with mlflow.start_run(run_name=name):

        mlflow.log_param("model_type", name)
        mlflow.log_param("features", ",".join(X.columns))
        mlflow.log_param("test_size", 0.2)

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        r2 = r2_score(y_test, y_pred)

        mlflow.log_metric("r2_score", r2)
        mlflow.sklearn.log_model(model, "model")

        print(f"{name} → R² = {r2:.4f}")




linear_regression → R² = 0.9807




decision_tree → R² = 0.9671




random_forest → R² = 0.9811


### TASK 3: Spark ML Pipeline ...

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

spark_df = spark.table("ecommerce_catalog.default.events_gold") \
    .fillna(0, subset=["views", "carts", "revenue", "purchases"])

assembler = VectorAssembler(
    inputCols=["views", "carts", "revenue"],
    outputCol="features"
)

lr = LinearRegression(
    featuresCol="features",
    labelCol="purchases"
)

pipeline = Pipeline(stages=[assembler, lr])


In [0]:
#Train & evaluate..

train_df, test_df = spark_df.randomSplit([0.8, 0.2], seed=42)

pipeline_model = pipeline.fit(train_df)
predictions = pipeline_model.transform(test_df)

evaluator = RegressionEvaluator(
    labelCol="purchases",
    predictionCol="prediction",
    metricName="r2"
)

r2 = evaluator.evaluate(predictions)
print(f"Spark Pipeline R²: {r2:.4f}")


Spark Pipeline R²: 0.9956


In [0]:
predictions.select(
    "views", "carts", "revenue", "purchases", "prediction"
).show(5)

+-----+-----+------------------+---------+--------------------+
|views|carts|           revenue|purchases|          prediction|
+-----+-----+------------------+---------+--------------------+
|    1|    0|               0.0|        0|-0.17804818047412965|
| 4612|   62|           10175.5|       23|   12.13981321741502|
| 5291|   87|19951.839999999997|       40|   22.89343925460553|
|  307|    1|               0.0|        0| -1.1574762388707485|
|    5|    0|               0.0|        0|-0.19851958764584596|
+-----+-----+------------------+---------+--------------------+
only showing top 5 rows


### TASK 4: Select Best Model (Final Decision)

Selected Spark ML Pipeline with Linear Regression as the final model due to:

Highest R² score

Scalability on large datasets

Clean production-ready pipeline design