#### Data Preparation

In [0]:
df = spark.table("dl_ecommerce_idc.gold.category_daily_metrics")
df=df.toPandas()

In [0]:
df.head()

Unnamed: 0,category_code,event_date,views,carts,purchases,price
0,appliances.kitchen.steam_cooker,2019-10-14,170,7,4,130270.3
1,computers.peripherals.camera,2019-10-26,69,1,1,17180.38
2,computers.components.memory,2019-11-08,539,46,15,142884.42
3,kids.dolls,2019-10-19,498,6,8,73767.56
4,sport.bicycle,2019-11-17,2275,179,121,3572707.04


#### Splitting of train and test data

In [0]:
from sklearn.model_selection import train_test_split
X = df[["views", "carts"]]
y = df["purchases"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Simple Linear Regression Model

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
import numpy as np

model_a = LinearRegression()
model_a.fit(X_train, y_train)
y_pred_a = model_a.predict(X_test)

# Metrics
r2 = model_a.score(X_test, y_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_a))

print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")

R² Score: 0.9549
RMSE: 175.0858


#### Log the Model and Metrics using MLflow

In [0]:
import mlflow
import mlflow.sklearn

with mlflow.start_run(run_name="v1_linear_regression"):
    # Log params
    mlflow.log_param("model_type", "LinearRegression")
    mlflow.log_param("features", "views,carts")
    mlflow.log_param("test_size", 0.2)

    # Log metrics
    mlflow.log_metric("r2_score", r2)
    mlflow.log_metric("rmse", rmse)

    # Log trained model
    mlflow.sklearn.log_model(model_a,"model")




#### Adding one more feature (price) to the **model**

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
import numpy as np

X = df[["views", "carts","price"]]
y = df["purchases"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model_b = LinearRegression()
model_b.fit(X_train, y_train)
y_pred_b = model_b.predict(X_test)

# Metrics
r2 = model_b.score(X_test, y_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_b))

print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")

R² Score: 0.5924
RMSE: 797.5546


#### Logging the model with different features (version 2)

In [0]:
import mlflow
import mlflow.sklearn

with mlflow.start_run(run_name="v2_linear_regression"):
    # Log params
    mlflow.log_param("model_type", "LinearRegression")
    mlflow.log_param("features", "views,carts,price")
    mlflow.log_param("test_size", 0.2)

    # Log metrics
    mlflow.log_metric("r2_score", r2)
    mlflow.log_metric("rmse", rmse)

    # Log trained model
    mlflow.sklearn.log_model(model_b, "model")



### Model Evaluation (R² & RMSE)

- **RMSE** measures how far predictions are from actual values, on average, in the same unit as the target (`purchases`).
  - Lower RMSE = more accurate predictions.

- **R² score** measures how well the model explains the variation in the data.
  - Higher R² = better fit.

#### Interpretation of results:
- The model using **views and carts only** achieved **higher R² (~0.95)** and **much lower RMSE (~175)**.
- Adding **price** reduced R² (~0.59) and significantly increased RMSE (~798).

 This indicates that **price does not add predictive value at the category-day level** and introduces noise.

**Conclusion:**  
The simpler model (without price) performs better and is preferred.
