1. Setup and Data Loading (Code Cell)

In [5]:
## üìö 1. Setup and Data Loading (with Feature Engineering)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# --- Data Loading ---
file_path = '../../datasets/Supplement_Sales_Weekly_Expanded.csv'
try:
    data = pd.read_csv(file_path) 
    print("Raw data loaded successfully. Shape:", data.shape)
except FileNotFoundError:
    print("Error: Please check the path to your prepared dataset.")
    # Exit or raise error if loading fails to prevent subsequent KeyErrors

# --- Feature Engineering (Based on your Steps 1 & 2) ---

# 1. Data Cleaning and Preparation (Temporal & Grouping)
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data = data.drop(columns=['Category', 'Revenue', 'Location'], errors='ignore')

# Calculate the average monthly price (as per your notes)
product_data_grouped = data.groupby(['Product_Name', 'Year', 'Month']).agg(
    Price_Avg=('Price', 'mean'), # Calculate the average price
    Product_ID=('Product_Name', 'first')
).reset_index()

# Sort data chronologically for lag/time features
product_data_grouped = product_data_grouped.sort_values(by=['Product_Name', 'Year', 'Month']).reset_index(drop=True)

# Select a single product for simplicity in this demo
PRODUCT_ID = product_data_grouped['Product_Name'].unique()[0]
product_data = product_data_grouped[product_data_grouped['Product_Name'] == PRODUCT_ID].copy()


# 2. Creating Additional Features (Time & Lag)
product_data['Time_Index'] = np.arange(len(product_data)) + 1
product_data['Time_Index_Squared'] = product_data['Time_Index'] ** 2

# Seasonal Coding
product_data['Month_sin'] = np.sin(2 * np.pi * product_data['Month'] / 12)
product_data['Month_cos'] = np.cos(2 * np.pi * product_data['Month'] / 12)

# Lag Variables (using .shift())
product_data['Price_Lag_1'] = product_data['Price_Avg'].shift(1)
product_data['Price_Lag_3'] = product_data['Price_Avg'].shift(3)
product_data['Price_Lag_12'] = product_data['Price_Avg'].shift(12)

# Moving Averages (using .rolling())
product_data['Price_MA_6'] = product_data['Price_Avg'].rolling(window=6).mean().shift(1)
product_data['Price_MA_12'] = product_data['Price_Avg'].rolling(window=12).mean().shift(1)

# Drop initial rows with NaN values created by lags/MA
product_data = product_data.dropna().reset_index(drop=True)
print("Features engineered. Ready for modeling. Final shape:", product_data.shape)

# Define features (X) and target (y)
FEATURES = ['Year', 'Month', 'Month_sin', 'Month_cos', 'Time_Index', 'Time_Index_Squared', 
            'Price_Lag_1', 'Price_Lag_3', 'Price_Lag_12', 'Price_MA_6', 'Price_MA_12']
TARGET = 'Price_Avg'

X = product_data[FEATURES]
y = product_data[TARGET]

Raw data loaded successfully. Shape: (4384, 10)
Features engineered. Ready for modeling. Final shape: (51, 14)


# üéì 2. The Problem Cross-Validation Solves (Overfitting)

We are predicting the average price of a vitamin supplement using a Random Forest model.

### üß† The Exam Analogy and Overfitting

Imagine your **Model is a student** and your **Data is the exam material**.

> If you test your model using the **same data** it used to learn (study), the model will get a great score. This is called **overfitting**‚Äîthe model has simply **memorized the answers** instead of learning the general concepts.
>
> That great score is a **fake measure** of how well the model will perform on **new, unseen data** in the real world.

### ‚ö†Ô∏è The Problem with a Single Split

In our original project, we used a single **80/20 train-test split**, respecting the chronological order (training on older data, testing on newer data).

While respecting time is good, relying on **just one single split** has a major flaw:

1.  **Luck/Unluck:** If the 20% test period happens to contain unusual, scattered prices (noise), our final score will look worse than it really is.
2.  **No Confidence Interval:** We get one score (e.g., MAE = 0.50). We don't know if that score is stable or just a lucky/unlucky result from that specific time window.

**Cross-Validation fixes this by performing many fair, independent tests.**

In [6]:
## üìâ 3. The Unreliable Single Train-Test Split

# 3.1. Single split (Chronological)
# Since this is time-series data, we split chronologically (80% for train, 20% for test)
split_point = int(len(X) * 0.80)
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]

print(f"Train size: {len(X_train)} samples. Test size: {len(X_test)} samples.")


# 3.2. Train and Evaluate
model_single = RandomForestRegressor(n_estimators=100, random_state=42)
model_single.fit(X_train, y_train)

y_pred_single = model_single.predict(X_test)

mae_single = mean_absolute_error(y_test, y_pred_single)
print(f"\nSingle Train-Test MAE (Mean Absolute Error): ${mae_single:.3f}")



Train size: 40 samples. Test size: 11 samples.

Single Train-Test MAE (Mean Absolute Error): $4.998


# 3.3. Interpretation (Markdown Cell)
### Interpretation of Single Split

We achieved a Mean Absolute Error (MAE) of **[Insert MAE from above]**.

**Question:** Is this a robust score? What if we had tested on a different 20% period? We have no way to know.

This is why we need Cross-Validation.

## üîÅ 4. Introducing K-Fold Cross-Validation

Instead of one single test, K-Fold CV gives our model **multiple, independent tests**.

We split the entire dataset into **K** equal pieces (folds). We then run **K** separate experiments, rotating which piece is used for testing:

| Experiment | Training Data | Testing Data (Validation) |
| :--- | :--- | :--- |
| **Fold 1** | Folds 2, 3, 4, 5 | **Fold 1** |
| **Fold 2** | Folds 1, 3, 4, 5 | **Fold 2** |
| ... | ... | ... |

**Note on Time Series:** For true time-series data, standard K-Fold is usually not correct because it mixes past and future data. However, for a basic introduction, we will use it here to demonstrate the *averaging* concept. (We will address the proper time-series CV in a later notebook).

In [9]:
## üìä 5. Implementing K-Fold CV (5 Folds)

from sklearn.model_selection import KFold, cross_val_score

# Use K=5 for this introduction
kf = KFold(n_splits=5, shuffle=False) # shuffle=False maintains the time order within folds, though the folds themselves are not strictly chronological splits.

# Re-initialize the model
model_cv = RandomForestRegressor(n_estimators=100, random_state=42)

# Use cross_val_score: Note that 'neg_mean_absolute_error' is used because scikit-learn
# treats scoring metrics as something to be maximized (higher is better).
# The negative sign converts the error into a "score."
cv_scores = cross_val_score(
    model_cv, 
    X, 
    y, 
    cv=kf, 
    scoring='neg_mean_absolute_error'
)

# Convert negative scores back to positive MAE errors
cv_maes = -cv_scores 

print("Individual MAE scores for each fold (test):")
print(cv_maes)

print(f"\nFinal CV Score (Average MAE): ${cv_maes.mean():.3f}")
print(f"Standard Deviation of MAE: {cv_maes.std():.3f}")

Individual MAE scores for each fold (test):
[7.61504818 4.311373   8.2899025  5.60471    4.758919  ]

Final CV Score (Average MAE): $6.116
Standard Deviation of MAE: 1.571


## **üåü 6\. Conclusion: A Reliable Score**

Now we can directly compare the result from the single, potentially unreliable test against the more robust Cross-Validation (CV) average.

| Metric | Single Split (80/20) Result | K-Fold CV (Average) Result |
| :---- | :---- | :---- |
| **MAE** | **$4.998** | **$6.116** |
| **Robustness** | Low (Based on one single test period) | High (Based on 5 different test periods) |

### **üß† What These Results Tell Us**

1. **The Single Split was Optimistic (and Likely Unreliable):**  
   * Your initial single test score of **$4.998** was quite low. This suggests that the final 20% of the data used for the test might have been an **easier, less noisy period** for the model to predict.  
   * If you had relied only on that $4.998 score, you would have **overestimated** your model's real-world accuracy.  
2. **The CV Score is the Honest Grade:**  
   * The **Average MAE of $6.116** is the model's true, general performance. This score is much more trustworthy because it ensures that **every part of your data** has been used fairly for testing.  
3. **The Model's Performance Varies:**  
   * The **Individual MAE Scores** ranged from **$4.311** to **$8.290**.  
   * The **Standard Deviation of 1.571** shows that the model's prediction accuracy changes significantly depending on the time period it's tested on. A high standard deviation means the model is **not perfectly stable**.  
   * This is a strong sign that the model may be struggling with high **variance** (a form of instability), which is exactly what CV is designed to expose\!

**In summary, Cross-Validation gave your model an honest, overall grade of $6.116, revealing that its performance is less stable than the single initial test suggested.**