# The Battle of Algorithms: Model Comparison

## Selecting the Best Model

We have tried Linear Regression, Random Forests, and XGBoost individually.
Now, we run a **systematic benchmark** to decide which algorithm is truly the best for predicting **customer total transaction amounts** in Malaysia.

### The Contenders
1. **Linear Regression**: The baseline.
2. **Support Vector Regressor (SVR)**: Powerful for smaller datasets.
3. **Random Forest**: Robust ensemble method.
4. **XGBoost**: Gradient boosting machine.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Models
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

# Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score
import joblib  # For saving the model

%matplotlib inline
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Data Preparation Pipeline

To ensure a fair comparison, all models use the same data.
Since SVR and Linear Regression are scale-sensitive, we **scale** the features.
Tree-based models (RF, XGB) don't need it, but it won't hurt them.

In [None]:
# Load data
df = pd.read_csv("dummy_malaysia_customer_transactions_2025.csv")

# Encode categorical features
le_region = LabelEncoder()
le_quarter = LabelEncoder()
df['region_encoded'] = le_region.fit_transform(df['region'].fillna('Unknown'))
df['quarter_encoded'] = le_quarter.fit_transform(df['quarter'].fillna('Unknown'))

features = ['region_encoded', 'quarter_encoded', 'number_of_purchases']
target = 'total_transaction_amount'

data = df[features + [target]].dropna()
X = data[features]
y = data[target]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data prepared and scaled.")
print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")

## 2. Setting up the Benchmark

We define a list of models and evaluate each with **10-Fold Cross-Validation**.
We collect `R2` scores for each fold to visualize stability.

In [None]:
# Define models
models = []
models.append(('Linear', LinearRegression()))
models.append(('SVR', SVR()))
models.append(('RandomForest', RandomForestRegressor(n_estimators=100, random_state=42)))
models.append(('XGBoost', xgb.XGBRegressor(objective='reg:squarederror', seed=42)))

# Iterate and evaluate
results = []
names = []

print("--- Cross-Validation Results (R2 Score) ---")

for name, model in models:
    kfold = KFold(n_splits=10, shuffle=True, random_state=42)
    cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kfold, scoring='r2')
    results.append(cv_results)
    names.append(name)
    print(f"{name}: {cv_results.mean():.4f} (+/- {cv_results.std() * 2:.4f})")

## 3. Visualizing the Winner

A boxplot compares algorithms by showing median accuracy and spread (stability).

In [None]:
plt.figure(figsize=(12, 6))
plt.boxplot(results, labels=names, patch_artist=True,
            boxprops=dict(facecolor="lightblue", color="blue"),
            medianprops=dict(color="red"))

plt.title('Algorithm Comparison: R2 Score Distribution (Customer Transactions)', fontsize=14)
plt.ylabel('R2 Score (Higher is Better)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## 4. Final Training and Saving

Based on the results above, we select the best model (usually XGBoost or Random Forest).
We train it on the ENTIRE training set and save it for production use.

In [None]:
# Let's assume XGBoost won (if not, change this line!)
best_model_name = 'XGBoost'
final_model = xgb.XGBRegressor(objective='reg:squarederror', seed=42)

# Train on full training data
final_model.fit(X_train_scaled, y_train)

# Final Test Evaluation
y_pred_final = final_model.predict(X_test_scaled)
final_r2 = r2_score(y_test, y_pred_final)

print(f"Final Model ({best_model_name}) Test R2: {final_r2:.4f}")

# Save Model and Scaler
joblib.dump(final_model, 'customer_model_best.pkl')
joblib.dump(scaler, 'customer_scaler.pkl')

print("Model and Scaler saved to disk.")

## Summary

We have successfully:
1. Compared 4 different algorithms using K-Fold Cross-Validation on the **Malaysia Customer Transactions** dataset.
2. Visualized the results to make a data-driven choice.
3. Selected the best model and saved it for future use.

This concludes the model comparison module!