# Machine Learning with Scikit-Learn: Predicting Customer Transactions

**Goal**: Predict `total_transaction_amount` for Malaysian customers.

We will cover:
1. **Data Preprocessing**: Encoding categoricals + `StandardScaler`.
2. **Baseline Modeling**: Linear Regression as a benchmark.
3. **Advanced Modeling**: Random Forest Regressor.
4. **Robust Evaluation**: K-Fold Cross-Validation.
5. **Hyperparameter Tuning**: `GridSearchCV`.

Let's get started!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

%matplotlib inline
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Load Data
df = pd.read_csv("dummy_malaysia_customer_transactions_2025.csv")

# Encode categorical features
le_region = LabelEncoder()
le_quarter = LabelEncoder()
df['region_encoded'] = le_region.fit_transform(df['region'].fillna('Unknown'))
df['quarter_encoded'] = le_quarter.fit_transform(df['quarter'].fillna('Unknown'))

features = ['region_encoded', 'quarter_encoded', 'number_of_purchases']
target = 'total_transaction_amount'

data = df[features + [target]].dropna()
X = data[features]
y = data[target]

print(f"Dataset loaded. Samples: {len(data)}")

## 1. Feature Scaling

We use `StandardScaler` (mean=0, std=1). Essential for distance-based algorithms (KNN, SVM, Neural Nets) and helps Linear Regression converge faster.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=features)
print("Scaled Data Example:")
X_train_scaled_df.head(3)

## 2. Baseline Model: Linear Regression

We start with a simple model to set a performance benchmark.

In [None]:
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("--- Baseline (Linear Regression) ---")
print(f"MSE: {mse_lr:.2f}")
print(f"R2 Score: {r2_lr:.4f}")

## 3. Advanced Model: Random Forest & Cross-Validation

**Random Forest** can capture non-linear relationships. **K-Fold CV** (K=5) provides a more reliable accuracy estimate.

In [None]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
cv_scores = cross_val_score(rf, X_train_scaled, y_train, cv=5, scoring='r2')

print("--- Random Forest Cross-Validation ---")
print(f"CV R2 Scores: {cv_scores}")
print(f"Mean CV R2: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

## 4. Hyperparameter Tuning with GridSearchCV

`GridSearchCV` tries ALL combinations of parameters to find the best one:
- `n_estimators`: Number of trees.
- `max_depth`: Controls overfitting.
- `min_samples_split`: Minimum samples to split a node.

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, scoring='r2', n_jobs=-1, verbose=1)
print("Starting Grid Search... this may take a minute.")
grid_search.fit(X_train_scaled, y_train)

best_rf = grid_search.best_estimator_
print("\n--- Optimization Complete ---")
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")

## 5. Final Evaluation

Test the **Optimized Random Forest** on the unseen Test Set.

In [None]:
y_pred_best = best_rf.predict(X_test_scaled)

final_mse = mean_squared_error(y_test, y_pred_best)
final_r2 = r2_score(y_test, y_pred_best)
final_mae = mean_absolute_error(y_test, y_pred_best)

print("--- Final Test Set Performance ---")
print(f"Optimized Random Forest R2: {final_r2:.4f}")
print(f"MAE: RM {final_mae:.2f}")
print(f"Improvement over Baseline: {final_r2 - r2_lr:.4f}")

## 6. Visualization

### 6.1 Actual vs Predicted Plot

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred_best, alpha=0.6, color='purple', s=60)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Transaction Amount (MYR)', fontsize=12)
plt.ylabel('Predicted Transaction Amount (MYR)', fontsize=12)
plt.title(f'Final Model Accuracy (R\u00b2 = {final_r2:.2f})', fontsize=14)
plt.legend()
plt.tight_layout()
plt.show()

### 6.2 Feature Importance
What drives the transaction amount prediction in our optimized model?

In [None]:
importances = best_rf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(8, 4))
sns.barplot(x=importances[indices], y=np.array(features)[indices], palette='magma')
plt.title('Feature Importance (Optimized Random Forest)', fontsize=14)
plt.xlabel('Relative Importance')
plt.show()

## 7. Challenge Exercises

In [None]:
# Task 1: Try Support Vector Regressor (SVR)
# SVR is highly sensitive to scaling, so use X_train_scaled.
# Import SVR from sklearn.svm
# Experiment with kernel='rbf' or kernel='linear'

# Your code here

In [None]:
# Task 2: Feature Engineering
# Create 'Avg_Per_Purchase' = total_transaction_amount / number_of_purchases.
# Add this to X, strictly re-split, re-scale, and train the model.
# Does this domain-specific feature improve accuracy?

# Your code here