# Retail Sales Forecasting with SageMaker XGBoost

This notebook demonstrates end-to-end retail sales forecasting using Amazon SageMaker's managed XGBoost container. We'll cover:

1. **Data preprocessing** - Load and prepare retail sales data
2. **Feature engineering** - Create time-series features for forecasting
3. **Model training** - Train XGBoost model using SageMaker
4. **Hyperparameter tuning** - Optimize model performance automatically
5. **Batch inference** - Generate predictions on test data
6. **Resource cleanup** - Clean up resources

## Dataset Attribution
Â© Chen, D. (2012). Online Retail II [Dataset]. UCI Machine Learning Repository. Available at: https://archive.ics.uci.edu/dataset/502/online+retail+ii. Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode#s6a.

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost boto3 joblib

import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
import os
import time
import joblib

# Initialize boto3 session
session = boto3.Session()
s3_client = boto3.client('s3')
region = session.region_name or 'us-west-2'
account_id = boto3.client('sts').get_caller_identity()['Account']

# S3 bucket for data storage
bucket = f'sagemaker-{region}-{account_id}'
prefix = 'retail-sales-forecasting'

# Create unique session ID for resource tracking
session_id = f"{int(time.time())}"
print(f"Session ID: {session_id} (for resource cleanup)")
print(f"Region: {region}")
print(f"Bucket: {bucket}")

## 2. Data Loading and Preprocessing

In [None]:
# Load and preprocess data
current_region = boto3.Session().region_name or "us-west-2"
data_url = f"s3://sagemaker-example-files-prod-{current_region}/datasets/tabular/online_retail/online_retail_II_20k.csv"
df = pd.read_csv(data_url)
df = df.dropna(subset=["Customer ID"])
df = df[(df["Quantity"] > 0) & (df["Price"] > 0)]
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["Revenue"] = df["Quantity"] * df["Price"]

# Aggregate daily sales
daily_sales = df.groupby(df['InvoiceDate'].dt.date).agg({
    'Revenue': 'sum', 'Quantity': 'sum', 'Invoice': 'nunique', 'Customer ID': 'nunique'
}).reset_index()
daily_sales.columns = ['Date', 'Revenue', 'Quantity', 'Orders', 'Customers']
daily_sales['Date'] = pd.to_datetime(daily_sales['Date'])
daily_sales = daily_sales.sort_values('Date').reset_index(drop=True)

print(f'Summary of Revenue')
print(daily_sales)

## 3. Feature Engineering

In [None]:
# Create features
def create_features(df):
    df = df.copy()
    df['DayOfWeek'] = df['Date'].dt.dayofweek
    df['Month'] = df['Date'].dt.month
    df['Quarter'] = df['Date'].dt.quarter
    df['IsWeekend'] = (df['DayOfWeek'] >= 5).astype(int)
    
    # Reduced lag features to preserve data
    for lag in [1, 2]:
        df[f'Revenue_lag_{lag}'] = df['Revenue'].shift(lag)
    
    # Smaller rolling window
    df['Revenue_ma_3'] = df['Revenue'].rolling(window=3).mean()
    
    return df

daily_sales_features = create_features(daily_sales)

# Drop rows with NaN values
daily_sales_features = daily_sales_features.dropna().reset_index(drop=True)

if len(daily_sales_features) < 5:
    raise ValueError(f"Insufficient data after feature engineering: {len(daily_sales_features)} rows")

print(f"Final dataset has {len(daily_sales_features)} rows for modeling")
print(daily_sales_features)

## 4. Data Preparation
From the Final Datasets that has 6 rows, first 4 are used for training and 2 will be used as test

In [None]:
# Prepare training data
feature_cols = [col for col in daily_sales_features.columns if col not in ['Date', 'Revenue']]
print(f"Feature columns: {feature_cols}")

split_idx = int(len(daily_sales_features) * 0.8)
train_data = daily_sales_features[:split_idx]
test_data = daily_sales_features[split_idx:]

print(f"Train data shape: {train_data.shape}, Test data shape: {test_data.shape}")

# Prepare features and target
X_train = train_data[feature_cols]
y_train = train_data['Revenue']
X_test = test_data[feature_cols]
y_test = test_data['Revenue']

print(f"Training data: {len(train_data)} days, Test data: {len(test_data)} days")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

os.makedirs("notebook_outputs", exist_ok=True)

## 5. Model Training with Local XGBoost

Training XGBoost model locally for faster iteration and simpler deployment.

In [None]:
# Train initial XGBoost model locally
print("Training initial XGBoost model...")

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters
params = {
    'objective': 'reg:squarederror',
    'max_depth': 6,
    'eta': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'rmse'
}

# Train model
evals = [(dtrain, 'train'), (dtest, 'test')]
model = xgb.train(
    params,
    dtrain,
    num_boost_round=100,
    evals=evals,
    early_stopping_rounds=10,
    verbose_eval=20
)

print(f"\nModel training completed!")
print(f"Best iteration: {model.best_iteration}")
print(f"Best score: {model.best_score:.4f}")

## 6. Hyperparameter Tuning with GridSearch

In [None]:
# Hyperparameter tuning using GridSearchCV
print("Starting hyperparameter tuning...")

# Define parameter grid
param_grid = {
    'max_depth': [3, 6, 10],
    'eta': [0.01, 0.1, 0.3],
    'subsample': [0.5, 0.8, 1.0],
    'colsample_bytree': [0.5, 0.8, 1.0]
}

# Create XGBoost regressor
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    random_state=42
)

# Perform grid search (using a subset of parameters for speed)
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid={
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.3]
    },
    cv=3,
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best score (negative MSE): {grid_search.best_score_:.4f}")

# Use best model
best_model = grid_search.best_estimator_
print("\nBest model selected from grid search")

## 7. Batch Inference

Using the best model from tuning, we'll generate predictions on the test data.

In [None]:
# Generate predictions using the best model
print("Generating predictions on test data...")

predictions = best_model.predict(X_test)

print(f"Generated {len(predictions)} predictions")
print(f"Predictions: {predictions}")

## 8. Results Analysis

In [None]:
# Analyze predictions
actual_values = y_test.values
predicted_values = predictions

# Calculate metrics
mse = mean_squared_error(actual_values, predicted_values)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_values, predicted_values)

print(f"Model Performance Metrics:")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")

# Create comparison dataframe
results_df = pd.DataFrame({
    'Actual': actual_values,
    'Predicted': predicted_values,
    'Error': actual_values - predicted_values
})

print("\nPrediction Results:")
print(results_df)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(results_df.index, results_df['Actual'], 'o-', label='Actual', linewidth=2)
plt.plot(results_df.index, results_df['Predicted'], 's-', label='Predicted', linewidth=2)
plt.xlabel('Test Sample')
plt.ylabel('Revenue')
plt.title('Actual vs Predicted Revenue')
plt.legend()
plt.grid(True)
plt.show()

# Save predictions
results_df.to_csv('notebook_outputs/predictions.csv', index=False)
print("\nPredictions saved to notebook_outputs/predictions.csv")

## 9. Conclusion

### ðŸš€ **How to Improve Prediction Accuracy**

1. **Larger Training Dataset**: Use full historical data (months/years vs. sample data)
2. **Advanced Features**: Add external factors (holidays, promotions, weather)
3. **Algorithm Selection**: Try DeepAR, Prophet, or ensemble methods
4. **Data Quality**: Handle missing values and remove outliers

### âš¡ **How to Improve Performance & Speed**

1. **Infrastructure**: Use larger instances or distributed training
2. **Optimization**: Enable early stopping and cache preprocessed data
3. **Real-time**: Deploy to SageMaker endpoints with auto-scaling

### ðŸŽ¯ **Next Steps for Production**

1. **MLOps**: Set up automated retraining with SageMaker Pipelines
2. **Monitoring**: Implement data drift detection
3. **Integration**: Connect with business systems (ERP, inventory)

### ðŸ’¡ **Business Impact**

Accurate forecasting enables inventory optimization, better resource planning, improved financial planning, and higher customer satisfaction through product availability.

## 10. Resource Clean up

Clean up local resources created by this notebook:

In [None]:
# Clean up local resources
print('Cleaning up local resources created by this notebook...')

# Clean up local files created by this notebook
files_to_delete = ['predictions.csv', 'churn.txt']
deleted_count = 0

for file in files_to_delete:
    file_path = os.path.join('notebook_outputs', file)
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f'Deleted {file_path}')
        deleted_count += 1

print(f'\nDeleted {deleted_count} local files')
print('Cleanup completed - only local resources created by this notebook were removed!')