# Retail Sales Forecasting with SageMaker XGBoost

This notebook demonstrates end-to-end retail sales forecasting using Amazon SageMaker's managed XGBoost container. We'll cover:

1. **Data preprocessing** - Load and prepare retail sales data
2. **Feature engineering** - Create time-series features for forecasting
3. **Model training** - Train XGBoost model using SageMaker
4. **Hyperparameter tuning** - Optimize model performance automatically
5. **Batch inference** - Generate predictions on test data
6. **Resource cleanup** - Clean up resources

## Dataset Attribution
Â© Chen, D. (2012). Online Retail II [Dataset]. UCI Machine Learning Repository. Available at: https://archive.ics.uci.edu/dataset/502/online+retail+ii. Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode#s6a.

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install -q sagemaker pandas numpy matplotlib seaborn scikit-learn xgboost boto3 joblib

import boto3
import pandas as pd
import numpy as np
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter
from sagemaker.transformer import Transformer
from sagemaker.amazon.amazon_estimator import get_image_uri
import matplotlib.pyplot as plt
import os
import time

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
prefix = 'retail-sales-forecasting'

# Create unique session ID for resource tracking
session_id = f"{int(time.time())}"
print(f"Session ID: {session_id} (for resource cleanup)")

print(f"Role: {role}")
print(f"Bucket: {bucket}")

## 2. Data Loading and Preprocessing

In [None]:
# Load and preprocess data
current_region = boto3.Session().region_name or "us-west-2"
data_url = f"s3://sagemaker-example-files-prod-{current_region}/datasets/tabular/online_retail/online_retail_II_20k.csv"
df = pd.read_csv(data_url)
df = df.dropna(subset=["Customer ID"])
df = df[(df["Quantity"] > 0) & (df["Price"] > 0)]
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["Revenue"] = df["Quantity"] * df["Price"]

# Aggregate daily sales
daily_sales = df.groupby(df['InvoiceDate'].dt.date).agg({
    'Revenue': 'sum', 'Quantity': 'sum', 'Invoice': 'nunique', 'Customer ID': 'nunique'
}).reset_index()
daily_sales.columns = ['Date', 'Revenue', 'Quantity', 'Orders', 'Customers']
daily_sales['Date'] = pd.to_datetime(daily_sales['Date'])
daily_sales = daily_sales.sort_values('Date').reset_index(drop=True)

print(f'Summary of Revenue')
print(daily_sales)

## 3. Feature Engineering

In [None]:
# Create features
def create_features(df):
    df = df.copy()
    df['DayOfWeek'] = df['Date'].dt.dayofweek
    df['Month'] = df['Date'].dt.month
    df['Quarter'] = df['Date'].dt.quarter
    df['IsWeekend'] = (df['DayOfWeek'] >= 5).astype(int)
    
    # Reduced lag features to preserve data
    for lag in [1, 2]:
        df[f'Revenue_lag_{lag}'] = df['Revenue'].shift(lag)
    
    # Smaller rolling window
    df['Revenue_ma_3'] = df['Revenue'].rolling(window=3).mean()
    
    return df

daily_sales_features = create_features(daily_sales)

# Drop rows with NaN values
daily_sales_features = daily_sales_features.dropna().reset_index(drop=True)

if len(daily_sales_features) < 5:
    raise ValueError(f"Insufficient data after feature engineering: {len(daily_sales_features)} rows")

print(f"Final dataset has {len(daily_sales_features)} rows for modeling")
print(daily_sales_features)

## 4. Data Preparation
From the Final Datasets that has 6 rows, first 4 are used for training and 2 will be used as test

In [None]:
# Prepare training data
feature_cols = [col for col in daily_sales_features.columns if col not in ['Date', 'Revenue']]
print(f"Feature columns: {feature_cols}")

split_idx = int(len(daily_sales_features) * 0.8)
train_data = daily_sales_features[:split_idx]
test_data = daily_sales_features[split_idx:]

print(f"Train data shape: {train_data.shape}, Test data shape: {test_data.shape}")

# XGBoost format (target first)
train_xgb = pd.concat([train_data['Revenue'], train_data[feature_cols]], axis=1)
test_xgb = pd.concat([test_data['Revenue'], test_data[feature_cols]], axis=1)

print(f"Train data shape before saving: {train_xgb.shape}")
print(f"Test data shape before saving: {test_xgb.shape}")
print(f"Train data sample:\n{train_xgb.head()}")

os.makedirs("notebook_outputs", exist_ok=True)

# Upload to S3
try:
    train_xgb.to_csv('notebook_outputs/train.csv', index=False, header=False)
    test_xgb.to_csv('notebook_outputs/test.csv', index=False, header=False)
    
    # Verify files were created and have content
    train_size = os.path.getsize('notebook_outputs/train.csv')
    test_size = os.path.getsize('notebook_outputs/test.csv')
    print(f"Local train.csv size: {train_size} bytes")
    print(f"Local test.csv size: {test_size} bytes")
    
    if train_size == 0:
        raise ValueError("train.csv is empty!")
    
    train_path = sagemaker_session.upload_data('notebook_outputs/train.csv', bucket, f'{prefix}/data')
    test_path = sagemaker_session.upload_data('notebook_outputs/test.csv', bucket, f'{prefix}/data')
    
    print(f"Training data: {len(train_data)} days, Test data: {len(test_data)} days")
    print(f"Train path: {train_path}")
    print(f"Test path: {test_path}")
    
except Exception as e:
    print(f"Error in data preparation: {e}")
    raise

## 5. Model Training with Managed XGBoost Container

## You can switch the instance type to GPU instance from ml.m5.large to reduce the training time (5 minutes), however selecting the GPU instance will result in cost increase.

In [None]:
# Use SageMaker's built-in XGBoost algorithm
from sagemaker.amazon.amazon_estimator import get_image_uri
import sagemaker

region = boto3.Session().region_name
container = get_image_uri(region, 'xgboost', repo_version='1.7-1')

xgb_estimator = sagemaker.estimator.Estimator(
    container,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    sagemaker_session=sagemaker_session
)

xgb_estimator.set_hyperparameters(
    objective='reg:squarederror',
    num_round=100,
    max_depth=6,
    eta=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='rmse'
)

print("Built-in XGBoost estimator ready")

## 6. Basic Training

In [None]:
# Train initial model
if 'train_path' not in locals():
    raise NameError("train_path not defined. Please run the data preparation cell first.")

print(f"Starting training with data: {train_path}")
xgb_estimator.fit({
    'train': sagemaker.inputs.TrainingInput(train_path, content_type='text/csv')
})
print(f"Model training completed: {xgb_estimator.model_data}")

## 7. Hyperparameter Tuning

In [None]:
# Hyperparameter tuning
hyperparameter_ranges = {
    'max_depth': IntegerParameter(3, 10),
    'eta': ContinuousParameter(0.01, 0.3),
    'subsample': ContinuousParameter(0.5, 1.0),
    'colsample_bytree': ContinuousParameter(0.5, 1.0),
    'num_round': IntegerParameter(50, 200)
}

tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name='validation:rmse',
    objective_type='Minimize',
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=5,
    max_parallel_jobs=2
)

tuner.fit({
    'train': sagemaker.inputs.TrainingInput(train_path, content_type='text/csv'),
    'validation': sagemaker.inputs.TrainingInput(test_path, content_type='text/csv')
})
best_estimator = tuner.best_estimator()
print(f"Best model: {best_estimator.model_data}")

## 8. Batch Inference

Using the best model from tuning, we'll generate predictions on the Customer Data.

This step can take up to 8 minutes depending on the instance type you choose for training.

In [None]:
# Batch inference
batch_input = test_data[feature_cols].fillna(0).astype(float)
batch_input.to_csv('notebook_outputs/batch_input.csv', index=False, header=False)
batch_input_path = sagemaker_session.upload_data('notebook_outputs/batch_input.csv', bucket, f'{prefix}/batch/input')

# Use built-in XGBoost container for batch transform
from sagemaker.model import Model

# Create model from best estimator
model = Model(
    image_uri=best_estimator.image_uri,
    model_data=best_estimator.model_data,
    role=role,
    sagemaker_session=sagemaker_session
)

# Create transformer
transformer = model.transformer(
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{bucket}/{prefix}/batch/output/'
)

print(f"Starting batch transform with model: {best_estimator.model_data}")
transformer.transform(batch_input_path, content_type='text/csv')
transformer.wait()
print("Batch inference completed")

## 9. Results Analysis

In [None]:
# Download and analyze predictions
import boto3
from sklearn.metrics import mean_squared_error, mean_absolute_error

s3_client = boto3.client('s3')

# Get prediction file
output_key = f'{prefix}/batch/output/batch_input.csv.out'
s3_client.download_file(bucket, output_key, 'notebook_outputs/predictions.csv')

# Load predictions
predictions = pd.read_csv('notebook_outputs/predictions.csv', header=None)
predictions.columns = ['Predicted_Revenue']

# Compare with actual values
actual_values = test_data['Revenue'].values
predicted_values = predictions['Predicted_Revenue'].values

# Calculate metrics
mse = mean_squared_error(actual_values, predicted_values)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actual_values, predicted_values)

print(f"Model Performance Metrics:")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")

# Create comparison dataframe
results_df = pd.DataFrame({
    'Actual': actual_values,
    'Predicted': predicted_values,
    'Error': actual_values - predicted_values
})

print("\nPrediction Results:")
print(results_df)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(results_df.index, results_df['Actual'], 'o-', label='Actual', linewidth=2)
plt.plot(results_df.index, results_df['Predicted'], 's-', label='Predicted', linewidth=2)
plt.xlabel('Test Sample')
plt.ylabel('Revenue')
plt.title('Actual vs Predicted Revenue')
plt.legend()
plt.grid(True)
plt.show()

## 10. Conclusion

### ðŸš€ **How to Improve Prediction Accuracy**

1. **Larger Training Dataset**: Use full historical data (months/years vs. sample data)
2. **Advanced Features**: Add external factors (holidays, promotions, weather)
3. **Algorithm Selection**: Try DeepAR, Prophet, or ensemble methods
4. **Data Quality**: Handle missing values and remove outliers

### âš¡ **How to Improve Performance & Speed**

1. **Infrastructure**: Use larger instances (ml.m5.xlarge) and multi-instance training
2. **Optimization**: Enable early stopping and cache preprocessed data
3. **Real-time**: Deploy to SageMaker endpoints with auto-scaling

### ðŸŽ¯ **Next Steps for Production**

1. **MLOps**: Set up automated retraining with SageMaker Pipelines
2. **Monitoring**: Implement data drift detection
3. **Integration**: Connect with business systems (ERP, inventory)

### ðŸ’¡ **Business Impact**

Accurate forecasting enables inventory optimization, better resource planning, improved financial planning, and higher customer satisfaction through product availability.

## 11. Resource Clean up

Clean up resources to avoid ongoing charges:

In [None]:
# Clean up resources
sagemaker_client = boto3.client('sagemaker')
s3_client = boto3.client('s3')

print('Cleaning up resources created by this notebook...')

# Clean up S3 objects created by this notebook
print('Deleting S3 objects created by this notebook...')
objects_to_delete = [
    f'{prefix}/data/train.csv',
    f'{prefix}/data/test.csv',
    f'{prefix}/batch/input/batch_input.csv',
    f'{prefix}/batch/output/batch_input.csv.out'
]

deleted_count = 0
for obj_key in objects_to_delete:
    try:
        s3_client.delete_object(Bucket=bucket, Key=obj_key)
        print(f'Deleted s3://{bucket}/{obj_key}')
        deleted_count += 1
    except Exception as e:
        print(f'Could not delete {obj_key}: {e}')

print(f'Deleted {deleted_count} S3 objects')

# Clean up local files created by this notebook
files_to_delete = ['train.csv', 'test.csv', 'batch_input.csv', 'predictions.csv']
for file in files_to_delete:
    file_path = os.path.join('notebook_outputs', file)
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f'Deleted {file_path}')

print('Cleanup completed - only resources created by this notebook were removed!')