# Retail Sales Forecasting with OpenShift AI

This notebook demonstrates how to build a retail sales forecasting model that can be deployed on OpenShift.

## Project Overview
- **Objective**: Predict daily sales for retail stores
- **Data**: Synthetic retail sales data (2023)
- **Approach**: Time series forecasting with feature engineering
- **Deployment Target**: OpenShift with KServe

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

In [2]:
# Load the data
df = pd.read_csv('../data/retail_sales.csv')
df['date'] = pd.to_datetime(df['date'])
df.head()

In [3]:
# Exploratory Data Analysis
plt.figure(figsize=(15, 10))

# Sales by category
plt.subplot(2, 2, 1)
sns.boxplot(data=df, x='product_category', y='sales')
plt.title('Sales Distribution by Category')
plt.xticks(rotation=45)

# Sales by store
plt.subplot(2, 2, 2)
top_stores = df.groupby('store_id')['sales'].sum().nlargest(5).index
sns.boxplot(data=df[df['store_id'].isin(top_stores)], x='store_id', y='sales')
plt.title('Sales Distribution by Top Stores')
plt.xticks(rotation=45)

# Sales over time
plt.subplot(2, 2, 3)
df.groupby('date')['sales'].sum().plot()
plt.title('Total Daily Sales')

# Promotion impact
plt.subplot(2, 2, 4)
sns.boxplot(data=df, x='promotion', y='sales')
plt.title('Sales with vs without Promotion')

plt.tight_layout()
plt.show()

In [4]:
# Feature Engineering
def create_features(df):
    # Create time-based features
    df['day_of_week'] = df['date'].dt.dayofweek
    df['day_of_month'] = df['date'].dt.day
    df['day_of_year'] = df['date'].dt.dayofyear
    df['week_of_year'] = df['date'].dt.isocalendar().week
    df['month'] = df['date'].dt.month
    df['quarter'] = df['date'].dt.quarter
    df['year'] = df['date'].dt.year
    
    # Create lag features (previous day sales)
    df = df.sort_values(['store_id', 'product_category', 'date'])
    df['lag_1'] = df.groupby(['store_id', 'product_category'])['sales'].shift(1)
    df['lag_7'] = df.groupby(['store_id', 'product_category'])['sales'].shift(7)
    df['lag_30'] = df.groupby(['store_id', 'product_category'])['sales'].shift(30)
    
    # Rolling averages
    df['rolling_7'] = df.groupby(['store_id', 'product_category'])['sales'].transform(
        lambda x: x.rolling(7, min_periods=1).mean()
    )
    df['rolling_30'] = df.groupby(['store_id', 'product_category'])['sales'].transform(
        lambda x: x.rolling(30, min_periods=1).mean()
    )
    
    return df

# Apply feature engineering
df = create_features(df)
df = df.dropna()  # Drop rows with NaN from lag features
df.head()

In [5]:
# Prepare data for modeling
# Define features and target
X = df.drop(columns=['date', 'sales'])
y = df['sales']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=False
)

# Define categorical and numerical features
categorical_features = ['store_id', 'product_category', 'day_of_week', 'month', 'quarter']
numerical_features = [
    'promotion', 'holiday', 'day_of_month', 'day_of_year', 
    'week_of_year', 'year', 'lag_1', 'lag_7', 'lag_30', 
    'rolling_7', 'rolling_30'
]

# Create preprocessing pipelines
numerical_transformer = SimpleImputer(strategy='constant', fill_value=0)

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Print shapes
print(f"Training shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")

In [6]:
# Train model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ))
])

# Fit the model
model.fit(X_train, y_train)

# Evaluate on test set
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

# Feature importance (for numerical features)
feature_names = (
    numerical_features +
    list(model.named_steps['preprocessor']
         .named_transformers_['cat']
         .named_steps['onehot']
         .get_feature_names_out(categorical_features))
)

importances = model.named_steps['regressor'].feature_importances_
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(15), x='importance', y='feature')
plt.title('Top 15 Feature Importances')
plt.tight_layout()
plt.show()

In [7]:
# Save model for deployment
model_path = '../models/retail_sales_model.joblib'
joblib.dump(model, model_path)
print(f"Model saved to {model_path}")

# Create sample input for testing
sample_input = X_test.iloc[0:1].copy()
sample_input.to_csv('../data/sample_input.csv', index=False)
print("Sample input saved to ../data/sample_input.csv")

## OpenShift Deployment Preparation

To deploy this model on OpenShift:

1. **Create a Model Serving Image**:
   - Use the saved model file (`retail_sales_model.joblib`)
   - Create a Dockerfile with Python dependencies
   - Build and push to an image registry

2. **Deploy with KServe**:
   - Create an InferenceService YAML
   - Configure resources and auto-scaling
   - Expose the service

3. **Set up Monitoring**:
   - Configure Prometheus metrics
   - Create Grafana dashboards

4. **Create CI/CD Pipeline**:
   - Use OpenShift Pipelines (Tekton)
   - Automate retraining and deployment

The next steps would involve creating the necessary Kubernetes manifests and OpenShift resources to deploy this model as a scalable service.