# AutoML Training - Ship Date Prediction Model

**Goal:** Train ML models to predict when orders will ship from the distribution center using AutoML.

This notebook will:
1. Load closed delivery data from the semantic model
2. Calculate **DAYS_TO_SHIP** (days from creation to DC ship date)
3. Train **regression model** to predict DAYS_TO_SHIP
4. Register best model to MLflow
5. Enable ship date forecasting for open deliveries

**Use Case:** Enable operations team to:
- Forecast when orders will ship from the distribution center
- Plan logistics and inventory allocation
- Set realistic customer expectations for ship dates
- Identify orders with unusually long processing times

### ⭐ 1. Imports

- pandas: dataframe manipulation
- mlflow: experiment tracking and model registry
- AutoML (FLAML): automatic model selection / tuning
- sklearn: train/test split + evaluation metrics
- sempy.fabric: read tables from Power BI semantic model

In [None]:
# ==============================================================================
# IMPORTS & CONFIGURATION
# ==============================================================================
# WHY: Load required libraries for ML training and semantic model access
#
# Key libraries:
# - sempy.fabric: Connect to Power BI semantic model
# - flaml.AutoML: Microsoft's AutoML library (automatically tests multiple algorithms)
# - mlflow: Track experiments and save models
# - sklearn: Machine learning utilities (train/test split, metrics)
# ==============================================================================

import sempy.fabric as fabric
import pandas as pd
import numpy as np
from flaml import AutoML
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Configuration
DATASET = "DLV Aging"  # UPDATE: Match your semantic model name
TARGET_COLUMN = "DAYS_TO_SHIP"  # What we're predicting (days from creation to ship)

# AutoML settings
# WHY: These settings control how AutoML explores different models
automl_settings = {
    "time_budget": 600,      # Maximum 10 minutes for training (reduce if needed)
    "metric": "mae",         # Optimize for Mean Absolute Error (avg days off)
    "task": "regression",    # Predicting a number (days to ship), not a category
    "log_file_name": "automl.log",
    "seed": 42,              # For reproducible results
    "early_stop": True,      # Stop if no improvement (saves time/capacity)
}

print("✅ Configuration loaded")
print(f"   Target: {TARGET_COLUMN}")
print(f"   AutoML budget: {automl_settings['time_budget']} seconds")

### ⭐ 2. Configuration

**IMPORTANT:** Update these settings for your environment.

In [None]:
# ==============================================================================
# LOAD TRAINING DATA: Closed Deliveries
# ==============================================================================
# WHY: We need historical deliveries with known outcomes (shipped with GI Date)
#      to train the model to recognize patterns in shipping lead times.
#
# FILTER: Only deliveries with both GI Date (ship date) AND Delivery Created On
#         These are "labeled data" where we can calculate actual days to ship.
# ==============================================================================

print("Loading closed delivery data from semantic model...")

# Get workspace
ws = fabric.get_workspace_id()

# DAX query to get closed deliveries with both required dates
dax_query = """
EVALUATE
FILTER(
    Aging,
    NOT(ISBLANK(Aging[GI Date])) &&
    NOT(ISBLANK(Aging[Delivery Created On]))
)
"""

df = fabric.evaluate_dax(dataset=DATASET, dax_string=dax_query, workspace=ws)

# Clean column names (DAX adds table prefixes)
df.columns = [col.split('[')[-1].replace(']', '') if '[' in col else col for col in df.columns]

print(f"✅ Loaded {len(df):,} closed deliveries")
print(f"✅ Columns: {df.shape[1]}")
df.head()

### ⭐ 3. Load Training Data Using DAX

Load **closed deliveries** (with GI Date) for model training.

In [None]:
# ==============================================================================
# CALCULATE TARGET VARIABLE: DAYS_TO_SHIP
# ==============================================================================
# WHY: This is the metric we're predicting - how many days from order creation
#      to shipping from the DC.
#
# Formula: DAYS_TO_SHIP = GI Date - Delivery Created On
#
# This represents the processing lead time within the distribution center.
# ==============================================================================

print("=== Calculating Target Variable ===")

# Check if required columns exist
required_cols = ['GI Date', 'Delivery Created On']
missing = [col for col in required_cols if col not in df.columns]

if missing:
    raise ValueError(f"Missing required columns: {missing}")

# Calculate DAYS_TO_SHIP
df[TARGET_COLUMN] = (
    pd.to_datetime(df['GI Date']) - 
    pd.to_datetime(df['Delivery Created On'])
).dt.days

# Remove rows with missing target values
df_clean = df.dropna(subset=[TARGET_COLUMN]).copy()
print(f"✅ Target column calculated: {TARGET_COLUMN}")
print(f"✅ Valid rows: {len(df_clean):,} (removed {len(df) - len(df_clean):,} missing target)")

# Show target distribution
print(f"\nTarget Variable Stats:")
print(f"  Mean: {df_clean[TARGET_COLUMN].mean():.2f} days")
print(f"  Median: {df_clean[TARGET_COLUMN].median():.2f} days")
print(f"  Min: {df_clean[TARGET_COLUMN].min():.0f} days")
print(f"  Max: {df_clean[TARGET_COLUMN].max():.0f} days")

### ⭐ 4. Data Quality Check & Filtering

Remove rows with missing target values.

In [None]:
# Check target column and remove nulls
if TARGET_COLUMN not in df.columns:
    raise ValueError(f"Target column '{TARGET_COLUMN}' not found!")

print(f"Target: {TARGET_COLUMN}")
print(f"Total rows: {len(df):,}")
print(f"Null values: {df[TARGET_COLUMN].isnull().sum():,}")

# Remove rows with null target
df_clean = df[df[TARGET_COLUMN].notna()].copy()
print(f"✅ Clean dataset: {len(df_clean):,} rows")

### ⭐ 5. Feature Selection

Define features based on available columns in the Aging table.

In [None]:
# ==============================================================================
# FEATURE SELECTION
# ==============================================================================
# WHY: Choose which columns (features) the model will use to predict ship time.
#      More relevant features = better predictions.
#
# Features are the "clues" the model uses:
# - Location: Plant, Shipping Point
# - Logistics: Carrier, Delivery Type
# - Product: Brand, Channel, Category
# - Customer: Strategic Account flag
# - Timing: Day of week, month created
# ==============================================================================

# Define potential features based on Aging table schema
potential_features = [
    # Location & Routing
    'Plant',                    # Distribution center
    'Shipping Point',           # Shipping location
    'EWM Carrier Code',         # Carrier code (important for performance)
    
    # Product Information
    'Brand',                    # Calculated column (Callaway/Odyssey, Jack Wolfskin, etc.)
    'Channel',                  # Calculated column (E-commerce, Inter-company, etc.)
    'Product Category',         # Product category
    'Product Type',             # Product type
    'Standard Or Custom',       # Standard or custom product
    
    # Customer & Account
    'STRATEGIC_ACCOUNT',        # Strategic vs non-strategic (KEY for prioritization!)
    'Sold To - Key',            # Customer identifier
    
    # Delivery Attributes
    'Delivery Type',            # Type of delivery
    'DELIVERY_QTY',             # Quantity (numeric)
    'DELIVERY_VALUE_USD',       # Value in USD (numeric)
    'Delivery Priority',        # Priority level
    'Shipping Condition',       # Shipping condition code
    
    # Processing Status
    'Credit Status',            # Calculated column (Credit checked, Released, etc.)
    'Distribution Status',      # Calculated column (Confirmed, Distributed, etc.)
    'STATUS',                   # Delivery status
]

# Add temporal features from 'Delivery Created On' if it exists
# WHY: Deliveries created on certain days/months may have different processing times
if 'Delivery Created On' in df_clean.columns:
    try:
        df_clean['created_dayofweek'] = pd.to_datetime(df_clean['Delivery Created On']).dt.dayofweek
        df_clean['created_month'] = pd.to_datetime(df_clean['Delivery Created On']).dt.month
        potential_features.extend(['created_dayofweek', 'created_month'])
        print("✅ Added temporal features (day of week, month)")
    except Exception as e:
        print(f"⚠️ Could not create temporal features: {e}")

# Filter to only features that exist in the dataframe
feature_cols = [f for f in potential_features if f in df_clean.columns]

print(f"=== Feature Selection ===")
print(f"Potential features: {len(potential_features)}")
print(f"Available features: {len(feature_cols)}")
print(f"\nUsing features:")
for i, f in enumerate(feature_cols, 1):
    print(f"  {i}. {f}")

# Check for missing features
missing_features = [f for f in potential_features if f not in df_clean.columns]
if missing_features:
    print(f"\n⚠️  Missing features (not in data):")
    for f in missing_features:
        print(f"     - {f}")

### ⭐ 6. Prepare Features + Target

In [None]:
# Extract features and target
X = df_clean[feature_cols].copy()
y = df_clean[TARGET_COLUMN].copy()

# Encode categorical variables
categorical_cols = X.select_dtypes(include=['object', 'string']).columns.tolist()
for col in categorical_cols:
    X[col] = X[col].fillna('Unknown')
    X[col] = X[col].astype("category").cat.codes

# Handle numeric NaNs
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()
for col in numeric_cols:
    if X[col].isnull().sum() > 0:
        X[col] = X[col].fillna(X[col].median())

print(f"✅ Features: {X.shape[1]} columns, {X.shape[0]:,} rows")
print(f"✅ Target: {y.shape[0]:,} values")

### ⭐ 7. Train/Test Split

In [None]:
# ==============================================================================
# FEATURE ENCODING: Convert Text to Numbers
# ==============================================================================
# WHY: Machine learning models only understand numbers, not text.
#      We need to convert text like "FedEx" or "US01" into numeric codes.
#
# Process:
# 1. Text columns → Convert to category codes (FedEx=0, UPS=1, DHL=2, etc.)
# 2. Missing text → Fill with "Unknown" first
# 3. Numeric columns → Fill missing values with median (middle value)
# ==============================================================================

# Extract features and target
X = df_clean[feature_cols].copy()  # Input features
y = df_clean[TARGET_COLUMN].copy()  # Output to predict

# Encode categorical variables (text → numbers)
categorical_cols = X.select_dtypes(include=['object', 'string']).columns.tolist()
for col in categorical_cols:
    X[col] = X[col].fillna('Unknown')  # Fill missing text with "Unknown"
    X[col] = X[col].astype("category").cat.codes  # Convert to numeric codes

# Handle numeric NaNs (fill with median)
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()
for col in numeric_cols:
    if X[col].isnull().sum() > 0:
        X[col] = X[col].fillna(X[col].median())

print(f"✅ Features: {X.shape[1]} columns, {X.shape[0]:,} rows")
print(f"✅ Target: {y.shape[0]:,} values")


### ⭐ 8. Train AutoML Model

Using FLAML AutoML to find the best regression model.

In [None]:
# ==============================================================================
# TRAIN/TEST SPLIT
# ==============================================================================
# WHY: Split data into two parts:
#      1. Training set (80%): Teaches the model patterns
#      2. Test set (20%): Validates how well it learned (unseen data)
#
# This prevents "overfitting" - memorizing training data vs learning patterns.
# ==============================================================================

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # Reproducible split
)

print(f"=== Train/Test Split ===")
print(f"Training set:   {len(X_train):,} rows ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test set:       {len(X_test):,} rows ({len(X_test)/len(X)*100:.0f}%)")
print(f"Features:       {X_train.shape[1]}")


### ⭐ 9. Evaluate Model Performance

In [None]:
# ==============================================================================
# EVALUATE MODEL PERFORMANCE
# ==============================================================================
# WHY: Test the model on unseen data to see how accurate predictions are.
#
# Metrics explained:
# - MAE (Mean Absolute Error): Average days the prediction is off
#   Example: MAE=2 means predictions are typically ±2 days off
#
# - RMSE (Root Mean Squared Error): Penalizes large errors more
#   Useful for catching predictions that are WAY off
#
# - R² (R-squared): How well the model explains variance (0-1 scale)
#   0.5 = 50% of variance explained, 0.8 = 80% (good), 0.95 = 95% (excellent)
# ==============================================================================

# Make predictions on test set (unseen data)
preds = automl.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)

print("\n=== MODEL PERFORMANCE ON TEST SET ===")
print(f"MAE:  {mae:.2f} days (avg prediction error)")
print(f"RMSE: {rmse:.2f} days (penalizes large errors)")
print(f"R²:   {r2:.3f} (variance explained: {r2*100:.1f}%)")
print("="*40)

# Interpretation guide
if mae < 2:
    print("✅ Excellent accuracy (< 2 days error)")
elif mae < 3:
    print("✅ Good accuracy (2-3 days error)")
elif mae < 5:
    print("⚠️ Moderate accuracy (3-5 days error)")
else:
    print("⚠️ Consider adding more features or data")

### ⭐ 10. Register Model in MLflow

In [None]:
# ==============================================================================
# REGISTER MODEL IN MLFLOW
# ==============================================================================
# WHY: MLflow stores the trained model so Notebook 03 can load it for scoring.
#      It also tracks experiments, metrics, and model versions.
#
# What gets saved:
# - The trained model (algorithm + learned parameters)
# - Performance metrics (MAE, RMSE, R²)
# - Model version history
#
# Model name: "ship_date_predictor" - used in Notebook 03 to load model
# ==============================================================================

print("\n=== Registering Model in MLflow ===")

mlflow.set_experiment("Ship_Date_Prediction")

with mlflow.start_run(run_name="automl_regression") as run:
    # Log performance metrics
    mlflow.log_metric("test_mae", mae)
    mlflow.log_metric("test_rmse", rmse)
    mlflow.log_metric("test_r2", r2)
    
    # Register the trained model
    mlflow.sklearn.log_model(
        sk_model=automl.model,
        artifact_path="model",
        registered_model_name=MODEL_NAME_REGRESSION
    )
    
    print(f"✅ Model registered: {MODEL_NAME_REGRESSION}")
    print(f"✅ Run ID: {run.info.run_id}")
    print(f"✅ Metrics logged: MAE={mae:.2f}, RMSE={rmse:.2f}, R²={r2:.3f}")

In [None]:
# ==============================================================================
# AUTOML TRAINING
# ==============================================================================
# WHY: AutoML automatically tests multiple algorithms and finds the best one.
#      It's like having a data scientist test Random Forest, XGBoost, 
#      Linear Regression, etc. and pick the winner.
#
# What AutoML does:
# 1. Tests different algorithms (Random Forest, Gradient Boosting, etc.)
# 2. Tunes hyperparameters (tree depth, learning rate, etc.)
# 3. Validates each model
# 4. Returns the best performer
#
# Time budget: 600 seconds (10 min) - reduce if capacity is tight
# ==============================================================================

print("\n" + "="*60)
print("STARTING AUTOML TRAINING")
print("="*60)
print("This will test multiple algorithms and find the best model...")
print(f"Time budget: {automl_settings['time_budget']} seconds")
print(f"Training on: {len(X_train):,} deliveries")
print("="*60 + "\n")

# Initialize AutoML
automl = AutoML()

# Train (this takes 5-10 minutes)
automl.fit(X_train, y_train, **automl_settings)

print("\n" + "="*60)
print("TRAINING COMPLETE!")
print("="*60)
print(f"Best model: {automl.best_estimator}")
print(f"Best validation score (MAE): {automl.best_loss:.2f} days")
print("="*60)

# Model name for MLflow registration
MODEL_NAME_REGRESSION = "ship_date_predictor"

In [None]:
print("\n" + "="*60)
print("TRAINING COMPLETE")
print("="*60)
print(f"Model: {MODEL_NAME_REGRESSION}")
print(f"Features: {len(feature_cols)}")
print(f"Training samples: {len(X_train):,}")
print(f"Test MAE: {mae:.2f} days")
print(f"\nThis model predicts DAYS_TO_SHIP (days from creation to DC ship)")
print("="*60)
print("\nNEXT: Open 03_batch_scoring_pipeline.ipynb to score open deliveries")
print("="*60)

---

## ✅ Training Complete!

Proceed to **`03_batch_scoring_pipeline.ipynb`** to generate predictions for open deliveries.