# AutoML Training - Late Delivery Prediction Model

**Goal:** Train ML models to predict late deliveries using AutoML.

This notebook will:
1. Load closed delivery data from the semantic model
2. Train **regression model** to predict AGE_REQ_DATE (days late/early)
3. Create **classification model** for late vs on-time prediction
4. Generate **lateness buckets** (0-2, 3-5, 6-9, 10+ days)
5. Register best models to MLflow

**Use Case:** Enable operations team to:
- Identify deliveries at high risk of shipping late
- Prioritize corrective actions for strategic accounts
- Proactively communicate with business teams about potential delays

### ‚≠ê 1. Imports

- pandas: dataframe manipulation
- mlflow: experiment tracking and model registry
- AutoML (FLAML): automatic model selection / tuning
- sklearn: train/test split + evaluation metrics
- sempy.fabric: read tables from Power BI semantic model

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Machine Learning
from flaml import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    mean_absolute_error, r2_score, mean_squared_error, mean_absolute_percentage_error,
    accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
)
from sklearn.preprocessing import LabelEncoder

# Experiment tracking
import mlflow
from mlflow.tracking import MlflowClient

# Semantic Link - Connect to Power BI
import sempy.fabric as fabric

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

print("‚úÖ All libraries imported")

### ‚≠ê 2. Configuration

**IMPORTANT:** Update these settings for your environment.

In [None]:
# Semantic model name
DATASET = "DLV Aging Columns & Measures"

# Target variable: AGE_REQ_DATE (days late/early vs Customer Requested Delivery Date)
# Positive = Late, Negative = Early, 0 = On-time
TARGET_COLUMN = "AGE_REQ_DATE"

# Model names for MLflow registry
MODEL_NAME_REGRESSION = "POC-LateDelivery-Regression-AutoML"
MODEL_NAME_CLASSIFICATION = "POC-LateDelivery-Classification-AutoML"

# Workspace
ws = fabric.get_workspace_id()

print(f"üìä Semantic Model: {DATASET}")
print(f"üéØ Target Variable: {TARGET_COLUMN}")
print(f"ü§ñ Regression Model: {MODEL_NAME_REGRESSION}")
print(f"ü§ñ Classification Model: {MODEL_NAME_CLASSIFICATION}")
print(f"üè¢ Workspace ID: {ws}")

### ‚≠ê 3. Load Training Data Using DAX

Load **closed deliveries** (with GI Date) for model training.

In [None]:
# DAX query: Get closed deliveries (deliveries that have shipped)
# Filter for deliveries with Goods Issue (GI) Date populated
dax_query = """
EVALUATE
FILTER(
    Aging,
    NOT(ISBLANK(Aging[GI Date]))
)
"""

# Execute DAX query
print("Loading closed delivery data from semantic model...")
df = fabric.evaluate_dax(dataset=DATASET, dax_string=dax_query, workspace=ws)

# Clean column names (DAX adds table prefixes like 'Aging[column]')
df.columns = [col.split('[')[-1].replace(']', '') if '[' in col else col for col in df.columns]

print(f"‚úÖ Loaded {len(df):,} closed deliveries")
print(f"‚úÖ Columns: {df.shape[1]}")
df.head()

### ‚≠ê 4. Data Quality Check & Filtering

Remove rows with missing target values.

In [None]:
# Check target column and remove nulls
if TARGET_COLUMN not in df.columns:
    raise ValueError(f"Target column '{TARGET_COLUMN}' not found!")

print(f"Target: {TARGET_COLUMN}")
print(f"Total rows: {len(df):,}")
print(f"Null values: {df[TARGET_COLUMN].isnull().sum():,}")

# Remove rows with null target
df_clean = df[df[TARGET_COLUMN].notna()].copy()
print(f"‚úÖ Clean dataset: {len(df_clean):,} rows")

### ‚≠ê 5. Feature Selection

Define features based on available columns in the Aging table.

In [None]:
# Define features based on actual Aging table schema
# Updated to match the customer's semantic model columns

potential_features = [
    # Location & Routing
    'Plant',                    # Distribution center
    'Shipping Point',           # Shipping location
    'EWM Carrier Code',         # Carrier code (important for performance)
    
    # Product Information
    'Brand',                    # Calculated column (Callaway/Odyssey, Jack Wolfskin, etc.)
    'Channel',                  # Calculated column (E-commerce, Inter-company, etc.)
    'Product Category',         # Product category
    'Product Type',             # Product type
    'Standard Or Custom',       # Standard or custom product
    
    # Customer & Account
    'STRATEGIC_ACCOUNT',        # Strategic vs non-strategic (KEY for prioritization!)
    'Sold To - Key',            # Customer identifier
    
    # Delivery Attributes
    'Delivery Type',            # Type of delivery
    'DELIVERY_QTY',             # Quantity (numeric)
    'DELIVERY_VALUE_USD',       # Value in USD (numeric)
    'Delivery Priority',        # Priority level
    'Shipping Condition',       # Shipping condition code
    
    # Processing Status
    'Credit Status',            # Calculated column (Credit checked, Released, etc.)
    'Distribution Status',      # Calculated column (Confirmed, Distributed, etc.)
    'STATUS',                   # Delivery status
]

# Add temporal features from 'Delivery Created On' if it exists
if 'Delivery Created On' in df_clean.columns:
    try:
        df_clean['created_dayofweek'] = pd.to_datetime(df_clean['Delivery Created On']).dt.dayofweek
        df_clean['created_month'] = pd.to_datetime(df_clean['Delivery Created On']).dt.month
        potential_features.extend(['created_dayofweek', 'created_month'])
        print("‚úÖ Added temporal features (day of week, month)")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not create temporal features: {e}")

# Filter to only features that exist in the dataframe
feature_cols = [f for f in potential_features if f in df_clean.columns]

print(f"=== Feature Selection ===")
print(f"Potential features: {len(potential_features)}")
print(f"Available features: {len(feature_cols)}")
print(f"\nUsing features:")
for i, f in enumerate(feature_cols, 1):
    print(f"  {i}. {f}")

# Check for missing features
missing_features = [f for f in potential_features if f not in df_clean.columns]
if missing_features:
    print(f"\n‚ö†Ô∏è  Missing features (not in data):")
    for f in missing_features:
        print(f"     - {f}")

### ‚≠ê 6. Prepare Features + Target

In [None]:
# Extract features and target
X = df_clean[feature_cols].copy()
y = df_clean[TARGET_COLUMN].copy()

# Encode categorical variables
categorical_cols = X.select_dtypes(include=['object', 'string']).columns.tolist()
for col in categorical_cols:
    X[col] = X[col].fillna('Unknown')
    X[col] = X[col].astype("category").cat.codes

# Handle numeric NaNs
numeric_cols = X.select_dtypes(include=['number']).columns.tolist()
for col in numeric_cols:
    if X[col].isnull().sum() > 0:
        X[col] = X[col].fillna(X[col].median())

print(f"‚úÖ Features: {X.shape[1]} columns, {X.shape[0]:,} rows")
print(f"‚úÖ Target: {y.shape[0]:,} values")

### ‚≠ê 7. Train/Test Split

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"‚úÖ Training set: {X_train.shape[0]:,} rows")
print(f"‚úÖ Test set: {X_test.shape[0]:,} rows")
print(f"‚úÖ Features: {X_train.shape[1]} columns")

### ‚≠ê 8. Train AutoML Model

Using FLAML AutoML to find the best regression model.

In [None]:
# Initialize AutoML
automl = AutoML()

settings = {
    "time_budget": 180,  # 3 minutes
    "metric": "mae",
    "task": "regression",
    "estimator_list": ["rf", "xgboost", "extra_tree"],
    "log_file_name": "automl.log",
    "verbose": 0
}

print("Starting AutoML training...")
print(f"Time budget: {settings['time_budget']}s")
print(f"Metric: {settings['metric']}")

# Train
automl.fit(X_train, y_train, **settings)

print(f"\n‚úÖ Training complete!")
print(f"Best model: {automl.best_estimator}")

### ‚≠ê 9. Evaluate Model Performance

In [None]:
# Evaluate model
preds = automl.predict(X_test)

mae = mean_absolute_error(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)

print("\n=== MODEL PERFORMANCE ===")
print(f"MAE:  {mae:.2f} days")
print(f"RMSE: {rmse:.2f} days")
print(f"R¬≤:   {r2:.3f}")
print("="*40)

### ‚≠ê 10. Register Model in MLflow

In [None]:
# Register model in MLflow
mlflow.set_experiment("Late_Delivery_Prediction")

with mlflow.start_run(run_name="automl_regression") as run:
    # Log metrics
    mlflow.log_metric("test_mae", mae)
    mlflow.log_metric("test_rmse", rmse)
    mlflow.log_metric("test_r2", r2)
    
    # Register model
    mlflow.sklearn.log_model(
        sk_model=automl.model,
        artifact_path="model",
        registered_model_name=MODEL_NAME_REGRESSION
    )
    
    print(f"\n‚úÖ Model registered: {MODEL_NAME_REGRESSION}")
    print(f"Run ID: {run.info.run_id}")

In [None]:
print("\n" + "="*60)
print("TRAINING COMPLETE")
print("="*60)
print(f"Model: {MODEL_NAME_REGRESSION}")
print(f"Features: {len(feature_cols)}")
print(f"Training samples: {len(X_train):,}")
print(f"Test MAE: {mae:.2f} days")
print(f"Test R¬≤: {r2:.3f}")
print("="*60)
print("\n‚úÖ Next: Open 03_batch_scoring_pipeline.ipynb")
print("="*60)

---

## ‚úÖ Training Complete!

Proceed to **`03_batch_scoring_pipeline.ipynb`** to generate predictions for open deliveries.