# ü§ñ Ship Date Prediction - AutoML Training (Simplified)

**Train ML model to predict when orders will ship from distribution centers**

**What this does:**
1. Load closed deliveries (GI Date populated) from Power BI semantic model
2. Calculate DAYS_TO_SHIP (GI Date - Delivery Created On)
3. Train AutoML regression model on shipping patterns
4. Register model in MLflow for batch scoring

**Target:** DAYS_TO_SHIP = Days from order creation to DC ship date  
**Model:** ship_date_predictor (regression)

In [None]:
# Configuration & Imports
import sempy.fabric as fabric
import pandas as pd
import numpy as np
from flaml import AutoML
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Settings
DATASET = "DLV Aging"
TARGET_COLUMN = "DAYS_TO_SHIP"
MODEL_NAME = "ship_date_predictor"

automl_settings = {
    "time_budget": 600,
    "metric": "mae",
    "task": "regression",
    "seed": 42,
    "early_stop": True
}

print(f"‚úÖ Configuration loaded | Target: {TARGET_COLUMN} | Model: {MODEL_NAME}")

### Load Training Data
Query closed deliveries with both GI Date (ship date) and Delivery Created On populated

In [None]:
# Load closed deliveries from semantic model
print("Loading closed deliveries...")

dax_query = """
EVALUATE
FILTER(
    Aging,
    NOT(ISBLANK(Aging[GI Date])) &&
    NOT(ISBLANK(Aging[Delivery Created On]))
)
"""

ws = fabric.get_workspace_id()
df = fabric.evaluate_dax(dataset=DATASET, dax_string=dax_query, workspace=ws)

# Clean column names
df.columns = [col.split('[')[-1].replace(']', '') if '[' in col else col for col in df.columns]

# Calculate target: days from creation to ship
df[TARGET_COLUMN] = (pd.to_datetime(df['GI Date']) - pd.to_datetime(df['Delivery Created On'])).dt.days
df = df.dropna(subset=[TARGET_COLUMN])

print(f"‚úÖ Loaded {len(df):,} closed deliveries | Mean lead time: {df[TARGET_COLUMN].mean():.1f} days")
df.head()

### Feature Selection & Preparation
Select features and encode categorical variables for ML training

In [None]:
# Define features (filter to those available in data)
potential_features = [
    'Plant', 'Shipping Point', 'EWM Carrier Code',
    'Brand', 'Channel', 'Product Category', 'Product Type', 'Standard Or Custom',
    'STRATEGIC_ACCOUNT', 'Sold To - Key',
    'Delivery Type', 'DELIVERY_QTY', 'DELIVERY_VALUE_USD', 'Delivery Priority',
    'Shipping Condition', 'Credit Status', 'Distribution Status', 'STATUS'
]

# Add temporal features
if 'Delivery Created On' in df.columns:
    df['created_dayofweek'] = pd.to_datetime(df['Delivery Created On']).dt.dayofweek
    df['created_month'] = pd.to_datetime(df['Delivery Created On']).dt.month
    potential_features.extend(['created_dayofweek', 'created_month'])

feature_cols = [f for f in potential_features if f in df.columns]

# Extract features and target
X = df[feature_cols].copy()
y = df[TARGET_COLUMN].copy()

# Encode categorical variables
for col in X.select_dtypes(include=['object', 'string']).columns:
    X[col] = X[col].fillna('Unknown').astype('category').cat.codes

# Fill numeric NaNs with median
for col in X.select_dtypes(include=['number']).columns:
    if X[col].isnull().sum() > 0:
        X[col] = X[col].fillna(X[col].median())

print(f"‚úÖ Features: {len(feature_cols)} | Rows: {len(X):,}")
print(f"Features: {', '.join(feature_cols[:8])}...")

### Train/Test Split
Split data 80/20 for training and validation

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"‚úÖ Train: {len(X_train):,} | Test: {len(X_test):,} | Features: {X_train.shape[1]}")

### Train AutoML Model
Use FLAML AutoML to find best regression model (10 min budget)

In [None]:
# Train AutoML model
print("Training AutoML model (10 min budget)...")

automl = AutoML()
automl.fit(X_train, y_train, **automl_settings)

print(f"\n‚úÖ Training complete | Best model: {automl.best_estimator}")
print(f"Best MAE: {automl.best_loss:.2f} days")

### Evaluate Performance
Test model accuracy on unseen data

In [None]:
# Categorical columns that need encoding
categorical_cols = [
    'Plant', 'Brand', 'Product Category', 'Standard Or Custom',
    'Channel', 'STRATEGIC_ACCOUNT', 'Credit Status', 
    'Distribution Status', 'EWM_CARRIER_CODE'
]

print("üîÑ Encoding categorical variables...")
for col in categorical_cols:
    if col in X.columns:
        X[col] = X[col].astype('category').cat.codes

print(f"‚úÖ Encoded {len(categorical_cols)} categorical columns")
print(f"\nData types:")
print(X.dtypes)

## 7Ô∏è‚É£ Train/Test Split

In [None]:
# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"üìä Data Split:")
print(f"   Training:   {len(X_train):,} rows ({len(X_train)/len(X)*100:.1f}%)")
print(f"   Test:       {len(X_test):,} rows ({len(X_test)/len(X)*100:.1f}%)")
print(f"\n   Features:   {X_train.shape[1]}")

## ‚úÖ Training Complete!

**Next Steps:**
1. Run notebook `03_batch_scoring_pipeline.ipynb` to generate predictions for open deliveries
2. View predictions in Power BI report

**Model Summary:**
- Model Name: POC-LateDelivery-Regression-AutoML
- Target: AGE_REQ_DATE (days late/early)
- Features: 12 key columns
- Algorithm: AutoML (Random Forest, XGBoost, Extra Trees)
- Performance: Check MAE, RMSE, R¬≤ scores above