# ðŸ”® AGE_REQ_DATE Prediction - AutoML Training (Lateness Forecasting)

**Train ML model to predict delivery lateness vs customer requested dates**

**What this does:**
1. Load closed deliveries (GI Date populated) from Power BI semantic model
2. Calculate AGE_REQ_DATE (GI Date - Req. Date Header)
3. Train AutoML regression model on lateness patterns
4. Register model in MLflow for batch scoring

**Target:** AGE_REQ_DATE = Days late/early vs customer requested delivery date
- Positive = Late (shipped after customer request)
- Negative = Early (shipped before customer request)
- Zero = On-time

**Model:** ship_date_predictor (regression)

In [None]:
# Configuration & Imports
import sempy.fabric as fabric
import pandas as pd
import numpy as np
from flaml import AutoML
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Settings
DATASET = "DLV Aging"
TARGET_COLUMN = "AGE_REQ_DATE"
MODEL_NAME = "ship_date_predictor"

automl_settings = {
    "time_budget": 600,
    "metric": "mae",
    "task": "regression",
    "seed": 42,
    "early_stop": True
}

print(f"âœ… Configuration loaded | Target: {TARGET_COLUMN} | Model: {MODEL_NAME}")

### Load Training Data
Query closed deliveries with GI Date and customer requested delivery date populated

In [None]:
# Load closed deliveries from semantic model
print("Loading closed deliveries...")

dax_query = """
EVALUATE
FILTER(
    Aging,
    NOT(ISBLANK(Aging[GI Date])) &&
    NOT(ISBLANK(Aging[Req. Date Header]))
)
"""

ws = fabric.get_workspace_id()
df = fabric.evaluate_dax(dataset=DATASET, dax_string=dax_query, workspace=ws)

# Clean column names
df.columns = [col.split('[')[-1].replace(']', '') if '[' in col else col for col in df.columns]

# Calculate target: days late/early vs customer requested date
df[TARGET_COLUMN] = (pd.to_datetime(df['GI Date']) - pd.to_datetime(df['Req. Date Header'])).dt.days
df = df.dropna(subset=[TARGET_COLUMN])

print(f"âœ… Loaded {len(df):,} closed deliveries | Mean lateness: {df[TARGET_COLUMN].mean():.1f} days")
df.head()

### Feature Selection & Preparation
Select features including customer/product attributes and temporal patterns from requested date

In [None]:
# Define features (filter to those available in data)
potential_features = [
    'Plant', 'Shipping Point', 'EWM Carrier Code',
    'Brand', 'Channel', 'Product Category', 'Product Type', 'Standard Or Custom',
    'STRATEGIC_ACCOUNT', 'Sold To Name 1', 'Sold To - Key',
    'Delivery Type', 'DELIVERY_QTY', 'DELIVERY_VALUE_USD', 'Delivery Priority',
    'Shipping Condition', 'EWM Shipping Condition', 'Credit Status', 'Distribution Status', 'STATUS'
]

# Add temporal features from requested delivery date
if 'Req. Date Header' in df.columns:
    df['req_dayofweek'] = pd.to_datetime(df['Req. Date Header']).dt.dayofweek
    df['req_month'] = pd.to_datetime(df['Req. Date Header']).dt.month
    potential_features.extend(['req_dayofweek', 'req_month'])

feature_cols = [f for f in potential_features if f in df.columns]

# Extract features and target
X = df[feature_cols].copy()
y = df[TARGET_COLUMN].copy()

# Encode categorical variables
for col in X.select_dtypes(include=['object', 'string']).columns:
    X[col] = X[col].fillna('Unknown').astype('category').cat.codes

# Fill numeric NaNs with median
for col in X.select_dtypes(include=['number']).columns:
    if X[col].isnull().sum() > 0:
        X[col] = X[col].fillna(X[col].median())

print(f"âœ… Features: {len(feature_cols)} | Rows: {len(X):,}")
print(f"Features: {', '.join(feature_cols[:8])}...")

### Train/Test Split
Split data 80/20 for training and validation

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"âœ… Train: {len(X_train):,} | Test: {len(X_test):,} | Features: {X_train.shape[1]}")

### Train AutoML Model
Use FLAML AutoML to find best regression model (10 min budget)

In [None]:
# Train AutoML model
print("Training AutoML model (10 min budget)...")

automl = AutoML()
automl.fit(X_train, y_train, **automl_settings)

print(f"\nâœ… Training complete | Best model: {automl.best_estimator}")
print(f"Best MAE: {automl.best_loss:.2f} days")

### Evaluate Performance
Test model accuracy on predicting delivery lateness (positive = late, negative = early)

In [None]:
# Evaluate on test set
preds = automl.predict(X_test)

mae = mean_absolute_error(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)

print("\n=== MODEL PERFORMANCE ===")
print(f"MAE:  {mae:.2f} days (avg error in lateness prediction)")
print(f"RMSE: {rmse:.2f} days")
print(f"RÂ²:   {r2:.3f} ({r2*100:.1f}% variance explained)")
print("\nInterpretation:")
print("- Positive predictions = Late delivery (after customer requested date)")
print("- Negative predictions = Early delivery (before customer requested date)")
print("=========================")

### Register Model in MLflow
Save model for batch scoring in Notebook 04 (AGE_REQ_DATE predictions)

In [None]:
# Register model in MLflow
print("\nRegistering model in MLflow...")

mlflow.set_experiment("AGE_REQ_DATE_Prediction")

with mlflow.start_run(run_name="automl_lateness_regression") as run:
    mlflow.log_metric("test_mae", mae)
    mlflow.log_metric("test_rmse", rmse)
    mlflow.log_metric("test_r2", r2)
    
    mlflow.sklearn.log_model(
        sk_model=automl.model,
        artifact_path="model",
        registered_model_name=MODEL_NAME
    )
    
    print(f"âœ… Model registered: {MODEL_NAME}")
    print(f"âœ… Run ID: {run.info.run_id}")
    print(f"\nðŸŽ¯ Ready for batch scoring in Notebook 04")

---
**Next:** Run `04_age_req_date_scoring_pipeline.ipynb` to predict delivery lateness for open deliveries