# Semantic Link Data Preparation - Late Delivery Prediction

**Goal:** Prepare data from the "DLV Aging Columns & Measures" semantic model to predict late deliveries.

## Use Case
Predict which **open deliveries** will ship late relative to the Customer Requested Delivery Date, enabling:
- Proactive communication with business teams about specific late deliveries
- Identification of deliveries with high risk of shipping late
- Prioritization of corrective actions for strategic accounts

## This Notebook Will:
1. Connect to the semantic model
2. Load historical **closed deliveries** (last 2-4 weeks) for training
3. Validate the target variable: `AGE_REQ_DATE` (days late/early vs requested date)
4. Explore features: Plant, Brand, Channel, Carrier, Strategic Account, etc.
5. Prepare data for AutoML training

### üì¶ 1. Install Semantic Link

In [None]:
%pip install -U semantic-link --q

### üîß 2. Configuration

**IMPORTANT:** Update the semantic model name to match your environment.

In [None]:
import sempy.fabric as fabric 
import pandas as pd
from sempy.fabric import FabricDataFrame

# Semantic model name
DATASET = "DLV Aging Columns & Measures"

# Get workspace ID
ws = fabric.get_workspace_id()

print(f"‚úÖ Workspace ID: {ws}")
print(f"‚úÖ Semantic Model: {DATASET}")

### üìä 3. List Tables in Semantic Model

Let's see what tables are available in the semantic model.

In [None]:
# List all tables
tables_fdf = fabric.list_tables(DATASET, workspace=ws) 
print(f"Tables found: {len(tables_fdf)}")
tables_fdf

### üìã 4. List Columns in the Aging Table

Explore all columns available in the Aging table.

In [None]:
# List all columns across all tables
columns_df = fabric.list_columns(DATASET, workspace=ws)
print(f"\nTotal columns across all tables: {len(columns_df)}")
print(f"\nColumn details:")
columns_df

### üîó 5. Visualize Relationships

Check if there are any relationships between tables in the semantic model.

In [None]:
try:
    from sempy.relationships import plot_relationship_metadata
    relationships = fabric.list_relationships(workspace=ws, dataset=DATASET)
    
    if len(relationships) > 0:
        print(f"Found {len(relationships)} relationships")
        plot_relationship_metadata(relationships)
    else:
        print("No relationships found in this semantic model.")
        print("The Aging table appears to be a single fact table with all data.")
except Exception as e:
    print(f"Note: {e}")
    print("This is normal if the Aging table is a flat/denormalized table.")

### üì• 6. Load Sample Data from Aging Table

Load a sample of the Aging table to understand the data structure.

In [None]:
# Load closed deliveries from Aging table using DAX
# Filter for deliveries that have shipped (GI Date is not null) in the last 30 days for training
dax_query = """
EVALUATE
FILTER(
    Aging,
    NOT(ISBLANK(Aging[GI Date]))
)
"""

df_closed = fabric.evaluate_dax(dataset=DATASET, dax_string=dax_query, workspace=ws)

# Clean column names (remove table prefixes if present)
df_closed.columns = [col.split('[')[-1].replace(']', '') if '[' in col else col for col in df_closed.columns]

print(f"‚úÖ Loaded {len(df_closed):,} closed deliveries")
print(f"‚úÖ Columns: {df_closed.shape[1]}")
print(f"\nFirst few rows:")
df_closed.head()

### üîç 7. Explore Data Structure

Understanding closed deliveries that will serve as training data.

In [None]:
print("=== DATA SUMMARY ===\n")
print(f"Shape: {df_closed.shape[0]:,} rows √ó {df_closed.shape[1]} columns\n")
print(f"\nKey columns for late delivery prediction:")

# Group columns by category
key_cols = {
    'Target Variable': ['AGE_REQ_DATE', 'AGE_CREATEDON', 'Aging Bucket based on RDD'],
    'Date Fields': ['Delivery Created On', 'Req. Date Header', 'GI Date', 'Manifest Date'],
    'Delivery Info': ['Delivery Number', 'STATUS', 'Delivery Type', 'DELIVERY_QTY', 'DELIVERY_VALUE_USD'],
    'Location/Routing': ['Plant', 'Shipping Point', 'EWM_CARRIER_CODE'],
    'Product': ['Brand', 'Product Category', 'Product Type', 'Standard Or Custom'],
    'Customer': ['Channel', 'STRATEGIC_ACCOUNT', 'Sold To - Key', 'Ship To - Key'],
    'Processing': ['Credit Status', 'Distribution Status', 'OVERALL_PROCESSING_STATUS']
}

for category, cols in key_cols.items():
    print(f"\n{category}:")
    for col in cols:
        if col in df_closed.columns:
            print(f"  ‚úÖ {col}")
        else:
            print(f"  ‚ùå {col} (not found)")

In [None]:
print("\n=== DATA TYPES ===\n")
print(df_closed.dtypes)

In [None]:
print("\n=== MISSING VALUES ===\n")
missing = df_closed.isnull().sum()
missing_pct = (missing / len(df_closed) * 100).round(2)
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing Count': missing.values,
    'Missing %': missing_pct.values
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
print(missing_df.to_string(index=False))

### üìà 8. Explore Key Features for Prediction

Understanding the features that will help predict late deliveries.

In [None]:
# Key categorical features for late delivery prediction
categorical_cols = ['Plant', 'Brand', 'Channel', 'Product Category', 'Product Type', 
                    'Standard Or Custom', 'Credit Status', 'Distribution Status',
                    'STRATEGIC_ACCOUNT', 'EWM_CARRIER_CODE', 'Delivery Type',
                    'Aging Bucket based on RDD', 'Aging Bucket based on Created Date']

print("=== CATEGORICAL FEATURE VALUE COUNTS ===\n")
for col in categorical_cols:
    if col in df_closed.columns:
        print(f"\n{col}:")
        value_counts = df_closed[col].value_counts().head(10)
        print(value_counts)
        print(f"  Total unique values: {df_closed[col].nunique()}")
        print(f"  Missing values: {df_closed[col].isnull().sum()} ({df_closed[col].isnull().sum()/len(df_closed)*100:.1f}%)")

In [None]:
# Key numeric columns
numeric_cols = ['DELIVERY_QTY', 'DELIVERY_VALUE_USD', 'AGE_CREATEDON', 'AGE_REQ_DATE']

print("\n=== NUMERIC FEATURE STATISTICS ===\n")
for col in numeric_cols:
    if col in df_closed.columns:
        print(f"\n{col}:")
        stats = df_closed[col].describe()
        print(stats)
        
        # For aging columns, show late vs on-time distribution
        if 'AGE_' in col:
            late_count = (df_closed[col] > 0).sum()
            on_time_count = (df_closed[col] <= 0).sum()
            print(f"\n  Late deliveries (>0 days): {late_count} ({late_count/len(df_closed)*100:.1f}%)")
            print(f"  On-time/Early (<=0 days): {on_time_count} ({on_time_count/len(df_closed)*100:.1f}%)")

### üéØ 9. Validate Target Variable: AGE_REQ_DATE

**Target Variable:** `AGE_REQ_DATE` - Days late/early relative to Customer Requested Delivery Date

- **Positive values** = Late delivery (shipped after requested date)
- **Zero** = On-time delivery  
- **Negative values** = Early delivery (shipped before requested date)

This is the key metric for:
- Meeting customer expectations
- Identifying deliveries at risk of SLA breach
- Bucketing late deliveries (0-2, 3-5, 6-9, 10+ days)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize AGE_REQ_DATE distribution
if 'AGE_REQ_DATE' in df_closed.columns:
    fig, axes = plt.subplots(1, 2, figsize=(16, 5))
    
    # Filter out nulls
    age_req_data = df_closed['AGE_REQ_DATE'].dropna()
    
    # Histogram
    axes[0].hist(age_req_data, bins=50, edgecolor='black', color='steelblue')
    axes[0].axvline(0, color='red', linestyle='--', linewidth=2, label='On-time threshold')
    axes[0].axvline(age_req_data.median(), color='orange', linestyle='--', label=f'Median: {age_req_data.median():.1f} days')
    axes[0].set_title('Distribution of AGE_REQ_DATE (Days Late/Early)', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Days (Negative=Early, Positive=Late)')
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # Late vs On-time pie chart
    late_count = (age_req_data > 0).sum()
    on_time_early_count = (age_req_data <= 0).sum()
    
    axes[1].pie([late_count, on_time_early_count], 
                labels=[f'Late\n({late_count:,})', f'On-time/Early\n({on_time_early_count:,})'],
                colors=['#FF6B6B', '#51CF66'],
                autopct='%1.1f%%',
                startangle=90,
                textprops={'fontsize': 12, 'fontweight': 'bold'})
    axes[1].set_title('Late vs On-time/Early Deliveries', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("\n" + "="*70)
    print("TARGET VARIABLE: AGE_REQ_DATE")
    print("="*70)
    print(f"Mean: {age_req_data.mean():.2f} days")
    print(f"Median: {age_req_data.median():.2f} days")
    print(f"Min: {age_req_data.min():.2f} days (earliest)")
    print(f"Max: {age_req_data.max():.2f} days (latest)")
    print(f"Std Dev: {age_req_data.std():.2f} days")
    print(f"\nLate deliveries (>0 days): {late_count:,} ({late_count/len(age_req_data)*100:.1f}%)")
    print(f"On-time/Early (<=0 days): {on_time_early_count:,} ({on_time_early_count/len(age_req_data)*100:.1f}%)")
    print("="*70)
    
    # Show lateness buckets if they exist
    if 'Aging Bucket based on RDD' in df_closed.columns:
        print("\n" + "="*70)
        print("EXISTING AGING BUCKETS (from semantic model)")
        print("="*70)
        bucket_counts = df_closed['Aging Bucket based on RDD'].value_counts().sort_index()
        print(bucket_counts)
        print("="*70)
else:
    print("‚ö†Ô∏è Warning: AGE_REQ_DATE column not found!")
    print("Cannot proceed with late delivery prediction without this target variable.")

### üìÖ 10. Analyze Strategic Account Performance

Understanding late delivery patterns for strategic vs non-strategic accounts.

In [None]:
# Analyze strategic account performance
if 'STRATEGIC_ACCOUNT' in df_closed.columns and 'AGE_REQ_DATE' in df_closed.columns:
    print("=== STRATEGIC ACCOUNT PERFORMANCE ===\n")
    
    for acct_type in df_closed['STRATEGIC_ACCOUNT'].dropna().unique():
        subset = df_closed[df_closed['STRATEGIC_ACCOUNT'] == acct_type]
        age_data = subset['AGE_REQ_DATE'].dropna()
        
        if len(age_data) > 0:
            late_pct = ((age_data > 0).sum() / len(age_data) * 100)
            print(f"\n{acct_type}:")
            print(f"  Total deliveries: {len(subset):,}")
            print(f"  Average lateness: {age_data.mean():.2f} days")
            print(f"  Late delivery rate: {late_pct:.1f}%")
    
    print("\n" + "="*70)
else:
    print("‚ö†Ô∏è Strategic account analysis unavailable")
    print("   STRATEGIC_ACCOUNT or AGE_REQ_DATE column missing")

### ‚úÖ 11. Summary & Recommendations

Data is ready for AutoML training to predict late deliveries!

In [None]:
print("="*70)
print("LATE DELIVERY PREDICTION - DATA PREPARATION COMPLETE")
print("="*70)

# Validate target variable
if 'AGE_REQ_DATE' in df_closed.columns:
    print("\n‚úÖ TARGET VARIABLE: AGE_REQ_DATE")
    print("   Days late/early relative to Customer Requested Delivery Date")
    print("   - Positive = Late delivery")
    print("   - Zero = On-time")
    print("   - Negative = Early delivery")
else:
    print("\n‚ùå ERROR: AGE_REQ_DATE not found!")
    print("   Cannot proceed without target variable")

# Recommended features
print("\nüìä RECOMMENDED FEATURES FOR PREDICTION:")
recommended_features = [
    'Plant',
    'Brand', 
    'Channel',
    'Product Category',
    'Product Type',
    'Standard Or Custom',
    'Credit Status',
    'Distribution Status',
    'STRATEGIC_ACCOUNT',
    'EWM_CARRIER_CODE',
    'Shipping Point',
    'Delivery Type',
    'DELIVERY_QTY',
    'DELIVERY_VALUE_USD'
]

available_features = [f for f in recommended_features if f in df_closed.columns]
missing_features = [f for f in recommended_features if f not in df_closed.columns]

print(f"\n  Available features ({len(available_features)}/{len(recommended_features)}):")
for f in available_features:
    print(f"    ‚úÖ {f}")

if missing_features:
    print(f"\n  Missing features ({len(missing_features)}):")
    for f in missing_features:
        print(f"    ‚ö†Ô∏è {f}")

# Data quality check
print(f"\nüìà DATA SUMMARY:")
print(f"   Total closed deliveries: {len(df_closed):,}")
if 'AGE_REQ_DATE' in df_closed.columns:
    valid_target = df_closed['AGE_REQ_DATE'].notna().sum()
    print(f"   Valid target values: {valid_target:,} ({valid_target/len(df_closed)*100:.1f}%)")

print("\n" + "="*70)
print("‚úÖ NEXT STEP: Open 02_autoML_training_pipeline.ipynb")
print("="*70)
print("\nThis notebook will:")
print("  1. Train regression model to predict AGE_REQ_DATE")
print("  2. Create classification model for late vs on-time")
print("  3. Generate lateness buckets (0-2, 3-5, 6-9, 10+ days)")
print("  4. Register best model to MLflow")
print("="*70)

---

## Next Step

Proceed to **`02_autoML_training_pipeline.ipynb`** to train the late delivery prediction model.

The model will predict:
- **AGE_REQ_DATE** (regression): How many days late/early will the delivery be?
- **is_late** (classification): Will the delivery be late (yes/no)?
- **lateness_bucket** (multi-class): Which bucket (0-2, 3-5, 6-9, 10+ days)?