# Hybrid Forecast Generation - Complete Example

This notebook demonstrates the hybridization algorithm that consolidates Time Series (TS) and Machine Learning (ML) forecasts into a single hybrid forecast.

## Business Rules

The algorithm applies three rules:

1. **ML Forecast** - Use for promo (not retired), short lifecycle, or new assortment
2. **TS Forecast** - Use for retired/low volume with low TS forecast (≤ 0.01)
3. **Ensemble (Average)** - Use for all other cases

## Setup and Imports

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
from datetime import datetime
from hybridization import hybridization, IB_ZERO_DEMAND_THRESHOLD

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## Create Sample Data

Let's create sample data demonstrating all three scenarios:
- **Case 1-3**: ML forecast (promo, short, new assortment)
- **Case 4-5**: TS forecast (retired/low volume with low forecast)
- **Case 6**: Ensemble (regular case with average)
- **Case 7-8**: Missing value handling
- **Case 9**: Ensemble with low TS threshold (not Retired/Low Volume)

In [None]:
sample_data = pd.DataFrame([
    # Case 1: Promo demand (not retired) -> Should use ML
    {
        'PRODUCT_LVL_ID': 'P001', 'LOCATION_LVL_ID': 'LOC_001',
        'CUSTOMER_LVL_ID': 'CUST_001', 'DISTR_CHANNEL_LVL_ID': 'CH_1',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 80.0, 'ML_FORECAST_VALUE': 120.0,
        'SEGMENT_NAME': 'Regular', 'DEMAND_TYPE': 'promo', 'ASSORTMENT_TYPE': 'old'
    },
    # Case 2: Short lifecycle -> Should use ML
    {
        'PRODUCT_LVL_ID': 'P002', 'LOCATION_LVL_ID': 'LOC_001',
        'CUSTOMER_LVL_ID': 'CUST_001', 'DISTR_CHANNEL_LVL_ID': 'CH_1',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 60.0, 'ML_FORECAST_VALUE': 95.0,
        'SEGMENT_NAME': 'Short', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'old'
    },
    # Case 3: New assortment -> Should use ML
    {
        'PRODUCT_LVL_ID': 'P003', 'LOCATION_LVL_ID': 'LOC_001',
        'CUSTOMER_LVL_ID': 'CUST_001', 'DISTR_CHANNEL_LVL_ID': 'CH_1',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 40.0, 'ML_FORECAST_VALUE': 110.0,
        'SEGMENT_NAME': 'Regular', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'new'
    },
    # Case 4: Retired with low TS forecast -> Should use TS
    {
        'PRODUCT_LVL_ID': 'P004', 'LOCATION_LVL_ID': 'LOC_002',
        'CUSTOMER_LVL_ID': 'CUST_001', 'DISTR_CHANNEL_LVL_ID': 'CH_1',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 0.005, 'ML_FORECAST_VALUE': 50.0,
        'SEGMENT_NAME': 'Retired', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'old'
    },
    # Case 5: Low Volume with low TS forecast -> Should use TS
    {
        'PRODUCT_LVL_ID': 'P005', 'LOCATION_LVL_ID': 'LOC_002',
        'CUSTOMER_LVL_ID': 'CUST_001', 'DISTR_CHANNEL_LVL_ID': 'CH_1',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 0.008, 'ML_FORECAST_VALUE': 35.0,
        'SEGMENT_NAME': 'Low Volume', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'old'
    },
    # Case 6: Regular case -> Should use Ensemble (average)
    {
        'PRODUCT_LVL_ID': 'P006', 'LOCATION_LVL_ID': 'LOC_003',
        'CUSTOMER_LVL_ID': 'CUST_002', 'DISTR_CHANNEL_LVL_ID': 'CH_2',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 75.0, 'ML_FORECAST_VALUE': 85.0,
        'SEGMENT_NAME': 'Regular', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'old'
    },
    # Case 7: Missing TS forecast -> Should use ML value (coalesce)
    {
        'PRODUCT_LVL_ID': 'P007', 'LOCATION_LVL_ID': 'LOC_003',
        'CUSTOMER_LVL_ID': 'CUST_002', 'DISTR_CHANNEL_LVL_ID': 'CH_2',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': np.nan, 'ML_FORECAST_VALUE': 65.0,
        'SEGMENT_NAME': 'Regular', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'old'
    },
    # Case 8: Missing ML forecast -> Should use TS value (coalesce)
    {
        'PRODUCT_LVL_ID': 'P008', 'LOCATION_LVL_ID': 'LOC_003',
        'CUSTOMER_LVL_ID': 'CUST_002', 'DISTR_CHANNEL_LVL_ID': 'CH_2',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 90.0, 'ML_FORECAST_VALUE': np.nan,
        'SEGMENT_NAME': 'Regular', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'old'
    },
    # Case 9: Ensemble with low TS (not Retired/Low Volume) -> Should use TS due to threshold
    {
        'PRODUCT_LVL_ID': 'P009', 'LOCATION_LVL_ID': 'LOC_003',
        'CUSTOMER_LVL_ID': 'CUST_002', 'DISTR_CHANNEL_LVL_ID': 'CH_2',
        'PERIOD_DT': datetime(2023, 1, 1), 'PERIOD_END_DT': datetime(2023, 1, 2),
        'TS_FORECAST_VALUE_REC': 0.005, 'ML_FORECAST_VALUE': 50.0,
        'SEGMENT_NAME': 'Regular', 'DEMAND_TYPE': 'regular', 'ASSORTMENT_TYPE': 'old'
    }
])

print("Input Data (RECONCILED_FORECAST):")
print("=" * 100)
sample_data

In [None]:
## Apply Hybridization Algorithm

Now we'll apply the hybridization function to consolidate the forecasts.

Unnamed: 0,date,SEGMENT_NAME,HYBRID_FORECAST_VALUE,CUSTOMER_LVL_ID1,DISTR_CHANNEL_LVL_ID,VF_FORECAST_VALUE,ML_FORECAST_VALUE,MID_RECONCILED_FORECAST,DEMAND_TYPE,ASSORTMENT_TYPE,PRODUCT_LVL_ID1,LOCATION_LVL_ID
0,2002-12-01,Low Volume,0.405073,928845,2,0.985982,0.209550,0.503854,regular,old,12311,10000
1,2002-12-02,Low Volume,0.074648,917788,3,0.029834,0.600605,0.223685,promo,old,12311,10000
2,2002-12-03,Low Volume,0.057117,926685,4,0.040331,0.409552,0.755967,promo,new,10091,10000
3,2002-12-04,Retired,0.800762,919984,3,0.591651,0.776989,0.633118,regular,new,10084,10000
4,2002-12-05,Low Volume,1.810381,962662,1,0.369578,0.992910,0.303877,promo,new,12311,10000
...,...,...,...,...,...,...,...,...,...,...,...,...
7122,2022-06-01,Retired,1.928949,951021,3,1.110048,0.947514,1.553459,promo,old,10084,10000
7123,2022-06-02,Short,0.699601,939587,3,2.478038,1.329466,0.292256,promo,new,12311,10000
7124,2022-06-03,Short,0.251907,978303,2,0.384645,0.086105,0.187996,promo,new,10091,10000
7125,2022-06-04,Low Volume,0.554015,970922,1,0.618104,0.148942,0.805203,promo,old,10091,10000


In [None]:
# Apply hybridization with default threshold (0.01)
hybrid_forecast = hybridization(sample_data)

print("\nOutput Data (HYBRID_FORECAST):")
print("=" * 100)
hybrid_forecast

Unnamed: 0,date,SEGMENT_NAME,HYBRID_FORECAST_VALUE,CUSTOMER_LVL_ID1,DISTR_CHANNEL_LVL_ID,VF_FORECAST_VALUE,ML_FORECAST_VALUE,MID_RECONCILED_FORECAST,DEMAND_TYPE,ASSORTMENT_TYPE,PRODUCT_LVL_ID1,LOCATION_LVL_ID
0,2002-12-01,Low Volume,0.405073,928845,2,0.985982,0.209550,0.503854,regular,old,12311,10000
1,2002-12-02,Low Volume,0.074648,917788,3,0.029834,0.600605,0.223685,promo,old,12311,10000
2,2002-12-03,Low Volume,0.057117,926685,4,0.040331,0.409552,0.755967,promo,new,10091,10000
3,2002-12-04,Retired,0.800762,919984,3,0.591651,0.776989,0.633118,regular,new,10084,10000
4,2002-12-05,Low Volume,1.810381,962662,1,0.369578,0.992910,0.303877,promo,new,12311,10000
...,...,...,...,...,...,...,...,...,...,...,...,...
7122,2022-06-01,Retired,1.928949,951021,3,1.110048,0.947514,1.553459,promo,old,10084,10000
7123,2022-06-02,Short,0.699601,939587,3,2.478038,1.329466,0.292256,promo,new,12311,10000
7124,2022-06-03,Short,0.251907,978303,2,0.384645,0.086105,0.187996,promo,new,10091,10000
7125,2022-06-04,Low Volume,0.554015,970922,1,0.618104,0.148942,0.805203,promo,old,10091,10000


## Analyze Results

Let's examine the key columns to see how the hybridization worked.

In [None]:
# Display key columns
key_columns = [
    'PRODUCT_LVL_ID', 'SEGMENT_NAME', 'DEMAND_TYPE', 'ASSORTMENT_TYPE',
    'TS_FORECAST_VALUE', 'ML_FORECAST_VALUE', 
    'HYBRID_FORECAST_VALUE', 'ENSEMBLE_FORECAST_VALUE', 'FORECAST_SOURCE'
]

print("\nSummary View:")
print("=" * 100)
hybrid_forecast[key_columns]

Unnamed: 0,date,SEGMENT_NAME,HYBRID_FORECAST_VALUE,CUSTOMER_LVL_ID1,DISTR_CHANNEL_LVL_ID,VF_FORECAST_VALUE,ML_FORECAST_VALUE,MID_RECONCILED_FORECAST,DEMAND_TYPE,ASSORTMENT_TYPE,PRODUCT_LVL_ID1,LOCATION_LVL_ID,FORECAST_SOURCE,ENSEMBLE_FORECAST_VALUE
0,2002-12-01,Low Volume,0.001000,928845,2,0.985982,0.209550,0.503854,regular,old,12311,10000,vf,
1,2002-12-02,Low Volume,0.600605,917788,3,0.029834,0.600605,0.223685,promo,old,12311,10000,ml,
2,2002-12-03,Low Volume,0.409552,926685,4,0.040331,0.409552,0.755967,promo,new,10091,10000,ml,
3,2002-12-04,Retired,0.776989,919984,3,0.591651,0.776989,0.633118,regular,new,10084,10000,ml,
4,2002-12-05,Low Volume,0.992910,962662,1,0.369578,0.992910,0.303877,promo,new,12311,10000,ml,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7122,2022-06-01,Retired,0.001000,951021,3,1.110048,0.947514,1.553459,promo,old,10084,10000,vf,
7123,2022-06-02,Short,1.329466,939587,3,2.478038,1.329466,0.292256,promo,new,12311,10000,ml,
7124,2022-06-03,Short,0.086105,978303,2,0.384645,0.086105,0.187996,promo,new,10091,10000,ml,
7125,2022-06-04,Low Volume,0.148942,970922,1,0.618104,0.148942,0.805203,promo,old,10091,10000,ml,


## Forecast Source Distribution

Let's see how many records fall into each category (ML, TS, or Ensemble).


In [None]:
print("Forecast Source Distribution:")
print("=" * 50)
print(hybrid_forecast['FORECAST_SOURCE'].value_counts())

print("\n\nPercentage Distribution:")
print("=" * 50)
print(hybrid_forecast['FORECAST_SOURCE'].value_counts(normalize=True) * 100)


## Detailed Analysis by Forecast Source

### ML Forecast Cases


In [None]:
ml_cases = hybrid_forecast[hybrid_forecast['FORECAST_SOURCE'] == 'ml']

print("ML Forecast Cases (Rule 1: Promo/Short/New):")
print("=" * 100)
print(ml_cases[key_columns])

print("\n\nExplanation:")
print("These cases use ML forecast because they meet one of these conditions:")
print("  1. Promo demand (not retired segment)")
print("  2. Short lifecycle products")
print("  3. New assortment items")


### TS Forecast Cases


In [None]:
ts_cases = hybrid_forecast[hybrid_forecast['FORECAST_SOURCE'] == 'ts']

print("TS Forecast Cases (Rule 2: Retired/Low Volume with low forecast):")
print("=" * 100)
print(ts_cases[key_columns])

print(f"\n\nExplanation:")
print(f"These cases use TS forecast because:")
print(f"  1. Segment is 'Retired' OR 'Low Volume'")
print(f"  2. TS forecast <= {IB_ZERO_DEMAND_THRESHOLD} (zero demand threshold)")
print(f"\nThis indicates minimal expected future demand.")


### Ensemble Forecast Cases


In [None]:
ensemble_cases = hybrid_forecast[hybrid_forecast['FORECAST_SOURCE'] == 'ensemble']

print("Ensemble Forecast Cases (Rule 3: Average of TS and ML or TS if low):")
print("=" * 100)
print(ensemble_cases[key_columns])

print("\n\nExplanation:")
print("These cases use ensemble because:")
print("  - They don't meet conditions for ML or TS rules")
print("  - If TS_FORECAST_VALUE <= threshold, HYBRID_FORECAST_VALUE = TS_FORECAST_VALUE")
print("  - Otherwise, HYBRID_FORECAST_VALUE = Average(TS_FORECAST_VALUE, ML_FORECAST_VALUE)")
print("\nNote: Missing values are handled gracefully - if one forecast is missing,")
print("the other is used. ENSEMBLE_FORECAST_VALUE shows the calculated average.")


## Verify Ensemble Calculations

Let's verify that the ensemble calculations are correct.


In [None]:
print("Ensemble Calculation Verification:")
print("=" * 100)

for idx, row in ensemble_cases.iterrows():
    ts_val = row['TS_FORECAST_VALUE']
    ml_val = row['ML_FORECAST_VALUE']
    hybrid_val = row['HYBRID_FORECAST_VALUE']
    ensemble_val = row['ENSEMBLE_FORECAST_VALUE']
    
    if pd.notna(ts_val) and ts_val <= IB_ZERO_DEMAND_THRESHOLD:
        expected = ts_val
        expected_type = 'TS (low threshold)'
    else:
        values = [v for v in [ts_val, ml_val] if pd.notna(v)]
        expected = np.mean(values) if values else np.nan
        expected_type = 'Average'
    
    print(f"\nProduct {row['PRODUCT_LVL_ID']}:")
    print(f"  TS:       {ts_val:.3f if pd.notna(ts_val) else 'NaN'}")
    print(f"  ML:       {ml_val:.3f if pd.notna(ml_val) else 'NaN'}")
    print(f"  Hybrid:   {hybrid_val:.3f if pd.notna(hybrid_val) else 'NaN'}")
    print(f"  Ensemble: {ensemble_val:.3f if pd.notna(ensemble_val) else 'NaN'}")
    print(f"  Expected: {expected:.3f if pd.notna(expected) else 'NaN'} ({expected_type})")
    print(f"  ✓ Match:  {np.isclose(hybrid_val, expected, equal_nan=True)}")


## Summary Statistics


In [None]:
print("Hybrid Forecast Value Statistics by Source:")
print("=" * 100)

for source in ['ml', 'ts', 'ensemble']:
    subset = hybrid_forecast[hybrid_forecast['FORECAST_SOURCE'] == source]
    if len(subset) > 0:
        print(f"\n{source.upper()} Forecast Statistics:")
        print("-" * 50)
        print(subset['HYBRID_FORECAST_VALUE'].describe())
    else:
        print(f"\n{source.upper()} Forecast: No records")


## Conclusion

This notebook demonstrated the hybrid forecast generation algorithm:

### Key Takeaways:

1. **ML forecasts** (3 cases) - Used for promotional, short-lifecycle, and new assortment items where ML models excel
2. **TS forecasts** (2 cases) - Used for retired or low-volume products with minimal expected demand
3. **Ensemble forecasts** (3 cases) - Average of TS and ML used for regular cases to combine strengths of both methods

### Business Value:

- **Flexibility**: Different forecast sources for different scenarios
- **Accuracy**: Leverages strengths of both TS and ML models
- **Robustness**: Handles missing values gracefully
- **Transparency**: Clear indication of which forecast source was used

### Next Steps:

- Apply to your actual reconciled forecast data
- Adjust `ib_zero_demand_threshold` if needed
- Integrate with downstream steps (disaggregation, disaccumulation)
- Monitor forecast accuracy by source type
