# Batch Prediction for Demand Forecasting

This notebook uses the trained XGBoost model to generate demand predictions for a specified period. The key feature of this process is its **iterative nature**. The model predicts demand for one week at a time, and the output of that prediction is then used as an input feature (a sales lag) for the following week. This simulates a real-world scenario where recent sales data informs future forecasts.

## 1. Loading Libraries and Modules

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import pickle
import yaml
import os
import utils_gpu

# Set pandas display options
pd.options.display.max_columns = None

print("Libraries loaded successfully.")

Libraries loaded successfully.


## 2. Loading Model Artifacts and Configuration

We need to load two key artifacts from the training phase:
1.  **The trained `pipeline.pkl`**: This file contains the complete Scikit-learn pipeline, including all preprocessing steps and the final XGBoost model.
2.  **The `model_params.yml` file**: This configuration file contains the list of features the model was trained on, ensuring we use the exact same features for prediction.

In [None]:
# --- Configuration ---
# IMPORTANT: Update this path to point to the directory where your trained model is saved.
# This path should contain a 'model/pipeline.pkl' and a 'params/exeperiment_params.json' or the original 'model_params.yml'
MODEL_OUTPUT_PATH = '../02_models/outputs/202509202132_XGBOOST/' #<-- PLEASE UPDATE THIS PATH
PARAMS_FILE_PATH = '../02_models/model_params.yml'

# --- Load Model ---
model_path = os.path.join(MODEL_OUTPUT_PATH, 'model', 'pipeline.pkl')
print(f"Loading model from: {model_path}")
with open(model_path, 'rb') as f:
    pipeline = pickle.load(f)
print("Model pipeline loaded successfully.")

# --- Load Parameters ---
print(f"Loading parameters from: {PARAMS_FILE_PATH}")
with open(PARAMS_FILE_PATH, "r") as f:
    configs = yaml.safe_load(f)

FEATURE_COLUMNS = configs['FEATURES']
print(f"Model expects {len(FEATURE_COLUMNS)} features.")

Loading model from: ../02_models/outputs/202509202132_XGBOOST/model/pipeline.pkl
Model pipeline loaded successfully.
Loading parameters from: ../02_models/model_params.yml
Model expects 24 features.


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  pipeline = pickle.load(f)
  pipeline = pickle.load(f)
  pipeline = pickle.load(f)
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


## 3. Loading and Preparing Production Data

In [None]:
DATA_PATH = '../../data/processed/processed_production_quarter_items.parquet'

print(f"Loading production data from: {DATA_PATH}")
df_prod = pd.read_parquet(DATA_PATH)

# Ensure data types are correct and sort the data chronologically
df_prod['week_of_year'] = df_prod['week_of_year'].astype(int)
df_prod = df_prod.sort_values(["week_of_year", "internal_product_id", "internal_store_id"]).reset_index(drop=True)

print(f"The production dataset contains {df_prod.shape[0]} rows and {df_prod.shape[1]} columns.")
df_prod.head()

Loading production data from: ../../data/processed/processed_production_quarter_items.parquet
The production dataset contains 1634745 rows and 27 columns.


Unnamed: 0,internal_product_id,internal_store_id,distributor_id,premise,categoria_pdv,zipcode,tipos,label,subcategoria,marca,fabricante,month,week_of_year,city,holiday,previous_month_quantity_sum,previous_month_gross_value_sum,previous_month_net_value_sum,previous_month_gross_profit_sum,previous_month_discount_sum,quantity_lag1,quantity_lag2,quantity_lag3,quantity_lag4,quantity_lag5,discount_rate_month,profit_margin_month
0,1000423277513436210,3820079530528229795,4,On Premise,Bar,30188,Draft,Core,Lager / Pilsner,Reformation A Cold One 97.1 Pilsner,Reformation Brewery,1,0,Woodstock,0,1.0,99.0,93.5,28.5,0.0,1.0,,,,,0.0,0.287879
1,1000423277513436210,7360454609632053666,4,On Premise,Restaurant,30143,Draft,Core,Lager / Pilsner,Reformation A Cold One 97.1 Pilsner,Reformation Brewery,1,0,Jasper,0,3.0,297.0,280.5,85.5,0.0,1.0,1.0,1.0,,,0.0,0.287879
2,1004943868572044494,3712962728699172861,4,Off Premise,Package/Liquor,30030,Distilled Spirits,Core,Rum,La Favorite Rhum Agricole Blanc,Caribbean Spirits,1,0,Decatur,0,,,,,,,,,,,0.0,0.0
3,1009179103632945474,1001371918471115422,4,Off Premise,Convenience,30175,Package,Core,Lager,Busch,AB Anheuser Busch Inc,1,0,Talking Rock,0,7.0,145.424997,124.147261,30.459262,12.95,2.0,2.0,2.0,1.0,,0.089049,0.20945
4,1009179103632945474,1004779246734143594,4,Off Premise,Package/Liquor,30013,Package,Core,Lager,Busch,AB Anheuser Busch Inc,1,0,Conyers,0,289.0,5960.458004,5081.991375,1214.015278,592.200004,9.0,,,,,0.099355,0.203678


### 3.1 Applying Manual Pre-Pipeline Transformations

Here, we replicate the manual data cleaning steps that were performed during training *before* the main preprocessing pipeline was applied. This ensures the prediction data has the exact same structure and format as the training data.

In [None]:
print("Applying manual pre-processing steps to match training data format...")

# 1. Handle '_sum' columns: Fill NaN values with 0
print("Filling NaN values in '_sum' columns with 0.")
cols_sum = [col for col in df_prod.columns if 'sum' in col]
df_prod[cols_sum] = df_prod[cols_sum].fillna(0)

# 2. Handle 'premise' column
# In training, rows with NaN 'premise' were dropped. We replicate that here.
initial_rows = len(df_prod)
df_prod.dropna(subset=['premise'], inplace=True)
drop_premise_len = len(df_prod)

if initial_rows > drop_premise_len:
    print(f"Dropped {initial_rows - drop_premise_len} rows with missing 'premise' value.")

# dropping label 'other'
df_prod = df_prod[~(df_prod.label == 'Other')].reset_index(drop=True)
drop_other_len = len(df_prod)

if drop_premise_len > drop_other_len:
    print(f"Dropped {drop_premise_len - drop_other_len} rows with label 'Other'.")

# renaming label Specialty to Speciality (merge theses two labels)
df_prod['label'] = df_prod['label'].replace({'Specialty': 'Speciality'})

df_prod.loc[
    df_prod.loc[
        (df_prod.tipos == 'Allocated Spirits') & (df_prod.subcategoria == 'Liqueurs & Cordials')
    ].index.tolist(),
    'label'
] = 'Discontinued'

# Map 'premise' string values to integer values
print("Mapping 'premise' column to integers.")
df_prod['premise'] = df_prod['premise'].map({'Off Premise': 1, 'On Premise': 0}).astype(int)

print("Manual pre-processing complete.")

Applying manual pre-processing steps to match training data format...
Filling NaN values in '_sum' columns with 0.
Mapping 'premise' column to integers.
Manual pre-processing complete.


## 4. Iterative Prediction Loop

This is the core of the notebook. We loop through each week from 1 to 5. In each iteration, we:

1.  **Predict**: Use the model to predict the quantity for the `current_week`.
2.  **Store**: Save the rounded predictions in a new `quantity` column.
3.  **Update Lags**: If it's not the last week, we update the `quantity_lag` features for the `next_week`. The prediction we just made for `week_n` becomes `quantity_lag1` for `week_n+1`. The old `quantity_lag1` becomes `quantity_lag2`, and so on.

In [None]:
print("Starting iterative prediction process...")

# Initialize the quantity column that will hold our predictions
df_prod['quantity'] = 0
df_prod['key'] = df_prod['internal_product_id'] + ' ' + df_prod['internal_store_id'] + ' ' + df_prod['distributor_id']
df_prod = df_prod.sort_values(["week_of_year", "internal_product_id", "internal_store_id", "distributor_id"]).reset_index(drop=True)

weeks_to_predict = sorted(df_prod['week_of_year'].unique())
print(f"Will predict for weeks: {weeks_to_predict}")

for i, current_week in enumerate(weeks_to_predict):
    print(f"\n--- Predicting for Week {current_week} ---")

    # --- 1. PREDICT ---
    # Select data for the current week
    week_mask = df_prod['week_of_year'] == current_week
    df_current_week = df_prod[week_mask]

    if df_current_week.empty:
        print(f"No data found for week {current_week}. Skipping.")
        continue

    # Ensure we only use the features the model was trained on
    X_pred = df_current_week[FEATURE_COLUMNS]

    # Make predictions
    predictions = pipeline.predict(X_pred)

    # --- 2. STORE ---
    # Round predictions to nearest whole number, as we can't sell fractions of products
    # and ensure they are non-negative.
    rounded_predictions = np.round(predictions).astype(int)
    rounded_predictions[rounded_predictions < 0] = 0

    # Store the predictions in our main dataframe
    df_prod.loc[week_mask, 'quantity'] = rounded_predictions
    print(f"Stored {len(rounded_predictions)} predictions for week {current_week}.")

    # --- 3. UPDATE LAGS ---
    # Check if this is not the last week in our prediction sequence
    if current_week < weeks_to_predict[-1]:
        # Select data for the next week
        next_week = weeks_to_predict[i + 1]
        next_week_mask = df_prod['week_of_year'] == next_week
        print(f"Updating lag features for next week ({next_week})...")

        # Get the internal_product_id and internal_store_id pairs, which are duplicated
        # when we look at the both weeks: current and next week.
        common_pairs = df_prod.loc[
            week_mask | next_week_mask,
             ['internal_product_id', 'internal_store_id', 'distributor_id']
        ]
        common_pairs['key'] = common_pairs['internal_product_id'] + ' ' + \
                              common_pairs['internal_store_id'] + ' ' + \
                              common_pairs['distributor_id']

        repeated_items = common_pairs.loc[
            common_pairs.duplicated(
                subset=['key']
            ),
            ['key']
        ]

        df_next_week_duplicated = df_prod[
            next_week_mask & df_prod['key'].isin(repeated_items['key'].values)
        ].sort_values([
            "week_of_year",
            "internal_product_id",
            "internal_store_id",
            "distributor_id"
        ]).reset_index(drop=True)
        df_current_week_duplicated = df_prod[
            week_mask & df_prod['key'].isin(repeated_items['key'].values)
        ].sort_values([
            "week_of_year",
            "internal_product_id",
            "internal_store_id",
            "distributor_id"
        ]).reset_index(drop=True)

        # Create a map of (product, store) -> predicted_quantity for the current week
        prediction_map = df_current_week_duplicated.set_index(['internal_product_id', 'internal_store_id', 'distributor_id'])['quantity']

        # Shift all existing lag features one week back
        for lag in range(4, 0, -1):
            df_next_week_duplicated[f'quantity_lag{lag+1}'] = df_current_week_duplicated[f'quantity_lag{lag}']
        df_next_week_duplicated['quantity_lag1'] = df_current_week_duplicated['quantity']
        for i in range(1, 6):
            df_prod.loc[
                df_prod.loc[
                    next_week_mask & df_prod['key'].isin(repeated_items['key'].values)
                ].index.tolist(),
                f'quantity_lag{i}'
            ] = df_next_week_duplicated[f'quantity_lag{i}'].values
        print(f"Lag features updated for week {next_week}.")

print("\nIterative prediction process completed.")

Starting iterative prediction process...
Will predict for weeks: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4)]

--- Predicting for Week 0 ---




Stored 326949 predictions for week 0.
Updating lag features for next week (1)...
Lag features updated for week 1.

--- Predicting for Week 1 ---




Stored 326949 predictions for week 1.
Updating lag features for next week (2)...
Lag features updated for week 2.

--- Predicting for Week 2 ---




Stored 326949 predictions for week 2.
Updating lag features for next week (3)...
Lag features updated for week 3.

--- Predicting for Week 3 ---




Stored 326949 predictions for week 3.
Updating lag features for next week (4)...
Lag features updated for week 4.

--- Predicting for Week 4 ---




Stored 326949 predictions for week 4.

Iterative prediction process completed.


## 5. Format and Save Final Results

Finally, we format the output DataFrame to match the required schema and save it as a Parquet file.

In [None]:
print("Formatting final results...")

# Define the final column mapping and order
FINAL_COLUMNS = {
    'week_of_year': 'semana',
    'internal_store_id': 'pdv',
    'internal_product_id': 'produto',
    'quantity': 'quantidade'
}

# Create the final dataframe
df_final = df_prod[list(FINAL_COLUMNS.keys())].copy()
df_final.rename(columns=FINAL_COLUMNS, inplace=True)

# Add +1 to weeks to ensure the correct format
df_final['semana'] = df_final['semana'] + 1

# Group them all to ensure non-duplicated
df_final = df_final.groupby(['semana', 'pdv', 'produto'])['quantidade'].sum().reset_index()

# Cast all columns to integer type as requested
for col in df_final.columns:
    df_final[col] = df_final[col].astype(int)

# --- Save Results ---
OUTPUT_FILENAME = '../../data/processed/predictions_january_2023.parquet'
df_final.to_parquet(OUTPUT_FILENAME, index=False)

print(f"Final predictions saved to '{OUTPUT_FILENAME}'")
print(f"Final dataframe shape: {df_final.shape}")
df_final.head()

Formatting final results...
Final predictions saved to '../../data/processed/predictions_january_2023.parquet'
Final dataframe shape: (1500000, 4)


Unnamed: 0,semana,pdv,produto,quantidade
0,1,1000237487041964405,5429216175252037173,1
1,1,1000237487041964405,777251454728290683,1
2,1,1001371918471115422,1009179103632945474,2
3,1,1001371918471115422,1029370090212151375,3
4,1,1001371918471115422,1120490062981954254,11
