# Inventory planning

This notebook is aimed to documents the steps taken to build and evaluate a predictive model for inventory planning.<br>
The goal is to predict the quantity of items sold based on historical sales data and various features.<br>
The model will help in making informed decisions for inventory management.

## Data exploration

In [1]:
import os
import pandas

cache_dir = "../.cache/"
data = pandas.read_csv(filepath_or_buffer=os.path.join(cache_dir, "datasets/type=raw/data.csv"))
data['DATE'] = pandas.to_datetime(data['DATE'])

### Current level exploration

In [2]:
sku_current_level_consistency = data.groupby('SKU')['CURRENT_LEVEL'].nunique()
print((sku_current_level_consistency == 1).all())

True


We observe that the CURRENT_LEVEL column is constant for each SKU, meaning it does not vary across different records for the same SKU.<br>
This lack of variability indicates that CURRENT_LEVEL does not provide any additional information that could influence the forecasting model.<br>
The CURRENT_LEVEL column has been excluded from the subsequent steps. This ensures that our focus remains on the features that have a meaningful impact on the forecasting process.

### SKU based model

The decision to include SKU as a feature in your model depends on the context and the nature of your forecasting task. Here are some considerations:

* When to Include SKU as a Feature:
    * Multiple SKUs with Different Sales Patterns: If your dataset contains multiple SKUs with distinct sales patterns, including the SKU as a feature can help the model learn these differences. This is particularly useful for models that can handle categorical features well, such as tree-based models (e.g., Random Forests, Gradient Boosting Machines) or neural networks.
    * SKU-Specific Forecasting: If your goal is to create a unified model that forecasts sales for multiple SKUs simultaneously, including the SKU can help the model differentiate between the products.

* When to Exclude SKU as a Feature:
    * Single SKU or Homogeneous SKUs: If the dataset pertains to a single SKU or multiple SKUs with similar sales patterns, the SKU may not add significant value and could potentially introduce noise.
    * Separate Models for Each SKU: If you plan to build and optimize separate models for each SKU, including SKU as a feature in each individual model would be redundant.

In [5]:
import scipy.stats as stats

# Perform ANOVA
anova_result = stats.f_oneway(*(data[data['SKU'] == sku]['QUANTITY_SOLD'] for sku in data['SKU'].unique()))
print('ANOVA result:', anova_result)

ANOVA result: F_onewayResult(statistic=57.2815366135137, pvalue=0.0)


* The F-statistic is 57.28 (high) indicates substantial differences in sales quantities between different SKUs.
* The p-value is 0.0, which is effectively zero, providing very strong evidence that the differences in sales quantities across SKUs are statistically significant and not due to random chance.

The results of the ANOVA test reveal a highly significant dependency of sales patterns on the SKU. With an F-statistic of 57.28 and a p-value of 0.0, the test strongly rejects the null hypothesis, indicating substantial differences in sales quantities across different SKUs. This suggests that the SKU is a critical feature that influences the sales behavior and should be included in the forecasting model. By incorporating the SKU, the model can better capture the unique sales patterns associated with each product, leading to more accurate and reliable forecasts.

# Model definition

I was considering two potential approaches for the forecasting model: manually defining features or using a temporal model.<br>
Each approach has its own set of advantages and disadvantages that are important to consider.

## Manually Defining Features

### Pros:

* Flexibility: Manually generated features such as day of the week, month, quarter, and lagged variables can provide the model with rich contextual information.
* Model Agnostic: These features can be used with a wide range of machine learning models, from linear regression to tree-based models and neural networks.

### Cons:

* Manual Effort: This approach requires significant domain knowledge and effort to engineer relevant and effective features.
* Complexity: There is a risk of missing out on capturing more complex temporal dependencies that might be better handled by specialized time series models.

## Using a Temporal Model (e.g., AR(1))

### Pros:

* Capturing Temporal Dependencies: Temporal models like AR(1) are specifically designed to capture temporal dependencies and can be very effective for time series forecasting.
* Simplicity: For certain types of data, simpler autoregressive models can provide good performance without the need for extensive feature engineering.

### Cons:

* Limited Flexibility: Autoregressive models might not handle complex seasonality or other patterns as effectively as more flexible models combined with manually engineered features.
* Model Specific: These models are tailored for time series data and might not integrate as seamlessly with other predictive features or complex datasets.

I took the pragmatic approach of going for the first solution, due to the simplicity of the model and its agnosticity.<br>
In other conditions, I would have tested both models. And depending on the output, go for on of them in production or AB test them.

### Ground truth model

To establish a baseline performance for our forecasting model, I initially implemented a Linear Regression model that can be our ground thruth.<br>
This model provided a reasonable starting point with:
* R² of 0.5303
* RMSE of 16.724.

The XGBoost model significantly improved the performance, achieving:
* R² of 0.7682
* RMSE of 11.749.<br>

This marked improvement across all evaluation metrics—R², MAE, MSE, and RMSE—indicates that the XGBoost model is better suited for capturing the intricacies of the data.<br>

| Metric | Linear Regression | XGBoost  |
|--------|-------------------|----------|
| R²     | 0.5303            | 0.7682   |
| MAE    | 3.048             | 2.696    |
| MSE    | 279.678           | 138.039  |
| RMSE   | 16.724            | 11.749   |