# Predicting Item Sales with getML on H&M Fashion Dataset

## Introduction

This notebook shows how to use [getML](https://getml.com) to predict item sales on the H&M Fashion dataset,
outperforming other approaches in the [Relational Deep Learning Benchmark (RelBench)](http://relbench.stanford.edu/). We achieve this with minimal code complexity and without requiring knowledge from the business domain.

### Why Focus on Feature Engineering?

Pedro Domingos, a leading ML researcher, highlighted in his famous 2012 paper that *features are the most critical factor in machine learning.*
Features are the "language" that allows prediction models to interpret relational data. If that language is poor or incomplete, even the best-tuned models will underperform. In classical ML approaches like gradient boosting features are undoubtly king. At getML, our mission is to automate feature engineering for relational data, minimizing the need for complex models, manual SQL code, and business domain expertise – often the Achilles' heel of predictive analytics. The importance of features isn’t limited to gradient boosting. Even in deep learning (text and images), architectures like CNNs, RNNs, and transformers see 70-90% of all operations count toward feature extraction. Regardless of the model, it’s the quality of features – not just the final layers – that drives performance.

### Why getML?

Relational learning is heavily underutilized across industries. At getML, we aim to change that by advancing the field through innovative feature learning algorithms. [FastProp](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-fastprop) (Fast Propositionalization) is one of our core algorithms, automating feature engineering for regression and classification tasks on relational data. It runs *[60 to 1000 times faster](https://github.com/getml/getml-community?tab=readme-ov-file#benchmarks)* than tools like [featuretools](https://www.featuretools.com) and [tsfresh](https://tsfresh.com), while scaling effortlessly to millions of rows.

### First Time Using getML?

If you're new to getML, consider starting with the simpler [notebook on user churn](hm-churn.ipynb) for an introduction to basic concepts.

### What This Notebook Covers

While getML can serve as a complete end-to-end solution, getML is also designed for seamless integration with other frameworks. In this notebook, we will:
- Use getML for *feature engineering* only and export (`transform`) the generated features, to
- train a *LightGBM regressor* on these features for prediction, and
- tune the resulting model with *Optuna* for hyperparameter optimization.

### Outline

This notebook is divided into six key sections:

1. [Setup](#1.-Setup) – Launch the getML engine and download the H&M dataset from RelBench.
2. [Data Preparation](#2.-Data-Preparation) – Load the dataset, define roles, and create a DataModel.
3. [The Basline Model](#3.-The-Baseline-Model) – Train a simple pipeline with FastProp and XGBoostRegressor.
4. [The Refined Model](#4.-The-Refined-Model) – Explore FastProp’s parameters and optimize the DataModel.
5. [Exporting Features](#5.-Exporting-Features) – Generate features and export them for external use.
6. [Training LightGBM](#6.-Training-LightGBM) – Train and evaluate a LightGBM regressor using Optuna for tuning.

---
## 1. Setup

> ⓘ Note: We assume you have all necessary libraries installed. We have [prepared an environment for you](pyproject.toml). To
> to use it, just start jupyter lab through `uv run jupyter lab`.

In this section, we:
- Import required libraries.
- Create a getML project.
- Download the "H&M" dataset from RelBench.

In [2]:
import getml
import pyarrow as pa
import pyarrow.parquet as pq
from relbench.datasets import get_dataset
from relbench.tasks import get_task

# Enable textual output to avoid rendering issues in certain JupyterLab environments
getml.utilities.progress.FORCE_TEXTUAL_OUTPUT = True
getml.utilities.progress.FORCE_MONOCHROME_OUTPUT = True

# Launch getML engine and set project.
getml.set_project("hm-item")

# Download dataset and task from RelBench.
dataset = get_dataset("rel-hm", download=True)
task = get_task("rel-hm", "item-sales", download=True)

[2K  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

---
## 2. Data Preparation
In this section, we:
- Annotate the data (assign roles) for feature learning.
- Build a data model to represent table relationships.
- Train a simple pipeline (`FastProp` + `XGBoostRegressor`) as a baseline for initial results.

### Annotating Data
Define the roles for population, customer, and transaction tables.
These roles help getML understand how to process each column.

Roles are set based on the insights gained from the [data model](#H&M-DataModel-Overview) and the [RelBench dataset description](https://relbench.stanford.edu/datasets/rel-hm).

In [3]:
# roles for the population tables (train, test, val).
population_roles = getml.data.Roles(
    join_key=["article_id"],
    target=["sales"],
    time_stamp=["timestamp"],
)

# Customer table roles. Keeping columns 'FN', 'Active', and 'postal_code'
# unused based on earlier pipeline checks
customer_roles = getml.data.Roles(
    join_key=["customer_id"], numerical=["age"], categorical=["club_member_status"]
)

# Transaction table roles (linking articles and customers).
transaction_roles = getml.data.Roles(
    join_key=["article_id", "customer_id"],
    time_stamp=["t_dat"],
    numerical=["price"],
    categorical=["sales_channel_id"],
)

The `article` table is omitted from feature learning, as it stands in a one-to-one realtionship
with the population table. These categorical article attributes are passed
separately to the LightGBM model later (See [Section 3 – Exporting Features](#Exporting-Features)).

### Loading Data

In [4]:
subsets = ("train", "test", "val")
populations = {}
for subset in subsets:
    populations[subset] = getml.data.DataFrame.from_parquet(
        f"{dataset.cache_dir}/tasks/item-sales/{subset}.parquet",
        subset,
        population_roles,
    )

customer = getml.data.DataFrame.from_parquet(
    f"{dataset.cache_dir}/db/customer.parquet", "customer", customer_roles
)

transaction = getml.data.DataFrame.from_parquet(
    f"{dataset.cache_dir}/db/transactions.parquet", "transaction", transaction_roles
)

### Defining the DataModel and Container
#### H&M DataModel Overview
<img src="https://relbench.stanford.edu/img/rel-hm.png" width="500"/>

#### Creating a getML DataModel

In [5]:
dm = getml.data.DataModel(population=populations["train"].to_placeholder())

dm.add(getml.data.to_placeholder(customer, transaction))

# Define table relationships:
# 3) population -> transaction (with a time restriction, 6 week memory).
dm.population.join(
    dm.transaction,
    on="article_id",
    time_stamps=("timestamp", "t_dat"),
    memory=getml.data.time.weeks(6),
)

# 2) transaction -> customer (many-to-one).
dm.transaction.join(
    dm.customer, on="customer_id", relationship=getml.data.relationship.many_to_one
)

# 3) Wrap data into a container for pipeline fitting.
container = getml.data.Container(**populations)
container.add(customer, transaction)

---
## 3. The Baseline Model

Now, we are already set up to train a simple pipeline with FastProp and XGBoostRegressor to establish a baseline. First, we define a simple pipeline with FastProp and XGBoostRegressor to establish a baseline.

In [6]:
pipe_base = getml.Pipeline(
    data_model=dm,
    feature_learners=getml.feature_learning.FastProp(),
    predictors=getml.predictors.XGBoostRegressor(),
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
)

We fit the pipeline on the training set and evaluate it on the validation set.

In [7]:
pipe_base.fit(container.train, check=True)
pipe_base.score(container.val)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
[2K  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
[2K  FastProp: Trying 54 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

Time taken: 0:00:03.837314.

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2025-01-15 17:28:23,train,sales,0.05194,0.07277,0.9991
1,2025-01-15 17:28:26,val,sales,3.02385,4.68686,0.1724


---
## 5. The Refined Model

Building on our baseline, this section focuses on refining the data model and the pipeline for improved accuracy.
These enhancements increase runtime from 3 minutes to approximately 40 minutes (gcloud; n2-standard-32, 32 vCPUs & 128 GB RAM).

The refined pipeline and data model expands the feature space from 54 to 1584 features by:
- Modify the data model to capture autoregressive (AR) effects in sales
- Add getML's [seasonal preprocessor](https://getml.com/latest/reference/preprocessors/seasonal/) to the pipeline
- Add aggregations to FastProp for more temporal aggregations,
- Handling categorical columns with `n_most_frequent`,
- Limiting total features with `num_features`.

### a. Capturing Autoregressive (AR) Effects in Sales

Sales data often exhibits autoregressive patterns. In the base pipeline, we observed a performance drop in MAE from 0.04749 to 0.06861 from train to validation sets, suggesting the baseline
features didn’t generalize well. By default, FastProp aggregates over the entire history (memory=6 weeks), potentially missing short-term trends.
To address this, we introduce a secondary join with a smaller 1-week memory window. Note that this join is introduced *in addition* to the existing one, resulting
in a new path features can be learned from.

In [8]:
dm.population.join(
    dm.transaction,
    on="article_id",
    time_stamps=("timestamp", "t_dat"),
    memory=getml.data.time.weeks(1),
)

This short-term join captures recent sales activity. Advanced getML algorithms like [MultiRel](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-multirel) or [Relboost](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-relboost) can learn such AR effects without modifying the DataModel, but for FastProp, this adjustment is crucial.

### b. Applying Seasonal Preprocessing

In [9]:
# The Seasonal preprocessor extracts temporal features (e.g., month, day-of-week) from time stamps.
seasonal_preprocessor = getml.preprocessors.Seasonal()

# We only want it to affect the population’s `timestamp`, not the transaction table’s
# `t_dat` and exclude `t_dat` via the [subroles](https://getml.com/latest/reference/data/subroles/) concept:
transaction.set_subroles(["t_dat"], getml.data.subroles.exclude.seasonal)

# sync the container, to reflect the changed annotations
container.sync()

### c. Add aggregations to FastProp that help the predictor to catch temporal correlations

FastProp’s default aggregations include `COUNT`, `SUM`, etc. We can add more advanced
aggregations like Exponentially Weighted Moving Averages (EWMA) and quantiles to
capture temporal patterns.

In [10]:
additional_aggregations = {
    getml.feature_learning.aggregations.EWMA_1D,
    getml.feature_learning.aggregations.EWMA_7D,
    getml.feature_learning.aggregations.EWMA_30D,
    getml.feature_learning.aggregations.Q_1,
    getml.feature_learning.aggregations.Q_5,
    getml.feature_learning.aggregations.Q_10,
    getml.feature_learning.aggregations.Q_25,
    getml.feature_learning.aggregations.TIME_SINCE_FIRST_MINIMUM,
    getml.feature_learning.aggregations.TIME_SINCE_LAST_MINIMUM,
    getml.feature_learning.aggregations.TIME_SINCE_LAST_MAXIMUM,
    getml.feature_learning.aggregations.TIME_SINCE_FIRST_MAXIMUM,
}

### d. Handling Categorical Features with `n_most_frequent`

Categorical columns in feature learning create a new dimension for feature generation. In the face of brute force methods (like FastProp), this can lead to an explosion in features as we are creating a new feature for each level of the categorical column for each aggregation we aplly to a column for each column we aggregate over. I.e. the total number of features grows exponentially with the number of categories in the categorical column.
`n_most_frequent` in FastProp helps to leviate this issue by restricting the number of categories that are considered for feature generation. FastProp will only create features for the `n_most_frequent` categories in a column, all other categories will be binned into a single category. This is especially useful when dealing with columns that contain many categories (like `sales_channel_id`). If we set `n_most_frequent=2`,
FastProp will look at the two most frequent categories in that column and create a fallback for everything else. This avoids explosive feature growth
when dealing with many possible categories.

In [11]:
n_most_frequent = 2

### e. Limiting the Total Number of Features

FastProp can generate a large number of features. It ranks them based on their pairwise correlation with the target. The highest-ranking subset is kept. Setting `num_features=200` means we retain only the top 200.
This prevents memory issues when feeding these features to non-memory-mapped models like XGBoost.

In [12]:
num_features = 200

Building and Fitting the Enhanced Pipeline

In [13]:
pipe_refined = getml.Pipeline(
    data_model=dm,
    preprocessors=seasonal_preprocessor,
    feature_learners=getml.feature_learning.FastProp(
        n_most_frequent=n_most_frequent,
        num_features=num_features,
        aggregation=(
            getml.feature_learning.FastProp.agg_sets.default | additional_aggregations
        ),
    ),
    predictors=getml.predictors.XGBoostRegressor(),
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
)

In [14]:
pipe_refined.fit(container.train, check=False)

# Evaluate the pipeline on the validation set.
pipe_refined.score(container.val)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Trying 1584 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

Time taken: 0:00:06.812240.

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2025-01-15 17:28:33,train,sales,0.05707,0.08228,0.9987
1,2025-01-15 17:28:40,val,sales,1.5356,4.0975,0.4312


Summary of Enhancements:
- Short-term trends are captured with a 1-week memory join.
- Seasonal patterns are derived via preprocessing.
- Advanced aggregations extend FastProp’s ability to model temporal dynamics.
- Categorical control via `n_most_frequent` prevents feature explosion.
- Feature limits ensure efficient training on external models like XGBoost.

These refinements lead to longer runtimes (~40 minutes end-to-end) but increase
the predictive model performance from 0.06863 to 0.04747 according to the MAE on
the provided validation split.

---
## 4. Exporting Features

Now that FastProp has generated features, we can export them for external use.
We enrich these features by merging item-level attributes from
the article table. Since the article table shares a many-to-one relationship
with the population, no additional aggregation is required.

The article table includes metadata (e.g., department info, section, color)
which can enhance downstream models, but not all columns seem relevant for sales prediction.
Here, we pick columns like `department_name` or `index_group_name` and do not
include item attributes like it's color.

First, we load the article table as an arrow table and select the relevant columns.

In [15]:
article_meta_cols = ["department_name", "index_group_name", "section_name"]

article = pq.read_table(
    f"{dataset.cache_dir}/db/article.parquet",
    columns=["article_id", *article_meta_cols],
    schema=pa.schema(
        [
            # getML exports join keys as strings, so we need to cast the article_id
            # to string to be able to join it with the features upon export
            pa.field("article_id", pa.string()),
            *[
                # we encode categorical columns as dictionary columns and use
                # int32 for the keys as integers would be downcast to int32 by
                # LightGBM anyway
                pa.field(col, pa.dictionary(pa.int32(), pa.string()))
                for col in article_meta_cols
            ],
        ]
    ),
)

Below, we define a helper function that:
1. Applies the fitted pipeline to transform data and extract FastProp features.
2. Merges article metadata (e.g., `department_name`) to enrich the feature set.
3. Exports the final features as Parquet files for later use with LightGBM.

> 💾 A Note on Memory Management
>
> Below, we do some stretching to be particularly
> economical with memory usage because, even for small data sets, FastProp can
> generate a very large number of features in a short amount of time.

In [16]:
def export_features(pipe, container, subset, batch_size=100000):
    """
    Batch-wise transform and export features (+ article metadata) for a given subset.
    """
    print(f"Exporting features for {subset} set...")
    name = f"hm_item_pipe_refined_{subset}_features"
    features = pipe.transform(container[subset], df_name=name)
    sink = pq.ParquetWriter(
        f"{name}.parquet",
        pa.unify_schemas([features[:0].to_arrow().schema, article.schema]),
        compression="snappy",
    )
    for batch in features.iter_batches(batch_size=batch_size):
        fastprop_feature_batch = batch.to_arrow()
        sink.write_table(fastprop_feature_batch.join(article, ["article_id"]))

In [17]:
# Export features for train, validation, and test sets
export_features(pipe_refined, container, "train")
export_features(pipe_refined, container, "val")
export_features(pipe_refined, container, "test")

Exporting features for train set...
[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Exporting features for val set...
[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Exporting features for test set...
[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

---
## 6. Training LightGBM

In this section, we train a LightGBM regressor using the features exported from the FastProp pipeline.
We leverage Optuna for hyperparameter optimization (hyperopt) to improve performance.

To run the script from terminal:
```bash
python hm-item-lgbm_tuning.py
```

If you want to run the script in the background and write the output to a log file, use:
```bash
python hm-item-lgbm_tuning.py &> lgbm_tuning.log &
```

To run the script from the notebook uncomment the cell below and run it.

In [18]:
# %run hm-item-lgbm_tuning.py

## Result
After approximately 10 hours of hyperopt and 50 trials, we achieve a test set MAE of 0.031. For reference, this follows the same tuning schedule as used by RelBench.

### Key Takeaways
- Our FastProp-driven features outperform manually engineered ones from a data scientist, whose best result achieved an MAE of 0.036 on the same dataset.
- This highlights the effectiveness of automated feature engineering and hyperparameter tuning in delivering superior performance with minimal manual effort.