# Predicting Item Sales with getML on H&M Fashion Dataset

#### *Advanced Applications of getML with External Predictors*

This notebook shows how to use [getML](https://getml.com) to predict item sales on the H&M Fashion dataset,  
outperforming other approaches in the [Relational Deep Learning Benchmark (RelBench)](http://relbench.stanford.edu/).  

We achieve this with **minimal code complexity** and **without requiring knowledge from the business domain.**

<br>

**Why Focus on Feature Engineering?**

Pedro Domingos, a leading ML researcher, highlighted in his famous 2012 paper that *features are the most critical factor in machine learning.*
Features are the "language" that allows prediction models to interpret relational data. If that language is poor or incomplete, even the best-tuned models will underperform.  

In classical ML approaches like gradient boosting features are undoubtly king. At getML, our mission is to automate feature engineering for relational data, minimizing the need for complex models, manual SQL code, and business domain expertise – often the Achilles' heel of predictive analytics.  

The importance of features isn’t limited to gradient boosting. Even in deep learning (text and images), architectures like CNNs, RNNs, and transformers see 70-90% of all operations count toward feature extraction. Regardless of the model, it’s the quality of features – not just the final layers – that drives performance.

<br>

**Why getML?**  

Relational learning is heavily underutilized across industries. At getML, we aim to change that by advancing the field through innovative feature learning algorithms.  

[**FastProp**](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-fastprop) (Fast Propositionalization) is one of our core algorithms, automating feature engineering for regression and classification tasks on relational data. It runs **[60 to 1000 times faster](https://github.com/getml/getml-community?tab=readme-ov-file#benchmarks)** than tools like [featuretools](https://www.featuretools.com) and [tsfresh](https://tsfresh.com), while scaling effortlessly to millions of rows.   

<br>

**First Time Using getML?**:

If you're new to getML, consider starting with the simpler *hm-churn.ipynb* notebook for an introduction to basic concepts. 

<br>

**What This Notebook Covers**  

While getML can serve as a complete end-to-end solution, it’s designed for **seamless integration** with other frameworks. In this notebook, we will:  
- **Fine-tune getML's parameters** to enhance feature extraction,  
- **Integrate with LightGBM and Optuna** for model training and hyperparameter tuning.  
  
<br>

**Notebook Outline**  

This notebook is divided into four key sections:  

1. **[The Base Model](#The-Base-Model)** – Load the data, build a base data model, and train the initial pipeline.  
2. **[The Tuned Model](#The-Tuned-Model)** – Explore FastProp’s parameters and optimize the DataModel.  
3. **[Exporting Features](#Exporting-Features)** – Generate features and export them for external use.  
4. **[Training LightGBM](#Training-LightGBM)** – Train and evaluate a LightGBM regressor using Optuna for tuning.  

---
## The Base Model

##### *Load Data, Build a Data Model, and Train the Initial Pipeline*

In this section, we:
- Launch the getML engine and create a project.
- Download the "H&M" dataset from RelBench.
- Assign roles to each table (population, customer, transaction).
- Build a DataModel to represent table relationships.
- Train a simple pipeline (FastProp + XGBoostRegressor) as a baseline for initial results.


In [1]:
# We assume you already have all necessary dependencies installed.
# Otherwise, uncomment the line below to install them.

# !pip install pyarrow
# !pip install getml
# !pip install relbench

In [2]:
import getml
import pandas as pd
from relbench.datasets import get_dataset
from relbench.tasks import get_task

# Launch getML engine and set project.
getml.engine.launch(in_memory=True)  # Keeps data in RAM for faster processing (default)
getml.set_project("hm-item")

# Download dataset and task from RelBench.
dataset = get_dataset("rel-hm", download=True)
task = get_task("rel-hm", "item-sales", download=True)

# Enable textual output to avoid rendering issues in certain JupyterLab environments
getml.utilities.progress.FORCE_TEXTUAL_OUTPUT = True


# ---
# Assigning Roles to Tables
# Define the roles for population, customer, and transaction tables.
# These roles help getML understand how to process each column.
# ---

# Roles for the population tables (train, test, val).
population_roles = getml.data.Roles(
    join_key=["article_id"],
    target=["sales"],
    time_stamp=["timestamp"],
)

# Customer table roles. Keeping columns 'FN', 'Active', and 'postal_code'
# unused based on earlier pipeline checks
customer_roles = getml.data.Roles(
    join_key=["customer_id"],
    numerical=["age"],
    categorical=["club_member_status"]
)

# Transaction table roles (linking articles and customers).
transaction_roles = getml.data.Roles(
    join_key=["article_id", "customer_id"],
    time_stamp=["t_dat"],
    numerical=["price"],
    categorical=["sales_channel_id"],
)

# Article roles are omitted for simplicity (trivial many-to-one relationship).
# These categorical article attributes are passed separately to LightGBM model.
# (See Section 3 - Exporting Features).
article_roles = getml.data.Roles(
    join_key=[],
    numerical=[],
    categorical=[]
)


# ---
# Loading Data
# ---

# Load the train, test, and val tables, then store them in a dict for convenience.
subsets = ("train", "test", "val")
populations = {}
for subset in subsets:
    populations[subset] = getml.data.DataFrame.from_parquet(
        f"{dataset.cache_dir}/tasks/item-sales/{subset}.parquet", 
        subset, 
        population_roles
    )

# Load peripheral tables (customer and transaction).
customer = getml.data.DataFrame.from_parquet(
    f"{dataset.cache_dir}/db/customer.parquet",
    "customer",
    customer_roles
)

transaction = getml.data.DataFrame.from_parquet(
    f"{dataset.cache_dir}/db/transactions.parquet",
    "transaction",
    transaction_roles
)

# Omitting the article table to prevent unnecessary complexity.


# ---
# Defining the DataModel and Container
# ---

# Initialize the DataModel using the train set as the population placeholder.
dm = getml.data.DataModel(population=populations["train"].to_placeholder())

# Add peripheral tables to the model as placeholders.
dm.add(getml.data.to_placeholder(customer, transaction))

# Define table relationships:
# 1) population -> transaction (time-aware, 6-week memory).
dm.population.join(
    dm.transaction,
    on="article_id",
    time_stamps=("timestamp", "t_dat"),
    memory=getml.data.time.weeks(6)
)

# 2) transaction -> customer (many-to-one).
dm.transaction.join(
    dm.customer,
    on="customer_id",
    relationship=getml.data.relationship.many_to_one
)

# Wrap data into a container for pipeline fitting.
container_1 = getml.data.Container(**populations)
container_1.add(customer, transaction)


# ---
# Defining, Fitting, and Evaluating the Pipeline
# ---

# Define a simple pipeline with FastProp for feature learning and XGBoost for prediction.
pipe_1 = getml.Pipeline(
    data_model=dm,
    feature_learners=[getml.feature_learning.FastProp()],
    predictors=[getml.predictors.XGBoostRegressor()],
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
)

# Train the pipeline and validate it on the test set.
pipe_1.fit(container_1.train, check=True)
pipe_1.score(container_1.val)

# Display pipeline performance scores.
pipe_1.scores

Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/jupyter/.getML --in-memory=true --install=false --launch-browser=true --log=false --project-directory=/home/jupyter/.getML/projects in /opt/conda/lib/python3.10/site-packages/getml/.getML/getml-community-1.5.0-amd64-linux...
Launched the getML Engine. The log output will be stored in /home/jupyter/.getML/logs/getml_20250105171329.log


[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:02[0m--:--[0m
[2K  Checking... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:11[0m5m 50%[0m • [36m00:01[0m
[?25h

[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:09[0m--:--[0m
[2K  FastProp: Trying 54 features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:00[0m00:09[0m
[2K  FastProp: Building features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:41[0m[36m00:01[0m00:02[0m
[2K  XGBoost: Training as predictor... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m02:06[0mm • [36m00:01[0m00:03[0m
[?25h

Time taken: 0:02:58.556960.

[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:02[0m--:--[0m
[2K  Preprocessing... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:00[0m00:02[0m
[2K  FastProp: Building features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:03[0m[0m • [36m--:--[0m
[?25h

Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2025-01-05 17:17:06,train,sales,0.04749,0.2922,0.6547
1,2025-01-05 17:17:12,val,sales,0.06863,0.415,0.6089


---
## The Tuned Model

##### *Refining the DataModel and FastProp Parameters*

Building on our baseline, this section focuses on refining the pipeline for improved accuracy.
These enhancements increase runtime from 3 minutes to approximately 40 minutes (gcloud; n2-standard-32, 32 vCPUs & 128 GB RAM).

The refined pipeline and data model expands the feature space from 54 to 1584 features by:
- Modify the data model to capture autoregressive (AR) effects in sales
- Add getML's [seasonal preprocessor](https://getml.com/latest/reference/preprocessors/seasonal/) to the pipeline
- Add aggregations to FastProp for more temporal aggregations,
- Handling categorical columns with `n_most_frequent`,
- Limiting total features with `num_features`.

In [3]:
# ---
# 1. Capturing Autoregressive (AR) Effects in Sales
# ---

# Sales data often exhibits autoregressive patterns. From pipe_1, we observed
# a performance drop in MAE from 0.04749 to 0.06861, suggesting the baseline 
# features didn’t generalize well. By default, FastProp aggregates over the 
# entire history (memory=6 weeks), potentially missing short-term trends.
# To address this, we introduce a secondary join with a smaller 1-week memory window.

dm.population.join(
    dm.transaction,
    on="article_id",
    time_stamps=("timestamp", "t_dat"),
    memory=getml.data.time.weeks(1)
)

# This short-term join captures recent sales activity.  
# Advanced getML algorithms like MultiRel or Relboost can learn such AR effects
# without modifying the DataModel, but for FastProp, this adjustment is crucial.

# ---
# 2. Applying Seasonal Preprocessing
# ---

# The Seasonal preprocessor extracts temporal features (e.g., month, day-of-week) from time stamps. 
seasonal_preprocessor = getml.preprocessors.Seasonal()

# We only want it to affect the population’s `timestamp`, not the transaction table’s
# `t_dat` and exclude `t_dat` via the **subroles** concept:
transaction.set_subroles(["t_dat"], getml.data.subroles.exclude.seasonal)

# Rebuild the container to reflect this change.
container_2 = getml.data.Container(**populations)
container_2.add(customer, transaction)


# ---
# 3. Add aggregations to FastProp that help the predictor to catch temporal correlations
# ---

# FastProp’s default aggregations include count, sum, etc. We can add more advanced
# aggregations like Exponentially Weighted Moving Averages (EWMA) and quantiles to 
# capture temporal patterns.

additional_aggregations = {
    getml.feature_learning.aggregations.EWMA_1D,
    getml.feature_learning.aggregations.EWMA_7D,
    getml.feature_learning.aggregations.EWMA_30D,
    getml.feature_learning.aggregations.Q_1,
    getml.feature_learning.aggregations.Q_5,
    getml.feature_learning.aggregations.Q_10,
    getml.feature_learning.aggregations.Q_25,
    getml.feature_learning.aggregations.TIME_SINCE_FIRST_MINIMUM,
    getml.feature_learning.aggregations.TIME_SINCE_LAST_MINIMUM,
    getml.feature_learning.aggregations.TIME_SINCE_LAST_MAXIMUM,
    getml.feature_learning.aggregations.TIME_SINCE_FIRST_MAXIMUM,
}


# ---
# 4. Handling Categorical Features with `n_most_frequent`
# ---

# `n_most_frequent` in FastProp helps manage columns that contain many 
# categories (like `sales_channel_id`). If we set `n_most_frequent=2`, 
# FastProp will look at the two most frequent categories in that column and 
# create a fallback for everything else. This avoids explosive feature growth 
# when dealing with many possible categories.

n_most_frequent = 2


# ---
# 5. Limiting the Total Number of Features
# ---

# FastProp can generate a large number of features. It ranks them based on 
# their pairwise correlation with the target. The highest-ranking subset is kept. 
# Setting `num_features=200` means we retain only the top 200. This prevents 
# memory issues when feeding these features to non-memory-mapped models like XGBoost.

num_features = 200


# ---
# Building and Fitting the Enhanced Pipeline
# ---

pipe_2 = getml.Pipeline(
    data_model=dm,
    preprocessors=seasonal_preprocessor,
    feature_learners=[
        getml.feature_learning.FastProp(
            n_most_frequent=n_most_frequent,
            num_features=num_features,
            aggregation=(
                getml.feature_learning.FastProp.agg_sets.default 
                | additional_aggregations
            ),
        )
    ],
    predictors=[getml.predictors.XGBoostRegressor()],
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
)

# Fit the pipeline on the training set (check disabled for efficiency).
pipe_2.fit(container_2.train, check=False)

# Evaluate the pipeline on the validation set.
pipe_2.score(container_2.val)
pipe_2.scores


# ---
# Summary of Enhancements:
# - **Short-term trends** are captured with a 1-week memory join. 
# - **Seasonal patterns** are derived via preprocessing.  
# - **Advanced aggregations** extend FastProp’s ability to model temporal dynamics.  
# - **Categorical control** via `n_most_frequent` prevents feature explosion.  
# - **Feature limits** ensure efficient training on external models like XGBoost.  
# 
# These refinements lead to longer runtimes (~40 minutes end-to-end) but increase
# the predictive model performance from 0.06863 to 0.04747 according to the MAE on
# the provided validation split.

[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:05[0m--:--[0m
[2K  Preprocessing... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:22[0m--:--[0m
[2K  FastProp: Trying 1584 features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m28:44[0m94%[0m • [36m02:00[0m
[2K  FastProp: Building features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m01:44[0m[36m00:01[0m00:03[0m
[2K  XGBoost: Training as predictor... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m04:57[0mm • [36m00:02[0m00:05[0m
[?25h

Time taken: 0:35:54.589582.

[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:05[0m--:--[0m
[2K  Preprocessing... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:00[0m00:05[0m
[2K  FastProp: Building features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:10[0m[0m • [36m--:--[0m
[?25h

Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2025-01-05 17:53:06,train,sales,0.04394,0.2819,0.677
1,2025-01-05 17:53:23,val,sales,0.04747,0.3853,0.6662


---
## Exporting Features

#### *Use FastProp features to train a LightGBM regressor with Optuna for hyperparameter tuning*

Now that FastProp has generated features, we can export them for external use. 
We enrich these features by merging item-level attributes from 
the article table. Since the article table shares a many-to-one relationship 
with the population, no additional aggregation is required. 

The article table includes metadata (e.g., department info, section, color) 
which can enhance downstream models, but not all columns seem relevant for sales prediction. 
Here, we pick columns like 'department_name' or 'index_group_name' and do not
include item attributes like it's color.

Below, we define a helper function that:
1. Applies the fitted pipeline to transform data and extract FastProp features.
2. Merges article metadata (e.g., 'department_name') to enrich the feature set.
3. Exports the final features as Parquet files for later use with LightGBM.

In [4]:
import pandas as pd

def export_and_augment_sets(pipe, name, cont, dataset, article_cols=None):
    # 1. Transform data using the fitted pipeline to extract FastProp features.
    fastprop_feats = pipe.transform(cont, df_name=f"{name}_transform_final-hm-item").to_pandas()

    # Ensure consistent data types for merging
    fastprop_feats["article_id"] = fastprop_feats["article_id"].astype(str)

    if article_cols:
        # 2. Load article metadata, selecting relevant columns for enrichment.
        article_meta = pd.read_parquet(f"{dataset.cache_dir}/db/article.parquet")
        article_meta["article_id"] = article_meta["article_id"].astype(str)
        
        # Keep only unique rows of selected metadata for merging
        article_meta_feats = article_meta[["article_id"] + article_cols].drop_duplicates("article_id")

        # 3. Merge FastProp features with article metadata
        feats_all = pd.merge(fastprop_feats, article_meta_feats, on="article_id", how="left")
    else:
        feats_all = fastprop_feats

    # 4. Export the enriched feature set to Parquet format
    feats_all.to_parquet(f"{name}_features_final-hm-item.parquet", index=False)
    
    return feats_all

# Select article metadata columns for enrichment
article_meta_cols = ['department_name', 'index_group_name', 'section_name']

# Export features for train, validation, and test sets
train_feats = export_and_augment_sets(pipe_2, "train", container_2.train, dataset, article_meta_cols)
val_feats = export_and_augment_sets(pipe_2, "val", container_2.val, dataset, article_meta_cols)
test_feats = export_and_augment_sets(pipe_2, "test", container_2.test, dataset, article_meta_cols)

[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:05[0m--:--[0m
[2K  Preprocessing... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:02[0m--:--[0m
[2K  FastProp: Building features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m01:46[0m[36m00:01[0m00:03[0m
[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:04[0m--:--[0m
[2K  Preprocessing... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:00[0m00:04[0m
[2K  FastProp: Building features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:04[0m[0m • [36m--:--[0m
[2K  Staging... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:05[0m--:--[0m
[2K  Preprocessing... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m • [33m00:00[0m00:05[0m
[2K  FastProp: Building features... [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

---
# Training LightGBM

##### *Predict on exported feature table and evaluate results*

In this section, we train a LightGBM regressor using the features exported from the FastProp pipeline.  
We leverage Optuna for hyperparameter optimization (hyperopt) to improve performance.  


<br>

**Run the script from terminal:**

`python hm-item-lgbm_tuning.py &`

<br>

**Observe the log output with:**

`tail -f opt-hm-item.log`

<br>

### **Result**
After approximately 10 hours of hyperopt and 50 trials, we achieve a test set MAE of 0.031.
For reference, this follows the same tuning schedule as used by RelBench.

Key Insight:

- Our FastProp-driven features outperform manually engineered ones from a data scientist, whose best result achieved an MAE of 0.036 on the same dataset.
- This highlights the effectiveness of automated feature engineering and hyperparameter tuning in delivering superior performance with minimal manual effort.