# 📚 Custom Data Transformation and Configuration Tutorial

## 🎯 Tutorial Overview

This guide provides a deep dive into the data transformation and configuration system in **TopoBench**. You will learn how to create, configure, and apply both default and custom preprocessing steps for any dataset.

This tutorial covers:

1.  **Understanding Dynamic Configuration** 🧠
    * Learn about configuration resolvers that automate parameter setup.
    * See how the framework intelligently selects transforms based on the model and data.

2.  **Defining Default Transforms** ⚙️
    * Create default transformation pipelines for specific datasets using YAML.
    * Override and customize parameters for your default transforms.

3.  **Implementing a Custom Transform** 🔨
    * Write your own transform class from scratch by inheriting from a base class.
    * Place the new transform within the project structure to make it accessible.

4.  **Configuring and Using New Transforms** 🔄
    * Integrate your custom transform into the framework using Hydra's configuration syntax.
    * Apply your new transform as part of a preprocessing pipeline.


---

### 🛠️ Technical Framework

This tutorial focuses on:
* The **TopoBench** library's data processing pipeline.
* **Hydra** for powerful and flexible configuration management.
* Inheriting from `torch_geometric.transforms.BaseTransform` to create new transforms.

---

### 🎓 Important Notes

* To make the learning process concrete, we'll work with practical examples using the **`US-county-demos`** and **`REDDIT-BINARY`** datasets.
* The principles shown here are general and can be applied to any dataset you wish to integrate.




## Preparing to Load the Custom Dataset: Understanding Configuration Imports

Before loading some dataset, it is crucial to understand the configuration imports, particularly those from the `topobench.utils.config_resolvers` module. These utility functions play a key role in dynamically configuring your machine learning pipeline.

### Key Imports for Dynamic Configuration

Lets  import the essential configuration resolver functions:

```python
from topobench.utils.config_resolvers import (
    get_default_transform,
    get_monitor_metric,
    get_monitor_mode,
    infer_in_channels,
)
```

### Why These Imports Matter

```yaml
data_dir: ${paths.data_dir}/${dataset.loader.parameters.data_domain}/${dataset.loader.parameters.data_type}
```
Many configuration values can be set by simply referencing other parts of the configuration, as seen in the data_dir example above. This is perfect for constructing file paths or reusing constant values. However, some parameters are not static; their optimal values depend on other components of your experiment, such as the specific dataset or model you are using.

For instance, determining the correct data transformations, the metric to monitor for training (e.g., accuracy vs. loss), or the number of input channels for a model or the choice of the lifting depending on the model and dataset domains. The automatization of such situations often requires logic that goes beyond simple variable substitution. Manually adjusting these for every experiment would be tedious and error-prone. This is where custom resolver functions come in—they embed this decision-making logic directly into the configuration itself.




### Practical Example: Dynamic Transforms

Consider the configuration in `projects/TopoBench/configs/run.yaml`, where the `transforms` parameter uses the `get_default_transform` function:

```yaml
transforms: ${get_default_transform:${dataset},${model}}
```

This syntax allows for automatic transformation selection based on the dataset and model, demonstrating the power of these configuration resolver functions.

By importing and utilizing these functions, you gain:
- Flexible configuration management
- Automatic parameter inference
- Reduced manual configuration overhead

These facilitate seamless dataset loading and preprocessing for multiple topological domains and provide an easy and intuitive interface for incorporating novel functionality.
```





In [1]:
from hydra import compose, initialize
from hydra.utils import instantiate



from topobench.utils.config_resolvers import (
    get_default_transform,
    get_monitor_metric,
    get_monitor_mode,
    infer_in_channels,
)


# Step 1: Default Data Transformations ⚙️

While most datasets can be used directly after integration, some require specific preprocessing transformations. These transformations might vary depending on the task, model, or other conditions.

## Example Case: US-county-demos Dataset

Let's look at our language dataset's structure the `compose` function. 
```python
cfg = compose(
    config_name="run.yaml",
    overrides=[
        "model=hypergraph/unignn2",
        "dataset=graph/US-county-demos",
    ], 
    return_hydra_config=True
)
```
we can see that the model is `hypergraph/unignn2` from hypergraph domain while the dataset is from graph domain.
This implied that the discussed above `get_default_transform` function:

```yaml
transforms: ${get_default_transform:${dataset},${model}}
```
Inferred a default transform from graph to hypegraph domain.

In [2]:
initialize(config_path="../configs", job_name="job")
cfg = compose(
    config_name="run.yaml",
    overrides=[
        "model=hypergraph/unignn2",
        "dataset=graph/US-county-demos",
    ], 
    return_hydra_config=True
)
loader = instantiate(cfg.dataset.loader)


dataset, dataset_dir = loader.load()

print('Transform name:', cfg.transforms.keys())
print('Transform parameters:', cfg.transforms['graph2hypergraph_lifting'])

The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  initialize(config_path="../configs", job_name="job")


Transform name: dict_keys(['graph2hypergraph_lifting'])
Transform parameters: {'_target_': 'topobench.transforms.data_transform.DataTransform', 'transform_type': 'lifting', 'transform_name': 'HypergraphKHopLifting', 'k_value': 1, 'feature_lifting': 'ProjectionSum', 'preserve_edge_attr': False, 'complex_dim': 1, 'neighborhoods': '${oc.select:model.backbone.neighborhoods,null}'}


Some datasets require might require default transforms which are applied whenever it is nedded to model the data. 

The topobench library provides a simple way to define custom transformations and apply them to the dataset.
Take a look at `TopoBench/configs/transforms/dataset_defaults` folder where you can find some default transformations for different datasets.

For example, REDDIT-BINARY does not have initial node features and it is a common practice to define initial features as gaussian noise.
Hence the `TopoBench/configs/transforms/dataset_defaults/REDDIT-BINARY.yaml` file incorporates the `gaussian_noise` transform by default. 
Hence whenver you choose to uplodad the REDDIT-BINARY dataset (and do not modify ```transforms``` parameter), the `gaussian_noise` transform will be applied to the dataset.

```yaml
defaults:
  - data_manipulations: equal_gaus_features
  - liftings@_here_: ${get_required_lifting:graph,${model}}
```




Below we provide an quick tutorial on how to create a data transformations and create a sequence of default transformations that will be executed whener you use the defined dataset config file.

In [3]:
# Avoid override transforms
cfg = compose(
    config_name="run.yaml",
    overrides=[
        "model=hypergraph/unignn2",
        "dataset=graph/REDDIT-BINARY",
    ], 
    return_hydra_config=True
)
loader = instantiate(cfg.dataset.loader)


dataset, dataset_dir = loader.load()

In [4]:
dataset[0]

Data(edge_index=[2, 480], y=[1], num_nodes=218)

Take a look at the default transforms and the parameters of `equal_gaus_features` transform

In [5]:
print('Transform name:', cfg.transforms.keys())
print('Transform parameters:', cfg.transforms['equal_gaus_features'])

Transform name: dict_keys(['equal_gaus_features', 'graph2hypergraph_lifting'])
Transform parameters: {'_target_': 'topobench.transforms.data_transform.DataTransform', 'transform_name': 'EqualGausFeatures', 'transform_type': 'data manipulation', 'mean': 0, 'std': 0.1, 'num_features': '${dataset.parameters.num_features}'}


In [6]:
from topobench.data.preprocessor import PreProcessor
preprocessed_dataset = PreProcessor(dataset, dataset_dir, cfg['transforms'])

Processing...
Done!


In [7]:
preprocessed_dataset[0]

Data(x=[218, 10], edge_index=[2, 480], y=[1], incidence_hyperedges=[218, 218], num_hyperedges=[1], x_0=[218, 10], x_hyperedges=[218, 10], num_nodes=218)

The preprocessed dataset has the features generated by the preprocessor. And the connectivity of the dataset has been transformed into hypegraph domain. 

### Creating your own default transforms

Now when we have seen how to add custom dataset and how does the default transform works. One might want to reate your own default transforms for new dataset that will be executed always whenwever the dataset under default configuration is used.


**To configure** the deafult transform navigate to `configs/transforms/dataset_defaults` create `<def_transforms.yaml>` and the follwoing `.yaml` file: 

```yaml
defaults:
  - transform_1: transform_1
  - transform_2: transform_2
  - transform_3: transform_3
```


**Important**
There are different types of transforms, including `data_manipulation`, `liftings`, and `feature_liftings`. In case you want to use multiple transforms from the same categoty, let's say from `data_manipulation`, then it is required to stick to a special syntaxis. [See hydra configuration for more information]() or the example below: 

```yaml
defaults:
  - data_manipulation@first_usage: transform_1
  - data_manipulation@second_usage: transform_2
```


### Notes: 

- **Transforms from the same category:** If There are a two transforms from the same catgory, for example, `data_manipulations`, it is required to use operator `@` to assign new diffrerent names `first_usage` and `second_usage` to each transform.

-  In the case of `equal_gaus_features` we have to override the initial number of features as the `equal_gaus_features.yaml` which uses a special register to infer the feature dimension (the registed logic descrived in Step 3.) However by some reason we want to specify `num_features` parameter we can override it in the default file without the need to change the transform config file. 

```yaml
defaults:
  - data_manipulations@equal_gaus_features: equal_gaus_features
  - data_manipulations@some_transform: some_transform
  - liftings@_here_: ${get_required_lifting:graph,${model}}

equal_gaus_features:
  num_features: 100
some_transform:
  some_param: bla
```

- We recommend to always add `liftings@_here_: ${get_required_lifting:graph,${model}}` so that a default lifting is applied to run any domain-specific topological model.

# Step 2: Custom Data Transformations ⚙️

### Creating a Transform

In general any transfom in the library inherits `torch_geometric.transforms.BaseTransform` class, which allow to apply a sequency of transforms to the data. Our inderface requires to implement the `forward` method. The important part of all transforms is that it takes `torch_geometric.data.Data` object and returns updated `torch_geometric.data.Data` object.



For language dataset,  we have generated the `equal_gaus_features` transfroms that is a data_manipulation transform hence we place it into `topobench/transforms/data_manipulation/` folder. 
Below you can see th `EqualGausFeatures` class: 


```python
   class EqualGausFeatures(torch_geometric.transforms.BaseTransform):
    r"""A transform that generates equal Gaussian features for all nodes.

    Parameters
    ----------
    **kwargs : optional
        Additional arguments for the class. It should contain the following keys:
        - mean (float): The mean of the Gaussian distribution.
        - std (float): The standard deviation of the Gaussian distribution.
        - num_features (int): The number of features to generate.
    """

    def __init__(self, **kwargs):
        super().__init__()
        self.type = "generate_non_informative_features"

        # Torch generate feature vector from gaus distribution
        self.mean = kwargs["mean"]
        self.std = kwargs["std"]
        self.feature_vector = kwargs["num_features"]
        self.feature_vector = torch.normal(
            mean=self.mean, std=self.std, size=(1, self.feature_vector)
        )

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}(type={self.type!r}, mean={self.mean!r}, std={self.std!r}, feature_vector={self.feature_vector!r})"

    def forward(self, data: torch_geometric.data.Data):
        r"""Apply the transform to the input data.

        Parameters
        ----------
        data : torch_geometric.data.Data
            The input data.

        Returns
        -------
        torch_geometric.data.Data
            The transformed data.
        """
        data.x = self.feature_vector.expand(data.num_nodes, -1)
        return data

```

As we said above the `forward` function takes as input the `torch_geometric.data.Data` object, modifies it, and returns it.

### Register the Transform

Similarly to adding dataset the transformations you have created and placed at right folder are automatically registered.


### Create a configuration file 
Now as we have registered the transform we can finally create the configuration file and use it in the framework: 

``` yaml
_target_: topobench.transforms.data_transform.DataTransform
transform_name: "EqualGausFeatures"
transform_type: "data manipulation"

mean: 0
std: 0.1
num_features: ${dataset.parameters.num_features}
``` 
Please refer to `configs/transforms/dataset_defaults/equal_gaus_features.yaml` for the example. 

**Notes:**

- You might notice an interesting key `_target_` in the configuration file. In general for any new transform you the `_target_` is always `topobench.transforms.data_transform.DataTransform`.  [For more information please refer to hydra documentation "Instantiating objects with Hydra" section.](https://hydra.cc/docs/advanced/instantiate_objects/overview/). 