## The end (?)

We've reached **the end** of this short Ph.D. course about reproducibility and deep learning experiments.

#### What to do?

## deasy-learning

In this **bonus** lecture, I'm going to briefly show you some functionalities of my small **custom library**.

#### Disclaimer

- What you are going to see is not **groundbreaking** and probably well-known since 50 years. I just couldn't find anything suitable...but I'm a picky person time to time...

- **No official code yet**: the library is currently undergoing a **major refactoring**!

- **ETA** release date: approximately end of May/beginning of June

- The chosen **name** is embarassing! Any suggestions? (*DISI $\rightarrow$ deasy $\rightarrow$ diversified easy learning*)

- **Ping me** if you are interested! (*don't you have anything else to do?*)

## What are we going to cover?

- ``deasy-learning``
    - Motivational intro $\rightarrow$ *why writing a custom library*
    - Overview
    - Configuration and Component
    - Registration
    - Training a SVM model with deasy-learning
- Evaluation questionnaire
- Motivational outro

## Motivational Intro

*Shouldn't you be doing research instead of jamming with your **mechanical** keyboard to write buggy code?*

### Why deasy-learning 

*Take a seat, my friend, for I'm telling you my story now*

Around the middle of my Ph.D., I noticed that:

- I was **copy-pasting** quite a lot of code from one project to another (*w/ several bugs*)

- The majority of my **most frequent errors** were about running experiments with the **wrong hyper-parameters configuration**

- I had a lot of models (variants included) to test and their **management** was becoming **cumbersome**

### An example

For one paper, I had to test a GNN:

- A basic model (**B**)

- A variant with an additional layer (**B + layer**)

- **3** variants of **B + layer** with a specific regularization (**B + layer + regs**)

- **All** the above models had to be tested on **5 different datasets**

- **All** the above models had to be tested on **2** different **input** configurations (**I1**, **I2**)

$\rightarrow $ *Because yes*

#### Totaling: 10 models per dataset $\rightarrow$ 50 models

Actually, I have also an LSTM baseline and a BERT model, so the total number of models was **150**.

$\rightarrow$ *Yes, I like to hurt myself*

### An Example (cont'd)

I was **confident** and started writing up model configurations in JSON format (*a simplified version*)

```
"dataset1_B_GNN_I1": {
        "rnn_encoding": false
        "representation_mlp_weights": [
                    64,
                    256,
                    512
                ]
        "merge_vocabularies": false,
        "build_embedding_matrix": true,
        "embedding_model_type": "glove",
        "input_dropout_rate": 0.4,
        "feature_class": "pos_text_graph_dep_adj_features",
        "is_directional": false,
        "clip_gradient": true,
        "embedding_dimension":  200,
        "l2_regularization": 0.0002,
        "max_grad_norm":  40,
        "gcn_info": {
                "0": {
                    "message_weights": [
                        64
                    ],
                    "aggregation_weights": [
                        512,
                        512
                    ],
                    "node_weights": [
                        64
                    ],
                    "pooling_weights": [
                        64,
                        1
                    ]
                }
            },
        "use_position_feature":  true
        "tokenizer_args": {
                "filters": "",
                "oov_token": 1
        },
        "weight_predictions": true,
        "add_gradient_noise": true,
        "dropout_rate": 0.4,
        "optimizer_args": {
                "learning_rate": 0.0002
        },
        "max_graph_nodes_limit": 140
    }
```

Eventually:

- My JSON configuration file ended up having **more than 8k lines** - barely readable (*not all models were yet listed!*)

- I could **hardly recall** (each day) each parameter type, allowed values, potential conflicts, etc...

- I was **still messing around** with wrong hyper-parameter settings (*my eyes @_@*)

- I was **wasting quite a lot of time** by just making errors and setting up the right experiment

### I want to be lazy!

I looked at my configuration hell and reasoned on the **limitations**...

- **Hardly** readable (*wanna collaborate? enjoy...*)

- **No typing** (*err, what was the accepted format of this parameter?*)

- Custom types (e.g., numpy.ndarray) had to be **converted**...

- **No** possibility to define configuration **constraints**

- **No** possibility to add **descriptions**

#### I'm fucking lazy...

- I don't want to **waste time** on writing my configuration in a different format

- I want to run **different** experiments with an **effort comparable to a mouse click**

- I want to **focus** on my research (*Nobody believes you, Fede!*)

### F\*\*\* you configuration files!

- F\*\*\* you JSON
- F\*\*\* you JSONL
- F\*\*\* YAML (*horrible*)
- F\*\*\* JSONNET (*kill me*)
- F\*\*\* TXT (*really?*)
- F\*\*\* CFG
- **BIG** F\*\*\* to command line arguments (*I'll never be your friend*)

$\rightarrow$ I want write my configurations in **Python**!

### Why not some existing library?

Well, this is entirely due to my **personal experience** and the fact that I'm a very stupid user

- [Allennlp](https://allenai.org/allennlp): cool features but fucking **hard** to **customize** unless you are from allennlp

- [ParlAI](https://parl.ai/): cool if you have to do the three commands written in the **tutorial page**. Otherwise, kill yourself!

- [Huggingface](https://huggingface.co/): super cool if you are working with transformers but **horrible configuration** format $\rightarrow$ you need an open tab on their documentation to **understand each hyper-parameter**

- Tensorflow/Torch/Keras: not a simple way to **define configurations** except for **flags**. You'll never have me!!!

#### Disclaimer: 

I'm **NOT** saying these are bad libraries at all! It's just that I was not able to use them to solve my issues...

$\rightarrow$ Why can't I run my SVM, decision tree, LSTM, BERT models using the **same configuration** and **training format**?

## Overview

### Inspiration

I remembered a cool feature of some library (*I think it was allenlp...*) 

- You could write your model and **register it** so that you could run your experiment by commandline (*aaargh...*)

This feature led me to the following conclusions:

1. **Separate** configuration from logic
2. **Register** configuration and logic separately to **quickly use them later** and organize your working environment

$\rightarrow$ Formally, I denote the configuration as ``Configuration`` and the logic as ``Component``.

### A visual depiction

A ``Component`` **is built** via its ``Configuration``.

<center>
<div>
<img src="Images/Lecture-5/conf_and_comp.png" width="1200"/>
</div>
</center>

## Binding

We **bind** the ``Configuration`` to its ``Component``.

<center>
<div>
<img src="Images/Lecture-5/conf_and_comp_bound.png" width="1200"/>
</div>
</center>

## Registration

We **register** the ``Configuration`` to remember it.

<center>
<div>
<img src="Images/Lecture-5/registration.png" width="1200"/>
</div>
</center>

## Registration (cont'd)

To remember a ``Configuration`` we simply define a ``RegistrationKey`` (*a compound dictionary key*)

<br/>

<center>
<div>
<img src="Images/Lecture-5/registration_with_key.png" width="1200"/>
</div>
</center>

## Binding (cont'd)

After registration, we can bind a ``Configuration`` to a ``Component`` using the ``RegistrationKey`` defined for the ``Configuration``.

<br/>

<center>
<div>
<img src="Images/Lecture-5/binding_with_key.png" width="1200"/>
</div>
</center>

### To sum up

- We **define** our ``Component``
- We **define** its corresponding ``Configuration``
- We **register** it to the ``Registry`` with a ``RegistrationKey``
- We **bind** the ``Configuration`` to ``Component`` by using the configuration ``RegistrationKey``

Thus, the ``RegistrationKey`` of our ``Configuration`` allows to:
- **Retrieve** the registered ``Configuration`` from the ``Registry``
- **Retrieve** the ``Component`` that our registered ``Configuration`` is bounded to

#### That's it! If you get this, you get 99% of deasy-learning!

## Configuration and Component

*Let's delve into the details!*

### Configuration parameters

A ``Configuration`` is comprised of ``Parameter`` objects

<center>
<div>
<img src="Images/Lecture-5/configuration.png" width="1200"/>
</div>
</center>

A ``Parameter`` is essentially a wrapper for each attribute of ``Configuration``

### Why ``Parameter``?

Essentially, ``Parameter`` is a useful wrapper for storing additional metadata

- type hints
- descriptions
- allowed value range
- possible variants of interest
- optional tags for quickly retrieving a certain subset of parameters
- ...

### Configuration class

In [None]:
class Configuration:
    
    def add(self, param: Parameter):
        ...
    
    def add_condition(self, condition: Callable[[], bool], name):
        ...
        
    def validate(self):
        ...
        
    def search(self, search_key, exact_match):
        ...
        
    def get_delta_copy(self, key_value_dict):
        ...
        
    @classmethod
    def get_default(cls):
        ...

### What does a ``Configuration`` do?

A ``Configuration`` is essentially an extension of a Python dictionary

- You can add ``Parameter``
- You can add **conditions** (i.e., callable functions) relating multiple ``Parameter``
- You can ``validate()`` your ``Configuration``: running all conditions to check for errors
- You can **quickly search** for ``Parameter``
- You can **quickly get a delta copy** of your ``Configuration`` via a simple key-value dictionary
- You can **specify the template** (``get_default()``) of your ``Configuration`` $\rightarrow$ a readable and detailed specification!
- Lastly, it is a Python object, you can define ``Configuration`` subclasses via **inheritance**!

### An example

In [None]:
class DataLoaderConfig(Configuration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.add_short(name='name',
                         type_hint=str,
                         description="Unique dataset identifier",
                         is_required=True)
        config.add_short(name='has_test_split_only',
                         value=False,
                         type_hint=bool,
                         description="Whether DataLoader has test split only or not")
        config.add_short(name='has_val_split',
                         value=True,
                         type_hint=bool,
                         description="Whether DataLoader has a val split or not")
        config.add_short(name='has_test_split',
                         value=True,
                         type_hint=bool,
                         description="Whether DataLoader has a test split or not")

        return config

### Another example

In [None]:
class ConfigA(Configuration):

    @classmethod
    def get_default(
            cls
    ) -> ConfigA:
        config = super().get_default()

        config.add_short(name='param_1',
                         value=True,
                         type_hint=bool)
        config.add_short(name='param_2',
                         value=True,
                         type_hint=bool)

        config.add_condition(condition=lambda p: p.param_1 == p.param_2)

        return config

### Nesting

A ``Configuration`` can also include another (or multiple) ``Configuration``.

In [None]:
class ParentConfig(Configuration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.add_short(name='param_1',
                         value=True,
                         type_hint=bool)
        config.add_short(name='param_2',
                         value=False, t
                         ype_hint=bool)
        config.add_short(name='child_A',
                         value=RegistrationKey(name='config_a',
                                               namespace='testing'),   # <--- This assumes that ConfigA is registered
                         is_registration=True)   # <--- metadata
        return config
    
class ConfigA(Configuration):
    ...

This allows us to define complex ``Component`` like

- A data-loading pipeline
- A data pre-processing pipeline
- A training routine
- ...

### An example

In [None]:
class ProcessorPipelineConfig(Configuration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.add_short(name='processors',
                         type_hint=List[RegistrationKey],
                         description='List of processors to be executed in sequence',
                         is_required=True,
                         is_registration=True)

        return config

In [None]:
class ProcessorPipeline(Component):
    
    def run(
            self,
            data: Optional[FieldDict] = None,
            is_training_data: bool = False
    ):
        for processor in self.processors:
            data = processor.run(data=data,
                                 is_training_data=is_training_data)
        return data

### Variants

In many cases, you may need **multiple** ``Configuration`` instances to define different scenarios.

- Work at ``Parameter`` level to specify variants
- Work at ``Configuration`` level to specify explicit configuration variants

### ``Parameter`` level variants

Simply list possible values via ``Parameter.variants`` field.

In [None]:
class ConfigA(Configuration):

    @classmethod
    def get_default(
            cls
    ) -> ConfigA:
        config = super().get_default()

        config.add_short(name='param_1',
                         value=True,
                         type_hint=bool,
                         variants=[False, True])    # <---
        config.add_short(name='param_2',
                         value=True,
                         type_hint=bool,
                         variants=[False, True])    # <---

        config.add_condition(condition=lambda p: p.param_1 == p.param_2)

        return config

### Configuration level variants

We can explicitly define new ``Configuration`` via two decorators: ``supports_variants``, ``add_variant``

In [None]:
@supports_variants
class ConfigA(Configuration):

    @classmethod
    def get_default(
            cls
    ) -> ConfigA:
        config = super().get_default()

        config.add_short(name='param_1',
                         value=True,
                         type_hint=bool,
                         variants=[False, True])
        config.add_short(name='param_2',
                         value=True,
                         type_hint=bool,
                         variants=[False, True])

        config.add_condition(condition=lambda p: p.param_1 == p.param_2)

        return config
    
    @classmethod
    @add_variant(variant_name='variant1')
    def variant1(cls):
        config = cls.get_default()
        config.param_1 = False
        config.param_2 = False
        return config
    
    @classmethod
    @add_variant(variant_name='variant2')
    def variant1(cls):
        config = cls.get_default()
        config.param_1 = True
        config.param_2 = True
        return config

### Registering variants

Both of them can be **automatically** considered during registration to quickly take into account variants.

In [None]:
Registry.register_and_bind_configuration_variants(configuration_class=ConfigA,
                                                  component_class=ComponentA,
                                                  name='config_a',
                                                  namespace='testing',
                                                  allow_parameters_variants=True)   # <---

Registration variants is controlled by ``allow_parameters_variants``

- [**True**] **Ignore** ``add_variant`` declarations and look for ``Parameter`` level variants
- [**False**] **Only** consider ``add_variant`` declarations

### Variants and Nesting

Registering variants is a powerful tool since it **supports** ``Configuration`` nesting!

Consider your complex pipeline: data-loading, pre-processing, model training, etc...

- You can write it as a combination of ``Configuration`` and ``Component`` classes
- You can define variants
- You can specify conditions $\rightarrow$ only **valid variants** are considered!
- You can register all possible valid variant combinations in one shot!

## An Example

In [None]:
class ConfigA(Configuration):

    @classmethod
    def get_default(
            cls
    ) -> ConfigA:
        config = super().get_default()
        config.add_short(name='param_1', value=True, type_hint=bool, variants=[False, True])
        config.add_short(name='child', value=RegistrationKey(name='config_b',
                                                             namespace='testing'), is_registration=True)
        return config


class ConfigB(Configuration):

    @classmethod
    def get_default(
            cls
    ) -> ConfigB:
        config = super().get_default()
        config.add_short(name='param_1', value=1, type_hint=int, variants=[1, 2])
        config.add_short(name='child', value=RegistrationKey(name='config_c',
                                                             namespace='testing'), is_registration=True)
        return config


class ConfigC(Configuration):

    @classmethod
    def get_default(
            cls
    ) -> ConfigC:
        config = super().get_default()
        config.add_short(name='param_1', value=False, type_hint=bool, variants=[False, True])
        return config


In [None]:
if __name__ == '__main__':
    Registry.register_and_bind(configuration_class=ConfigB,
                               configuration_constructor=Configuration.get_default,
                               component_class=Component,
                               name='config_b',
                               namespace='testing')
    Registry.register_and_bind(configuration_class=ConfigC,
                               configuration_constructor=Configuration.get_default,
                               component_class=Component,
                               name='config_c',
                               namespace='testing')

    for config_regr_key in Registry.register_and_bind_configuration_variants(configuration_class=ConfigA,
                                                                             component_class=Component,
                                                                             name='config_a',
                                                                             namespace='testing',
                                                                             allow_parameters_variants=True):
        print(config_regr_key)

In [None]:
name:config_a--tags:['child.child.param_1=False', 'child.param_1=1', 'param_1=False']--namespace:testing
name:config_a--tags:['child.child.param_1=False', 'child.param_1=1', 'param_1=True']--namespace:testing
name:config_a--tags:['child.child.param_1=False', 'child.param_1=2', 'param_1=False']--namespace:testing
name:config_a--tags:['child.child.param_1=False', 'child.param_1=2', 'param_1=True']--namespace:testing
name:config_a--tags:['child.child.param_1=True', 'child.param_1=1', 'param_1=False']--namespace:testing
name:config_a--tags:['child.child.param_1=True', 'child.param_1=1', 'param_1=True']--namespace:testing
name:config_a--tags:['child.child.param_1=True', 'child.param_1=2', 'param_1=False']--namespace:testing
name:config_a--tags:['child.child.param_1=True', 'child.param_1=2', 'param_1=True']--namespace:testing

### Calibration

What about hyper-parameter calibration?

Calibration is **implemented via** ``Configuration`` nesting

- `TunableConfiguration` is a ``Configuration`` that by default has a ``Parameter`` called ``calibration_config``.
- ``calibration_config`` is a ``Configuration`` that defines the search space

#### Why?

This allows you to define **multiple** search spaces and quickly switch from one to another.

Essentially, we are **de-coupling** a ``Configuration`` with its hyper-parameter search.

Besides, this also **inherently allows nesting** ``Configuration`` search spaces!

#### Difference with variants

Variants define what you **want to test**

Each variant may **undergo calibration**

### An example

In [None]:
class ConfigA(TunableConfiguration):

    @classmethod
    def get_default(cls):
        config = super().get_default()
        config.add_short(name='param1', value=1, type_hint=int)
        config.add_short(name='param2', value=True, type_hint=bool)
        config.calibration_config = RegistrationKey(name='calibration',
                                                    tags={'config_a'},
                                                    namespace='testing')

        return config

    
class CalConfigA(Configuration):

    @classmethod
    def get_default(cls):
        config = super().get_default()
        config.add_short(name='search_space',
                         value={
                             'param1': [1, 2, 3],
                             'param2': [False, True]
                         })
        return config
    
if __name__ == '__main__':
    config = ConfigA.get_default()
    search_space = config.get_search_space()
    print(search_space)

In [None]:
{'param1': [1, 2, 3], 'param2': [False, True]}

### An example with nesting

In [None]:
class ConfigA(TunableConfiguration):

    @classmethod
    def get_default(cls):
        config = super().get_default()
        config.add_short(name='param1', value=1, type_hint=int)
        config.add_short(name='param2', value=True, type_hint=bool)
        config.add_short(name='child',                               # <-- Pay attention to this name!
                         value=RegistrationKey(name='config_b',
                                               namespace='testing'),
                         is_registration=True)
        config.calibration_config = RegistrationKey(name='calibration',
                                                    tags={'config_a'},
                                                    namespace='testing')

        return config

class ConfigB(TunableConfiguration):

    @classmethod
    def get_default(cls):
        config = super().get_default()
        config.add_short(name='param1',
                         value=True,
                         type_hint=bool)
        config.calibration_config = RegistrationKey(name='calibration',
                                                    tags={'config_b'},
                                                    namespace='testing')
        return config

In [None]:
class CalConfigA(Configuration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()
        config.add_short(name='search_space',
                         value={
                             'param1': [1, 2, 3],
                             'param2': [False, True]
                         })
        return config

class CalConfigB(Configuration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()
        config.add_short(name='search_space',
                         value={
                             'param1': [False, True]
                         })
        return config

if __name__ == '__main__':
    config = ConfigA.get_default()
    search_space = config.get_search_space()
    print(search_space)

In [None]:
{'child.param1': [False, True], 'param1': [1, 2, 3], 'param2': [False, True]}

### Component delta copy from search space

In [None]:
if __name__ == '__main__':
    config = ConfigA.get_default()
    
    # Get search space
    search_space = config.get_search_space()
    combinations = get_dict_values_combinations(search_space)

    # Get component
    component = Registry.retrieve_component(name='config_a',
                                            namespace='testing')

    # Get a delta copy
    copy_component = component.get_delta_copy(params_dict=combinations[0])
    print(combinations)
    print(combinations[0])

In [None]:
[{'child.param1': False, 'param1': 1, 'param2': False}, {'child.param1': False, 'param1': 1, 'param2': True}, {'child.param1': False, 'param1': 2, 'param2': False}, {'child.param1': False, 'param1': 2, 'param2': True}, {'child.param1': False, 'param1': 3, 'param2': False}, {'child.param1': False, 'param1': 3, 'param2': True}, {'child.param1': True, 'param1': 1, 'param2': False}, {'child.param1': True, 'param1': 1, 'param2': True}, {'child.param1': True, 'param1': 2, 'param2': False}, {'child.param1': True, 'param1': 2, 'param2': True}, {'child.param1': True, 'param1': 3, 'param2': False}, {'child.param1': True, 'param1': 3, 'param2': True}]
{'child.param1': False, 'param1': 1, 'param2': False}

### Component

The ``Component`` is a simple interface that **doesn't define** any specific **behaviour**

- I really like freedom of choice
- You can define your preferred APIs since **you decide** how ``Component`` are nested

In [None]:
class Component(ABC):

    def __init__(
            self,
            config: Configuration,
            post_build: bool = True
    ):
        self.config = config
        if post_build:
            self.config.post_build()
        self.config.validate()
        
    def get_delta_copy(
            self,
            params_dict: Dict[str, Any]
    ):
        ...
        
    @serialize_save_and_load
    def save(self, serialization_path: Union[AnyStr, Path]):
        ...
        
    @serialize_save_and_load
    def load(self, serialization_path: Union[AnyStr, Path]):
        ...

    @abstractmethod
    def run(
            self,
            serialization_path: Optional[Union[AnyStr, Path]] = None,
            serialize: bool = False,
    ) -> Any:
        pass

### Component dynamics


#### ``__init__``

- The ``Component`` **only requires** its ``Configuration`` (*just like in Hugginface...*)
- The ``post_build(...)`` method builds **all nested** ``Configuration`` to their bounded ``Component``
- The ``Configuration`` is also validated after build (*it was already validated before build operation*)

#### ``run(...)``

- A general single interface that can be **arbitrarily extended**

#### ``get_delta_copy(...)``

- Quick API to get a delta **copy** of the original ``Component``

#### ``save(...)`` and ``load(...)``

- ``Component`` **automatically supports** own state serialization for quick and multiple re-uses

### That's all you have to know about ``Component``!

I **don't** want to setup **yet another restrictive** Python library with all its interfaces...

- Define any kind of ``Component`` you want
- Wrap existing code logic into ``Component``
- Simply re-map your configuration to ``Configuration``
- Done!

## Registration

*How does it work?*

### Registration format

Right now, registration has **very few** requirements and dynamics

- The ``Registry`` is yet a simple Python dictionary: ``RegistrationKey: Configuration``
- You need to **manually register** and **bind** ``Configuration`` via simple APIs
- You need to wrap all registrations in a ``register()`` function in ``__init__.py``

## An example

Suppose the following **recommended** code organization:

``
    project_folder
        |
        |__ configurations
        |        |__ __init__.py
        |        |__ data_loader.py
        |
        |__ components
        |       |__ __init__.py
        |       |__ data_loader.py
        |
        |__ my_script.py
``

In [None]:
from components.data_loader import ExampleLoader

# configurations/data_loader.py
class ExampleLoaderConfig(DataLoaderConfig):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.has_val_split = False
        config.name = 'example_dataset'

        config.add_short(name='file_manager_regr_info', value=RegistrationKey(
            name='file_manager',
            tags={'default'},
            namespace='generic'
        ), type_hint=RegistrationKey, description="registration info of built FileManager component."
                                                  " Used for filesystem interfacing")
        config.add_short(name='data_url', value='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
                         type_hint=AnyStr, description='URL to dataset archive file')
        config.add_short(name='download_directory', value='imdb', type_hint=str,
                         description='Folder the archive file is downloaded', is_required=True)
        config.add_short(name='download_filename', value='imdb.tar.gz', type_hint=str,
                         description='Name of the archive file', is_required=True)
        config.add_short(name='samples_amount', value=500, type_hint=int,
                         description='Number of samples per split to consider at maximum')

        return config


def register_data_loaders():
    Registry.register_and_bind(configuration_class=ExampleLoaderConfig,
                               component_class=ExampleLoader,
                               name='data_loader',
                               tags={'imdb'},
                               is_default=True,
                               namespace='examples')

In [None]:
# configurations/__init__.py
from configurations.data_loader import register_data_loaders

def register():
    register_data_loaders()

### Commands

That's **all you need to setup** since ``deasy_learning`` offers some **high-level APIs** to deal with registrations

In [None]:
# my_script.py
from pathlib import Path

from deasy_learning_generic.core.commands import setup_registry

if __name__ == '__main__':
    directory = Path(__file__).parent.resolve()   # <-- project_folder
    module_directories = [
        directory
    ]

    setup_registry(directory=directory,
                   module_directories=module_directories,
                   generate_registration=True)

The above command ``setup_registry`` only requires

- A ``directory`` path mainly used to save potentially serialized data during execution (e.g.., results, data, model weights...)
- A ``module_directories`` list of base folders from which to look for registrations
- ``generate_registrations=True`` tells the ``Registry`` to serialize in JSON format all ``RegistrationKey`` (*for visualization purposes*)

#### Behind the curtains

The ``Registry`` is looking for all ``register(...)`` functions in every ``__init__.py`` in subfolders of each directory in ``module_directories``

#### Limitations

Registration is **always done** at **runtime**!

Thus, **always** begin your script with ``setup_registry(...)``

### Inspecting found registrations

A ``registrations`` folder will be created under ``directory``.

``
    project_folder
        |
        |__ configurations
        |
        |__ components
        |
        |__ registrations
        |      |
        |      |__ examples    # <-- this is the namespace!
        |             |
        |             |__ components.json
        |             |
        |             |__ configurations.json
        |
        |__ my_script.py
``

In [None]:
# registrations/examples/configurations.json
[
    "name:data_loader--tags:['default', 'imdb']--namespace:examples",
]

## Training a SVM model with deasy-learning

Enough showcasing! Let's see some practical example (*still a showcase ehehe...*)

#### Steps

- Data loading: IMDB dataset
- Preprocessor pipeline: some text normalization and tf-idf encoding
- Model: a SVM
- Routine: a train and test routine

### Data loading (Component)

Let's define a base ``DataLoader`` component.

```python
class DataLoader(Component):

    @abc.abstractmethod
    def load_data(self
    ) -> Any:
        pass

    @abc.abstractmethod
    def get_splits(
            self,
    ) -> Tuple[Optional[pd.DataFrame], Optional[pd.DataFrame], Optional[pd.DataFrame]]:
        pass

    @abc.abstractmethod
    def parse(
            self,
            data: pd.DataFrame,
            data_name: str
    ) -> FieldDict:
        pass

    @serialize_run
    def run(self,
            serialization_path: Optional[Union[AnyStr, Path]] = None,
            serialize: bool = False,
            ) -> FieldDict:
        train_data, val_data, test_data = self.get_splits()

        # We might use a DataLoader to load inference data only
        if not self.has_test_split_only:
            if train_data is None:
                raise UnspecifiedDataSplitException(split='training')

            if self.has_val_split and val_data is None:
                raise UnspecifiedDataSplitException(split='validation')

        if self.has_test_split and test_data is None:
            raise UnspecifiedDataSplitException(split='test')

        # Build instances
        result = FieldDict()
        if train_data is not None:
            train_data = self.parse(data=train_data,
                                    data_name=f'train')
        result.add_short(name='train',
                         value=train_data,
                         type_hint=FieldDict,
                         tags={'train'})
        if val_data is not None:
            val_data = self.parse(data=val_data,
                                  data_name='val')
        result.add_short(name='val',
                         value=val_data,
                         type_hint=FieldDict,
                         tags={'val'})
        if test_data is not None:
            test_data = self.parse(data=test_data,
                                   data_name='test')
        result.add_short(name='test',
                         value=test_data,
                         type_hint=FieldDict,
                         tags={'test'})

        return result
```

And now we define our ``DataLoader`` for IMDB

```python
class ExampleLoader(DataLoader):

    def __init__(
            self,
            **kwargs
    ):
        super().__init__(**kwargs)

        # Update directory paths
        file_manager = Registry.retrieve_built_component_from_key(self.file_manager_regr_info)
        file_manager = cast(FileManager, file_manager)

        self.download_path = file_manager.run(filepath=self.download_directory).joinpath(self.download_filename)
        self.extraction_path = self.download_path.parents[0]
        self.dataframe_path = self.extraction_path.joinpath('dataset.csv')

    def download(
            self
    ):
        request.urlretrieve(self.data_url, self.download_path)

        logging_utility.logger.info('Download complete...Extracting files...')
        with tarfile.open(self.download_path) as loaded_tar:
            loaded_tar.extractall(self.extraction_path)
        logging_utility.logger.info('Extraction complete...')

    def read_df_from_files(
            self
    ) -> pd.DataFrame:
        dataframe_rows = []
        for split in ['train', 'test']:
            for sentiment in ['pos', 'neg']:
                folder = self.extraction_path.joinpath('aclImdb', split, sentiment)
                for filepath in folder.glob('**/*'):
                    if not filepath.is_file():
                        continue

                    filename = filepath.name
                    with filepath.open(mode='r', encoding='utf-8') as text_file:
                        text = text_file.read()
                        score = filename.split("_")[1].split(".")[0]
                        file_id = filename.split("_")[0]

                        # create single dataframe row
                        dataframe_row = {
                            "file_id": file_id,
                            "score": score,
                            "sentiment": sentiment,
                            "split": split,
                            "text": text
                        }
                        dataframe_rows.append(dataframe_row)

        df = pd.DataFrame(dataframe_rows)
        df = df[["file_id",
                 "score",
                 "sentiment",
                 "split",
                 "text"]]

        # Save dataframe for quick retrieval
        df.to_csv(path_or_buf=self.dataframe_path, index=None)

        return df

    def load_data(
            self
    ) -> pd.DataFrame:
        if not self.download_path.is_file():
            logging_utility.logger.info('First time loading dataset...Downloading...')
            self.download()
            df = self.read_df_from_files()
        else:
            if self.dataframe_path.is_file():
                logging_utility.logger.info('Loaded pre-loaded dataset...')
                df = pd.read_csv(self.dataframe_path)
            else:
                logging_utility.logger.info("Couldn't find pre-loaded dataset...Building dataset from files...")
                df = self.read_df_from_files()
                df.to_csv(self.dataframe_path, index=False)

        return df

    def get_splits(
            self,
    ) -> Tuple[Optional[pd.DataFrame], Optional[pd.DataFrame], Optional[pd.DataFrame]]:
        df = self.load_data()
        train = df[df.split == 'train'].sample(frac=1).reset_index(drop=True)[:self.samples_amount]
        val = None
        test = df[df.split == 'test'].sample(frac=1).reset_index(drop=True)[:self.samples_amount]

        return train, val, test

    def parse(
            self,
            data: pd.DataFrame,
            data_name: str
    ) -> Optional[FieldDict]:
        if data is None:
            return data

        return_field = FieldDict()
        return_field.add_short(name='text',
                               value=data['text'].values,
                               type_hint=Iterable[str],
                               tags={'text'},
                               description='Input text to classify')
        return_field.add_short(name='sentiment',
                               value=data['sentiment'].values,
                               type_hint=Iterable[str],
                               tags={'label'},
                               description='Sentiment associated to text')
        return return_field
```

### Data loading (Configuration)

We have defined our ``Component``, we now define the corresponding ``Configuration``

```python
class DataLoaderConfig(TunableConfiguration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.add_short(name='name',
                         type_hint=str,
                         description="Unique dataset identifier",
                         is_required=True)
        config.add_short(name='has_test_split_only',
                         value=False,
                         type_hint=bool,
                         description="Whether DataLoader has test split only or not")
        config.add_short(name='has_val_split',
                         value=True,
                         type_hint=bool,
                         description="Whether DataLoader has a val split or not")
        config.add_short(name='has_test_split',
                         value=True,
                         type_hint=bool,
                         description="Whether DataLoader has a test split or not")

        return config

```

```python
class ExampleLoaderConfig(DataLoaderConfig):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.has_val_split = False
        config.name = 'example_dataset'

        config.add_short(name='file_manager_regr_info', value=RegistrationKey(
            name='file_manager',
            tags={'default'},
            namespace='generic'
        ), type_hint=RegistrationKey, description="registration info of built FileManager component."
                                                  " Used for filesystem interfacing")
        config.add_short(name='data_url', value='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
                         type_hint=AnyStr, description='URL to dataset archive file')
        config.add_short(name='download_directory', value='imdb', type_hint=str,
                         description='Folder the archive file is downloaded', is_required=True)
        config.add_short(name='download_filename', value='imdb.tar.gz', type_hint=str,
                         description='Name of the archive file', is_required=True)
        config.add_short(name='samples_amount', value=500, type_hint=int,
                         description='Number of samples per split to consider at maximum')

        return config


def register_data_loaders():
    Registry.register_and_bind(configuration_class=ExampleLoaderConfig,
                               component_class=ExampleLoader,
                               name='data_loader',
                               tags={'imdb'},
                               is_default=True,
                               namespace='examples')
```

### Data loading (Testing)

Let's test our ``ExampleDataLoader``!

In [None]:
from pathlib import Path
from typing import cast

from deasy_learning_generic.components.data_loader import DataLoader
from deasy_learning_generic.core.commands import setup_registry
from deasy_learning_generic.core.registry import Registry
from deasy_learning_generic.utility import logging_utility

if __name__ == '__main__':
    directory = Path(__file__).parent.parent.resolve()
    module_directories = [
        Path(__file__).parent.parent.parent.resolve()
    ]

    setup_registry(directory=directory,
                   module_directories=module_directories,
                   generate_registration=True)

    logging_utility.logger.info(f'Directory: {directory}')

    loader_regr_key = "name:data_loader--tags:['default', 'imdb']--namespace:examples"
    loader = Registry.retrieve_component_from_key(config_regr_key=loader_regr_key)
    loader = cast(DataLoader, loader)
    data = loader.run()
    logging_utility.logger.info(data)

### Data pre-processing (Component)

We now define how to convert our inputs to be digested by our SVM model.

```python
class Processor(Component):

    @abc.abstractmethod
    def process(
            self,
            data: FieldDict,
            is_training_data: bool = False
    ) -> FieldDict:
        pass

    @serialize_run
    def run(
            self,
            data: Optional[FieldDict] = None,
            is_training_data: bool = False,
            serialization_path: Optional[Union[AnyStr, Path]] = None,
            serialize: bool = False,
    ) -> Optional[FieldDict]:
        if data is None:
            return data

        return self.process(data=data, is_training_data=is_training_data)
```

```python
class TfIdfProcessor(Processor):

    def __init__(
            self,
            **kwargs
    ):
        super().__init__(**kwargs)
        self.vectorizer = TfidfVectorizer(ngram_range=self.ngram_range)

    def process(
            self,
            data: FieldDict,
            is_training_data: bool = False,
    ):
        if is_training_data:
            text_data = data.search_by_tag(tags='text')
            text_data = list(itertools.chain.from_iterable([field.value for field in text_data.values()]))
            self.vectorizer.fit(text_data)

        text_fields = data.search_by_tag(tags='text')
        for index, field in text_fields.items():
            data[index] = self.vectorizer.transform(field.value)
        return data
```

```python
class LabelProcessor(Processor):

    def __init__(
            self,
            **kwargs
    ):
        super().__init__(**kwargs)
        self.encoders = dict()

    def process(
            self,
            data: FieldDict,
            is_training_data: bool = False
    ):
        if is_training_data and data is not None:
            label_fields = data.search_by_tag(tags='label')
            for index, field in label_fields.items():
                field_encoder = LabelEncoder() if index not in self.encoders else self.encoders[index]
                field_encoder.fit(field.value)
                self.encoders[index] = field_encoder

        label_fields = data.search_by_tag(tags='label')
        for index, field in label_fields.items():
            field_encoder = self.encoders[field.name]
            data[index] = field_encoder.transform(field.value)
        return data

```

### Data pre-processing (Configuration)

As done before, we define the corresponding ``Configuration``

```python
class TfIdfProcessorConfig(TunableConfiguration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.add_short(name='ngram_range',
                         value=(1, 1),
                         type_hint=Any,
                         description='Vectorizer ngram_range hyper-parameter')

        return config

def register_processors():
    tfidf_regr_key = Registry.register_and_bind(configuration_class=TfIdfProcessorConfig,
                                                component_class=TfIdfProcessor,
                                                name='processor',
                                                tags={'tf-idf'},
                                                is_default=True,
                                                namespace='examples')

    label_regr_key = Registry.register_and_bind(configuration_class=TunableConfiguration,
                                                component_class=LabelProcessor,
                                                name='processor',
                                                tags={'label'},
                                                is_default=True,
                                                namespace='examples')

    Registry.register_and_bind(configuration_class=ProcessorPipelineConfig,
                               configuration_constructor=ProcessorPipelineConfig.get_delta_class_copy,
                               configuration_kwargs={
                                   'params_dict': {
                                       'processors': [
                                           tfidf_regr_key,
                                           label_regr_key
                                       ]
                                   }
                               },
                               component_class=ProcessorPipeline,
                               name='processor',
                               tags={'tf-idf', 'label'},
                               namespace='examples')

```

### Data pre-processing (Testing)

Again, let's test our loading + preprocessing pipeline!

```python
from pathlib import Path
from typing import cast

from deasy_learning_generic.components.data_loader import DataLoader
from deasy_learning_generic.components.file_manager import FileManager
from deasy_learning_generic.components.processor import Processor
from deasy_learning_generic.core.commands import setup_registry
from deasy_learning_generic.core.registry import Registry
from deasy_learning_generic.utility import logging_utility

if __name__ == '__main__':
    directory = Path(__file__).parent.parent.resolve()
    module_directories = [
        Path(__file__).parent.parent.parent.resolve()
    ]

    file_manager_regr_key = setup_registry(directory=directory,
                                           module_directories=module_directories,
                                           generate_registration=True)

    file_manager = Registry.retrieve_built_component_from_key(config_regr_key=file_manager_regr_key)
    file_manager = cast(FileManager, file_manager)
    serialization_path = file_manager.run(filepath=file_manager.serialization_directory)
    serialize = True

    # DataLoader (dl)
    dl_regr_key = "name:data_loader--tags:['default', 'imdb']--namespace:examples"
    dl = Registry.retrieve_component_from_key(config_regr_key=dl_regr_key)
    dl = cast(DataLoader, dl)
    data = dl.run(serialization_path=serialization_path, serialize=serialize)

    # TfIdfProcessor (tip)
    tip_regr_key = "name:processor--tags:{'default', 'tf-idf'}--namespace:examples"
    tip = Registry.retrieve_component_from_key(config_regr_key=tip_regr_key)
    tip = cast(Processor, tip)
    tip.run(data=data.train, is_training_data=True, serialization_path=serialization_path)
    tip.run(data=data.val)
    tip.run(data=data.test, serialize=serialize, serialization_path=serialization_path)

    # LabelProcessor (lp)
    lp_regr_key = "name:processor--tags:{'label', 'default'}--namespace:examples"
    lp = Registry.retrieve_component_from_key(config_regr_key=lp_regr_key)
    lp = cast(Processor, lp)
    lp.run(data=data.train, is_training_data=True)
    lp.run(data=data.val)
    lp.run(data=data.test, serialize=serialize, serialization_path=serialization_path)
    logging_utility.logger.info(f'Train: {data.train}')
    logging_utility.logger.info(f'Val: {data.val}')
    logging_utility.logger.info(f'Test: {data.test}')
```

### Modeling (Component)

We are now ready to define our SVM ``Component`` wrapper!

```python
class ExampleModel(Model):

    def build_model(
            self,
            processor_state: FieldDict,
            callbacks: Optional[CallbackPipeline] = None
    ):
        self.model = SVC(C=self.C,
                         kernel=self.kernel,
                         class_weight=self.class_weight)

    def predict(
            self,
            data: FieldDict,
            callbacks: Optional[CallbackPipeline] = None,
            metrics: Optional[MetricPipeline] = None
    ) -> FieldDict:
        model_data = self.get_model_data(data=data, with_labels=True)
        predictions = self.model.predict(X=model_data['X'])

        return_field = FieldDict()
        return_field.add_short(name='predictions',
                               value=predictions)

        if 'y' in model_data and metrics is not None:
            metrics_result = metrics.run(y_pred=predictions,
                                         y_true=model_data['y'])
            return_field.add_short(name='metrics',
                                   value=metrics_result)

        return return_field

    def get_model_data(
            self,
            data: FieldDict,
            with_labels: bool = False
    ) -> Dict[str, Iterable]:
        return_data = dict()

        text_data = list(data.search_by_tag(tags='text').values())[0].value
        return_data['X'] = text_data

        if with_labels:
            label_data = data.search_by_tag(tags='label')
            if len(label_data):
                label_data = list(label_data.values())[0].value
                return_data['y'] = label_data

        return return_data

    def fit(
            self,
            train_data: FieldDict,
            val_data: Optional[FieldDict] = None,
            metrics: Optional[MetricPipeline] = None,
            callbacks: Optional[CallbackPipeline] = None
    ) -> FieldDict:
        train_info = self.model.fit(**self.get_model_data(data=train_data, with_labels=True))

        return_field = FieldDict()
        return_field.add_short(name='train_info',
                               value=train_info)

        if val_data is not None:
            val_info = self.evaluate_and_predict(data=val_data,
                                                 callbacks=callbacks,
                                                 metrics=metrics)
            return_field.add_short(name='val_info',
                                   value=val_info)
        return return_field

```

### Modeling (Configuration)

In [None]:
class ExampleModelConfig(TunableConfiguration):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.add_short(name='C',
                         value=1.0,
                         type_hint=float,
                         description='C parameter of SVC')
        config.add_short(name='kernel', type_hint=str, value='linear')
        config.add_short(name='class_weight', type_hint=str, value='balanced')

        return config


def register_models():
    Registry.register_and_bind(configuration_class=ExampleModelConfig,
                               component_class=ExampleModel,
                               name='model',
                               tags={'svm'},
                               is_default=True,
                               namespace='examples')

### Metrics (Configuration)

Before testing our model, we might define some metrics

```python
from sklearn.metrics import f1_score, accuracy_score

from deasy_learning_generic.components.metrics import MetricPipeline, LambdaMetric
from deasy_learning_generic.configurations.metrics import MetricManagerConfig, LambdaMetricConfig
from deasy_learning_generic.core.configuration import add_variant, supports_variants
from deasy_learning_generic.core.registry import Registry


@supports_variants
class ExampleLambdaMetricConfig(LambdaMetricConfig):

    @classmethod
    @add_variant('binary_F1')
    def get_sklearn_binary_f1(
            cls
    ):
        config = cls.get_default()
        config.name = 'binary_f1'
        config.method = f1_score
        config.method_args = {'average': 'binary', 'pos_label': 1}
        return config

    @classmethod
    @add_variant('macro_F1')
    def get_sklearn_macro_f1(
            cls
    ):
        config = cls.get_default()
        config.name = 'macro_f1'
        config.method = f1_score
        config.method_args = {'average': 'macro'}
        return config

    @classmethod
    @add_variant('accuracy')
    def get_sklearn_accuracy(
            cls
    ):
        config = cls.get_default()
        config.name = 'accuracy'
        config.method = accuracy_score
        return config


def register_metrics_configurations():
    variant_regr_keys = Registry.register_and_bind_configuration_variants(configuration_class=ExampleLambdaMetricConfig,
                                                                          component_class=LambdaMetric,
                                                                          name='metrics',
                                                                          namespace='examples')

    Registry.register_and_bind(configuration_class=MetricManagerConfig,
                               configuration_constructor=MetricManagerConfig.get_delta_class_copy,
                               configuration_kwargs={
                                   'params_dict': {
                                       'metrics': variant_regr_keys
                                   }
                               },
                               component_class=MetricPipeline,
                               name='metrics',
                               tags={'binary_f1', 'macro_f1', 'accuracy'},
                               namespace='examples')

```

### Modeling (Testing)

Let's see if we can train and evaluate our model!

```python
from pathlib import Path
from typing import cast

from deasy_learning_generic.components.data_loader import DataLoader
from deasy_learning_generic.components.metrics import MetricPipeline
from deasy_learning_generic.components.model import Model
from deasy_learning_generic.components.processor import Processor
from deasy_learning_generic.core.commands import setup_registry
from deasy_learning_generic.core.registry import Registry
from deasy_learning_generic.utility import logging_utility

if __name__ == '__main__':
    directory = Path(__file__).parent.parent.resolve()
    module_directories = [
        Path(__file__).parent.parent.parent.resolve()
    ]

    file_manager_regr_key = setup_registry(directory=directory,
                                           module_directories=module_directories,
                                           generate_registration=True)
    # DataLoader (dl)
    dl_regr_key = "name:data_loader--tags:{'imdb', 'default'}--namespace:examples"
    dl = Registry.retrieve_component_from_key(config_regr_key=dl_regr_key)
    dl = cast(DataLoader, dl)
    data = dl.run()

    # TfIdfProcessor (tip)
    tip_regr_key = "name:processor--tags:{'default', 'tf-idf'}--namespace:examples"
    tip = Registry.retrieve_component_from_key(config_regr_key=tip_regr_key)
    tip = cast(Processor, tip)
    tip.run(data=data.train, is_training_data=True)
    tip.run(data=data.val)
    tip.run(data=data.test)

    # LabelProcessor (lp)
    lp_regr_key = "name:processor--tags:{'label', 'default'}--namespace:examples"
    lp = Registry.retrieve_component_from_key(config_regr_key=lp_regr_key)
    lp = cast(Processor, lp)
    lp.run(data=data.train, is_training_data=True)
    lp.run(data=data.val)
    lp.run(data=data.test)
    logging_utility.logger.info(f'Train: {data.train}')
    logging_utility.logger.info(f'Val: {data.val}')
    logging_utility.logger.info(f'Test: {data.test}')

    # Metrics
    metrics_regr_key = "name:metrics--tags:['accuracy', 'binary_f1', 'macro_f1']--namespace:examples"
    metrics = Registry.retrieve_component_from_key(config_regr_key=metrics_regr_key)
    metrics = cast(MetricPipeline, metrics)

    # Model
    model_regr_key = "name:model--tags:['default', 'svm']--namespace:examples"
    model = Registry.retrieve_component_from_key(config_regr_key=model_regr_key)
    model = cast(Model, model)

    # Training
    model.build_model(processor_state=lp.state)
    fit_info = model.fit(train_data=data.train,
                         val_data=data.val,
                         metrics=metrics,
                         callbacks=None)
    logging_utility.logger.info(f'Fit info: {fit_info}')

    predict_info = model.predict(data=data.test,
                                 metrics=metrics)
    logging_utility.logger.info(f'Predict info: {predict_info}')
```

### Routine

The ``Routine`` is our task pipeline: from data loading to model training and evaluation

```python
class TrainAndTestRoutine(Routine):

    def get_random_split(
            self,
            data: pd.DataFrame
    ) -> Tuple[pd.DataFrame, pd.DataFrame]:
        amount = int(len(data) * self.validation_percentage)
        all_indexes = np.arange(len(data))
        split_indexes = np.random.choice(all_indexes, size=amount, replace=False)
        remaining_indexes = np.array([idx for idx in all_indexes if idx not in split_indexes])
        split_data = data[split_indexes]
        remaining_data = data[remaining_indexes]
        return remaining_data, split_data

    def build_routine_splits(
            self,
            train_data: pd.DataFrame = None,
            val_data: pd.DataFrame = None,
            test_data: pd.DataFrame = None,
            is_training: bool = False
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        logging_utility.logger.info(f'''
        Building routine splits...
        Train data: {len(train_data) if train_data is not None else train_data}
        Validation data: {len(val_data) if val_data is not None else val_data}
        Test data: {len(test_data) if test_data is not None else test_data}
        ''')

        if is_training:
            assert train_data is not None, f'Training data should be given when training. Got {train_data}'

        if self.has_val_split:
            if train_data is not None and val_data is None:
                assert self.validation_percentage is not None, "Routine is expected to build the validation data," \
                                                               " but no validation percentage was given"

                logging_utility.logger.info(f'Randomly splitting train data into train and validation splits. '
                                            f'Validation percentage: {self.validation_percentage}')
                train_data, val_data = self.get_random_split(data=train_data)

        if test_data is None and self.parameter_dict["has_test_split"]:
            logging_utility.logger.info(f'Randomly splitting train data into train and test splits. '
                                        f'Test percentage: {self.validation_percentage}')
            train_data, test_data = self.get_random_split(data=train_data)

        logging_utility.logger.info(f'''
        Done!
        Train data: {len(train_data) if train_data is not None else train_data}
        Validation data: {len(val_data) if val_data is not None else val_data}
        Test data: {len(test_data) if test_data is not None else test_data}
        ''')
        return train_data, val_data, test_data

    def _step_execute(
            self,
            routine_suffixes: List[RoutineSuffix],
            train_data: Optional[FieldDict] = None,
            val_data: Optional[FieldDict] = None,
            test_data: Optional[FieldDict] = None,
            is_training: bool = False,
            serialization_path: Optional[Union[AnyStr, Path]] = None,
            serialize: bool = False,
            serialize_data: bool = False
    ):
        step_result = FieldDict()

        serialization_path = Path(serialization_path) if type(serialization_path) != Path else serialization_path

        routine_name = '_'.join([str(suffix) for suffix in routine_suffixes])
        routine_path = serialization_path.joinpath(routine_name) if serialize is not None else None
        if routine_path is not None and not routine_path.is_dir():
            routine_path.mkdir()

        helper = cast(FrameworkHelper, self.helper)
        assert routine_suffixes[0].name == 'seed', \
            f'{self.__class__.__name__} expects a seed for each iteration!'
        helper.run(seed=routine_suffixes[0].value)

        # Processor
        processor = Registry.retrieve_component_from_key(config_regr_key=self.processor)
        processor = cast(ProcessorPipeline, processor)
        # serialize only at the end of process
        train_data = processor.run(data=train_data,
                                   is_training_data=is_training,
                                   serialization_path=routine_path,
                                   serialize_data=serialize_data,
                                   data_serialization_index='train')
        val_data = processor.run(data=val_data,
                                 serialization_path=routine_path,
                                 serialize_data=serialize_data,
                                 data_serialization_index='val')
        test_data = processor.run(data=test_data,
                                  serialize=serialize,
                                  serialization_path=routine_path,
                                  serialize_data=serialize_data,
                                  data_serialization_index='test')

        # Model
        if self.callbacks is not None:
            callbacks = Registry.retrieve_component_from_key(config_regr_key=self.callbacks)
            callbacks = cast(CallbackPipeline, callbacks)
        else:
            callbacks = None

        model = Registry.retrieve_component_from_key(config_regr_key=self.model)
        model = cast(Model, model)

        model.build_model(processor_state=processor.state,
                          callbacks=callbacks)

        # Training
        metrics = Registry.retrieve_component_from_key(config_regr_key=self.metrics)
        metrics = cast(MetricPipeline, metrics)

        if is_training:
            model.prepare_for_training(train_data=train_data)

            assert routine_suffixes[0].name == 'seed', \
                f'{self.__class__.__name__} expects a seed for each iteration!'
            helper.run(seed=routine_suffixes[0].value)

            fit_info = model.fit(train_data=train_data,
                                 val_data=val_data,
                                 metrics=metrics,
                                 callbacks=callbacks)
            step_result.add_short(name='fit_info',
                                  value=fit_info)
        else:
            model.prepare_for_loading(data=test_data if test_data is not None else val_data)

            model.load(serialization_path=routine_path)

            model.check_after_loading()

            assert routine_suffixes[0].name == 'seed', \
                f'{self.__class__.__name__} expects a seed for each iteration!'
            helper.run(seed=routine_suffixes[0].value)

        # Evaluator
        if val_data is not None:
            val_result = model.predict(data=val_data,
                                       metrics=metrics,
                                       callbacks=callbacks)
            step_result.add_short(name='val_info',
                                  value=val_result)

        if test_data is not None:
            test_result = model.predict(data=test_data,
                                        metrics=metrics,
                                        callbacks=callbacks)
            step_result.add_short(name='test_info',
                                  value=test_result)

        # Save
        if serialize:
            model.save(serialization_path=routine_path)

        return step_result

    def run(
            self,
            helper: Optional[FrameworkHelper] = None,
            is_training: bool = False,
            routine_path: Optional[AnyStr] = None,
            save_result: bool = False,
            serialization_path: Optional[AnyStr] = None,
            serialize: bool = False,
            serialize_data: bool = False
    ) -> FieldDict:
        routine_result = FieldDict()

        # Helper
        if helper is not None:
            self.helper = helper

        self.helper.run(seed=self.seeds[0])

        # Get data splits
        data_loader = cast(DataLoader, self.data_loader)
        train_data, val_data, test_data = data_loader.get_splits()
        train_data, val_data, test_data = self.build_routine_splits(train_data=train_data,
                                                                    val_data=val_data,
                                                                    test_data=test_data,
                                                                    is_training=is_training)

        for seed_idx, seed in enumerate(self.seeds):
            logging_utility.logger.info(f'Seed: {seed} (Progress -> {seed_idx + 1}/{len(self.seeds)})')

            seed_train_data = data_loader.parse(data=train_data, data_name='train')
            seed_val_data = data_loader.parse(data=val_data, data_name='val')
            seed_test_data = data_loader.parse(data=test_data, data_name='test')

            routine_suffixes = [
                RoutineSuffix(name='seed',
                              value=seed)
            ]
            step_result = self._step_execute(routine_suffixes=routine_suffixes,
                                             train_data=seed_train_data,
                                             val_data=seed_val_data,
                                             test_data=seed_test_data,
                                             is_training=is_training,
                                             serialization_path=serialization_path,
                                             serialize=serialize,
                                             serialize_data=serialize_data)
            routine_result.add_short(name='_'.join([str(suffix) for suffix in routine_suffixes]),
                                     value=step_result)

        return routine_result
```

### Routine (Configuration)

We use the ``TrainAndTestRoutine`` component and extend its existing ``TrainAndTestConfig``

```python
class TrainAndTestConfig(RoutineConfig):

    @classmethod
    def get_default(
            cls
    ):
        config = super(TrainAndTestConfig, cls).get_default()

        config.add_short(name='test_percentage', type_hint=Optional[float],
                         description='Training set percentage to use as test split.')
        config.add_short(name='has_val_split', value=True, type_hint=bool,
                         description='If true, val data is considered as a data split. '
                                     'If no val data is provided, it is built via random split')
        config.add_short(name='has_test_split', value=True, type_hint=bool,
                         description="If true, test data is distinct from training data "
                                     "and no data split is required")
        return config
```

```python
from deasy_learning_generic.components.routine import TrainAndTestRoutine
from deasy_learning_generic.configurations.routine import TrainAndTestConfig
from deasy_learning_generic.core.registry import Registry, RegistrationKey


class ExampleRoutineConfig(TrainAndTestConfig):

    @classmethod
    def get_default(
            cls
    ):
        config = super().get_default()

        config.data_loader = RegistrationKey(name='data_loader',
                                             tags={'imdb', 'default'},
                                             namespace='examples')
        config.processor = RegistrationKey(name='processor',
                                           tags={'tf-idf', 'label'},
                                           namespace='examples')
        config.model = RegistrationKey(name='model',
                                       tags={'svm', 'default'},
                                       namespace='examples')
        config.metrics = RegistrationKey(name='metrics',
                                         tags={'binary_f1', 'macro_f1', 'accuracy'},
                                         namespace='examples')
        config.helper = RegistrationKey(name='helper',
                                        tags={'default'},
                                        namespace='generic')

        config.name = 'svm'
        config.seeds = [15000, 42]
        config.has_val_split = False
        config.has_test_split = True

        return config


def register_routines():
    Registry.register_and_bind(configuration_class=ExampleRoutineConfig,
                               component_class=TrainAndTestRoutine,
                               name='routine',
                               tags={'train_and_test'},
                               namespace='examples')

```

### Routine (Testing)

Let's test our **whole** pipeline in a **single shot**!

```python
from pathlib import Path
from typing import cast

from deasy_learning_generic.utility import logging_utility
from deasy_learning_generic.components.file_manager import FileManager
from deasy_learning_generic.components.routine import Routine
from deasy_learning_generic.core.commands import setup_registry
from deasy_learning_generic.core.registry import Registry, RegistrationKey

if __name__ == '__main__':
    directory = Path(__file__).parent.parent.resolve()
    module_directories = [
        Path(__file__).parent.parent.parent.resolve()
    ]

    file_manager_regr_key = setup_registry(directory=directory,
                                           module_directories=module_directories,
                                           generate_registration=True)

    logging_utility.logger.info(f'Directory: {directory}')

    file_manager = Registry.retrieve_built_component_from_key(config_regr_key=file_manager_regr_key)
    file_manager = cast(FileManager, file_manager)
    serialization_path = file_manager.run(filepath=file_manager.serialization_directory)
    serialize = True

    # Routine
    routine_regr_key = RegistrationKey(name='routine',
                                       tags={'train_and_test'},
                                       namespace='examples')
    routine = Registry.retrieve_component_from_key(config_regr_key=routine_regr_key)
    routine = cast(Routine, routine)
    routine_info = routine.run(is_training=True,
                               serialization_path=serialization_path,
                               serialize=serialize,
                               serialize_data=True)
    logging_utility.logger.info(f'Predict info: {routine_info}')
    routine_info = routine.run(serialization_path=serialization_path,
                               serialize_data=True)
    logging_utility.logger.info(routine_info)
```

## Upcoming features

``deasy-learning`` is **just about** ``Registry``, ``Component``, ``Configuration``

Still, **quite a lot of stuff** can be defined

- Custom ``Component`` $\rightarrow$ e.g., ``TrainAndTestRoutine`` is a default ``Component`` of ``deasy-learning``
- Custom ``Configuration`` $\rightarrow$ same as for ``Component``
- A simple dedicated **web service** for visualizing registrations (similar to Hugginface space)
- **Upload/Download** ``Configuration`` and ``Component`` for research purposes

#### Python package

I thought about defining **small modules** regarding different frameworks

- ``deasy_learning_generic``: general module that defines the core of the library
- ``deasy_learning_tf``: specialized ``Configuration`` and ``Component`` for Tensorflow
- ``deasy_learning_th``: specialized ``Configuration`` and ``Component`` for Torch

## The end! Finally!

**Thank you very much** for attending my course **<3**

Since this is my **first teaching experience**, I kindly ask you for **your feedback**!!!!

#### [Evaluation questionnaire](https://forms.gle/wxAV2XarJSKr5qLa7)  $\leftarrow$ click me!

**Click on the title** to access to the Google Form.

Otherwise, here's the **QR code** for quick scan

<center>
<div>
<img src="Images/Lecture-5/questionnaire.png" width="400"/>
</div>
</center>

## Motivational outro

Before saying goodbye, here are few things I want to tell you

- Research is **chaotic**, often **competitive** and time to time **toxic** $\rightarrow$ let's **share** knowledge!

- ``deasy-learning`` is **my attempt** to share experience and set up a common working environment

- It may be a **good idea** to sit down and talk about research problems time to time! $\rightarrow$ **recurring themed meetings**

# Any questions?

<center>
<div>
<img src="Images/Lecture-1/jojo-arrivederci.gif" width="1200" alt='JOJO_arrivederci'/>
</div>
</center>