<a href="https://colab.research.google.com/github/fauxneticien/lnl-examples/blob/main/notebooks/03_train_lnl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using PyTorch Lightning and Lhotse for speech processing research

<p align="center"><img width="500" src="https://user-images.githubusercontent.com/9938298/244146091-1e3cf317-910a-4fcf-a0e2-6e755a4935c0.png"></p>

This tutorial is a brief overview of how `train_lnl.py` works. We'll keep things relatively brief compared to the other two tutorials because most of what is implemented here is probably very specific to my use case/preferences in terms of how various components are modularized and configured.


## Setup

### Install dependencies

As before, we'll assume the latest versions of `torch(audio)`, `lightning`, and `lhotse` as of early June 2023.

In [None]:
%%capture

!pip install --quiet torch==2.0.1 torchaudio==2.0.2 lightning==2.0.2 lhotse==1.14.0 hydra-core

## Configuration management with Hydra

The main motivation for using Hydra (and Lightning) in this toy/pedagogical environment is also to help us navigate many other codebases that use both Hydra and Lightning, for example:
- https://github.com/NVIDIA/NeMo
- https://github.com/openspeech-team/openspeech

Recall from previous tutorials that a typical Lightning script might look like:

```python
import lightning.pytorch as pl

from models import SomeModel
from datamodule import SomeDataModule

# Instantiate/configure model, datamodule, and trainer
model = SomeModel(MODEL_CONFIG)
datamodule = SomeDataModule(DATAMODULE_CONFIG)
trainer = pl.Trainer(TRAINER_CONFIG)

trainer.fit(model, datamodule)
```

Essentially, Hydra let's us setup and store the instantiation/configuration using a collection of YAML files containing various key-value pairs that are over-ridable. Let's see what that actually entails...

### Class instantiation in Python

Let's have a look at a toy example below. In Python we can define a class `MyAbstractClass`, which takes a parameter `my_variable` and sets its attribute (`self.my_variable`) to the value when it is instantiated. We also have a method `print_my_var` that we can call to see what this value is later.

In [None]:
class MyAbstractClass:
  def __init__(self, my_variable):
    self.my_variable = my_variable

  def print_my_var(self):
    print(self.my_variable)

my_instantiated_object = MyAbstractClass('hello!')

my_instantiated_object.print_my_var()

hello!


#### Import from file

We can also store `MyAbstractClass` in some file/folder and import this class when we want to use it.

In [None]:
%%bash
# my_abstract_class.py
cat << EOF > my_abstract_class.py

class MyAbstractClass:
  def __init__(self, my_variable):
    self.my_variable = my_variable

  def print_my_var(self):
    print(self.my_variable)

EOF

In [None]:
from my_abstract_class import MyAbstractClass

my_instantiated_object = MyAbstractClass('hello, from a file!')

my_instantiated_object.print_my_var()

hello, from a file!


#### Import/instantiate using Hydra

Hydra let's us store the path to the class (as `_target_`) and the instantiation parameters (as key-value pairs) and store it in a YAML format:

In [None]:
%%bash
# my_config.yaml
cat << EOF > my_config.yaml

my_instantiated_object:
  _target_: my_abstract_class.MyAbstractClass
  my_variable: "hello, from a yaml file!"

EOF

In [None]:
from hydra import initialize, compose

with initialize(version_base="1.3", config_path='.'):
    config = compose(config_name="my_config.yaml")

config

{'my_instantiated_object': {'_target_': 'my_abstract_class.MyAbstractClass', 'my_variable': 'hello, from a yaml file!'}}

In [None]:
from hydra.utils import instantiate

my_instantiated_object = instantiate(config.my_instantiated_object)

my_instantiated_object.print_my_var()

hello, from a yaml file!


### Hydra for ML workflows

The nice thing about Hydra is that it allows us to mix and match models, datasets, and trainer configurations. For example, given the (pseudo)code in `config.yaml` and `train.py`:

```yaml
# config.yaml
model:
  _target_: models.DeepSpeech.DeepSpeechLightningModule
  n_feature: 80

  val_decoder:
    _target_: models._utils.GreedyCTCDecoder
```

```python
# train.py
import hydra

@hydra.main(version_base="1.3", config_path=".", config_name="config.yaml")
def train(cfg) -> None:
  model = hydra.utils.instantiate(cfg.model)
  datamodule = hydra.utils.instantiate(cfg.datamodule)
  trainer = hydra.utils.instantiate(cfg.trainer)

  trainer.fit(model, datamodule)

if __name__ == "__main__":
    train()
```

You can run the following:
- Run training with default config:
  ```
  python train.py
  ```

- Over-ride parameters in the YAML using the CLI:
  ```
  python train.py model.n_feature=128
  ```

- Over-ride parameters not in the YAML but accepted by the class's `__init__`:
  ```
  python train.py trainer.accelerator=gpu trainer.devices=4
  ```

### Further reading
- Hydra docs: https://hydra.cc/docs/1.3/intro/
- Lightning + Hydra template: https://github.com/ashleve/lightning-hydra-template