## Sample demo for ESPnet-Easy!
In this notebook, we will demonstrate how to train an Automatic Speech Recognition (ASR) model using the Librispeech-100 dataset. The process in this notebook follows the same dataset preparation approach as the kaldi-style dataset. If you are interested in fine-tuning pretrained models, please refer to the libri100_finetune.ipynb file.

Before proceeding, please ensure that you have already downloaded the Librispeech-100 dataset from [OpenSLR](https://www.openslr.org/12) and have placed the data in a directory of your choice. In this notebook, we assume that you have stored the dataset in the `/hdd/dataset/` directory. If your dataset is located in a different directory, please make sure to replace `/hdd/dataset/` with the actual path to your dataset.

### Data Preparation

This notebook follows the data preparation steps outlined in `asr.sh`. Initially, we will create a dump file to store information about the data, including the data ID, audio path, and transcriptions.

ESPnet-Easy supports various types of datasets, including:

1. Dictionary-based dataset with the following structure:
   ```python
   {
     "data_id": {
         "speech": path_to_speech_file,
         "text": transcription
     }
   }
   ```

2. List of datasets with the following structure:
   ```python
   [
     {
         "speech": path_to_speech_file,
         "text": transcription
     }
   ]
   ```

If you choose to use a dictionary-based dataset, it's essential to ensure that each `data_id` is unique. ESPnet-Easy also accepts a dump file that may have already been created by `asr.sh`. However, in this notebook, we will create the dump file from scratch.

In [None]:
# Need to install espnet if you don't have it
%pip install -U ../../
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --no-cache

Now, let's create dump files!  
Please note that you will need to provide a dictionary to specify the file path and type for each data.
This dictionary should have the following format:

```python
{
    "data_name": ["dump_file_name", "dump_format"]
}
```

In [None]:
import os
import glob

import espnetez as ez


DUMP_DIR = "./dump/libri100"
LIBRI_100_DIRS = [
    ["/hdd/database/librispeech-100/LibriSpeech/train-clean-100", "train"],
    ["/hdd/database/librispeech-100/LibriSpeech/dev-clean", "dev-clean"],
    ["/hdd/database/librispeech-100/LibriSpeech/dev-other", "dev-other"],
]
data_info = {
    "speech": ["wav.scp", "sound"],
    "text": ["text", "text"],
}


def create_dataset(data_dir):
    dataset = {}
    for chapter in glob.glob(os.path.join(data_dir, "*/*")):
        text_file = glob.glob(os.path.join(chapter, "*.txt"))[0]

        with open(text_file, "r") as f:
            lines = f.readlines()

        ids_text = {
            line.split(" ")[0]: line.split(" ", maxsplit=1)[1].replace("\n", "")
            for line in lines
        }
        audio_files = glob.glob(os.path.join(chapter, "*.wav"))
        for audio_file in audio_files:
            audio_id = os.path.basename(audio_file)[: -len(".wav")]
            dataset[audio_id] = {
                "speech": audio_file,
                "text": ids_text[audio_id]
            }
    return dataset


for d, n in LIBRI_100_DIRS:
    dump_dir = os.path.join(DUMP_DIR, n)
    if not os.path.exists(dump_dir):
        os.makedirs(dump_dir)

    dataset = create_dataset(d)
    ez.data.create_dump_file(dump_dir, dataset, data_info)

For the validation files, you have two directories: `dev-clean` and `dev-other`.
To create a unified dev dataset, you can use the `ez.data.join_dumps` function.

In [None]:
ez.data.join_dumps(
    ["./dump/libri100/dev-clean", "./dump/libri100/dev-other"], "./dump/libri100/dev"
)

Now you have dataset files in the `dump` directory.
It looks like this:

wav.scp
```
1255-138279-0008 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0008.flac
1255-138279-0022 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0022.flac
```

text
```
1255-138279-0008 TWO THREE
1255-138279-0022 IF I SAID SO OF COURSE I WILL
```


### Train sentencepiece model

To train a SentencePiece model, we require a text file for training. Let's begin by creating the training file.

In [None]:
# generate training texts from the training data
# you can select several datasets to train sentencepiece.
ez.preprocess.prepare_sentences(["dump/libri100/train/text"], "dump/spm")

ez.preprocess.train_sentencepiece(
    "dump/spm/train.txt",
    "data/bpemodel",
    vocab_size=5000,
)

### Configure Training Process

For configuring the training process, you can utilize the configuration files already provided by ESPnet contributors. To use a configuration file, you'll need to create a YAML file on your local machine. For instance, you can use the [e-branchformer config](train_asr_e-branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml).

In my case, I've made a modification to the `batch_bins` parameter, changing it from `16000000` to `1600000` to run training on my GPU (RTX2080ti).

### Training

To prepare the stats file before training, you can execute the `collect_stats` method. This step is required before the training process and ensuring accurate statistics for the model.

In [None]:
import espnetez as ez

EXP_DIR = "exp/train_asr_branchformer_e24_amp"
STATS_DIR = "exp/stats"

# load config
training_config = ez.config.from_yaml(
    "asr",
    "train_asr_e_branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml",
)
preprocessor_config = ez.utils.load_yaml("preprocess.yaml")
training_config.update(preprocessor_config)

with open(preprocessor_config["token_list"], "r") as f:
    training_config["token_list"] = [t.replace("\n", "") for t in f.readlines()]

# Define the Trainer class
trainer = ez.Trainer(
    task='asr',
    train_config=training_config,
    train_dump_dir="dump/libri100/train",
    valid_dump_dir="dump/libri100/dev",
    data_info=data_info,
    output_dir=EXP_DIR,
    stats_dir=STATS_DIR,
    ngpu=1,
)
trainer.collect_stats()

Finally, we are ready to begin the training process!

In [None]:
trainer.train()

### Inference
You can just use the inference API of the ESPnet.

In [None]:
import librosa
from espnet2.bin.asr_inference import Speech2Text

m = Speech2Text(
    "./exp/train_asr_branchformer_e24_amp/config.yaml",
	"./exp/train_asr_branchformer_e24_amp/valid.acc.best.pth",
	beam_size=10
)

with open("./dump/libri100/dev/wav.scp", "r") as f:
    sample_path = f.readlines()[0]
    
y, sr = librosa.load(sample_path.split()[1], sr=16000, mono=True)
output = m(y)
print(output[0][0])
