## ESPnet-Easy with torchaudio.datasets
This Jupyter notebook provides a step-by-step guide on using the `torchaudio.datasets` library as an external dataset with ESPnet-Easy. In this demonstration, we will leverage the LibriSpeech dataset to train an Automatic Speech Recognition (ASR) model using the Librispeech-100 subset.

Prior to executing the code in this notebook, it is essential to install the torchaudio library.

### Data Preparation

To obtain the dataset, we can refer to the official documentation of [torchaudio.datasets](https://pytorch.org/audio/stable/datasets.html). 

In [None]:
import torchaudio
import espnetez as ez
import numpy as np

train_dataset = torchaudio.datasets.LIBRISPEECH(
    root='/hdd/database/torchaudio',
    url='train-clean-100',
    download=True
)
valid_dataset = torchaudio.datasets.LIBRISPEECH(
    root='/hdd/database/torchaudio',
    url='test-clean',
    download=True
)


### Train sentencepiece model

To train a SentencePiece model, we need a text file for training. 

Let's begin by creating the training file, and then execute `ez.preprocess.train_sentencepiece` to train the model.

In [None]:
# generate training texts from the training data
# you can select several datasets to train sentencepiece.
import os
from tqdm import tqdm

SPM_DIR = "data/spm"
SPM_TRAIN_FILE = f"{SPM_DIR}/train.txt"

if not os.path.exists(SPM_DIR):
    os.makedirs(SPM_DIR)

# create training data
text = []
for i in tqdm(range(len(train_dataset))):
    text.append(train_dataset[i][2])  # get transcription

with open(SPM_TRAIN_FILE, "w", encoding="utf-8") as f:
    f.write('\n'.join(text))

ez.preprocess.train_sentencepiece(
    SPM_TRAIN_FILE,
    "data/bpemodel",
    vocab_size=5000,
)

### Wrap with ESPnetEasyDataset

Before initiating the training process, it is crucial to adapt the dataset to the ESPnet format. The dataset class should output tokenized text and audio files in `np.array` format.

The `torchaudio.datasets` module offers datasets with the following format: `(audio, sample_rate, transcription, speaker_id, chapter_id, utterance_id)`.
To align with the ESPnet format, we must undertake the following preprocessing steps:

- Convert the audio to a `np.ndarray` instance in a single channel.
- Tokenize the transcription and convert it to a `np.ndarray` instance.

We define a `data_info` argument below to specify these preprocessing steps, which is then provided to the `ESPnetEasyDataset` constructor.

In [None]:
from espnet2.text.build_tokenizer import build_tokenizer
from espnet2.text.token_id_converter import TokenIDConverter

tokenizer = build_tokenizer(
    token_type='bpe',
    bpemodel='data/bpemodel/bpe.model'
)
converter = TokenIDConverter('data/bpemodel/tokens.txt')

def tokenize(text):
    return np.array(converter.tokens2ids(tokenizer.text2tokens(text)))

data_info = {
    "speech": lambda d: d[0].squeeze(0).numpy(),
    "text": lambda d: tokenize(d[2]),
}

# Convert into ESPnet-Easy dataset format
train_dataset = ez.dataset.ESPnetEasyDataset(train_dataset, data_info=data_info)
valid_dataset = ez.dataset.ESPnetEasyDataset(valid_dataset, data_info=data_info)

### Configure Training Process

Configuring process is as the same with the other notebook. Please refer to the `libri100.ipynb` notebook for more details.

### Training

While the configuration remains consistent with other notebooks, the instantiation arguments for the Trainer class differ in this case. As we have not generated dump files, we can disregard arguments related to dump files and directly provide the train/valid dataset classes.

```
trainer = Trainer(
    ...
    train_dataset=your_train_dataset_instance,
    train_dataset=your_valid_dataset_instance,
    ...
)
```

In [None]:
import espnetez as ez

EXP_DIR = "exp/small"
STATS_DIR = "exp/stats"

# load config
training_config = ez.config.from_yaml(
    "asr",
    "config/train_asr_e_branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml",
)
preprocessor_config = ez.utils.load_yaml("preprocess.yaml")
training_config.update(preprocessor_config)

with open(preprocessor_config["token_list"], "r") as f:
    training_config["token_list"] = [t.replace("\n", "") for t in f.readlines()]

# Define the Trainer class
trainer = ez.Trainer(
    task='asr',
    train_config=training_config,
    output_dir=EXP_DIR,
    stats_dir=STATS_DIR,
    train_dataset=train_dataset,
    valid_dataset=valid_dataset,
    ngpu=1,
)
trainer.collect_stats()
trainer.train()