### Download and prepare LibriSpeech dataset

This notebook prepares LibriSpeech dataset for training wav2letter model, 
- Download data by creating torchaudio dataset.
- Using prepareLibriSpeech utility, create manifest for training, validation and test sets (to be used by the daatset object).
- After successful completion of this demo, data is ready to be used for training the network.

In [1]:
import os
import shutil
import torchaudio
from wav2letter.prepareLibriSpeech import *
from wav2letter import config 

### Loading the config file...
This file carries important paths and urls to download and prepare data:
- $data\_dir$: path to the root directory of data
- $training\_url$: list of urls from LibriSpeech dataset to be used for training (e.g. train-clean-100, train-clean-360 etc.)
- $val\_url$: list of urls from LibriSpeech dataset to be used for validation (e.g. dev-clean)
- $test\_url$: list of urls from LibriSpeech dataset to be used for testing (e.g. test-clean).  
##### Manifests files created:
- $train\_manifest$: carries detailed information of directory structure and files for train dataset.
- $val\_manifest$: carries detailed information of directory structure and files for val dataset.
- $test\_manifest$: carries detailed information of directory structure and files for test dataset.

In [None]:
data_dir = config['data_dir']

# Create and empty parent directory
if config['force_redownload']:
    # remove an existing tree of folders
    if os.path.exists(data_dir):
        shutil.rmtree()
    # create parent folder for the datadirectory.
    os.makedirs(data_dir)
    print("Successfully created an empty parent directory.")

#### Downloading data...
- Make sure path specified by $data\_dir$ in $config\_rf.yaml$ is a valid path (where LibriSpeech data is to be downloaded or stored).

In [4]:
for url in config['training_url']:
    data = torchaudio.datasets.LIBRISPEECH(config['data_dir'], url=url, download=True)
for url in config['val_url']:
    data = torchaudio.datasets.LIBRISPEECH(config['data_dir'], url=url, download=True)
for url in config['test_url']:
    data = torchaudio.datasets.LIBRISPEECH(config['data_dir'], url=url, download=True)


  0%|          | 0.00/5.95G [00:00<?, ?B/s]

  0%|          | 0.00/21.5G [00:00<?, ?B/s]

  0%|          | 0.00/28.5G [00:00<?, ?B/s]

  0%|          | 0.00/322M [00:00<?, ?B/s]

  0%|          | 0.00/331M [00:00<?, ?B/s]

#### Setting up dataset
Creates the manifest files that are used by $wav2letter.datasets.LibriSpeechDataset$ object

In [5]:
prepare_dataset(config, train=True, val=True, test=True)

Preparing training manifest...!
Preparing val manifest...!
Preparing test manifest...!
Done...!
