# Pre-Process Speech Dataset for *Efficient Speech* Model Training
## Important: This notebook requires a GPU with CUDA available!

  The notebook demonstrates preprocessing a speech dataset into the format expected by EfficientSpeech for training checkpoints.

### **Dataset Specifications**
#### Input Dataset format:
This notebook assumes your dataset is a **folder** of Mono 22050Hz .wav files, with each audio file having a transcription text file with the same name.

* `MyDataset`:  folder
  - `speaker_001.wav`: an audio file
  - `speaker_001.txt`: text transcription of speaker_001.wav
  - ...
  - `speaker_999.wav`
  - `speaker_999.txt`

#### Output Dataset format:
The output for training is in this format:
* `content/output_dataset`:  folder
  - `configs/MyDataset`: folder
    - `preprocess.yaml`: Only this file is necessary to train EfficientSpeech models
    - `model.yaml`
    - `train.yaml`
  - `preprocessed_data/MyDataset`: folder
    - `duration`: folder
    - `energy`: folder
    - `mel`: folder
    - `pitch`: folder
    - `TextGrid/universal`: folder of .TextGrid files
    - `speakers.json`
    - `stats.json`
    - `train.txt`
    - `val.txt`     
  - `raw_data/universal`: folder
    - `metadata.csv`: corpus file
    - `speaker_001.wav`
    - `speaker_001.txt`
    - ...
    - `speaker_999.wav`
    - `speaker_999.txt`

### Links
EfficientSpeech repository: https://github.com/roatienza/efficientspeech  
FastSpeech2 repository: https://github.com/ming024/FastSpeech2  
Montreal Forced Aligner Tutorial: https://eleanorchodroff.com/mfa_tutorial.html



#  
---



## 0) Mount Google Drive
If your dataset is in a folder named `MyDataset` in your Google Drive, the path would be `/gdrive/MyDrive/MyDataset`.  
This step is optional if you upload your dataset some other way.

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive




#  
---



# 1) Run me first!
## Install Conda and some prerequisites
The runtime will restart after installation, please execute the remaining cells after the restart.

In [2]:
!pip install condacolab
!pip install numpy==1.22.4 pyworld==0.2.10
import condacolab
condacolab.install()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting condacolab
  Downloading condacolab-0.1.7-py3-none-any.whl (7.2 kB)
Installing collected packages: condacolab
Successfully installed condacolab-0.1.7
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyworld==0.2.10
  Downloading pyworld-0.2.10.tar.gz (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyworld
  Building wheel for pyworld (setup.py) ... [?25l[?25hdone
  Created wheel for pyworld: filename=pyworld-0.2.10-cp310-cp310-linux_x86_64.whl size=885499 sha256=90e36618354981b0a8fc6a901df772bde421978854d2e0a7713cce468af1c138
  Stored in directory: /root/.cache/pip/wheels/b7/6e/38/5c44182b8cdadd956e127d6b9dc2c4b539af20dfa43924f702
Successfu



#  
---



# 2) Prepare Dataset 
#### Make sure to configure the settings in the `Configuration Settings` section below before running these cells.
Running these cells will preprocess your dataset and save it to your Drive as a .zip file.



### Configuration Settings
##### Dataset
* dataset_name: The name of your dataset
* dataset_path: A directory with the raw audio files + text transcriptions. The text and audio file names should match.
* speaker_name: One of 'universal', 'LJSpeech'
* val_size: The size of your validation set. (default: 512)  
  - 0 < *val_size* < total audio files. 
  - Example: For FastSpeech2, LJSpeech config has 13,100 audio files with a *val_size* of 512.

##### Output
* output_path: Where to save the working files
* output_zip_path: Where to save the finished dataset as a .zip file

##### MFA (Montreal Forced Aligner) Settings
* text_file_extension: the file format extension of the text transcription files.
* corpus_name: 'metadata.csv'
* lexicon_path: the lexicon/dictionary to use when running MFA.
* dictionary_file: the lexicon/dictionary to use when preprocessing dataset  
* allow_overwrite_existing_corpus: Enable to allow overwriting existing `corpus_name` file.
* acoustic_model: MFA acoustic model (default: 'english_us_arpa')
* dictionary_model - MFA dictionary model (unused)

In [10]:
import os

# The input dataset
dataset_name = 'MyDataset' #@param {type:'string'}
dataset_path = '/gdrive/MyDrive/MyDataset' #@param {type:'string'}
speaker_name = 'universal' #@param {type:'string'}
val_size = 512 #@param {type:'integer'}

# The output folder for processed data
output_path = '/content/output_dataset' #@param {type:'string'}
output_zip_path = '/gdrive/MyDrive/output_dataset'#@param {type:'string'}

# MFA settings
text_file_extension = '.lab' #@param ['.txt','.lab']
corpus_name = 'metadata.csv' #@param {type:'string'}
lexicon_path = '/content/FastSpeech2/lexicon/librispeech-lexicon.txt' #@param {type:'string'}
dictionary_file = '/content/FastSpeech2/lexicon/librispeech-lexicon.txt' #@param {type:'string'}
allow_overwrite_existing_corpus = True #@param {type:'boolean'}
acoustic_model = 'english_us_arpa' #@param {type:'string'}
dictionary_model = 'english_us_arpa' #@param {type:'string'}

# Paths
preprocessed_data_path = os.path.join(output_path, 'preprocessed_data')
preprocessed_data_speaker_path = os.path.join(output_path, 'preprocessed_data',
                                              dataset_name)
raw_data_path = os.path.join(output_path, 'raw_data')
raw_data_speaker_path = os.path.join(output_path, 'raw_data', speaker_name)
corpus_path = raw_data_speaker_path
corpus_file_path = os.path.join(corpus_path, corpus_name)
mfa_output_path = os.path.join('/home/mfa_user', dataset_name, 'TextGrid')
textgrid_dir = os.path.join(preprocessed_data_speaker_path, 'TextGrid', speaker_name)
config_dir = f'/content/FastSpeech2/config/{dataset_name}'

# Create directory structure
%mkdir -p $output_path
%mkdir -p $corpus_path
%mkdir -p $preprocessed_data_speaker_path
%mkdir -p $raw_data_speaker_path
%mkdir -p $textgrid_dir
%mkdir -p $output_zip_path

## Setup dependencies

In [19]:
%rm -rf /content/FastSpeech2/
%cd /content/
!git clone https://github.com/ming024/FastSpeech2

/content
Cloning into 'FastSpeech2'...
remote: Enumerating objects: 991, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 991 (delta 10), reused 7 (delta 7), pack-reused 978[K
Receiving objects: 100% (991/991), 330.31 MiB | 30.69 MiB/s, done.
Resolving deltas: 100% (175/175), done.
Updating files: 100% (137/137), done.


### Install MFA

In [None]:
!conda install -c conda-forge montreal-forced-aligner

Create MFA user account

In [13]:
# MFA commands must be run as unprivileged user
!useradd -m -d /home/mfa_user mfa_user
!su - mfa_user -c "echo hello as mfa_user"

%mkdir /home/mfa_user
!chown -hR mfa_user /home/mfa_user


useradd: user 'mfa_user' already exists
hello as mfa_user
mkdir: cannot create directory ‘/home/mfa_user’: File exists


Download MFA models

In [14]:
# Excellent MFA tutorial: https://eleanorchodroff.com/mfa_tutorial.html
!su - mfa_user -c "mfa version"
!su - mfa_user -c "mfa model download acoustic $acoustic_model"
!su - mfa_user -c "mfa model download dictionary $dictionary_model"

2.2.12
[2;36m [0m[32mINFO    [0m Saved model to                                                        
[2;36m [0m         [35m/home/mfa_user/Documents/MFA/pretrained_models/acoustic/[0m[95menglish_us_arp[0m
[2;36m [0m         [95ma.zip[0m, you can now use english_us_arpa in place of acoustic paths in  
[2;36m [0m         mfa commands.                                                         
[2;36m [0m[32mINFO    [0m Saved model to                                                        
[2;36m [0m         [35m/home/mfa_user/Documents/MFA/pretrained_models/dictionary/[0m[95menglish_us_a[0m
[2;36m [0m         [95mrpa.dict[0m, you can now use english_us_arpa in place of dictionary paths
[2;36m [0m         in mfa commands.                                                      


### More dependencies required

In [15]:
# Moving the order of dependencies around may cause errors. You have been warned! 
!pip install librosa==0.9.2 unidecode==1.3.6 tgt==1.4.4 pyworld==0.2.10

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting librosa==0.9.2
  Using cached librosa-0.9.2-py3-none-any.whl (214 kB)
Collecting unidecode==1.3.6
  Using cached Unidecode-1.3.6-py3-none-any.whl (235 kB)
Collecting tgt==1.4.4
  Using cached tgt-1.4.4.tar.gz (21 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyworld==0.2.10
  Using cached pyworld-0.2.10-cp310-cp310-linux_x86_64.whl
Collecting resampy>=0.2.2
  Downloading resampy-0.4.2-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: tgt
  Building wheel for tgt (setup.py) ... [?25l[?25hdone
  Created wheel for tgt: filename=tgt-1.4.4-py3-none-any.whl size=28903 sha256=03797ca9fe3b3f41d1716f1f0b5756d50a2612c0aa84cf4ab2641d392e803e22
  Stored in directory: /root/.cache/pip/wheels/09/e6/aa/821531faeb4e05a65d1c763570e9079146

## Preprocess Data

#### YAML helper functions

In [16]:
import os
import yaml


# YAML functions
def get_yaml_path(name):
  return os.path.join(config_dir, name+'.yaml')


def get_yaml_contents(name):
  with open(get_yaml_path(name), 'r') as f:
    return yaml.safe_load(f.read())
            

def write_yaml(name, contents):
  with open(get_yaml_path(name), 'w') as f:
    f.write(yaml.dump(contents))

### Make metadata.csv corpus
Saved to output_path/raw_data/speaker_name

In [17]:
import os

# Don't overwrite existing file
if not allow_overwrite_existing_corpus:
  assert(not os.path.exists(corpus_file_path)), 'Corpus file already exists, enable `allow_overwrite_existing_corpus` to disable this behavior.'


def concatenate_file_contents(filename):
    """ Reads text file and outputs string with name of file and contents """
    filename_no_ext = str(os.path.basename(filename)).replace(text_file_extension,'')
    with open(filename, 'r') as file:
      contents = file.read().strip()
      result = f"{filename_no_ext}|{contents}|{contents}\r\n"
      return result


def process_files_in_path(text_files_path, output_corpus_file_path):
    """ Open a file at output_corpus_path and write formatted data to it """
    with open(output_corpus_file_path, 'w') as f:
        # Get all .txt files in the specified path
        txt_files = [file for file in os.listdir(text_files_path) if file.endswith(text_file_extension)]
        txt_files_count = len(txt_files)
        if txt_files_count <= 0:
          print(f'No text files with extension {text_file_extension} found in {text_files_path}, try changing `text_file_extension` in settings')
        # Process each file and concatenate the contents
        for file in txt_files:
          file_path = os.path.join(text_files_path, file)
          output = concatenate_file_contents(file_path)
          f.write(output)
# Run
print(f'Dataset path: {dataset_path}')
print(f'Corpus path: {corpus_file_path}')
process_files_in_path(dataset_path, corpus_file_path)
print('Done')

Dataset path: /gdrive/MyDrive/MyDataset
Corpus path: /content/output_dataset/raw_data/universal/metadata.csv
Done


### Create configuration files
Modify LJSpeech config with user defined parameters

In [20]:
import os
import yaml
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper

copied_config_dir = os.path.join(output_path, 'configs')

!mkdir -p $config_dir 
!mkdir $copied_config_dir
!cp -r /content/FastSpeech2/config/LJSpeech/* $config_dir

# model.yaml - change speaker name
model = get_yaml_contents('model')
model['vocoder']['speaker'] = speaker_name
write_yaml('model', model)

# preprocess.yaml - update paths and add field to text
pp = get_yaml_contents('preprocess')
pp['dataset'] = dataset_name
pp['path']['corpus_path'] = corpus_path
pp['path']['lexicon_path'] = lexicon_path
pp['path']['raw_path'] = raw_data_path
pp['path']['preprocessed_path'] = preprocessed_data_speaker_path
pp['preprocessing']['text']['max_length'] = 4096  # Needed for training EfficientSpeech models
pp['preprocessing']['val_size'] = val_size  # Needed for training EfficientSpeech models

write_yaml('preprocess', pp)

# train.yaml - update paths
tr = get_yaml_contents('train')
tr['path']['ckpt_path'] = f'./output/ckpt/{dataset_name}'
tr['path']['log_path'] = f'./output/log/{dataset_name}'
tr['path']['result_path'] = f'./output/result/{dataset_name}'
write_yaml('train', tr)

print(f'Wrote configs in {config_dir}, copying to {copied_config_dir}')
!cp -r $config_dir $copied_config_dir
print('Done')

mkdir: cannot create directory ‘/content/output_dataset/configs’: File exists
Wrote configs in /content/FastSpeech2/config/MyDataset, copying to /content/output_dataset/configs
Done


### Prepare align
The following code is modified from https://github.com/ming024/FastSpeech2/blob/master/preprocessor/ljspeech.py


In [21]:
#The following code is modified from 
# https://github.com/ming024/FastSpeech2/blob/master/preprocessor/ljspeech.py

import os
import yaml
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper

import librosa
import numpy as np
from scipy.io import wavfile
from tqdm import tqdm

# Workaround for importing text
import sys
sys.path.append('/content/FastSpeech2')
from text import _clean_text


def prepare_align(config):
    sampling_rate = config["preprocessing"]["audio"]["sampling_rate"]
    max_wav_value = config["preprocessing"]["audio"]["max_wav_value"]
    cleaners = config["preprocessing"]["text"]["text_cleaners"]
    speaker = speaker_name
    with open(corpus_file_path, encoding="utf-8") as f:
        for line in tqdm(f):
            parts = line.strip().split("|")
            base_name = parts[0]
            text = parts[2]
            text = _clean_text(text, cleaners)

            wav_path = os.path.join(dataset_path, "{}.wav".format(base_name))
            if os.path.exists(wav_path):
                os.makedirs(raw_data_speaker_path, exist_ok=True)
                wav, sr = librosa.load(wav_path, sr=sampling_rate)
                #wav, _ = librosa.load(wav_path, sampling_rate)
                wav = wav / max(abs(wav)) * max_wav_value
                wavfile.write(
                    os.path.join(raw_data_speaker_path, "{}.wav".format(base_name)),
                    sampling_rate,
                    wav.astype(np.int16),
                )
                with open(
                    os.path.join(raw_data_speaker_path, "{}.lab".format(base_name)),
                    "w",
                ) as f1:
                    f1.write(text)


config = get_yaml_contents('preprocess')
prepare_align(config)
print('Prepare align done')

9it [00:05,  1.78it/s]

Prepare align done





### Run MFA forced alignment
Creates TextGrid files

In [22]:
# Output TextGrid files go here
!su - mfa_user -c "mkdir -p $mfa_output_path"
%mkdir -p $textgrid_dir

# Allow mfa_user access to output directory 
!chown mfa_user $textgrid_dir

# Command line options
# -m fast: immediate disconnect (doesn't work sadface)
# --clean: cleans output dir for subsequent runs (if off, 
#             does not overwrite old data)
# --single_speaker: multiprocessing for only one speaker
mfa_cmd_opts = f'--clean --single_speaker'
align_cmd_opts = f'{corpus_path} {dictionary_file} {acoustic_model} {mfa_output_path}'

# Command must be run as unprivileged user
!echo Running mfa align with arguments: $mfa_cmd_opts $align_cmd_opts
!su - mfa_user -c "mfa align $mfa_cmd_opts $align_cmd_opts"

# If the cell hangs you can terminate it early after it says "Exporting alignment TextGrids to..."
#!echo Copying TextGrid files to $textgrid_dir
#!cp $mfa_output_path/*.* $textgrid_dir 

Running mfa align with arguments: --clean --single_speaker /content/output_dataset/raw_data/universal /content/FastSpeech2/lexicon/librispeech-lexicon.txt english_us_arpa /home/mfa_user/MyDataset/TextGrid
The global MFA database server does not exist, initializing it first.
waiting for server to start.... done
server started
[2;36m [0m[32mINFO    [0m Setting up corpus information[33m...[0m                                      
[2;36m [0m[32mINFO    [0m Loading corpus from source files[33m...[0m                                   
[2K[35m   0%[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0/100 [0m [ [33m0:00:01[0m < [36m-:--:--[0m , [31m? it/s[0m ]
[?25h[2;36m [0m[32mINFO    [0m Found [1;36m1[0m speaker across [1;36m9[0m files, average number of utterances per      
[2;36m [0m         speaker: [1;36m9.0[0m                                                          
[2;36m [0m[32mINFO    [0m Initializing multiprocessing jobs[33m...[0m       

Copy output of above to dir

In [23]:
!echo Copying TextGrid files to $textgrid_dir
!cp $mfa_output_path/*.* $textgrid_dir 

Copying TextGrid files to /content/output_dataset/preprocessed_data/MyDataset/TextGrid/universal


## Preprocess TextGrid files for preprocessed_data/ folder
This creates the files in preprocessed_data/ 

In [24]:
import sys
sys.path.append('/content/FastSpeech2')
from preprocessor.preprocessor import Preprocessor

config = get_yaml_contents('preprocess')
preprocessor = Preprocessor(config)
preprocessor.build_from_path()

  fft_window = pad_center(fft_window, filter_length)
  mel_basis = librosa_mel_fn(


Processing Data ...


100%|██████████| 1/1 [00:01<00:00,  1.40s/it]

Computing statistic quantities ...
Total time: 0.01080050390526581 hours





['p303_009|universal|{DH EH1 R IH0 Z AH0 K AO1 R D IH0 NG T IH0 L EH1 JH AH0 N D AH0 B OY1 L IH0 NG P AA1 T AH0 V G OW1 L D AE1 T W AH1 N EH1 N D}|there is , according to legend, a boiling pot of gold at one end.',
 'p303_003|universal|{S IH1 K S S P UW1 N Z AH0 V F R EH1 SH S N OW1 P IY1 Z F AY1 V TH IH1 K S L AE1 B Z AH0 V B L UW1 CH IY1 Z AE1 N D M EY1 B IY0 EY1 S N AE1 K F R ER0 HH ER0 B R AH1 DH ER0 B AA1 B}|six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother bob.',
 'p303_002|universal|{AE1 S K HH ER1 T UW1 B R IH1 NG DH IY1 Z TH IH1 NG Z W IH1 TH HH ER0 F R AH1 M DH AH0 S T AO1 R}|ask her to bring these things with her from the store.',
 'p303_005|universal|{SH IY1 K AH0 N S K UW1 P DH IY1 Z TH IH1 NG Z IH0 N T AH0 TH R IY1 R EH1 D B AE1 G Z AE1 N D W IY1 W AH0 L G OW1 M IY1 T HH ER1 W EH1 N Z D EY2 AE1 T DH AH0 T R EY1 N S T EY1 SH AH0 N}|she can scoop these things into three red bags, and we will go meet her wednesday at the train 

## Save out processed dataset
Saves to your Drive by default

In [26]:
%cd /content/

zip_name = f'{dataset_name}.zip'

!echo Saving dataset as $zip_name at $output_zip_path
!zip -r $zip_name $output_path
!cp $zip_name $output_zip_path
!echo Done, copied to $output_zip_path

/content
Saving dataset as MyDataset.zip at /gdrive/MyDrive/output_dataset
updating: content/output_dataset/ (stored 0%)
  adding: content/output_dataset/raw_data/ (stored 0%)
  adding: content/output_dataset/raw_data/universal/ (stored 0%)
  adding: content/output_dataset/raw_data/universal/p303_008.wav (deflated 16%)
  adding: content/output_dataset/raw_data/universal/p303_004.lab (deflated 9%)
  adding: content/output_dataset/raw_data/universal/p303_003.wav (deflated 14%)
  adding: content/output_dataset/raw_data/universal/p303_005.wav (deflated 14%)
  adding: content/output_dataset/raw_data/universal/p303_005.lab (deflated 22%)
  adding: content/output_dataset/raw_data/universal/p303_001.wav (deflated 19%)
  adding: content/output_dataset/raw_data/universal/p303_003.lab (deflated 22%)
  adding: content/output_dataset/raw_data/universal/p303_004.wav (deflated 17%)
  adding: content/output_dataset/raw_data/universal/metadata.csv (deflated 68%)
  adding: content/output_dataset/raw_dat



#  
---

