Preprocessed Data:

- grapheme (words) -> Phonemes

- waveform -> Duration
- waveform -> Mel Spectrogram 
- waveform -> Energy
- waveform -> Pitch -> Pitch Spectrogram

In [1]:
from pathlib import Path
import os

In [2]:
path = Path("../data/LJSpeech-1.1")
os.listdir(path)

['.DS_Store', 'wavs', 'README', 'metadata.csv']

In [3]:
os.listdir(path/"wavs")[:4]

['LJ026-0155.lab', 'LJ007-0005.wav', 'LJ038-0170.wav', 'LJ019-0020.lab']

We must preprocess the dataset structure of the LJDataset folder to fit the format of the MFA tool. The structure must look like the following: 

+-- prosodylab_corpus_directory

|   +-- speaker1

|       --- recording1.wav

|       --- recording1.lab

|       --- recording2.wav

|       --- recording2.lab

|   +-- speaker2

|       --- recording3.wav

|       --- recording3.lab

|   --- ...

where .lab is a text format and will contain transcript of the recording

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv(path/"metadata.csv", delimiter="|", 
                 names=["file", "transcript", "normalized_transcript"])
df.head()

Unnamed: 0,file,transcript,normalized_transcript
0,LJ001-0001,"Printing, in the only sense with which we are ...","Printing, in the only sense with which we are ..."
1,LJ001-0002,in being comparatively modern.,in being comparatively modern.
2,LJ001-0003,For although the Chinese took impressions from...,For although the Chinese took impressions from...
3,LJ001-0004,"produced the block books, which were the immed...","produced the block books, which were the immed..."
4,LJ001-0005,the invention of movable metal letters in the ...,the invention of movable metal letters in the ...


In [6]:
len(df), len(os.listdir(path/"wavs"))

(13100, 26170)

In [7]:
out_path = path/"wavs"

In [8]:
df = df.dropna()
len(df)

13084

In [9]:
for i, row in df.iterrows():
    file_name, text = row.file, row.normalized_transcript
    
    file_path = out_path/(file_name + ".lab")
    with open(file_path, "w") as f:
        f.write(text)

In [10]:
lab_files = list(filter(lambda x: "lab" in x, os.listdir(out_path)))

In [11]:
wav_files = list(filter(lambda x: "wav" in x, os.listdir(out_path)))

In [12]:
na_files = []
for file in wav_files:
    lab_name = file.split(".")[0] + ".lab"
    if lab_name not in lab_files:
        na_files.append(file)

In [13]:
for basename in na_files:
    os.remove(out_path/basename)