Let's begin by downloading the librispeech dataset.

In [1]:
!mkdir -p data

In [2]:
fnames = [
    'dev-clean.tar.gz',
    'dev-other.tar.gz',
    'test-clean.tar.gz',
    'test-other.tar.gz',
    'train-clean-100.tar.gz',
    'train-clean-360.tar.gz',
    'train-other-500.tar.gz',    
]
    
for fn in fnames:
    !wget http://www.openslr.org/resources/12/{fn} -P data

--2020-09-23 13:30:47--  http://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘data/dev-clean.tar.gz’


2020-09-23 13:31:32 (7.13 MB/s) - ‘data/dev-clean.tar.gz’ saved [337926286/337926286]

--2020-09-23 13:31:32--  http://www.openslr.org/resources/12/dev-other.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 314305928 (300M) [application/x-gzip]
Saving to: ‘data/dev-other.tar.gz’


2020-09-23 13:32:24 (5.78 MB/s) - ‘data/dev-other.tar.gz’ saved [314305928/314305928]

--2020-09-23 13:32:25--  http://www.openslr.org/resources/12/test-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46

In [None]:
%%capture

for fn in fnames:
    !tar -xvf data/{fn} -C data

Audio recordings were aligned with text using the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/). We will train our embeddings in an unsupervised way (meaning the text labels corresponding to audio will not be used) however we still want to have the ability to chunk the audio into meaningful parts (we want a complete utterance to constitute an exampole).

Let's obtain labels from this very helpful [repository](https://github.com/CorentinJ/librispeech-alignments). Unfortunately since they are stored on google drive you will need to download them manually on your workstation and upload to the `data` directory in the root of this repository.

In [None]:
!unzip data/LibriSpeech-Alignments.zip -d data

In [1]:
import numpy as np
import librosa
import os
from IPython.lib.display import Audio

To read the data, we can adapt the parser from this repository from which we obtained the labels.

In [2]:
librispeech_root = "data/LibriSpeech"

def split_audio(audio_fpath, words, end_times):
    # Load the audio waveform
    sample_rate = 16000     # Sampling rate of LibriSpeech 
    wav, _ = librosa.load(audio_fpath, sample_rate)
    
    start_times = np.array([0.0] + end_times[:-1])
    end_times = np.array(end_times)
    assert len(words) == len(end_times) == len(start_times)
    assert words[0] == '' and words[-1] == ''
        
    segments = []
    for word, st, et in zip(words, start_times, end_times):
        if word == '': continue
        utterance = wav[int(st*sample_rate):int(et*sample_rate)]
        segments.append([word, utterance])
    return segments

Let's generate train examples and store them on disk. This way our code for training the model will be more streamlined than if we wanted to do something fancy with indexing in a wave file. Simpler code - smaller chance of introducing a bug!

Plus this gives us a chance to offload some computation to before training, which is always nice.

To extract MFCC features, we will use functionality from the [python_speech_features](https://github.com/jameslyons/python_speech_features) repository.

In [3]:
from python_speech_features.base import mfcc
import pandas as pd
import uuid

In [19]:
!mkdir -p data/examples

In [20]:
def store_audio(audio):
    fn = f'data/examples/{uuid.uuid4().hex}.pkl'
    pd.to_pickle(mfcc(audio), fn)
    return fn

In [23]:
%%time

source_words, target_words, source_fns, target_fns, set_names, speaker_ids, book_ids = [], [], [], [], [], [], []

window_size = 3
offsets = list(range(-window_size+1, window_size))
offsets.remove(0)

for set_name in os.listdir(librispeech_root):
    if set_name not in ['train-clean-360', 'train-clean-100', 'test-clean', 'dev-clean']: continue
    set_dir = os.path.join(librispeech_root, set_name)
    if not os.path.isdir(set_dir):
        continue
    for speaker_id in os.listdir(set_dir):
        speaker_dir = os.path.join(set_dir, speaker_id)
        for book_id in os.listdir(speaker_dir):
            book_dir = os.path.join(speaker_dir, book_id)
            alignment_fpath = os.path.join(book_dir, "%s-%s.alignment.txt" % 
                                           (speaker_id, book_id))
            
            if not os.path.exists(alignment_fpath):
                raise Exception("Alignment file not found. Did you download and merge the txt "
                                "alignments with your LibriSpeech dataset?")

            with open(alignment_fpath, "r") as alignment_file:
                for line in alignment_file:

                    # Retrieve the utterance id, the words as a list and the end_times as a list
                    utterance_id, words, end_times = line.strip().split(' ')
                    words = words.replace('\"', '').split(',')
                    end_times = [float(e) for e in end_times.replace('\"', '').split(',')]
                    audio_fpath = os.path.join(book_dir, utterance_id + '.flac')
                    
                    segments = split_audio(audio_fpath, words, end_times)
                    segments_processed = [[word, store_audio(audio)] for word, audio in segments]
        
                    for i, (word, path) in enumerate(segments_processed):
                        for offset in offsets:
                            if i + offset < 0 or i + offset > len(segments_processed)-1: continue

                            source_words.append(word), target_words.append(segments_processed[i+offset][0]),
                            source_fns.append(path), target_fns.append(segments_processed[i+offset][1])
                            set_names.append(set_name), speaker_ids.append(speaker_id), book_ids.append(book_id)

CPU times: user 11h 26min 34s, sys: 15min 56s, total: 11h 42min 30s
Wall time: 3h 13min 30s


In [26]:
df = pd.DataFrame(data={
    'source_word': source_words,
    'target_word': target_words,
    'source_fn': source_fns,
    'target_fn': target_fns,
    'set_name': set_names,
    'speaker_id': speaker_ids,
    'book_id': book_ids
})

In [27]:
df.shape

(17937758, 7)

In [28]:
df.head()

Unnamed: 0,source_word,target_word,source_fn,target_fn,set_name,speaker_id,book_id
0,I,FELT,data/examples/53dfbe1cffca4994b848b6b117acb365...,data/examples/f95d4d56b23f4c06920f666bb4fa65cf...,train-clean-360,7000,83696
1,I,THAT,data/examples/53dfbe1cffca4994b848b6b117acb365...,data/examples/caaa01a2dc014e4b9c4a0e832a471d97...,train-clean-360,7000,83696
2,FELT,I,data/examples/f95d4d56b23f4c06920f666bb4fa65cf...,data/examples/53dfbe1cffca4994b848b6b117acb365...,train-clean-360,7000,83696
3,FELT,THAT,data/examples/f95d4d56b23f4c06920f666bb4fa65cf...,data/examples/caaa01a2dc014e4b9c4a0e832a471d97...,train-clean-360,7000,83696
4,FELT,IT,data/examples/f95d4d56b23f4c06920f666bb4fa65cf...,data/examples/b3332ce1a360418eb9f836c38f9ff2aa...,train-clean-360,7000,83696


In [30]:
df.set_name.unique()

array(['train-clean-360', 'train-clean-100', 'test-clean', 'dev-clean'],
      dtype=object)

In [29]:
df.to_csv('data/examples.csv', header=False)