# Seminar: Diphone Synthersis
At this seminar we will construct the simpliest possible synthesis - diphone model.
<img src="https://github.com/alex-kozinov/courses/blob/master/speech-shad/seminars/09-week/concat-scheme.png?raw=1">
We will use part of the LJSpeech dataset.
Your task will be to design search and concatenation of the units.
Preprocessor stages are already performed for the test samples (and it'll be your home assignment to create a small g2p for CMU english phoneset).

In [1]:
import os 
!git clone https://github.com/yandexdataschool/speech_course.git
os.chdir("speech_course/week_09")
!ls

Cloning into 'speech_course'...
remote: Enumerating objects: 295, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 295 (delta 20), reused 67 (delta 18), pack-reused 222[K
Receiving objects: 100% (295/295), 144.43 MiB | 26.83 MiB/s, done.
Resolving deltas: 100% (107/107), done.
Checking out files: 100% (96/96), done.
concat-scheme.png   seminar4_student.ipynb  wavs_need.txt
fallback_rules.txt  test_phones.txt


## Alignment
The first and very import part in the data preparation is alignment: we need to determine the timings of phonemes our utterance consists of.
Even the concatenative syntheses are not used today in prod alignment is still an important phase for upsampling-based parametric acoustic models (e.g. fastspeech).

### Motreal Force Aligner
To process audio we will use MFA.

At the alignment stage we launch xent-trained TDNN ASR system with fixed text on the output and try to determine the most probable phonemes positions in the timeline.

In [2]:
%%writefile install_mfa.sh
#!/bin/bash

## a script to install Montreal Forced Aligner (MFA)

root_dir=${1:-/tmp/mfa}
mkdir -p $root_dir
cd $root_dir

# download miniconda3
wget -q --show-progress https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $root_dir/miniconda3 -f

# create py38 env
$root_dir/miniconda3/bin/conda create -n aligner -c conda-forge openblas python=3.8 openfst pynini ngram baumwelch -y
source $root_dir/miniconda3/bin/activate aligner

# install mfa, download kaldi
pip install montreal-forced-aligner # install requirements
pip install git+https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner.git # install latest updates

mfa thirdparty download

echo -e "\n======== DONE =========="
echo -e "\nTo activate MFA, run: source $root_dir/miniconda3/bin/activate aligner"
echo -e "\nTo delete MFA, run: rm -rf $root_dir"
echo -e "\nSee: https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html to know how to use MFA"

Writing install_mfa.sh


In [3]:
# download and install mfa
INSTALL_DIR="/tmp/mfa" # path to install directory

!bash ./install_mfa.sh {INSTALL_DIR}

PREFIX=/tmp/mfa/miniconda3
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | done
Solving environment: - \ done

## Package Plan ##

  environment location: /tmp/mfa/miniconda3

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - brotlipy==0.7.0=py38h27cfd23_1003
    - ca-certificates==2020.10.14=0
    - certifi==2020.6.20=pyhd3eb1b0_3
    - cffi==1.14.3=py38h261ae71_2
    - chardet==3.0.4=py38h06a4308_1003
    - conda-package-handling==1.7.2=py38h03888b9_0
    - conda==4.9.2=py38h06a4308_0
    - cryptography==3.2.1=py38h3c74f83_1
    - idna==2.10=py_0
    - ld_impl_linux-64==2.33.1=h53a641e_7
    - libedit==3.1.20191231=h14c3975_1
    - libffi==3.3=he6710b0_2
    - libgcc-ng==9.1.0=hdf63c60_0
    - libstdcxx-ng==9.1.0=hdf63c60_0
    - ncurses==6.2=he6710b0_1
    - openssl==1.1.1h=h7b6447c_0
    - pip==20.2.4=py38h06a4308_0
    - pycosat==0.6.3=py38h7b6447c_1
    - pycparser==2.20=py_2
    - pyopenssl==19.1.0=pyhd3eb1b0_1
    - pysocks=

In [4]:
!source {INSTALL_DIR}/miniconda3/bin/activate aligner; mfa align --help

usage: mfa align
       [-h]
       [--config_path CONFIG_PATH]
       [-s SPEAKER_CHARACTERS]
       [-t TEMP_DIRECTORY]
       [-j NUM_JOBS]
       [-v]
       [-c]
       [-d]
       corpus_directory
       dictionary_path
       acoustic_model_path
       output_directory

positional arguments:
  corpus_directory
    Full path
    to the
    directory
    to align
  dictionary_path
    Full path
    to the pron
    unciation
    dictionary
    to use
  acoustic_model_path
    Full path
    to the
    archive
    containing
    pre-trained
    model or
    language ()
  output_directory
    Full path
    to output
    directory,
    will be
    created if
    it doesn't
    exist

optional arguments:
  -h, --help
    show this
    help
    message and
    exit
  --config_path CONFIG_PATH
    Path to
    config file
    to use for
    alignment
  -s SPEAKER_CHARACTERS, --speaker_characters SPEAKER_CHARACTERS
    Number of
    characters
    of file
    names to
    use for
    determ

### LJSpeech data subset
Here we will download the dataset.
However we don't need the whole LJSpeech for diphone synthesis (and it will be processed for quite a while).
Here we will take about 1/10 of the dataset. That's more than enough for diphone TTS.

In [5]:
!echo "download and unpack ljs dataset"
!mkdir -p ./ljs; cd ./ljs; wget -q --show-progress https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
!cd ./ljs; tar xjf LJSpeech-1.1.tar.bz2

download and unpack ljs dataset


In [6]:
# We need sox to convert audio to 16kHz (the format alignment works with)
!sudo apt install -q -y sox
!sudo apt install -q -y libopenblas-dev

Reading package lists...
Building dependency tree...
Reading state information...
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 34 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,717 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrwb0 amd64 0.1.3-2.1 [45.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic

In [7]:
!mkdir ./wav
!cat wavs_need.txt | xargs -I F -P 30 sox --norm=-3 ./ljs/LJSpeech-1.1/wavs/F.wav -r 16k -c 1 ./wav/F.wav
!echo "Number of clips" $(ls ./wav/ | wc -l)

Number of clips 1273


It should be 1273 clips here

In [8]:
with open('wavs_need.txt') as ifile:
    wavs_need = {l.strip() for l in ifile}

In [9]:
# metadata to transcripts
lines = open('./ljs/LJSpeech-1.1/metadata.csv', 'r').readlines()
for line in lines:
    fn, _, transcript = line.strip().split('|')
    if fn in wavs_need:
        with open('./wav/{}.txt'.format(fn), 'w') as ofile:
            ofile.write(transcript)

!echo "Number of transcripts" $(ls ./wav/*.txt | wc -l)

Number of transcripts 1273


Let's download the artifacts for alignment.

For phoneme ASR we need acoustic model and lexicon (mapping word=>phonemes) made by some other g2p

In [10]:
!wget -q --show-progress https://github.com/MontrealCorpusTools/mfa-models/raw/master/acoustic/english.zip
!wget -q --show-progress http://www.openslr.org/resources/11/librispeech-lexicon.txt



Finally, we come to the alignment.

It will take about 15-17 min for our subset to be aligned

In [11]:
!source {INSTALL_DIR}/miniconda3/bin/activate aligner; \
mfa align -t ./temp -c -j 4 ./wav librispeech-lexicon.txt ./english.zip ./ljs_aligned
!echo "See output files at ./ljs_aligned"

All required kaldi binaries were found!
./temp/wav/align.log
INFO - Setting up corpus information...
INFO - Number of speakers in corpus: 1, average number of utterances per speaker: 1273.0
INFO - Parsing dictionary without pronunciation probabilities without silence probabilities
INFO - Creating dictionary information...
INFO - Setting up training data...
INFO - Generating base features (mfcc)...
INFO - Calculating CMVN...
INFO - Done with setup!
INFO - Performing first-pass alignment...
INFO - Calculating fMLLR for speaker adaptation...
INFO - Performing second-pass alignment...
INFO - All done!
See output files at ./ljs_aligned


In [12]:
!ls ljs_aligned/|wc -l 

1273


In [13]:
import IPython.display
from IPython.core.display import display

def display_audio(data):
    display(IPython.display.Audio(data, rate=22050))

In [16]:
# to install textgrids
!pip install praat-textgrids

Collecting praat-textgrids
  Downloading https://files.pythonhosted.org/packages/58/a1/474304ce1c8d391a6c6bf87ed9f9566eab3d594dd1dbb739a79db7720c3e/praat-textgrids-1.3.1.tar.gz
Building wheels for collected packages: praat-textgrids
  Building wheel for praat-textgrids (setup.py) ... [?25l[?25hdone
  Created wheel for praat-textgrids: filename=praat_textgrids-1.3.1-cp37-none-any.whl size=12398 sha256=e11467cdb51bc7fcb25bf31371ada82f7e17ba6363803086a2b6f781d120a413
  Stored in directory: /root/.cache/pip/wheels/96/d1/17/9af523668ff127df07805e3790c2027d8ace0a22c633c55699
Successfully built praat-textgrids
Installing collected packages: praat-textgrids
Successfully installed praat-textgrids-1.3.1


In [21]:
import numpy as np
from scipy.io import wavfile
import textgrids
import glob

Alignment outputs are textgrids - and xml-like structure with layers for phonemes and words (with timings)

In [36]:
alignment = {f.split("/")[-1].split(".")[0][4:]: textgrids.TextGrid(f) for f in glob.iglob('ljs_aligned/*')}

In [30]:
wavs = {f.split("/")[-1].split(".")[0]: wavfile.read(f)[1] for f in glob.iglob('./ljs/LJSpeech-1.1/wavs/*.wav')}

In [38]:
allphones = {
    ph.text for grid in alignment.values() for ph in grid["phones"]
}
# let's exclude special symbols: silence, spoken noise, non-spoken noise
allphones = {ph for ph in allphones if ph == ph.upper()}
assert len(allphones) == 69

Here your part begins:
You need to create `diphone index` - mapping structure that will allow you to find original utterance and position in it by diphone text id.

E.g.:
`index[(PH1, PH2)] -> (utt_id, phoneme_index)`

In [53]:
alignment["LJ017-0213"]["phones"]

[<Interval text="DH" xmin=0.0 xmax=0.03>,
 <Interval text="AH0" xmin=0.03 xmax=0.07>,
 <Interval text="F" xmin=0.07 xmax=0.2>,
 <Interval text="ER1" xmin=0.2 xmax=0.28>,
 <Interval text="S" xmin=0.28 xmax=0.38>,
 <Interval text="T" xmin=0.38 xmax=0.44>,
 <Interval text="AE1" xmin=0.44 xmax=0.47>,
 <Interval text="N" xmin=0.47 xmax=0.52>,
 <Interval text="D" xmin=0.52 xmax=0.6>,
 <Interval text="S" xmin=0.6 xmax=0.69>,
 <Interval text="EH1" xmin=0.69 xmax=0.75>,
 <Interval text="K" xmin=0.75 xmax=0.83>,
 <Interval text="AH0" xmin=0.83 xmax=0.89>,
 <Interval text="N" xmin=0.89 xmax=0.97>,
 <Interval text="D" xmin=0.97 xmax=1.0>,
 <Interval text="M" xmin=1.0 xmax=1.05>,
 <Interval text="EY1" xmin=1.05 xmax=1.22>,
 <Interval text="T" xmin=1.22 xmax=1.33>,
 <Interval text="S" xmin=1.33 xmax=1.53>,
 <Interval text="sp" xmin=1.53 xmax=1.69>,
 <Interval text="K" xmin=1.69 xmax=1.82>,
 <Interval text="AA1" xmin=1.82 xmax=1.9>,
 <Interval text="R" xmin=1.9 xmax=2.01>,
 <Interval text="S" xmin=2.

In [47]:
wavs["LJ017-0213"]

array([-102, -332, -136, ...,  -18,   -6,   -9], dtype=int16)

In [58]:
a = [1, 2, 3]
a[:-1]

[1, 2]

In [59]:
diphone_index = dict()
for utt_id, some_alignment in alignment.items():
    phones = some_alignment["phones"]
    for i, (a, b) in enumerate(zip(phones[:-1], phones[1:])):
        if (a.text, b.text) in diphone_index.keys():
            continue
        diphone_index[(a.text, b.text)] = (utt_id, i)

In [60]:
# check yourself
for a, b in [('AH0', 'P'), ('P', 'AH0'), ('AH0', 'L')]:
    k, i = diphone_index[(a,b)]
    assert a == alignment[k]['phones'][i].text
    assert b == alignment[k]['phones'][i+1].text

In concat TTS you sometimes don't have all the diphones presented
If it's not very frequent ones it's not a trouble
But we need to provide some mechanism to replace missing units

In [61]:
with open("fallback_rules.txt") as ifile:
    lines = [l.strip().split() for l in ifile]
    fallback_rules = {l[0]: l[1:] for l in lines}

In the dict `fallback_rules` lie possible replacement for all the phones
(different replacements in order of similarity).

E.g. `a stressed` -> `a unstressed`  | `o stressed` | `o unstressed`

Here is also some work for you:
You need to create diphone fallbacks from the phoneme ones:

`diphone_fallbacks[(Ph1, Ph2)] -> (some_other_pair_of_phones_presented_in_dataset)`

and also, if `diphone_fallbacks[(a, b)] = c, d` then:
* c = a or
* c $\in$ fallback_rules[a] and/or
* d = b or
* d $\in$ fallback_rules[d]


In [67]:
diphone_fallbacks = dict()
def add_rule(a, b, r1, r2):
    if (a, b) in diphone_fallbacks.keys():
        return False
    if (r1, r2) in diphone_index.keys():
        diphone_fallbacks[(a, b)] = (r1, r2)
        return True
    return False

for a in allphones:
    for b in allphones:
        is_complete = False
        for r1 in fallback_rules[a]:
            if is_complete:
                    break
            for r2 in fallback_rules[b]:
                if is_complete:
                    break
                is_complete |= add_rule(a, b, r1, r2)
                is_complete |= add_rule(a, b, a, r2)
                is_complete |= add_rule(a, b, r1, b)


In [68]:
# check yourself
for a, b in [('Z', 'Z'), ('Z', 'AY1'), ('Z', 'EY0')]:
    assert (a, b) in diphone_fallbacks
    r1, r2 = diphone_fallbacks[(a, b)]
    assert r1 in fallback_rules[a] or r1 == a
    assert r2 in fallback_rules[b] or r2 == b
    assert r1 != a or r2 != b

In [69]:
# some helping constants
SAMPLE_RATE = 22050
WAV_TYPE = np.int16

Little DSP related to concatenative synthesis:

to prevent disturbing "clicking" sound (difference in volume) when concatenating fragments from different utterances we need to perform `cross-fade` - smoothing at concatenation point

If we concatenate $wav_1$ and $wav_2$ at some points $M_1$ and $M_2$ corrispondively we perform crossfade with overlap of $2 V$:

$$\forall i \in [-V; V]:~output[M_1+i] = (1-\alpha) \cdot wav_1[M_1+i] + \alpha \cdot wav_2[M_2+i]$$
Where $$\alpha = \frac{i+V}{2 V}$$

And for $i < -V:~ output[M_1+i] = wav_1[M_1+i]$

for $i > V:~output[M_1+i] = wav_2[M_2+i]$


But it is not ok if the overlapping comes outside the concatenation phoneme.

So, if junction phoneme starts and ends at positions $B_1$ and $E_1$ (the first wav) and $B_2$ and $E_2$ (the second one)
the extact formula for overlapping zone will be:
$$\forall i \in [-L; R]:~output[M_1+i] = (1-\alpha) \cdot wav_1[M_1+i] + \alpha \cdot wav_2[M_2+i]$$
Where:
$$\alpha = \frac{i+L}{L+R},~L = min(M_1-B_1, M_2 - B_2, V), ~R = min(E_1-M_1, E_2-M_2, V)$$
    

In [123]:
def crossfade(lcenter, ldata, rcenter, rdata, halfoverlap):
    """
    ldata, rdata - 1d numpy array only with junction phoneme (so, B1 = 0, E1 = ldata.shape[0])
    lcenter = M1
    rcenter = M2
    
    it is better to return the concatenated version of the junction phoneme (as numpy data)
    """
    M1 = lcenter
    M2 = rcenter
    V = halfoverlap
    B1 = 0
    E1 = ldata.shape[0]
    B2 = 0
    E2 = rdata.shape[0]

    L = min(M1 - B1, M2 - B2, V)
    R = min(E1 - M1, E2 - M2, V)

    only_l = ldata[:M1-L]
    mid_l = ldata[M1-L: M1+R]
    mid_r = rdata[M2-L : M2+R]
    only_r = rdata[M2+R:]

    alpha = np.arange(L+R) / (L+R)

    mid = mid_l * (1-alpha) + mid_r * alpha

    return np.hstack([only_l, mid, only_r])
    # # return np.hstack([only_l, mid])
    # return mid

In [124]:
def get_data(k, i):
    phoneme = alignment[k]['phones'][i]
    left = phoneme.xmin
    right = phoneme.xmax
    center = (left+right) * .5
    
    left = int(left * SAMPLE_RATE)
    center = int(center * SAMPLE_RATE)
    right = int(right * SAMPLE_RATE)
    return center - left, wavs[k][left:right]

In [125]:
# check yourself
cf = crossfade(*get_data('LJ050-0241', 3), *get_data('LJ038-0067', 56), 300)
assert np.abs(cf.shape[0] - 1764) < 10
assert np.abs(cf.mean() - 11) < 0.1

In [131]:
HALF_OVERLAP_CROSSFADE = 300

def synthesize(phonemes):
    diphones = []
    for ph1, ph2 in zip(phonemes[:-1], phonemes[1:]):
        diphone = (ph1, ph2)
        if diphone in diphone_index:
            k, i = diphone_index[diphone]
        else:
            k, i = diphone_index[diphone_fallbacks[diphone]]
            
        diphones.append((get_data(k, i), get_data(k, i+1)))
    output = []
    
    # Here you need to construct the result utterance with crossfades
    # NB: border (the first and the last phonemes does not require any crossfade and could be just copied)
    # !!!!!!!!!!!!!!!!!!!!!!#
    # INSERT YOUR CODE HERE #
    # !!!!!!!!!!!!!!!!!!!!!!#
    # need to return wav as 1d numpy array of type WAV_TYPE


    output.append(diphones[0][0][1])
    for diphone in diphones:
        output.append(crossfade(*diphone[0], *diphone[1], HALF_OVERLAP_CROSSFADE))
    output.append(diphones[-1][1][1])
    return np.hstack(output).astype(WAV_TYPE)

Check youself:

If everything was correct, you should hear 'hello world'

In [132]:
display_audio(synthesize(['HH', 'AH0', 'L', 'OW1', 'W', 'ER1', 'L', 'D']))

In [129]:
# load additional test texts
with open("test_phones.txt") as ifile:
    test_phones = []
    for l in ifile:
        test_phones.append(l.strip().split())

Here should a little part of the GLADOS song 

In [130]:
output = []
pause = np.zeros([int(0.1 * SAMPLE_RATE)], dtype=WAV_TYPE)
for test in test_phones:
    output.append(synthesize(test))
    output.append(pause)
    
display_audio(np.concatenate(output[:-1]))