# Training a Phoneme Language Model #
In this notebook I train a language model over English sounds (also known as [phonemes](https://en.wikipedia.org/wiki/Phoneme)). The data for English pronunciations comes from the CMU Pronouncing Dictionary. The pronunciations in the pronouncing dictionary are in [ARPABET](https://en.wikipedia.org/wiki/ARPABET), a set of symbols representing English sounds. So in ARPABET "fish" is pronounced as /F IH1 SH/.

By training on tens of thousands of pronunciations the model will hopefully learn [English phonotactics](https://en.wikipedia.org/wiki/Phonotactics#English_phonotactics), the rules that govern what sounds like a valid English word. For example, /F AH1 N/ ("fun") sounds good, but /NG S ER1/ (maybe represented as "ngsr") does not.


Check out the notebook `Phoneme Exploration.ipynb` if you want to see the model used.

In [1]:
import torch
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import ParameterGrid
from torch.nn import functional as F
from tqdm.notebook import tqdm

from sonorous.languagemodel import LanguageModel, ModelParams, Vocabulary
from sonorous.pronunciationdata import load_pronunciations
from sonorous.utils import split_data

## The Data ##
The data for this model comes from the [CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict), which contains over one hundred thousand pronunciations. Each pronuncation is in [ARPABET](https://en.wikipedia.org/wiki/ARPABET), a set of symbols for representing English speech sounds. In ARPABET the word "fish" is represented by the sequence of phonemes /F IH1 SH/. You can probably guess the first and third sounds. The vowel in the middle has "1" at the end to indicate it has the primary stress.

I'll use the `load_pronunciations` function to load the Pronouncing Dictionary into a Pandas DataFrame.

In [2]:
pronunciations = load_pronunciations()
print(f"There are {len(pronunciations):,} pronunciations.")
print()
print("Sample of 5 pronunciations:")
pronunciations.sample(5)

There are 131,964 pronunciations.

Sample of 5 pronunciations:


Unnamed: 0_level_0,pronunciation,num_phonemes,num_syllables,num_primary_stressed_syllables
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
diligent,"(ˈ, d, ɪ, l, ɪ, dʒ, ə, n, t)",8,3,1
lapsing,"(ˈ, l, æ, p, s, ɪ, ŋ)",6,2,1
sewage,"(ˈ, s, uː, ɪ, dʒ)",4,1,1
billet,"(ˈ, b, ɪ, l, ɪ, t)",5,2,1
minimize,"(ˈ, m, ɪ, n, ə, ˌ, m, aɪ, z)",7,3,1


Here is the pronunciation for "fish":

In [3]:
pronunciations.loc['fish']

pronunciation                     (ˈ, f, ɪ, ʃ)
num_phonemes                                 3
num_syllables                                1
num_primary_stressed_syllables               1
Name: fish, dtype: object

And here are all of the pronunciations for the word "tomato":

In [4]:
pronunciations.loc['tomato']

Unnamed: 0_level_0,pronunciation,num_phonemes,num_syllables,num_primary_stressed_syllables
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
tomato,"(t, ə, ˈ, m, eɪ, ˌ, t, oʊ)",6,3,1
tomato,"(t, ə, ˈ, m, ɑː, ˌ, t, oʊ)",6,2,1


## The Model ##
The module `languagemodel.py` contains a class `LanguageModel` that implements a simple neural language model. It's a PyTorch neural network comprised of the following layers:
1. **Embedding layer** to translate each phoneme into a dense vector. Note that in the code this is called the _encoder since it encodes input phonemes into a representation the model can work with.
2. An recurrent neural network (**RNN**) layer that processes each input phoneme sequentially and for each step generates (a) a hidden representation to pass on to the next step and (b) an output.
3. A **linear layer** that decodes the outputes (2b) into distributions over each phoneme. Note that in the code this is called the _docoder since it decodes the model's internal representations back into phonemes.

Let's go through a simple example of what happens when we pass the pronunciation /F IH1 SH/ through the model. Ultimately what I want ouf of the model is a prediction at each position of what the next phoneme should be. For example, when a well trained model is sees /F IH1/ it should know that /SH/ is likely, or at least not unlikely.

First I'll create a `Vocabulary` instance by passing in all the pronunciations. The `vocab` is used to convert phonemes into integer indices that the neural network handle. It does a few other things too, which you can see below. The `Vocabulary` class's code is in `sonorous/languagemodel.py`.

In [5]:
print(''.join(pronunciations.loc['fish'].pronunciation))

ˈfɪʃ


In [6]:
vocab = Vocabulary.from_texts(pronunciations.pronunciation.values)

print(f"There are {len(vocab)} distinct phonemes.")
print()
print("Looking up the int index for /ʃ/:", vocab['ʃ'])
print()
print("Checking whether /ʃ/ is in the vocabulary:", 'ʃ' in vocab)
print()
print("Looking up the phoneme for a specific int index:", vocab.token_from_idx(vocab['ʃ']))
print()
print("Encoding /ˈfɪʃ/:", vocab.encode_text(tuple("ˈfɪʃ")))

There are 47 distinct phonemes.

Looking up the int index for /ʃ/: 31

Checking whether /ʃ/ is in the vocabulary: True

Looking up the phoneme for a specific int index: ʃ

Encoding /ˈfɪʃ/: [ 1  5 13 22 31  2]


Now I'll define the model. Note that I'm not actually fitting the model to any data so the output will be random. The hyperparameters aren't optimal, but again that doesn't matter here since I just want to show the flow of data through the network.

The `ModelParams` class (from `sonorous/languagemodel.py` encapsulates hyperparameters and options for the model).

In [7]:
model_params = ModelParams(
    rnn_type='rnn', embedding_dimension=10, hidden_dimension=3, num_layers=1,
    max_epochs=3, early_stopping_rounds=3
)

language_model = LanguageModel(vocab, model_params, 'cpu')

Now I'll walk through what happens when we pass the word "fish" /F IH1 SH/ through the model:

In [8]:
fish_pronunciation = tuple("ˈfɪʃ")

The `Vocabulary.encode_text` function we saw earlier does a few things. First, it adds dummy `<START>` and `<END>` tokens to the pronunciation indicating its start and end. This allows the model to learn transition probabilities from the start of the word to the first phoneme, and from the last phoneme to the end of the word.

It then converts every phoneme to its ingeter index.

In [9]:
fish_input = vocab.encode_text(fish_pronunciation)
print(fish_input)
print()
for idx in fish_input:
    phoneme = vocab.token_from_idx(idx)
    print(f'{idx} => {phoneme}')

[ 1  5 13 22 31  2]

1 => <START>
5 => ˈ
13 => f
22 => ɪ
31 => ʃ
2 => <END>


And now we're ready to pass the input into the model's `forward` function, which takes in inputs and outputs predictions. This model's `forward` function expects a Tensor of dimension `(batch_size, NUMBER OF STEPS)`. A step here refers to a step forward in the sequence, so /<START> F IH1 SH <END>/ has 5 steps.

In [10]:
print("Input's shape:", fish_input.shape)
fish_batch_input = torch.LongTensor(fish_input).unsqueeze(0)
print("Batch input's shape:", fish_batch_input.shape)

Input's shape: (6,)
Batch input's shape: torch.Size([1, 6])


The first thing the `forward` function does is embed each phoneme using an [nn.Embedding](https://pytorch.org/docs/stable/nn.html#embedding). Each phoneme has a dedicated embedding vector of length `embedding_dimension`, so the shape of `embedded` is `(batch size, number of steps, embedding_dimension)`.

In [11]:
embedded = language_model._encoder(fish_batch_input)
print(embedded.shape)
print()
embedded

torch.Size([1, 6, 10])



tensor([[[-1.0845,  0.6374,  0.6895,  0.4447,  1.9042,  0.5735, -1.2779,
          -0.9029, -1.8333, -1.6840],
         [ 1.9033, -0.5870,  1.3252,  0.4500,  0.3476,  0.1615, -1.4788,
          -0.0770,  1.6703,  0.8833],
         [-0.6348,  1.3559, -1.6676, -1.2675,  0.7422, -0.0757,  0.6512,
           0.4348,  0.5014,  0.7137],
         [ 0.0088,  1.2573, -0.7593, -0.1210, -0.6114,  0.2659, -2.1194,
           0.1601, -1.0128, -0.1468],
         [-0.6256, -0.2645,  0.8634, -0.7564,  0.1040,  0.2622, -0.0729,
           0.3323,  1.2381, -0.2089],
         [-0.5891,  0.5280,  0.3780, -0.4542, -0.5642,  0.8115,  1.1417,
           0.7142,  0.5249, -0.1595]]], grad_fn=<EmbeddingBackward>)

I'll pass `embedded` to the [RNN layer](https://pytorch.org/docs/stable/nn.html#recurrent-layers), resulting in `rnn_output` and `hidden_state`. I won't go into detail on how RNNs work since there are many detailed posts on the web you can read, but the basic idea is a cell is applied sequentially to every token (i.e. step) in the input. At each step an output and a hidden state are produced. The hidden state can be passed on to the next step, and the output can be used to make a prediction.

The `rnn` layer below operates on the full sequence, so the results are for the entire sequence.

In [12]:
rnn_output, hidden_state = language_model._rnn(embedded)
print(rnn_output.shape)
print()
rnn_output

torch.Size([1, 6, 3])



tensor([[[ 0.5714, -0.8927,  0.7541],
         [ 0.8879,  0.1008, -0.8153],
         [-0.4025,  0.4163, -0.2686],
         [ 0.9633,  0.0914,  0.8364],
         [-0.1460,  0.6187, -0.4511],
         [-0.1026, -0.3411, -0.1833]]], grad_fn=<TransposeBackward1>)

For our application we can ignore the `hidden_state`-- the `rnn_output` is the interesting part. The first dimension is for the batch, and we only have a single input in our batch. The second dimension is for each of the input phonemes. The third dimension corresponds to `hidden_dimension`: you can think of this as the state of the RNN at each step.

As I said at the beginning of this section, I want the output of the RNN at each position to be predictions for the *next* position. So I'll apply a [linear layer](https://pytorch.org/docs/stable/nn.html#linear) to the `rnn_output`, resulting in a vector the size of the vocabularly at each position. The [softmax](https://pytorch.org/docs/stable/nn.functional.html#softmax) function normalizes the outputs into probability distributions for each prediction.

In [13]:
outputs = language_model._decoder(rnn_output)
probabilities = F.softmax(outputs, dim=-1).squeeze()
probabilities.shape

torch.Size([6, 47])

The shape of `probabilities` is (5, 42) because each of the five tokens in /ˈfɪʃ/ gets a a probability distribution over each of the 42 phonemes in the vocabulary.

The first phoneme in the input is the `<START>` token; let's see what the model thinks should come next.

In [14]:
probabilities_for_first_phoneme = probabilities[0]
most_likely_first_phoneme_idx = probabilities_for_first_phoneme.argmax().item()
most_likely_first_phoneme = vocab.token_from_idx(most_likely_first_phoneme_idx)
most_likely_first_phoneme

'g'

So the model predicts /ŋ/ to be the first phoneme in the word. Since the model isn't fit yet this is just a random guess. In order to get the model to make good predictions I'll need to first train a good model.

## Finding the Best Model ##
In this section I'll train a number of models on the train set and select the one that has the lowest error on the dev set. I'll split the DataFrame of pronunciations into three DataFrames, with 79% for training, 20% for dev/validation, and 1% for testing of the final model.

In [15]:
train_pronunciations, dev_pronunciations, test_pronunciations = split_data(pronunciations, dev_proportion=.2, test_proportion=.01)
len(train_pronunciations), len(dev_pronunciations), len(test_pronunciations)

(104251, 26393, 1320)

Note that I ran a larger parameter search before and saw that GRUs were consistently outperforming LSTMs and vanilla RNNs. There are 12 (4 * 3) models to train, and for each one I'm measuring train and dev error at every epoch. So if each model trains for the maximum of 2,000 epochs I would end up with 12 * 2,000 = 24,000 models to choose from. There's a good chance I'm overfitting the dev set with such a large search, but I'll inspect the learning curves to try to avoid selecting an iteration that randomly did well.

While each model trains for a maximum of 2000 epochs, it stops early if the dev error does not decrease for three epochs in a row. Since I'm going to be selecting the model with the lowest dev error there's no reason to keep training a model once it's started overfitting. Alternatively I could train all models to convergence and then add regularization to reduce the complexity and identify the sweet spot, but that's far more time consuming because it requires training more models and each of them for longer.

In [21]:
pronunciations.loc['excuse'].iloc[1].pronunciation

('ɪ', 'k', 's', 'ˈ', 'k', 'j', 'uː', 'z')

* check errors before fitting
* overfit a small set

In [17]:
%%time

model_parameters = ModelParams(rnn_type='gru', num_layers=1, max_epochs=1000, early_stopping_rounds=3,
                              embedding_dimension=100, hidden_dimension=20)
model = LanguageModel(vocab, model_parameters, device_name='cpu')

train_losses, dev_losses = model.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
)

Epoch 1: train loss: 1.0466	dev loss: 1.4134                                                                                               
	Generated: in train: 1%, assess: 0%, novel: 99%
	 eɪ ˈ d dʒ d ˌ tʃ ɛ ŋ g n p
	 ɛ
	 b b w t ɑː ŋ t s
	 uː h iː θ t ɝ ʒ
	 tʃ iː h ŋ p ˈ iː tʃ t ɔɪ
Epoch 2; Batch 13 of 102; loss: 1.0013                                                                                                    

KeyboardInterrupt: 

In [None]:
%%time

model_parameters = ModelParams(rnn_type='gru', num_layers=1, max_epochs=1000, early_stopping_rounds=3,
                              embedding_dimension=20, hidden_dimension=20)
model2 = LanguageModel(vocab, model_parameters, device_name='cuda')

train_losses, dev_losses = model2.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
)

In [None]:
train_pronunciations.head()

In [None]:
dev_pronunciations.head()

In [None]:
len(train_pronunciations), len(dev_pronunciations)

In [None]:
set(train_pronunciations.index) & set(dev_pronunciations.index)

In [None]:
train_pronunciations.loc['keanu']

In [None]:
dev_pronunciations.loc['keanu'].pronunciation

In [None]:
dev_pronunciations['perplexity'] = dev_pronunciations.pronunciation.apply(model2.perplexity_of_text)

In [None]:
train_pronunciations['perplexity'] = train_pronunciations.pronunciation.apply(model2.perplexity_of_text)

In [None]:
train_pronunciations.perplexity.describe()

In [None]:
(dev_pronunciations.num_phonemes==1).mean()

In [None]:
pronunciations.loc['environment'].pronunciation

In [None]:
train_pronunciations.pronunciation.apply(lambda p: p[0]).value_counts()

In [None]:
with open('model.pt', 'wb') as fh:
    model.save(fh)

In [None]:
model

In [None]:
model2

In [None]:
pron = ("ˈ", )
sorted(model.next_probabilities(pron).items(), key=lambda p: -p[1])

In [None]:
train_pronunciations[train_pronunciations.num_primary_stressed_syllables==0]

In [None]:
pronunciations.loc['did']

In [None]:
train_pronunciations.sort_values('perplexity', ascending=False).head()

In [None]:
dev_pronunciations.sort_values('perplexity', ascending=False).head()

In [None]:
dev_pronunciations.perplexity.describe()

In [None]:
pronunciations[pronunciations.pronunciation.apply(lambda p: p[0]) == 'g'].pronunciation.apply(lambda p: p[1]).value_counts()

In [None]:
pronunciation = ('g',)
sorted(model2.next_probabilities(pronunciation).items(), key=lambda p: -p[1])

In [None]:
pronunciations.loc['fang']

* Check the vocab. All look good?
* Check generated stress patterns vs in corpus. starting with primary. number of stresses. 
* --plot the loss for both. makes sense?--. Sort of; weird that dev drops more...
* --recheck train and dev loaders--. seems consistent

* check perplexity on dev and train sets. are there high perplexity words in the training set? or long words or something?
* are dev and train distributions the same?
  * num phonemes
  * num syllables
  * phoneme hists overall

In [None]:
plt.plot(train_losses, label='train')
plt.plot(dev_losses, label='dev')
plt.legend()

In [None]:
pronunciations.loc['environment'].pronunciation

In [None]:
pronunciation = ('ɪ', 'n', 'ˈ', 'v', 'aɪ', 'r', 'ə')
sorted(model.next_probabilities(pronunciation).items(), key=lambda p: -p[1])

In [None]:
from sonorous.languagemodel import build_data_loader
train_loader = build_data_loader(train_pronunciations.pronunciation.values.tolist(), vocab)
dev_loader = build_data_loader(dev_pronunciations.pronunciation.values.tolist(), vocab)
all_loader = build_data_loader(pronunciations.pronunciation.values.tolist(), vocab)

In [None]:
model.evaluate(train_loader)

In [None]:
model.evaluate(dev_loader)

In [None]:
model.evaluate(all_loader)

In [None]:
%%time


param_grid = ParameterGrid({
    'embedding_dimension': [10, 50, 100],
    'hidden_dimension': [10, 50, 100],
})

records = []
for params in tqdm(param_grid):
    model_parameters = ModelParams(rnn_type='gru', num_layers=1, max_epochs=1000, early_stopping_rounds=1, **params)
    model = LanguageModel(vocab, model_parameters, device_name='cuda')

    print('Model Params:', model_parameters)
    
    train_losses, dev_losses = model.fit(
        train_pronunciations.pronunciation.values.tolist(),
        dev_pronunciations.pronunciation.values.tolist()
    )
    
    for epoch, (train_loss, dev_loss) in enumerate(zip(train_losses, dev_losses), start=1):
        record = params.copy()
        record['epoch'] = epoch
        record['train_loss'] = train_loss
        record['dev_loss'] = dev_loss
        
        record['rnn_type'] = 'rnn'
        record['num_layers'] = 1 
    
        records.append(record)

models_df = pd.DataFrame.from_records(records)

In [None]:
%%time


param_grid = ParameterGrid({
    'embedding_dimension': [50],
    'hidden_dimension': [2],
})

records = []
for params in tqdm(param_grid):
    model_parameters = ModelParams(rnn_type='rnn', num_layers=1, max_epochs=1000, early_stopping_rounds=1, **params)
    model = LanguageModel(vocab, model_parameters, device_name='cpu')

    print('Model Params:', model_parameters)
    
    train_losses, dev_losses = model.fit(
        train_pronunciations.pronunciation.values.tolist(),
        dev_pronunciations.pronunciation.values.tolist()
    )
    
    for epoch, (train_loss, dev_loss) in enumerate(zip(train_losses, dev_losses), start=1):
        record = params.copy()
        record['epoch'] = epoch
        record['train_loss'] = train_loss
        record['dev_loss'] = dev_loss
    
        records.append(record)

models_df = pd.DataFrame.from_records(records)

In [None]:
models_df = pd.DataFrame.from_records(records)

There are {meow} models with a dev error of around .76. I'll choose the simplest one, which 

* point out that no matter how low the train error gets, the dev error
* which model parameters fail to ever get to the lowest dev error
* which model parameters overfit the most

* isolate the group of models with about .76 dev error. choose the simplest one.

In [None]:
t = models_df[(models_df.embedding_dimension==50) & (models_df.hidden_dimension==100) & (models_df.num_layers==3)]
t = t.set_index('epoch')
t.dev_loss.plot()
t.train_loss.plot()
_ = plt.legend()

In [None]:
models_df.sort_values('dev_loss').iloc[0]

Since the best "model" was at an earlier epoch I don't have access to it. So I'll train a model with that model's parameters and set the number of epochs to MEOW.

In [None]:
%%time

model = PhonemeLM(
    phoneme_to_idx, device='cuda', rnn_type='gru',
    embedding_dimension=50, hidden_dimension=100, num_layers=3,
    max_epochs=69, early_stopping_rounds=69, batch_size=1024,
)

train_loss, dev_loss = model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist())

Now that I have what I hope is the best model I can test how well it does on the holdout test set, which I haven't look at at all during the training process.

In [None]:
test_loader = build_data_loader(test_df.pronunciation.values.tolist(), lm.phoneme_to_idx)
lm.evaluate(test_loader)

In [None]:
dev_loader = build_data_loader(dev_df.pronunciation.values.tolist(), lm.phoneme_to_idx)
lm.evaluate(dev_loader)

TODO: compute the test error for the final model. Bar chart for the train, dev, and test errors

Comment on findings, probably that test is higher and that's expected because language models are very sensitive to corpus difference (and I probably overfit the dev set)

## Training the Final Model ##
Now that I've found the best parameters for the model according to the dev set I'll train a final model using all of the data. This should increase model performance overall since more data is better, but is also necessary since I'll be using the model to predict probabilities of all English words. If some of those words weren't in the training set they would artificially get lower probabilities. (Another approach here could be to train a model on e.g. 4/5 folds of the data and make predictions about the remaining 1/5, doing that 5 times to get unbiased predictions for all data, but this would have taken much longer to run.)

In [None]:
%%time

model_params = ModelParams(
    rnn_type='gru', embedding_dimension=50, hidden_dimension=50, num_layers=1,
    max_epochs=3, early_stopping_rounds=3, batch_size=1024
)

language_model = LanguageModel(vocab, model_params, device_name='cpu')

_ = language_model.fit(pronunciations.pronunciation.values)

And finally, I'll save the model so I can use it in the next notebook, `Phoneme Exploration.ipynb`.

In [None]:
with open('phoneme_language_model.pt', 'wb') as fh:
    model.save(fh)

In [None]:
pd