# Phoneme Language Model #
This notebook builds a language model over English sounds (also known as [phonemes](https://en.wikipedia.org/wiki/Phoneme)). Its goal is to model what it means to be a valid English pronunciation. In this notebook I'll use it to:


In [1]:
import torch
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import ParameterGrid
from sklearn.preprocessing import normalize
from torch import nn
from torch.nn import functional as F
from tqdm.notebook import tqdm

from sonorous.languagemodel import LanguageModel, ModelParams, Vocabulary
from sonorous.pronunciationdata import load_pronunciations
from sonorous.utils import split_data

## The Data ##
The data for this model comes from the [CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict), which contains over one hundred thousand pronunciations. Each pronuncation is in [ARPABET](https://en.wikipedia.org/wiki/ARPABET), a set of symbols for representing English speech sounds. In ARPABET the word "fish" is represented by the sequence of phonemes /F IH1 SH/. You can probably guess the first and third sounds. The vowel in the middle has "1" at the end to indicate it has the primary stress.

I'll use the `load_pronunciations` function to load the Pronouncing Dictionary into a Pandas DataFrame.

In [2]:
pronunciations = load_pronunciations()
print(f"There are {len(pronunciations):,} pronunciations")

pronunciations.sample(5)

There are 124,996 pronunciations


Unnamed: 0_level_0,pronunciation,pronunciation_string,length
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
folly,"(F, AA1, L, IY0)",F AA1 L IY0,4
dehnert,"(D, EH1, N, ER0, T)",D EH1 N ER0 T,5
aloha,"(AH0, L, OW1, HH, AA0)",AH0 L OW1 HH AA0,5
benackova,"(B, EH2, N, AH0, K, OW1, V, AH0)",B EH2 N AH0 K OW1 V AH0,8
geurin,"(ZH, ER0, AE1, N)",ZH ER0 AE1 N,4


Here is the pronunciation for "fish":

In [3]:
pronunciations.loc['fish', 'pronunciation_string']

'F IH1 SH'

And here are all of the pronunciations for the word "tomato":

In [4]:
pronunciations.loc['tomato', ['pronunciation_string']]

Unnamed: 0_level_0,pronunciation_string
word,Unnamed: 1_level_1
tomato,T AH0 M EY1 T OW2
tomato,T AH0 M AA1 T OW2


In [5]:
vocab = Vocabulary.from_texts(pronunciations.pronunciation.values)
print(f"There are {len(vocab)} phonemes in the vocabulary")

There are 73 phonemes in the vocabulary


In [None]:
model_params = ModelParams(
    rnn_type='gru', embedding_dimension=50, hidden_dimension=1, num_layers=1,
    max_epochs=3, early_stopping_rounds=3
)

language_model = LanguageModel(vocab, model_params, 'cpu')
language_model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist())

In [None]:
language_model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist(), max_epochs=3)

In [None]:
language_model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist(), max_epochs=5)

In [None]:
language_model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist(), max_epochs=3)

In [None]:
language_model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist(), max_epochs=5)

In [None]:
language_model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist(), max_epochs=5)

## The Model ##
The module `languagemodel.py` contains a class `LanguageModel` that implements a simple neural language model. It's a fairly simple PyTorch neural network comprised of the following layers:
1. Embedding layer to translate each phoneme into a dense vector.
2. An recurrent neural network (RNN) layer that processes each input phoneme sequentially and for each step generates (a) a hidden representation to pass on to the next step and (b) an output.
3. A linear layer that decodes the outputes (2b) into distributions over each phoneme.

Let's go through a simple example of what happens when we pass the pronunciation /F IH1 SH/ through the model. Ultimately what I want ouf of the model is a prediction at each position of what the next phoneme should be. For example, when a well trained model is processing "F" I would hope it assigns a high probability to "IH1" coming next.

First I'll define our model. Note that I'm not actually fitting the model to any data so the output will be random. The parameters aren't optimal, but again that doesn't matter here since I just want to show the flow of data through the network.

In [None]:
# TODO: comment on this


In [None]:
vocab = Vocabulary.from_texts(df.pronunciation.values)

model_params = ModelParams(
    rnn_type='rnn', embedding_dimension=10, hidden_dimension=3, num_layers=1,
    max_epochs=3, early_stopping_rounds=3
)

language_model = LanguageModel(vocab, model_params, 'cpu')

Next I'll define the pronuncation we'll be working with. This is the standard pronunciation of "fish".

In [None]:
pronunciation = ("F", "IH1", "SH")

Within the neural network, phonemes are represented as integer indices. So the phoneme "F" might correspond go 23. Overall there are 73 distinct phonemes. The class `Vocabulary` finds every distinct phoneme in the pronunciations and constructs a dictionary mapping each one to an integer.

In [None]:
print(f'There are {len(vocab)} distinct phonemes.')

Before passing the pronunciation into the neural network I need to add dummy tokens to the pronunciation indicating its start and end. This allows the model to learn transition probabilities from the start of the word to the first phoneme, and from the last phoneme to the end of the word. We end up with this: `["START", "F", "IH1", "SH", "END"]`.

I also need to translate each phoneme into an integer index within the model. The `encode_pronunciation` function handles both of these tasks. It uses the `phoneme_to_idx` mapping I built above.

In [None]:
input_ = vocab.encode_text(pronunciation)
input_

Here is what each of those ints maps to:

In [None]:
for idx in input_:
    phoneme = vocab.token_from_idx(idx)
    print(f'{idx} => {phoneme}')

And now we're ready to pass the input into the model's `forward` function. You can see the full function in `sonorous/languagemodel.py: LanguageModel`, but I'll walk through it layer by layer here.

In Torch, the `forward` function is the forward pass through the network, taking in inputs and outputting predictions. This model's `forward` function expects a Tensor of dimension (number of batches, number of steps, vocabulary size).

In this case I have a single pronunciation so the batch size is 1.

In [None]:
print("Input's shape:", input_.shape)
batch_input = torch.LongTensor(input_).unsqueeze(0)
print("Batch input's shape:", batch_input.shape)

The first thing the `forward` function does is embed each phoneme using an [nn.Embedding](https://pytorch.org/docs/stable/nn.html#embedding). Each phoneme has a dedicated embedding vector of length `embedding_dimension`.

In [None]:
embedded = language_model._encoder(batch_input)
print(embedded.shape)
embedded

The entire embedding is of the shape (meow), with M representing the number of phonemes in the input (including the dummy ones) and N repreesnting the embedding dimension.

*Side Note*: I want to talk a little bit about why embeddings are useful and whether they make sense here.

For this language model the vocabulary is the set of possible phonemes, of which there are MEOW. In most natural language processing tasks the tokens in the vocabulary are words (e.g. "cat", "threw"), where there are likely to be tens of thousands. So when embedding words using an embedding dimension of 100, each word is represented by a dense vector of 100 floats. Without embeddings you would need to represent each of your words with a one hot vector the size of your entire vocabulary. So embeddings result in fewer model parameters since the model only needs to know about those 100 floats. Since embeddings are updated during backpropagation, the model learns how to represent each token as a vector.

For this phoneme language model the vocabulary size is already very small to begin with and I initially was going to just represent each phoneme with a one hot vector. However, I wanted to use embeddings because I was hoping that the model would learn to represent each phoneme as its component parts. For example, the embedding space could capture whether a phoneme is a vowel or a consonant, whether it's [voiced or voiceless](https://en.wikipedia.org/wiki/Voice_(phonetics)), how [sonorant](https://en.wikipedia.org/wiki/Sonorant) it is, etc. Some limited probing of the embeddings generated didn't turn up anything too interesting (see more below) so I'm not sure how much they're helping, if at all. I have not measured model performance just using one hot encoding, but I wouldn't be surprised if it were equal or better.

So now we have the output of our embedding layer. We can pass this through the RNN layer, resulting in `rnn_output` and `hidden_state`. I won't go into detail on how RNNs work, but the basic idea is it has shared weights that are applied sequentially to every token (i.e. step) in the input. At each step an output and a hidden state are produced. The hidden state can be passed on to the next step, and the output can be used to make a prediction.

The `rnn` layer below operates on the full sequence, so the results are for the entire sequence.

In [None]:
rnn_output, hidden_state = language_model._rnn(embedded)
print(rnn_output.shape)
rnn_output

For our application we can ignore the `hidden_state`-- the `rnn_output` is the interesting part. The shape of `rnn_output` is MEOW MEOW MEOW. The first dimension is for the batch, and we only have a single input in our batch. The second dimension is for each of the input phonemes: `input_` was of length MEOW so this is too. The third dimension corresponds to `hidden_dimension`: you can think of this as the state of the RNN at each step.

Now we're getting to the interesting part. As I said at the beginning of this section, I want the output of the RNN at each position to be predictions for the *next* position. So I'll apply a linear layer to the `rnn_output`, resulting in a vector the size of the vocabularly at each position. The [softmax](https://pytorch.org/docs/stable/nn.functional.html#softmax) function normalizes the outputs into probability distributions for each prediction.

In [None]:
outputs = language_model._decoder(rnn_output)
probabilities = F.softmax(outputs, dim=-1).squeeze()
probabilities.shape

The shape of `probabilities` is MEOW X MEOW: for each of the MEOW phonemes in the input there's a distribution over all phonemes indicating what the model thinks the next phoneme in the input should be. The first phoneme in the input is the `START` token, so let's see what the model thinks should come next.

In [None]:
probabilities_for_first_phoneme = probabilities[0]
most_likely_first_phoneme_idx = probabilities_for_first_phoneme.argmax().item()
most_likely_first_phoneme = vocab.token_from_idx(most_likely_first_phoneme_idx)
most_likely_first_phoneme

So the model predicts /HH/ to be the first phoneme in the word. Since the model isn't fit yet this is just a random guess.

MEOW update based on what actually comes out

Note that this isn't a very good prediction. If you look at the rest of the distribution you'll see it doesn't make sense. In order to get the model to make good predictions (e.g. that "S" is likely as the first phoneme in a word) we'll need to first train a good model.

## Finding the Best Model ##
In this section I'll train a number of models on the train set and select the one that has the lowest error on the dev set. I'll split the DataFrame of pronunciations into three DataFrames, with 79% for training, 20% for dev/validation, and 1% for testing of the final model.

In [13]:
train_pronunciations, dev_pronunciations, test_pronunciations = split_data(pronunciations, dev_proportion=.2, test_proportion=.01)
len(train_pronunciations), len(dev_pronunciations), len(test_pronunciations)

(98746, 25000, 1250)

Note that I ran a larger parameter search before and saw that GRUs were consistently outperforming LSTMs and vanilla RNNs. In the search below I'm looking for the model with the lowest dev error. There are 12 (4 * 3) models to build, and for each one I'm measuring train and dev error at every epoch. So if each model trains for the maximum of 2,000 epochs I would end up with 12 * 2,000 = 24,000 models to choose from. There's a good chance I'm overfitting the dev set with such a large search, but I'll inspect the learning curves to try to avoid selecting an iteration that randomly did well.

While each model trains for a maximum of 2000 epochs, it stops early if the dev error does not decrease for three epochs in a row. Since I'm going to be selecting the model with the lowest dev error there's no reason to keep training a model once it's started overfitting. Alternatively I could train all models to convergence and then add regularization to reduce the complexity and identify the sweet spot, but that's far more time consuming because it requires training more models and each of them for longer.

In [17]:
%%time


param_grid = ParameterGrid({
    'embedding_dimension': [50],
    'hidden_dimension': [2],
})

records = []
for params in tqdm(param_grid):
    model_parameters = ModelParams(rnn_type='rnn', num_layers=1, max_epochs=1000, early_stopping_rounds=1, **params)
    model = LanguageModel(vocab, model_parameters, device_name='cpu')

    print('Model Params:', model_parameters)
    
    train_losses, dev_losses = model.fit(
        train_pronunciations.pronunciation.values.tolist(),
        dev_pronunciations.pronunciation.values.tolist()
    )
    
    for epoch, (train_loss, dev_loss) in enumerate(zip(train_losses, dev_losses), start=1):
        record = params.copy()
        record['epoch'] = epoch
        record['train_loss'] = train_loss
        record['dev_loss'] = dev_loss
    
        records.append(record)

models_df = pd.DataFrame.from_records(records)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Model Params: ModelParams(rnn_type='rnn', embedding_dimension=50, hidden_dimension=2, num_layers=1, max_epochs=1000, early_stopping_rounds=1, learning_rate=0.001, dropout=0, l2_strength=0, batch_size=1024)
Epoch 1: train loss: 2.7358	dev loss: 2.9814
	Generated: in train: 3%, assess: 0%, novel: 97%
	 AE1 AY2 AH0 AA2 UH1
	 N ZH UW1 OY2
	 
	 AE1 IY1 UH1 T IH1
	 OY2 IY1 TH UW2 N UW0 UW1 AE0 AH2 AA1 IY0 AO2 F UH1 AY1
Epoch 2: train loss: 2.3473	dev loss: 2.6413
	Generated: in train: 4%, assess: 0%, novel: 96%
	 P UW1 IH2 AW0 AW1 L IH0 EH0 ER0 UH1 AE1 CH P JH ER2 UW G AH2 AH2 HH G UH1 IH0
	 UW2 TH S OW2 IH2
	 TH IH1 AA0 EH1 IH2 AH1 DH W
	 R UH1 R AW2 OW0 L OW0 AE1
	 EY0 R DH EY2 OY1 UW2 AW1
Epoch 3: train loss: 2.0126	dev loss: 2.3506
	Generated: in train: 7%, assess: 0%, novel: 93%
	 UW2 UH1 EH2 AY1 EH1 UW1 EH0
	 OW0 CH SH AO0 EY0 N W EY1 UW W CH
	 P OY2
	 ER0 AH0 F K AO1 IY1 K
	 Y AE0
Epoch 4: train loss: 1.7399	dev loss: 2.1130
	Generated: in train: 8%, assess: 0%, novel: 92%
	 
	 IY1
	 

Epoch 46: train loss: 0.7150	dev loss: 1.0994
	Generated: in train: 8%, assess: 1%, novel: 91%
	 AY1 S AE1 N
	 EY1 T EH1 N
	 N R
	 L EH1 AH0 L V UW1
	 M AH0 AH0 S
Epoch 47: train loss: 0.7125	dev loss: 1.0959
	Generated: in train: 10%, assess: 0%, novel: 90%
	 IH0 L
	 AE1 L N N EY2 F L T OY1 R D
	 D D
	 M ER0 AA1 EH1 S T AH0 UH2 ER0
	 M IY0 S OW1 R R ER0
Epoch 48: train loss: 0.7096	dev loss: 1.0920
	Generated: in train: 3%, assess: 1%, novel: 96%
	 ER1 B ER1 D OW1 D N D AH0 N
	 S S R IY0
	 B IY0 IY0 AH0 SH T
	 S S
	 N M AH0 N EH1 L B AY1 D
Epoch 49: train loss: 0.7072	dev loss: 1.0886
	Generated: in train: 5%, assess: 1%, novel: 94%
	 TH AH0 AE1 L OW1 M UW IY0
	 R IY0
	 AH0 G IH0 P K AY2 AH0 K D T
	 K R B
	 B AY2 K
Epoch 50: train loss: 0.7041	dev loss: 1.0837
	Generated: in train: 9%, assess: 0%, novel: 91%
	 IY0 L UH1 P
	 L
	 L N T S
	 AH1 NG AH0 JH
	 B S N
Epoch 51: train loss: 0.7016	dev loss: 1.0802
	Generated: in train: 5%, assess: 0%, novel: 95%
	 ER2 B IH2 T OW1 N
	 UW2 P D AO

Epoch 92: train loss: 0.6584	dev loss: 1.0188
	Generated: in train: 13%, assess: 3%, novel: 84%
	 ER0 Y AH0 L AH0 K
	 M AY1 NG P
	 JH AA2 F AH0 D AH0 N
	 SH IH1 N L AE0 L IH0 K M IY1 S AH0 D
	 N
Epoch 93: train loss: 0.6579	dev loss: 1.0180
	Generated: in train: 9%, assess: 1%, novel: 90%
	 G AO1 Z
	 K DH UW1 HH ER0 IY0 OW0 OW0 IH0 K IH1 JH L
	 G R AA1 N
	 CH EH1 NG NG AH0 K ER0 T EH2 N L
	 EY1 N AH0 EH1 EH0 L AY2 IY0 AH0 Z IH0 K TH IY0
Epoch 94: train loss: 0.6575	dev loss: 1.0175
	Generated: in train: 9%, assess: 0%, novel: 91%
	 EH1 S AH0 P R Z D
	 R M UH1 L ER0 W
	 AH0 D D Z AH0 M G IH0 G
	 B OW1 R W ER0 M EH1 NG ER0
	 AH1 F IY0
Epoch 95: train loss: 0.6571	dev loss: 1.0169
	Generated: in train: 12%, assess: 1%, novel: 87%
	 L AA1 R
	 T IY1 B OW1 N
	 SH
	 S AH0 V Z L AH0 CH
	 S AE1 L V HH IY1 AA2
Epoch 96: train loss: 0.6567	dev loss: 1.0164
	Generated: in train: 8%, assess: 0%, novel: 92%
	 K IH0 R IH0 N AH0 L ER0
	 UW1 R AH0 K OW0 M
	 HH EH1 V AH0 G IH0 IH2 M AH0 G UH1 B N
	 M Y 

Early stopping because of no decrease in 1 epochs.


NameError: name 'num_parameters' is not defined

In [18]:
models_df = pd.DataFrame.from_records(records)

In [19]:
models_df.groupby(['embedding_dimension', 'hidden_dimension', 'num_layers']).dev_loss.min().sort_values().to_frame()

KeyError: 'embedding_dimension'

There are {meow} models with a dev error of around .76. I'll choose the simplest one, which 

* point out that no matter how low the train error gets, the dev error
* which model parameters fail to ever get to the lowest dev error
* which model parameters overfit the most

* isolate the group of models with about .76 dev error. choose the simplest one.

In [None]:
t = models_df[(models_df.embedding_dimension==50) & (models_df.hidden_dimension==100) & (models_df.num_layers==3)]
t = t.set_index('epoch')
t.dev_loss.plot()
t.train_loss.plot()
_ = plt.legend()

In [None]:
models_df.sort_values('dev_loss').iloc[0]

Since the best "model" was at an earlier epoch I don't have access to it. So I'll train a model with that model's parameters and set the number of epochs to MEOW.

In [None]:
%%time

lm = PhonemeLM(
    phoneme_to_idx, device='cuda', rnn_type='gru',
    embedding_dimension=50, hidden_dimension=100, num_layers=3,
    max_epochs=69, early_stopping_rounds=69, batch_size=1024,
)

train_loss, dev_loss = lm.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist())

In [None]:
test_loader = build_data_loader(test_df.pronunciation.values.tolist(), lm.phoneme_to_idx)
lm.evaluate(test_loader)

In [None]:
dev_loader = build_data_loader(dev_df.pronunciation.values.tolist(), lm.phoneme_to_idx)
lm.evaluate(dev_loader)

* compute the test error for the final model. plot the train, dev, and test errors

## Training the Final Model ##
Now that we've found the best parameters for the model according to the dev set we'll train a final model using all of the data. This should increase model performance overall since more data is better, but is also necessary since we'll be using the model to predict probabilities of all English words below. If some of those words weren't in the training set they would artificially get lower probabilities. (Another approach here could be to train a model on e.g. 4/5 folds of the data and make predictions about the remaining 1/5, doing that 5 times to get unbiased predictions for all data, but this would have taken much longer to run.)

In [None]:
%%time

model_params = ModelParams(
    rnn_type='gru', embedding_dimension=50, hidden_dimension=100, num_layers=3,
    max_epochs=69, early_stopping_rounds=69, batch_size=1024
)

language_model = LanguageModel(vocab, model_params, device_name='cpu')

train_loss, dev_loss = language_model.fit(df.pronunciation.values.tolist())

In [24]:
with open('lm_50_2.pt', 'wb') as fh:
    model.save(fh)