# Training a Phoneme Language Model #
In this notebook I train a language model over English sounds (also known as [phonemes](https://en.wikipedia.org/wiki/Phoneme)). The data for English pronunciations comes from the CMU Pronouncing Dictionary. The pronunciations in the pronouncing dictionary are in [ARPABET](https://en.wikipedia.org/wiki/ARPABET), a set of symbols representing English sounds. So in ARPABET "fish" is pronounced as /F IH1 SH/.

By training on tens of thousands of pronunciations the model will hopefully learn [English phonotactics](https://en.wikipedia.org/wiki/Phonotactics#English_phonotactics), the rules that govern what sounds like a valid English word. For example, /F AH1 N/ ("fun") sounds good, but /NG S ER1/ (maybe represented as "ngsr") does not.


Check out the notebook `Phoneme Exploration.ipynb` if you want to see the model used.

In [1]:
import torch
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import ParameterGrid
from torch.nn import functional as F
from tqdm.notebook import tqdm

from sonorous.languagemodel import LanguageModel, ModelParams, Vocabulary
from sonorous.pronunciationdata import load_pronunciations
from sonorous.utils import split_data

## The Data ##
The data for this model comes from the [CMU Pronouncing Dictionary](http://www.speech.cs.cmu.edu/cgi-bin/cmudict), which contains over one hundred thousand pronunciations. Each pronuncation is in [ARPABET](https://en.wikipedia.org/wiki/ARPABET), a set of symbols for representing English speech sounds. In ARPABET the word "fish" is represented by the sequence of phonemes /F IH1 SH/. You can probably guess the first and third sounds. The vowel in the middle has "1" at the end to indicate it has the primary stress.

I'll use the `load_pronunciations` function to load the Pronouncing Dictionary into a Pandas DataFrame.

In [2]:
pronunciations = load_pronunciations()
print(f"There are {len(pronunciations):,} pronunciations.")
print()
print("Sample of 5 pronunciations:")
pronunciations.sample(5)

There are 131,964 pronunciations.

Sample of 5 pronunciations:


Unnamed: 0_level_0,pronunciation,as_string,num_phonemes,num_syllables
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
layoff,"(ˈ, l, eɪ, ˌ, ɔ, f)",ˈleɪˌɔf,4,2
congolese,"(ˌ, k, ɑː, n, g, ə, ˈ, l, iː, z)",ˌkɑːngəˈliːz,8,3
prefers,"(p, r, ɪ, ˈ, f, ɝː, z)",prɪˈfɝːz,6,2
nader,"(ˈ, n, eɪ, d, ɝ)",ˈneɪdɝ,4,2
rollins's,"(ˈ, r, ɑː, l, ɪ, n, z, ɪ, z)",ˈrɑːlɪnzɪz,8,3


Here is the pronunciation for "fish":

In [3]:
pronunciations.loc['fish']

pronunciation    (ˈ, f, ɪ, ʃ)
as_string                ˈfɪʃ
num_phonemes                3
num_syllables               1
Name: fish, dtype: object

And here are all of the pronunciations for the word "tomato":

In [4]:
pronunciations.loc['tomato']

Unnamed: 0_level_0,pronunciation,as_string,num_phonemes,num_syllables
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
tomato,"(t, ə, ˈ, m, eɪ, ˌ, t, oʊ)",təˈmeɪˌtoʊ,6,3
tomato,"(t, ə, ˈ, m, ɑː, ˌ, t, oʊ)",təˈmɑːˌtoʊ,6,3


## The Model ##
The module `languagemodel.py` contains a class `LanguageModel` that implements a simple neural language model. It's a PyTorch neural network comprised of the following layers:
1. **Embedding layer** to translate each phoneme into a dense vector. Note that in the code this is called the _encoder since it encodes input phonemes into a representation the model can work with.
2. An recurrent neural network (**RNN**) layer that processes each input phoneme sequentially and for each step generates (a) a hidden representation to pass on to the next step and (b) an output.
3. A **linear layer** that decodes the outputes (2b) into distributions over each phoneme. Note that in the code this is called the _docoder since it decodes the model's internal representations back into phonemes.

Let's go through a simple example of what happens when we pass the pronunciation /F IH1 SH/ through the model. Ultimately what I want ouf of the model is a prediction at each position of what the next phoneme should be. For example, when a well trained model is sees /F IH1/ it should know that /SH/ is likely, or at least not unlikely.

First I'll create a `Vocabulary` instance by passing in all the pronunciations. The `vocab` is used to convert phonemes into integer indices that the neural network handle. It does a few other things too, which you can see below. The `Vocabulary` class's code is in `sonorous/languagemodel.py`.

In [5]:
print(''.join(pronunciations.loc['fish'].pronunciation))

ˈfɪʃ


In [6]:
vocab = Vocabulary.from_texts(pronunciations.pronunciation.values)

print(f"There are {len(vocab)} distinct phonemes.")
print()
print("Looking up the int index for /ʃ/:", vocab['ʃ'])
print()
print("Checking whether /ʃ/ is in the vocabulary:", 'ʃ' in vocab)
print()
print("Looking up the phoneme for a specific int index:", vocab.token_from_idx(vocab['ʃ']))
print()
print("Encoding /ˈfɪʃ/:", vocab.encode_text(tuple("ˈfɪʃ")))

There are 47 distinct phonemes.

Looking up the int index for /ʃ/: 31

Checking whether /ʃ/ is in the vocabulary: True

Looking up the phoneme for a specific int index: ʃ

Encoding /ˈfɪʃ/: [ 1  5 13 22 31  2]


Now I'll define the model. Note that I'm not actually fitting the model to any data so the output will be random. The hyperparameters aren't optimal, but again that doesn't matter here since I just want to show the flow of data through the network.

The `ModelParams` class (from `sonorous/languagemodel.py` encapsulates hyperparameters and options for the model).

In [7]:
model_params = ModelParams(
    rnn_type='rnn', embedding_dimension=10, hidden_dimension=3, num_layers=1,
    max_epochs=3, early_stopping_rounds=3
)

language_model = LanguageModel(vocab, model_params, 'cpu')

Now I'll walk through what happens when we pass the word "fish" /F IH1 SH/ through the model:

In [8]:
fish_pronunciation = tuple("ˈfɪʃ")

The `Vocabulary.encode_text` function we saw earlier does a few things. First, it adds dummy `<START>` and `<END>` tokens to the pronunciation indicating its start and end. This allows the model to learn transition probabilities from the start of the word to the first phoneme, and from the last phoneme to the end of the word.

It then converts every phoneme to its ingeter index.

In [9]:
fish_input = vocab.encode_text(fish_pronunciation)
print(fish_input)
print()
for idx in fish_input:
    phoneme = vocab.token_from_idx(idx)
    print(f'{idx} => {phoneme}')

[ 1  5 13 22 31  2]

1 => <START>
5 => ˈ
13 => f
22 => ɪ
31 => ʃ
2 => <END>


And now we're ready to pass the input into the model's `forward` function, which takes in inputs and outputs predictions. This model's `forward` function expects a Tensor of dimension `(batch_size, NUMBER OF STEPS)`. A step here refers to a step forward in the sequence, so /<START> F IH1 SH <END>/ has 5 steps.

In [10]:
print("Input's shape:", fish_input.shape)
fish_batch_input = torch.LongTensor(fish_input).unsqueeze(0)
print("Batch input's shape:", fish_batch_input.shape)

Input's shape: (6,)
Batch input's shape: torch.Size([1, 6])


The first thing the `forward` function does is embed each phoneme using an [nn.Embedding](https://pytorch.org/docs/stable/nn.html#embedding). Each phoneme has a dedicated embedding vector of length `embedding_dimension`, so the shape of `embedded` is `(batch size, number of steps, embedding_dimension)`.

In [11]:
embedded = language_model._encoder(fish_batch_input)
print(embedded.shape)
print()
embedded

torch.Size([1, 6, 10])



tensor([[[-0.8161, -0.8673, -0.8075,  2.0873, -0.3538,  0.7794, -1.1596,
          -0.1017,  0.7367, -0.1752],
         [-0.7299, -0.2113,  0.0539,  1.2842,  0.8718, -0.1122, -0.1860,
           0.4326, -0.8699,  1.6186],
         [-1.8333,  0.9768, -1.2780,  1.6892, -0.1444, -0.0279, -0.3749,
           1.2191, -1.2809,  0.8605],
         [-0.2274, -0.2111, -1.1193, -0.1698,  0.8556,  0.6363, -0.0765,
           0.1790, -0.8408,  1.4178],
         [-0.9943, -0.5585, -0.8412,  1.6460,  0.6196,  1.0931,  1.4780,
          -1.3041,  0.4240, -1.7979],
         [-1.7650, -0.5110, -1.6025,  1.0763, -0.5914, -1.4563,  0.3198,
           1.6319, -0.7680, -2.1770]]], grad_fn=<EmbeddingBackward>)

I'll pass `embedded` to the [RNN layer](https://pytorch.org/docs/stable/nn.html#recurrent-layers), resulting in `rnn_output` and `hidden_state`. I won't go into detail on how RNNs work since there are many detailed posts on the web you can read, but the basic idea is a cell is applied sequentially to every token (i.e. step) in the input. At each step an output and a hidden state are produced. The hidden state can be passed on to the next step, and the output can be used to make a prediction.

The `rnn` layer below operates on the full sequence, so the results are for the entire sequence.

In [12]:
rnn_output, hidden_state = language_model._rnn(embedded)
print(rnn_output.shape)
print()
rnn_output

torch.Size([1, 6, 3])



tensor([[[ 0.2904,  0.9463, -0.4982],
         [-0.7268, -0.8107, -0.6783],
         [-0.9655,  0.5736, -0.6357],
         [-0.7353, -0.4838, -0.0497],
         [-0.9659,  0.9752,  0.2446],
         [-0.9720,  0.8659, -0.0780]]], grad_fn=<TransposeBackward1>)

For our application we can ignore the `hidden_state`-- the `rnn_output` is the interesting part. The first dimension is for the batch, and we only have a single input in our batch. The second dimension is for each of the input phonemes. The third dimension corresponds to `hidden_dimension`: you can think of this as the state of the RNN at each step.

As I said at the beginning of this section, I want the output of the RNN at each position to be predictions for the *next* position. So I'll apply a [linear layer](https://pytorch.org/docs/stable/nn.html#linear) to the `rnn_output`, resulting in a vector the size of the vocabularly at each position. The [softmax](https://pytorch.org/docs/stable/nn.functional.html#softmax) function normalizes the outputs into probability distributions for each prediction.

In [13]:
outputs = language_model._decoder(rnn_output)
probabilities = F.softmax(outputs, dim=-1).squeeze()
probabilities.shape

torch.Size([6, 47])

The shape of `probabilities` is (5, 42) because each of the five tokens in /ˈfɪʃ/ gets a a probability distribution over each of the 42 phonemes in the vocabulary.

The first phoneme in the input is the `<START>` token; let's see what the model thinks should come next.

In [14]:
probabilities_for_first_phoneme = probabilities[0]
most_likely_first_phoneme_idx = probabilities_for_first_phoneme.argmax().item()
most_likely_first_phoneme = vocab.token_from_idx(most_likely_first_phoneme_idx)
most_likely_first_phoneme

'k'

So the model predicts /ŋ/ to be the first phoneme in the word. Since the model isn't fit yet this is just a random guess. In order to get the model to make good predictions I'll need to first train a good model.

## Finding the Best Model ##
In this section I'll train a number of models on the train set and select the one that has the lowest error on the dev set. I'll split the DataFrame of pronunciations into three DataFrames, with 79% for training, 20% for dev/validation, and 1% for testing of the final model.

In [15]:
train_pronunciations, dev_pronunciations, test_pronunciations = split_data(pronunciations, dev_proportion=.2, test_proportion=.01)
len(train_pronunciations), len(dev_pronunciations), len(test_pronunciations)

(104251, 26393, 1320)

I split the corpus up into train/dev/test so that I could use the standard approach of training models with different hyperparameters and selecting the model that performs best against the dev set. Unfortunately that doesn't work well here because 

Note that I ran a larger parameter search before and saw that GRUs were consistently outperforming LSTMs and vanilla RNNs. There are 12 (4 * 3) models to train, and for each one I'm measuring train and dev error at every epoch. So if each model trains for the maximum of 2,000 epochs I would end up with 12 * 2,000 = 24,000 models to choose from. There's a good chance I'm overfitting the dev set with such a large search, but I'll inspect the learning curves to try to avoid selecting an iteration that randomly did well.

While each model trains for a maximum of 2000 epochs, it stops early if the dev error does not decrease for three epochs in a row. Since I'm going to be selecting the model with the lowest dev error there's no reason to keep training a model once it's started overfitting. Alternatively I could train all models to convergence and then add regularization to reduce the complexity and identify the sweet spot, but that's far more time consuming because it requires training more models and each of them for longer.

In [17]:
%%time

#temp

model_parameters = ModelParams(
    rnn_type='gru', num_layers=1, embedding_dimension=100, hidden_dimension=2,
    max_epochs=1000, early_stopping_rounds=3
)

model = LanguageModel(vocab, model_parameters, device_name='cpu')

train_losses, dev_losses = model.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

Epoch 1: train loss: 3.3053	dev loss: 3.3871                                                                                               
	Generated: in train: 0%, assess: 0%, novel: 100%
	 ˈ ʊ t æ ˌ l ɛ ɑː aɪ ʊ ɪ ŋ z ʌ ɪ ɝː b dʒ l eɪ s
	 ɔɪ ˌ ʌ n ɝ eɪ uː ɪ iː ɛ ə m b j r ɔ ˌ iː n uː ʒ n m ʃ k v h aɪ ˌ
	 ɝ ŋ
	 iː ŋ w uː v ɛ iː s ʒ ɛ t ɝ j ɛ æ tʃ θ h
	 ʒ s w
Epoch 2: train loss: 2.8625	dev loss: 3.0034                                                                                               
	Generated: in train: 0%, assess: 0%, novel: 100%
	 l j i i dʒ ɔɪ æ
	 ŋ ɛ g w ɪ w
	 ə ɛ z ˈ m w ə g oʊ ˈ
	 ɝː eɪ aɪ l ˌ v ɛ aʊ
	 ɝː dʒ ɛ b iː ɪ r ɔ aʊ g oʊ æ ˌ l
Epoch 3: train loss: 2.4605	dev loss: 2.6573                                                                                               
	Generated: in train: 1%, assess: 0%, novel: 99%
	 
	 h s ɝ ə aʊ l iː iː h dʒ aʊ oʊ l ɝ iː ə h ɔɪ ʃ ˈ ɛ ʌ ʊ uː ʊ i ɑː
	 z iː ʃ
	 i ɝ ˈ ɔ ɝ z ŋ b aʊ f ˈ
	 
Epoch 4: train loss: 2.1054	dev loss: 2.3

Epoch 31: train loss: 0.7836	dev loss: 1.1273                                                                                               
	Generated: in train: 1%, assess: 0%, novel: 99%
	 ˈ ˈ ɪ
	 ˈ ə ˈ t
	 r ˈ ɝ aɪ
	 j ˈ ɪ ə ɔ l uː ə ə f s s ˈ g
	 r ˈ uː k d
Epoch 32: train loss: 0.7806	dev loss: 1.1238                                                                                               
	Generated: in train: 0%, assess: 0%, novel: 100%
	 ˈ p
	 aʊ ˈ i ɛ s t r
	 ˈ ˈ
	 ˈ ˈ i
	 ˈ s b b k m s
Epoch 33: train loss: 0.7778	dev loss: 1.1205                                                                                               
	Generated: in train: 2%, assess: 0%, novel: 98%
	 v uː n ˈ t l ɛ ɛ ˌ ˌ f ɑː ˈ
	 ˈ n n
	 r ˈ t n
	 ʃ ˈ b ʃ iː v
	 n ˈ ɪ k t iː n r
Epoch 34: train loss: 0.7752	dev loss: 1.1176                                                                                               
	Generated: in train: 3%, assess: 0%, novel: 97%
	 ˈ ə ə t r s h ˈ
	 ʌ ˈ z k ʌ m

Epoch 91: train loss: 0.7115	dev loss: 1.0349                                                                                               
	Generated: in train: 2%, assess: 0%, novel: 98%
	 ɪ ə ˈ g d n
	 s ˈ z aɪ ə ə ə r
	 ˈ iː z s ə s iː i
	 ˈ s g ɑː
	 ˈ l ə ɪ b k ʃ t ə s z ɪ ɛ ɪ ɑː t ɝ n
Epoch 92: train loss: 0.7111	dev loss: 1.0343                                                                                               
	Generated: in train: 3%, assess: 0%, novel: 97%
	 ˈ t ə t n
	 ˈ g l tʃ t ŋ
	 ˈ aɪ ɪ p ə n
	 m m k
	 ˌ æ ˈ k z dʒ ɪ z v
Epoch 93: train loss: 0.7107	dev loss: 1.0338                                                                                               
	Generated: in train: 3%, assess: 0%, novel: 97%
	 ˈ aɪ l l
	 ˈ g dʒ
	 ˈ k ɑː ˌ p d t
	 ˈ ɝ z t k ɪ ˌ l oʊ z ɪ r l
	 k l ˌ s t ˈ f ʃ n ɪ m
Epoch 94: train loss: 0.7104	dev loss: 1.0332                                                                                               
	Generated: in train: 1%,

Epoch 121: train loss: 0.7028	dev loss: 1.0230                                                                                               
	Generated: in train: 6%, assess: 0%, novel: 94%
	 d m ŋ ˈ ɔ ə n r s
	 n ˈ ɑː s æ k
	 ˈ oʊ ɝː dʒ
	 p d ˈ b ŋ t dʒ
	 ˈ k ɛ r
Epoch 122: train loss: 0.7014	dev loss: 1.0210                                                                                               
	Generated: in train: 1%, assess: 1%, novel: 98%
	 ˈ ɑː b ɪ n ɝ
	 ˈ iː ɝ θ ə ˌ z oʊ t r r g i
	 ˈ ˌ
	 ˈ l ə ˌ ð l ɑː
	 l ˈ ɛ b t eɪ r n
Epoch 123: train loss: 0.7005	dev loss: 1.0197                                                                                               
	Generated: in train: 2%, assess: 0%, novel: 98%
	 f ˈ v s oʊ m l
	 ˈ iː l m ɪ ɪ oʊ i
	 ˈ oʊ ɛ s ɝ ɪ ˌ t ɝ z ɝ z
	 ˈ m eɪ s ə n
	 ˈ ɑː s r oʊ
Epoch 124: train loss: 0.6999	dev loss: 1.0187                                                                                               
	Generated: in train: 2%, asse

Epoch 179: train loss: 0.6929	dev loss: 1.0090                                                                                               
	Generated: in train: 2%, assess: 2%, novel: 96%
	 ˈ m d
	 m ˈ r ɑː n ə ŋ ɝ
	 p ə ˈ l ɪ s ˌ m ɔ aʊ oʊ ɝ
	 ˈ d ɪ
	 ˈ r l oʊ r b k
Epoch 180: train loss: 0.6928	dev loss: 1.0090                                                                                               
	Generated: in train: 2%, assess: 0%, novel: 98%
	 ˈ ɑː p n d
	 ˈ ɪ iː ˌ ʌ m k
	 ˈ k æ ˌ tʃ ˌ s oʊ oʊ v ɪ z
	 ˈ ɪ n ə k s
	 ˈ h ɑː k ˌ uː v m ɪ
Epoch 181: train loss: 0.6927	dev loss: 1.0086                                                                                               
	Generated: in train: 4%, assess: 1%, novel: 95%
	 s ˈ iː k d t
	 ˈ ɪ n m ə s
	 ˈ aʊ p æ ə ɑː s ə ɪ k
	 m ˈ k ɛ k
	 ˈ w aɪ j
Epoch 182: train loss: 0.6927	dev loss: 1.0087                                                                                               
	Generated: in train: 7%, assess: 

Epoch 208: train loss: 0.6886	dev loss: 1.0029                                                                                               
	Generated: in train: 3%, assess: 0%, novel: 97%
	 h f oʊ s p s ɝ l t i
	 k ˈ h ɛ ə ˌ r n ə iː s ŋ ˌ ə v t b
	 ˌ ˌ ɑː ˈ ɛ ɪ ʃ z t
	 ˈ s dʒ g eɪ t
	 ˈ h ɔ ə ˌ ˌ ɑː ˌ b l w æ b ə ɝ m p
Epoch 209: train loss: 0.6885	dev loss: 1.0026                                                                                               
	Generated: in train: 3%, assess: 1%, novel: 96%
	 l ˈ ʃ
	 oʊ ˈ ə ˈ ə ˈ
	 ˈ d uː l ə n
	 ˈ ɔ ɪ m ə l v ɪ l l
	 f k p g r r s g
Epoch 210: train loss: 0.6885	dev loss: 1.0026                                                                                               
	Generated: in train: 2%, assess: 0%, novel: 98%
	 ˈ iː ˌ aʊ t iː ˌ ɑː l ə ʃ
	 ˈ s p ɑː n ə s t
	 ˈ m æ ʃ n
	 ˈ ə n m n z
	 ˈ d oʊ ə s
Epoch 211: train loss: 0.6884	dev loss: 1.0025                                                                                   

KeyboardInterrupt: 

In [19]:
model.generate()

('s', 'l', 'l', 'k', 'l', 'g', 'dʒ', 'ˌ', 'l', 'ɛ', 't')

In [None]:
%%time

model_parameters = ModelParams(
    rnn_type='gru', num_layers=3, embedding_dimension=20, hidden_dimension=20,
    max_epochs=1000, early_stopping_rounds=3
)

model = LanguageModel(vocab, model_parameters, device_name='cuda')

train_losses, dev_losses = model.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

In [None]:
with open('model_20_20_3.pt', 'wb') as fh:
    model.save(fh)

In [None]:
%%time

model_parameters = ModelParams(
    rnn_type='gru', num_layers=1, embedding_dimension=20, hidden_dimension=20,
    max_epochs=1000, early_stopping_rounds=3
)

model2 = LanguageModel(vocab, model_parameters, device_name='cuda')

train_losses2, dev_losses2 = model2.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

In [None]:
with open('model_20_20_1.pt', 'wb') as fh:
    model2.save(fh)

In [None]:
%%time

model_parameters = ModelParams(
    rnn_type='gru', num_layers=1, embedding_dimension=20, hidden_dimension=10,
    max_epochs=1000, early_stopping_rounds=3
)

small_model = LanguageModel(vocab, model_parameters, device_name='cuda')

_ = small_model.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

In [None]:
with open('model_20_10_1.pt', 'wb') as fh:
    small_model.save(fh)

In [None]:
%%time

model_parameters = ModelParams(
    rnn_type='gru', num_layers=1, embedding_dimension=10, hidden_dimension=10,
    max_epochs=1000, early_stopping_rounds=3
)

model_10_10_1 = LanguageModel(vocab, model_parameters, device_name='cuda')

_ = model_10_10_1.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

with open('model_10_10_1.pt', 'wb') as fh:
    model_10_10_1.save(fh)

In [None]:
%%time

model_parameters = ModelParams(
    rnn_type='gru', num_layers=1, embedding_dimension=10, hidden_dimension=5,
    max_epochs=1000, early_stopping_rounds=3
)

model_10_5_1 = LanguageModel(vocab, model_parameters, device_name='cuda')

_ = model_10_5_1.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

with open('model_10_5_1.pt', 'wb') as fh:
    model_10_5_1.save(fh)

In [None]:
%%time

model_parameters = ModelParams(
    rnn_type='gru', num_layers=1, embedding_dimension=10, hidden_dimension=3,
    max_epochs=1000, early_stopping_rounds=3
)

model_10_3_1 = LanguageModel(vocab, model_parameters, device_name='cuda')

_ = model_10_3_1.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

with open('model_10_3_1.pt', 'wb') as fh:
    model_10_3_1.save(fh)

In [None]:
%%time

model_parameters = ModelParams(
    rnn_type='rnn', num_layers=1, embedding_dimension=10, hidden_dimension=3,
    max_epochs=1000, early_stopping_rounds=3
)

rnn_10_3_1 = LanguageModel(vocab, model_parameters, device_name='cuda')

_ = rnn_10_3_1.fit(
    train_pronunciations.pronunciation.values.tolist(),
    dev_pronunciations.pronunciation.values.tolist(),
    print_every=1,
)

with open('rnn_10_3_1.pt', 'wb') as fh:
    rnn_10_3_1.save(fh)

In [None]:
%%time


param_grid = ParameterGrid({
    'embedding_dimension': [10, 50, 100],
    'hidden_dimension': [10, 50, 100],
})

records = []
for params in tqdm(param_grid):
    model_parameters = ModelParams(rnn_type='gru', num_layers=1, max_epochs=1000, early_stopping_rounds=1, **params)
    model = LanguageModel(vocab, model_parameters, device_name='cuda')

    print('Model Params:', model_parameters)
    
    train_losses, dev_losses = model.fit(
        train_pronunciations.pronunciation.values.tolist(),
        dev_pronunciations.pronunciation.values.tolist()
    )
    
    for epoch, (train_loss, dev_loss) in enumerate(zip(train_losses, dev_losses), start=1):
        record = params.copy()
        record['epoch'] = epoch
        record['train_loss'] = train_loss
        record['dev_loss'] = dev_loss
        
        record['rnn_type'] = 'rnn'
        record['num_layers'] = 1 
    
        records.append(record)

models_df = pd.DataFrame.from_records(records)

In [None]:
%%time


param_grid = ParameterGrid({
    'embedding_dimension': [50],
    'hidden_dimension': [2],
})

records = []
for params in tqdm(param_grid):
    model_parameters = ModelParams(rnn_type='rnn', num_layers=1, max_epochs=1000, early_stopping_rounds=1, **params)
    model = LanguageModel(vocab, model_parameters, device_name='cpu')

    print('Model Params:', model_parameters)
    
    train_losses, dev_losses = model.fit(
        train_pronunciations.pronunciation.values.tolist(),
        dev_pronunciations.pronunciation.values.tolist()
    )
    
    for epoch, (train_loss, dev_loss) in enumerate(zip(train_losses, dev_losses), start=1):
        record = params.copy()
        record['epoch'] = epoch
        record['train_loss'] = train_loss
        record['dev_loss'] = dev_loss
    
        records.append(record)

models_df = pd.DataFrame.from_records(records)

In [None]:
models_df = pd.DataFrame.from_records(records)

There are {meow} models with a dev error of around .76. I'll choose the simplest one, which 

* point out that no matter how low the train error gets, the dev error
* which model parameters fail to ever get to the lowest dev error
* which model parameters overfit the most

* isolate the group of models with about .76 dev error. choose the simplest one.

In [None]:
t = models_df[(models_df.embedding_dimension==50) & (models_df.hidden_dimension==100) & (models_df.num_layers==3)]
t = t.set_index('epoch')
t.dev_loss.plot()
t.train_loss.plot()
_ = plt.legend()

In [None]:
models_df.sort_values('dev_loss').iloc[0]

Since the best "model" was at an earlier epoch I don't have access to it. So I'll train a model with that model's parameters and set the number of epochs to MEOW.

In [None]:
%%time

model = PhonemeLM(
    phoneme_to_idx, device='cuda', rnn_type='gru',
    embedding_dimension=50, hidden_dimension=100, num_layers=3,
    max_epochs=69, early_stopping_rounds=69, batch_size=1024,
)

train_loss, dev_loss = model.fit(train_df.pronunciation.values.tolist(), dev_df.pronunciation.values.tolist())

Now that I have what I hope is the best model I can test how well it does on the holdout test set, which I haven't look at at all during the training process.

In [None]:
test_loader = build_data_loader(test_df.pronunciation.values.tolist(), lm.phoneme_to_idx)
lm.evaluate(test_loader)

In [None]:
dev_loader = build_data_loader(dev_df.pronunciation.values.tolist(), lm.phoneme_to_idx)
lm.evaluate(dev_loader)

TODO: compute the test error for the final model. Bar chart for the train, dev, and test errors

Comment on findings, probably that test is higher and that's expected because language models are very sensitive to corpus difference (and I probably overfit the dev set)

## Training the Final Model ##
Now that I've found the best parameters for the model according to the dev set I'll train a final model using all of the data. This should increase model performance overall since more data is better, but is also necessary since I'll be using the model to predict probabilities of all English words. If some of those words weren't in the training set they would artificially get lower probabilities. (Another approach here could be to train a model on e.g. 4/5 folds of the data and make predictions about the remaining 1/5, doing that 5 times to get unbiased predictions for all data, but this would have taken much longer to run.)

In [None]:
%%time

model_params = ModelParams(
    rnn_type='gru', embedding_dimension=50, hidden_dimension=50, num_layers=1,
    max_epochs=3, early_stopping_rounds=3, batch_size=1024
)

language_model = LanguageModel(vocab, model_params, device_name='cpu')

_ = language_model.fit(pronunciations.pronunciation.values)

And finally, I'll save the model so I can use it in the next notebook, `Phoneme Exploration.ipynb`.

In [None]:
with open('phoneme_language_model.pt', 'wb') as fh:
    model.save(fh)

In [None]:
pd