<a href="https://colab.research.google.com/github/BYU-Handwriting-Lab/GettingStarted/blob/solution/notebooks/language-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Model

This notebook provides code to create a character-level language model in 
TensorFlow.

### Dependencies

Import the necessary dependencies and download our character set and corpus.

In [64]:
import tensorflow as tf

import string
import re

import json
import numpy as np
import pandas as pd
from tqdm import tqdm

In [65]:
!wget -q https://raw.githubusercontent.com/ericburdett/named-entity-recognition/master/char_set.json
!wget -q --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZsJ8cZSDU98GpcK-kl_Cq3eTt-R2YvSJ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZsJ8cZSDU98GpcK-kl_Cq3eTt-R2YvSJ" -O french_ner_dataset.csv && rm -rf /tmp/cookies.txt

In [66]:
# ID: 1M26Gpca8Ug4YvRLxoUDDCjMBeJtojITY
!wget -q --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1wDMLz9hTmfvPhkhCHTylbeAU6Utpkqb1' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1wDMLz9hTmfvPhkhCHTylbeAU6Utpkqb1" -O french_text.txt && rm -rf /tmp/cookies.tx

## Load the Corpus

Define some constants to help us know which characters are used for words and
which are used for punctuation/digits.

Load the corpus to be used for tokenization and dataset creation.

In [271]:
DEFAULT_CHARS = ' !"#$%&\'()*+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_`abcdefghijklmnopqrstuvwxyz|~£§¨«¬\xad' \
                '°²´·º»¼½¾ÀÂÄÇÈÉÊÔÖÜßàáâäæçèéêëìîïñòóôöøùúûüÿłŒœΓΖΤάήαδεηικλμνξοπρτυχψωόώІ‒–—†‡‰‹›₂₤℔⅓⅔⅕⅖⅗⅘⅙⅚⅛∆∇∫≠□♀♂✓ｆ'
# The default list of non-punctuation characters needed for the word beam search decoding algorithm
DEFAULT_NON_PUNCTUATION = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÂÄÇÈÉÊÔÖÜßàáâäæçèéêëìîïñòóôöøùúûüÿ' \
                          'łŒœΓΖΤάήαδεηικλμνξοπρτυχψωόώІ'

DEFAULT_PUNCTUATION = string.punctuation + '0123456789'

In [283]:
lines = open('french_text.txt', 'r', encoding='utf8').readlines()

french_words = []
for line in lines:
    french_words.extend(line.split())
french_words = ' '.join(french_words)

## Tokenization

One of the hardest parts is creating a good tokenization method.

This tokenizer will create a token for each word. Each punctuation or digit
character will have its own token.

In [282]:
class Tokenizer:
    def __init__(self, corpus, word_chars, punctuation, lower=False):
        self.word_chars = word_chars
        self.punctuation = punctuation
        self.regex = r"[" + self.word_chars + r"]+|[^\s]" 

        words = self.split(corpus)
        all_words_list = words + list(punctuation)
        all_words_list_unique = list(set(all_words_list))
        all_words = [' '.join(all_words_list_unique)]

        self.total_tokens = len(all_words_list_unique) + 2 # +2 to account for 0 (reserved) and 1 (OOV)
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.total_tokens, filters='', lower=lower, oov_token='<OOV>')
        self.tokenizer.fit_on_texts(all_words)

    def split(self, text):
        return re.findall(self.regex, text)

    def texts_to_sequences(self, text):
        words = self.split(text)
        return self.tokenizer.texts_to_sequences([' '.join(words)])

In [299]:
tokenizer = Tokenizer(french_words, DEFAULT_NON_PUNCTUATION, DEFAULT_PUNCTUATION, lower=False)
sentence = 'acte de deces-de..1832(hello)5eme'

print('Original Sentence:', sentence)
print('Split Sentence:', tokenizer.split(sentence))
print('Tokenized Sentence:', tokenizer.texts_to_sequences(sentence))

Original Sentence: acte de deces-de..1832(hello)5eme
Split Sentence: ['acte', 'de', 'deces', '-', 'de', '.', '.', '1', '8', '3', '2', '(', 'hello', ')', '5', 'eme']
Tokenized Sentence: [[6569, 19780, 31579, 18946, 19780, 21148, 21148, 31723, 41758, 18333, 16325, 40352, 1, 34224, 21550, 32853]]


In [294]:
embedding = tf.keras.layers.Embedding(tokenizer.total_tokens, 1024)

sequence = tokenizer.texts_to_sequences(sentence)
sequence = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=1)

tf.squeeze(embedding(sequence))

<tf.Tensor: shape=(1024,), dtype=float32, numpy=
array([ 0.02524426,  0.007955  ,  0.04026515, ...,  0.04701746,
       -0.03919454, -0.00287429], dtype=float32)>

## Dataset Creation

Create the Tensorflow dataset using the tokenizer created above.

In [302]:
type(french_words)

str

In [308]:
# Tokenize the entire corpus
tokenized_french_words = tokenizer.texts_to_sequences(french_words)[0]

# Create the dataset
dataset = tf.data.Dataset.from_tensor_slices(tokenized_french_words)

# Show one batch of 100 words
for word in dataset.batch(100).take(1):
    print(word)

tf.Tensor(
[ 6167 17775 25002 23978 26367 16754 15464 21030 19780 27240 31723 41758
 40384 40384 10216  9598 27682 26251 31723 32372   241 16325 16325 21550
  2370 16325 41742  1875 18333 31723 21550 29930 18333 13718 28333 28461
 31723 31723 18333  2706 35201 40743 25438  7105 13058  2558 39347 21945
 17621 19780 28323 10717 23790  7105 13296  7105 18333 31723 29930 31723
 41758 40384 40384 39812 35535  9620 34372 18861 18946 19626 40352 16688
 34224 41168 41168 20319 27518 21123 16688 22122   910 33712 29826  3617
  9907  9617 22951  7105 27436  5644 19780 41553 20157  9907 28010 41584
 20273  9907 30527  9598], shape=(100,), dtype=int32)


In [304]:
tf.constant(tokenizer.texts_to_sequences(french_words)[0][:100])

<tf.Tensor: shape=(100,), dtype=int32, numpy=
array([ 6167, 17775, 25002, 23978, 26367, 16754, 15464, 21030, 19780,
       27240, 31723, 41758, 40384, 40384, 10216,  9598, 27682, 26251,
       31723, 32372,   241, 16325, 16325, 21550,  2370, 16325, 41742,
        1875, 18333, 31723, 21550, 29930, 18333, 13718, 28333, 28461,
       31723, 31723, 18333,  2706, 35201, 40743, 25438,  7105, 13058,
        2558, 39347, 21945, 17621, 19780, 28323, 10717, 23790,  7105,
       13296,  7105, 18333, 31723, 29930, 31723, 41758, 40384, 40384,
       39812, 35535,  9620, 34372, 18861, 18946, 19626, 40352, 16688,
       34224, 41168, 41168, 20319, 27518, 21123, 16688, 22122,   910,
       33712, 29826,  3617,  9907,  9617, 22951,  7105, 27436,  5644,
       19780, 41553, 20157,  9907, 28010, 41584, 20273,  9907, 30527,
        9598], dtype=int32)>

### Model Creation

Build our simple model that includes an embedding layer, recurrent layer, and
dense layer to get us down to the number of classes.

In [314]:
class LanguageModel(tf.keras.Model):
    def __init__(self, vocab_size=199):
        super(LanguageModel, self).__init__()

        self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=512)
        self.gru = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(128, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'))
        self.dense = tf.keras.layers.Dense(vocab_size)
        self.softmax = tf.keras.layers.Softmax()
    
    def call(self, x):
        x = self.embedding(x)
        print(x.shape)
        x = tf.expand_dims(x, 0)
        print(x.shape)
        x = self.gru(x)
        print(x.shape)
        x = self.dense(x)
        print(x.shape)
        x = self.softmax(x)
        print(x.shape)
        x = tf.squeeze(x)
        print(x.shape)

        return x

In [315]:
model = LanguageModel(vocab_size=tokenizer.total_tokens)

for word in dataset.batch(100).take(1):
    output = model(word)
    print(output)

(100, 512)
(1, 100, 512)
(1, 100, 256)
(1, 100, 42422)
(1, 100, 42422)
(100, 42422)
tf.Tensor(
[[2.3500746e-05 2.3613335e-05 2.3589833e-05 ... 2.3504810e-05
  2.3536553e-05 2.3515748e-05]
 [2.3552744e-05 2.3658771e-05 2.3513856e-05 ... 2.3528448e-05
  2.3576778e-05 2.3596218e-05]
 [2.3551389e-05 2.3657112e-05 2.3566507e-05 ... 2.3595325e-05
  2.3592183e-05 2.3561002e-05]
 ...
 [2.3526287e-05 2.3549321e-05 2.3616652e-05 ... 2.3522518e-05
  2.3556988e-05 2.3592453e-05]
 [2.3470309e-05 2.3575378e-05 2.3558294e-05 ... 2.3535453e-05
  2.3604551e-05 2.3590810e-05]
 [2.3587469e-05 2.3599499e-05 2.3574377e-05 ... 2.3609020e-05
  2.3578086e-05 2.3566294e-05]], shape=(100, 42422), dtype=float32)


Test it out just to make sure it works.

In [None]:
model = LanguageModel(vocab_size=199)

sequence = tf.constant(np.random.randint(0, 199, size=(100)))
output = model(sequence)

print('Sequence:', sequence.shape)
print('Output:', output.shape)

Sequence: (100,)
Output: (100, 199)


### Train the Model

Train the model based on the text in our corpus.

The goal is to predict the next character. Thus, the target is the input tensor
rolled by one character.

In [None]:
@tf.function(experimental_relax_shapes=True)
def process_sentence(sentence, target):
    with tf.GradientTape() as tape:
        output = model(sentence)
        loss = loss_fn(target, output)
        
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(target, tf.argmax(output, axis=1))

epochs = 50
dataset = tf.data.Dataset.from_tensor_slices(sentences_tensor)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Accuracy(name='train_accuracy')

for epoch in range(epochs):
    train_loss.reset_states()
    train_accuracy.reset_states()

    train_loop = tqdm(total=len(dataset), position=0, leave=True)
    for sentence in dataset:
        model.gru.reset_states()

        process_sentence(sentence, tf.roll(sentence, -1, 0))
        train_loop.set_description('Train - Epoch: {}, Loss: {:.4f}, Accuracy: {:.4f}'.format(epoch, train_loss.result(), train_accuracy.result()))
        train_loop.update(1)

Train - Epoch: 0, Loss: 3.8689, Accuracy: 0.1690: 100%|██████████| 44/44 [00:04<00:00, 10.61it/s]
Train - Epoch: 1, Loss: 2.9965, Accuracy: 0.1984: 100%|██████████| 44/44 [00:01<00:00, 33.77it/s]
Train - Epoch: 2, Loss: 2.7309, Accuracy: 0.2724: 100%|██████████| 44/44 [00:01<00:00, 33.65it/s]
Train - Epoch: 3, Loss: 2.4409, Accuracy: 0.3254: 100%|██████████| 44/44 [00:01<00:00, 33.39it/s]
Train - Epoch: 4, Loss: 2.2543, Accuracy: 0.3645: 100%|██████████| 44/44 [00:01<00:00, 31.72it/s]
Train - Epoch: 6, Loss: 2.0014, Accuracy: 0.4358: 100%|██████████| 44/44 [00:01<00:00, 34.09it/s]
Train - Epoch: 7, Loss: 1.8863, Accuracy: 0.4674: 100%|██████████| 44/44 [00:01<00:00, 33.45it/s]
Train - Epoch: 8, Loss: 1.7707, Accuracy: 0.5109: 100%|██████████| 44/44 [00:01<00:00, 33.22it/s]
Train - Epoch: 9, Loss: 1.6587, Accuracy: 0.5438: 100%|██████████| 44/44 [00:01<00:00, 33.36it/s]
Train - Epoch: 10, Loss: 1.5501, Accuracy: 0.5774: 100%|██████████| 44/44 [00:01<00:00, 32.95it/s]
Train - Epoch: 11, 

KeyboardInterrupt: ignored

### Character-Level Results

Observe the results by generating text one character at a time.

Run this code block if you chose the character-level dataset

In [None]:
input = tf.constant([197])
string_output = ''
k = 2
model.gru.reset_states()
for _ in range(200):  # Max number of iterations
    output = model(input)
    char_idx = np.random.choice(tf.math.top_k(output, k=k).indices.numpy()[0])
    if char_idx == 198:
        break
    string_output += mapper.idx_to_char(char_idx)
    input = tf.constant([char_idx])

print(string_output)

c'une dux-sevatier à civellie he Mremen de querarancisquin maite de Stint née àa Marine mandien Avels neur en sept ans, tors apons du secour en,, dé laven apés neufante du sorre et de la cinq hons som


### Word-Level Results

Observe the results by generating text one word at a time.

Run this code block if you chose the word-level dataset.

In [None]:
input = tf.constant([1042])  # Start token
k = 30
model.gru.reset_states()
sequences = []
for _ in range(15):
    output = model(input)
    char_idx = np.random.choice(tf.math.top_k(output, k=k).indices.numpy()[0])
    if char_idx == 1043:
        break
    sequences.append(char_idx)

print(tokenizer.sequences_to_texts([sequences]))

['huit la mil en deux deux à Francois trente quatre la en deux à cinq']
