<a href="https://colab.research.google.com/github/BYU-Handwriting-Lab/GettingStarted/blob/solution/notebooks/language-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Model

This notebook provides code to create a character-level language model in 
TensorFlow.

### Dependencies

Import the necessary dependencies and download our character set and corpus.

In [1]:
import tensorflow as tf

import json
import numpy as np
import pandas as pd
from tqdm import tqdm

In [2]:
!wget -q https://raw.githubusercontent.com/ericburdett/named-entity-recognition/master/char_set.json
!wget -q --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZsJ8cZSDU98GpcK-kl_Cq3eTt-R2YvSJ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZsJ8cZSDU98GpcK-kl_Cq3eTt-R2YvSJ" -O french_ner_dataset.csv && rm -rf /tmp/cookies.txt

### Character Set Mapping

Create a Character Set Mapper to go between string and integer representations.

Specify the starting and ending character token. These are useful when feeding
sentences into our language model.

In [4]:
 class CharsetMapper():
    def __init__(self, filepath='char_set.json', sequence_size=20, start_char=197, end_char=198):
        self.start_char = start_char
        self.end_char = end_char
        with open(filepath) as f:
            self.char_dict = json.load(f)
    
    def char_to_idx(self, char):
        if char in self.char_dict['char_to_idx']:
            return int(self.char_dict['char_to_idx'][char])
        else:
            return 0
  
    def idx_to_char(self, idx):
        if str(int(idx)) in self.char_dict['idx_to_char']:
            return self.char_dict['idx_to_char'][str(int(idx))]
        else:
            return ''
  
    def str_to_idxs(self, string):
        assert type(string) == str

        idxs = [self.start_char]
        for char in string:
            idxs.append(self.char_to_idx(char))
        idxs.append(self.end_char)

        return np.array(idxs)
  
    def idxs_to_str(self, idxs):
        chars = ''

        for idx in idxs:
            chars += self.idx_to_char(idx)
    
        return chars

### Dataset Creation

Create our dataset by reading from the CSV using pandas, joining sentences, and
mapping char representations to integer representations.

Notice the use of tf.ragged.constant. This allows us to create a tensor with
unequal sequence lengths. Without this, we would be forced to use padding so
that our sequence lengths would be constant.

In [5]:
mapper = CharsetMapper()

df = pd.read_csv('french_ner_dataset.csv', sep='\t', header=None, names=['word', 'entity', 'id'])
df_size = df['id'].max()

sentences_str = []
sentences = []
for i in range(df_size):
    ith_sentence_words = df.loc[df['id'] == i]
    sentence = " ".join(ith_sentence_words['word'].to_list())
    sentences_str.append(sentence)
    sentences.append(mapper.str_to_idxs(sentence))

sentences_tensor = tf.ragged.constant(sentences)

### Word-Level Tokenization

We can also tokenize at a word-level instead of the character level.

In [66]:
words = list(set(df['word'].values))

tokenizer = tf.keras.preprocessing.text.Tokenizer(len(words), filters='', lower=False)
tokenizer.fit_on_texts(words)

phrase = 'L\'an mil huit cent'
sequences = tokenizer.texts_to_sequences([phrase])
back_to_phrase = tokenizer.sequences_to_texts(sequences)[0]

print('Total Tokens:', len(words))
print('Original Phrase:', phrase)
print('Tokenized:', sequences)
print('Round Trip:', back_to_phrase)

Total Tokens: 1042
Original Phrase: L'an mil huit cent
Tokenized: [[195, 782, 935, 592]]
Round Trip: L'an mil huit cent


In [108]:
tokenized_sentences = tokenizer.texts_to_sequences(sentences_str)

new_sentences = [[1042] + sentence + [1043] for sentence in tokenized_sentences]
sentences_tensor = tf.ragged.constant(new_sentences)
sentences_tensor

<tf.RaggedTensor [[1042, 195, 782, 935, 592, 557, 622, 719, 108, 634, 448, 19, 1027, 107, 83, 984, 448, 43, 176, 885, 26, 255, 270, 660, 912, 1001, 278, 535, 1001, 983, 435, 1001, 92, 310, 344, 108, 776, 406, 1001, 279, 174, 1001, 372, 1008, 448, 776, 536, 977, 138, 176, 285, 410, 1001, 385, 230, 256, 719, 138, 41, 384, 410, 1001, 143, 303, 256, 416, 23, 31, 144, 20, 957, 297, 719, 202, 448, 370, 495, 26, 599, 718, 115, 423, 249, 107, 230, 984, 448, 840, 138, 176, 285, 816, 851, 92, 310, 344, 108, 776, 108, 696, 83, 832, 782, 935, 592, 263, 42, 202, 861, 719, 1001, 567, 1024, 1001, 400, 765, 920, 807, 671, 20, 969, 271, 107, 92, 310, 344, 108, 776, 77, 115, 26, 26, 20, 941, 311, 719, 108, 440, 599, 460, 210, 26, 108, 242, 652, 608, 770, 852, 138, 41, 384, 138, 176, 285, 817, 660, 862, 1043], [1042, 195, 782, 935, 592, 634, 263, 696, 303, 719, 108, 696, 935, 122, 107, 634, 984, 448, 840, 942, 885, 725, 255, 270, 660, 1004, 890, 245, 617, 1001, 278, 535, 1001, 983, 435, 1001, 123, 310, 3

### Model Creation

Build our simple model that includes an embedding layer, recurrent layer, and
dense layer to get us down to the number of classes.

In [6]:
class LanguageModel(tf.keras.Model):
    def __init__(self, vocab_size=199):
        super(LanguageModel, self).__init__()

        self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=128)
        self.gru = tf.keras.layers.GRU(128, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform')
        self.dense = tf.keras.layers.Dense(vocab_size)
        self.softmax = tf.keras.layers.Softmax()
    
    def call(self, x):
        x = self.embedding(x)
        x = tf.expand_dims(x, 0)
        x = self.gru(x)
        x = self.dense(x)
        x = self.softmax(x)
        x = tf.squeeze(x, 0)

        return x

Test it out just to make sure it works.

In [8]:
model = LanguageModel(vocab_size=199)

sequence = tf.constant(np.random.randint(0, 199, size=(100)))
output = model(sequence)

print('Sequence:', sequence.shape)
print('Output:', output.shape)

Sequence: (100,)
Output: (100, 199)


### Train the Model

Train the model based on the text in our corpus.

The goal is to predict the next character. Thus, the target is the input tensor
rolled by one character.

In [9]:
@tf.function(experimental_relax_shapes=True)
def process_sentence(sentence, target):
    with tf.GradientTape() as tape:
        output = model(sentence)
        loss = loss_fn(target, output)
        
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(target, tf.argmax(output, axis=1))

epochs = 50
dataset = tf.data.Dataset.from_tensor_slices(sentences_tensor)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Accuracy(name='train_accuracy')

for epoch in range(epochs):
    train_loss.reset_states()
    train_accuracy.reset_states()

    train_loop = tqdm(total=len(dataset), position=0, leave=True)
    for sentence in dataset:
        model.gru.reset_states()

        process_sentence(sentence, tf.roll(sentence, -1, 0))
        train_loop.set_description('Train - Epoch: {}, Loss: {:.4f}, Accuracy: {:.4f}'.format(epoch, train_loss.result(), train_accuracy.result()))
        train_loop.update(1)

Train - Epoch: 0, Loss: 3.8689, Accuracy: 0.1690: 100%|██████████| 44/44 [00:04<00:00, 10.61it/s]
Train - Epoch: 1, Loss: 2.9965, Accuracy: 0.1984: 100%|██████████| 44/44 [00:01<00:00, 33.77it/s]
Train - Epoch: 2, Loss: 2.7309, Accuracy: 0.2724: 100%|██████████| 44/44 [00:01<00:00, 33.65it/s]
Train - Epoch: 3, Loss: 2.4409, Accuracy: 0.3254: 100%|██████████| 44/44 [00:01<00:00, 33.39it/s]
Train - Epoch: 4, Loss: 2.2543, Accuracy: 0.3645: 100%|██████████| 44/44 [00:01<00:00, 31.72it/s]
Train - Epoch: 6, Loss: 2.0014, Accuracy: 0.4358: 100%|██████████| 44/44 [00:01<00:00, 34.09it/s]
Train - Epoch: 7, Loss: 1.8863, Accuracy: 0.4674: 100%|██████████| 44/44 [00:01<00:00, 33.45it/s]
Train - Epoch: 8, Loss: 1.7707, Accuracy: 0.5109: 100%|██████████| 44/44 [00:01<00:00, 33.22it/s]
Train - Epoch: 9, Loss: 1.6587, Accuracy: 0.5438: 100%|██████████| 44/44 [00:01<00:00, 33.36it/s]
Train - Epoch: 10, Loss: 1.5501, Accuracy: 0.5774: 100%|██████████| 44/44 [00:01<00:00, 32.95it/s]
Train - Epoch: 11, 

KeyboardInterrupt: ignored

### Character-Level Results

Observe the results by generating text one character at a time.

Run this code block if you chose the character-level dataset

In [16]:
input = tf.constant([197])
string_output = ''
k = 2
model.gru.reset_states()
for _ in range(200):  # Max number of iterations
    output = model(input)
    char_idx = np.random.choice(tf.math.top_k(output, k=k).indices.numpy()[0])
    if char_idx == 198:
        break
    string_output += mapper.idx_to_char(char_idx)
    input = tf.constant([char_idx])

print(string_output)

c'une dux-sevatier à civellie he Mremen de querarancisquin maite de Stint née àa Marine mandien Avels neur en sept ans, tors apons du secour en,, dé laven apés neufante du sorre et de la cinq hons som


### Word-Level Results

Observe the results by generating text one word at a time.

Run this code block if you chose the word-level dataset.

In [140]:
input = tf.constant([1042])  # Start token
k = 30
model.gru.reset_states()
sequences = []
for _ in range(15):
    output = model(input)
    char_idx = np.random.choice(tf.math.top_k(output, k=k).indices.numpy()[0])
    if char_idx == 1043:
        break
    sequences.append(char_idx)

print(tokenizer.sequences_to_texts([sequences]))

['huit la mil en deux deux à Francois trente quatre la en deux à cinq']
