<a href="https://colab.research.google.com/github/BYU-Handwriting-Lab/GettingStarted/blob/master/notebooks/language-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Model

This notebook provides code to create a character-level language model in 
TensorFlow.

### Dependencies

Import the necessary dependencies and download our character set and corpus.

In [2]:
import tensorflow as tf

import json
import numpy as np
import pandas as pd
from tqdm import tqdm

In [2]:
!wget -q https://raw.githubusercontent.com/ericburdett/named-entity-recognition/master/char_set.json
!wget -q --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZsJ8cZSDU98GpcK-kl_Cq3eTt-R2YvSJ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZsJ8cZSDU98GpcK-kl_Cq3eTt-R2YvSJ" -O french_ner_dataset.csv && rm -rf /tmp/cookies.txt

### Character Set Mapping

Create a Character Set Mapper to go between string and integer representations.

Specify the starting and ending character token. These are useful when feeding
sentences into our language model.

In [3]:
 class CharsetMapper():
    def __init__(self, filepath='char_set.json', sequence_size=20, start_char=197, end_char=198):
        self.start_char = start_char
        self.end_char = end_char
        with open(filepath) as f:
            self.char_dict = json.load(f)
    
    def char_to_idx(self, char):
        if char in self.char_dict['char_to_idx']:
            return int(self.char_dict['char_to_idx'][char])
        else:
            return 0
  
    def idx_to_char(self, idx):
        if str(int(idx)) in self.char_dict['idx_to_char']:
            return self.char_dict['idx_to_char'][str(int(idx))]
        else:
            return ''
  
    def str_to_idxs(self, string):
        assert type(string) == str

        idxs = [self.start_char]
        for char in string:
            idxs.append(self.char_to_idx(char))
        idxs.append(self.end_char)

        return np.array(idxs)
  
    def idxs_to_str(self, idxs):
        chars = ''

        for idx in idxs:
            chars += self.idx_to_char(idx)
    
        return chars

### Dataset Creation

Create our dataset by reading from the CSV using pandas, joining sentences, and
mapping char representations to integer representations.

Notice the use of tf.ragged.constant. This allows us to create a tensor with
unequal sequence lengths. Without this, we would be forced to use padding so
that our sequence lengths would be constant.

In [111]:
mapper = CharsetMapper()

df = pd.read_csv('french_ner_dataset.csv', sep='\t', header=None, names=['word', 'entity', 'id'])
df_size = df['id'].max()

sentences_str = []
sentences = []
for i in range(df_size):
    ith_sentence_words = df.loc[df['id'] == i]
    sentence = " ".join(ith_sentence_words['word'].to_list())
    sentences_str.append(sentence)
    sentences.append(mapper.str_to_idxs(sentence))

sentences_tensor = tf.ragged.constant(sentences)

### Model Creation

Build our simple model that includes an embedding layer, recurrent layer, and
dense layer to get us down to the number of classes.

In [160]:
class LanguageModel(tf.keras.Model):
    def __init__(self, vocab_size=199):
        pass
    
    def call(self, x):
        pass

Test it out just to make sure it works.

In [209]:
model = LanguageModel()

sequence = tf.constant(np.random.randint(0, 197, size=(100)))
output = model(sequence)

print('Sequence:', sequence.shape)
print('Output:', output.shape)

Sequence: (100,)
Output: (100, 199)


### Train the Model

Train the model based on the text in our corpus.

The goal is to predict the next character. Thus, the target is the input tensor
rolled by one character.

In [None]:
@tf.function(experimental_relax_shapes=True)
def process_sentence(sentence, target):
    output = pass
    loss = pass

    train_loss(loss)
    train_accuracy(target, tf.argmax(output, 1))

epochs = 50
dataset = pass
loss_fn = pass
optimizer = pass
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Accuracy(name='train_accuracy')

for epoch in range(epochs):
    train_loss.reset_states()
    train_accuracy.reset_states()

    train_loop = tqdm(total=len(dataset), position=0, leave=True)
    for sentence in dataset:
        model.gru.reset_states()

        process_sentence(sentence, tf.roll(sentence, -1, 0))
        train_loop.set_description('Train - Epoch: {}, Loss: {:.4f}, Accuracy: {:.4f}'.format(epoch, train_loss.result(), train_accuracy.result()))
        train_loop.update(1)

### Results

Observe the results by generating text. 

In [216]:
input = tf.constant([197]) # First input is the starting character token
string_output = ''
# More Stuff...

for _ in range(200):  # Max number of iterations
    pass

print('Generated Text:', string_output)

Generated Text: 
