# Tibetan to English Translation
## Setup

In this notebook, I will create a model to translate Tibetan sentences into English sentences. To create this model, I drew on the Keras tutorial provided here:

https://keras.io/examples/nlp/neural_machine_translation_with_keras_nlp/

I've adapted the model from the tutorial to translate Tibetan into English, rather than English to Spanish, and streamlined the code for simplicity and to meet my need for computational efficiency. Additionally, I've substantially altered the model in order to more fully optimize for the much, much smaller dataset available for the Tibetan language.

The first step of this process is to import the necessary libraries.

In [1]:
import pathlib
import random
import tensorflow as tf
from tensorflow import keras
import keras_nlp
import matplotlib.pyplot as plt

2023-09-09 20:42:06.115452: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Using TensorFlow backend


Next, I will establish the necessary constants for the model. 

I will use a batch size of 32 for my data.

I will train the model for 100 epochs.

I also establish a size for the vocabulary that the model will use and the dimensions for the model to expect from the data. 

An interesting addition here is AUTOTUNE. tf.data.AUTOTUNE will automate optimization for training the model. This is extremely useful both for effectively utilizing computing resources, and for avoiding too much time lost to optimization tinkering.

In [2]:
BATCH_SIZE = 32
EPOCHS = 100
MAX_SEQUENCE_LENGTH = 40
VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

AUTOTUNE = tf.data.AUTOTUNE

## Importing and Exploring the Data

The data for this project comes from Lotsawa House. I'm beginning by using their "Words of the Buddha" collection. The translations come in bilingual pdfs which need to be converted to a usable .txt file format.

### Converting PDFs to txt

In [3]:
from PyPDF2 import PdfReader

reader = PdfReader('data/texts.pdf')

num_pages = len(reader.pages)

text = []

for page in reader.pages:
    text.append(page.extract_text())

with open('data/test.txt', 'w') as f:
    f.writelines('\n'.join(text))

Now that a txt file has been created. We need to remove lines from the file that are not useful to us. This includes pages numbers, Tibetan script lines, etc.

In [4]:
import re

text = []

with open('data/test.txt', 'r') as f:
    for line in f:
        new_line = re.sub(r'[^a-zA-Z ]', '', line)
        if new_line.replace(' ', '') != '':
            text.append(new_line)

with open('data/test2.txt', 'w') as f:
    f.writelines('\n'.join(text))

Now that we've wittled the text down we can set the text into Tibetan and English sentence pairs. Lotsawa House translations are conveniently provided in multiple lines. First Tibetan and then the English translation.

In [25]:
from english_words import get_english_words_set
english = get_english_words_set(['web2'], lower=True)

pairs = []

with open('data/test2.txt', 'r') as f:
    text = f.readlines()
    
    for i in range(len(text) - 1):
        words = text[i].split()
        num_words = len(words)
        if (words[0].lower() not in english) and (words[len(words) - 1].lower() not in english):
            pair = (text[i].replace('\n', '') + ',' + text[i+1])
            pairs.append(pair)

with open('data/pairs.txt', 'w') as f:
    f.writelines(pairs)

### Working With Sentence Pairs

I've then split the sentence pairs into two sets and set every English letter to be lowercase to avoid any confusion in the model. This is not necessary for Tibetan because the script does not use upper and lower cases.

In [6]:
text_file = pathlib.Path('data/pairs.txt')

with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    tib, eng = line.split(",")[:2]
    eng = eng.lower()
    text_pairs.append((eng, tib))

The data comes in the form of numerous sentence pairs. First the sentence is given in English, then in Tibetan. Each pair also has a source attribution, but that won't be necessary for the model. Below, I've printed some representative sentence pairs.

In [7]:
for _ in range(5):
    print(random.choice(text_pairs))

('you whose shins reach  leagues down to the depths of the ocean', 'gyats zab su pakts gy tri ypa kangp pmor chinpa')
('to our attaining complete enlightenment', 'bardu chpa tamch kn')
('they enter harmoniously', 'tnpar zhukpa')
('who for the sake of all who live', 'semchen kn gyi dn gyi chir')
('gagana samudgate svabhava vishuddhe mahanaya parivare svaha ', 'aparimita punya jnana sambharo pachite  om sarva samskara parishuddha dharmate')


Now, we can split the sentence pairs into training, validation, and test sets. Notice that this dataset is quite small. This is one of the challenges of creating models for global minority languages. There is substantially less data to work with than if we were working with, for example, Spanish or French. As a result, I've allocated just 5% of the pairs to validation and testing respectively.

In [8]:
random.shuffle(text_pairs)
num_val_samples = int(0.05 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

2262 total pairs
2036 training pairs
113 validation pairs
113 test pairs


## Creating the Tokenizer

The tokenizer will assign each unique word in the dataset a 'token' a unique number that allows the data to be treated numerically during model training. In order to do this, a "vocabulary" must first be created. This is a complete list of the unique English and Tibetan words in the dataset.

### Vocabulary

In [9]:
def train_word_piece(text_samples, vocab_size, reserved_tokens):
    word_piece_ds = tf.data.Dataset.from_tensor_slices(text_samples)
    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
        reserved_tokens=reserved_tokens,
    )

    return vocab

### Tokenizing

Note that I've set aside some peculiar tokens. These correspond to whitespace,unknown characters, the beginnings and endings of sentences. I don't want the tokenizer to treat these things as words that need to be tokenized.

In [10]:
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

eng_samples = [text_pair[0] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, VOCAB_SIZE, reserved_tokens)

tib_samples = [text_pair[1] for text_pair in train_pairs]
tib_vocab = train_word_piece(tib_samples, VOCAB_SIZE, reserved_tokens)

2023-09-09 20:42:41.824185: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-09 20:42:42.288887: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-09 20:42:42.289260: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Below we can see some example words from the dataset. Note that Tibetan uses a distinct writing system that may not render correctly.

In [11]:
print("English Tokens: ", eng_vocab[100:115])
print("Tibetan Tokens: ", tib_vocab[100:115])

English Tokens:  ['##par', '##b', '##g', '##le', '##te', 'di', 'it', 'ts', 'will', 'tamch', 'through', 'dn', 'his', 'namo', '##ti']
Tibetan Tokens:  ['dak', 'dharmate', 'puye', '##g', 'and', 'aparimita', 'chok', 'su', '##ed', '##ha', '##ra', '##d', 'mi', '##t', '##ta']


Finally, we can tokenize the vocabularies.

In [12]:
eng_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab, lowercase=False
)

tib_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=tib_vocab, lowercase=False
)

### Data Preprocessing

Next, I will preprocess each batch of data. This consists of re-assembling the English-Tibetan sentence pairs. Each sentence must be padded with the "[PAD]" whitespace token in order to make each sequence of tokens the same length. This is because the model expects inputs of a particular shape. Once the sentence has been padded to the appropriate length, a [START] token can be appended to the beginning and an [END] token appended to the end.

Finally, this assembled dataset can be split into training and validation sets.

In [13]:
def tib_eng_preprocess_batch(eng, tib):

    eng = eng_tokenizer(eng)
    tib = tib_tokenizer(tib)

    # pad eng to max_sequence_length
    eng_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH+1,
        pad_value = eng_tokenizer.token_to_id("[PAD]"),
    )

    eng = eng_start_end_packer(eng)

    # add special tokens [start] and [end] and pad tib
    tib_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length = MAX_SEQUENCE_LENGTH,
        start_value = tib_tokenizer.token_to_id("[START]"),
        end_value = tib_tokenizer.token_to_id("[END]"),
        pad_value = tib_tokenizer.token_to_id("[PAD]")
    )

    tib = tib_start_end_packer(tib)

    return (
        {
        "encoder_inputs": tib,
        "decoder_inputs": eng[:, :-1]
        },
        eng[:, 1:],
    )

def make_dataset(pairs):
    eng_texts, tib_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    tib_texts = list(tib_texts)
    dataset = tf.data.Dataset.from_tensor_slices((tib_texts, eng_texts))
    dataset=dataset.batch(BATCH_SIZE)
    dataset = dataset.map(tib_eng_preprocess_batch, num_parallel_calls=AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()

tib_eng_train_ds = make_dataset(train_pairs)
tib_eng_val_ds = make_dataset(val_pairs)

### Creating the Model

Now it's time to build the model itself. This model is an Autoencoder, which consists of an encoder and a decoder. 

The encoder input layer takes in a set of tokenized inputs. These inputs are then passed to a layer that accounts for the number assigned to the token as well as the position of that token in the sentence. The next layer is a typical dense Encoder layer.

The decoder takes in a set of tokenized inputs from the Tibetan dataset and passes them to a layer that will account for the token number and position of the token in those sentences. This is then passed to a typical dense Decoder layer.

Both the Encoder and Decoder layers are helpfully provided out-of-the-box by Keras.

In [14]:
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length = MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(encoder_inputs)

encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim = INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(VOCAB_SIZE, activation="softmax")(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

tib_eng_translator = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="tib_eng_translator",
)

### Model Summary

In [15]:
tib_eng_translator.summary()

Model: "tib_eng_translator"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 token_and_position_embeddi  (None, None, 256)            3850240   ['encoder_inputs[0][0]']      
 ng (TokenAndPositionEmbedd                                                                       
 ing)                                                                                             
                                                                                                  
 decoder_inputs (InputLayer  [(None, None)]               0         []           

## Compilation

Now, I've compiled the model.

Of note here is the choice of optimization algorith. I have used RMSProp. RMSProp is similar to Adagrad, which we studied in class, and as a result it converges much more quickly than, say, SGD. However, it is less susceptible to vanishing gradients. This is perfect for our small dataset with small batch sizes.

The loss function is Sparse Categorical Crossentropy. Not every word appears in every sentence so the data for most natural language related tasks is necessarily sparse.

In [16]:
tib_eng_translator.compile(
    "rmsprop", 
    loss="sparse_categorical_crossentropy", 
    metrics=["accuracy"]
)

## Fitting the Model

In [17]:
""" tib_eng_history = tib_eng_translator.fit(
    tib_eng_train_ds, 
    epochs=100, 
    validation_data=tib_eng_val_ds
    ) """

' tib_eng_history = tib_eng_translator.fit(\n    tib_eng_train_ds, \n    epochs=100, \n    validation_data=tib_eng_val_ds\n    ) '

To avoid training and retraining the model, I'll now save this model with these results.

In [18]:
# tib_eng_translator.save('models/tib-eng-translator-0.0.keras')

Now, for future testing, I can reopen the model.

In [19]:
tib_eng_translator = tf.keras.models.load_model('models/tib-eng-translator-0.0.keras')

### Visualizing the Training Results

Below, we can see how the loss and accuracy evolved over the course of training. Here we can really see how difficult it is to make effective generative tools from small datasets. Even as the model's accuracy improves substantially on the training set, the accuracy on the validation data remains unacceptably low.

The loss on the validation data also never decreases, instead getting worse as time goes on.

In [20]:
""" acc = tib_eng_history.history['accuracy']
val_acc = tib_eng_history.history['val_accuracy']

loss = tib_eng_history.history['loss']
val_loss = tib_eng_history.history['val_loss']

epochs_range = range(100)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show() """

" acc = tib_eng_history.history['accuracy']\nval_acc = tib_eng_history.history['val_accuracy']\n\nloss = tib_eng_history.history['loss']\nval_loss = tib_eng_history.history['val_loss']\n\nepochs_range = range(100)\n\nplt.figure(figsize=(8, 8))\nplt.subplot(1, 2, 1)\nplt.plot(epochs_range, acc, label='Training Accuracy')\nplt.plot(epochs_range, val_acc, label='Validation Accuracy')\nplt.legend(loc='lower right')\nplt.title('Training and Validation Accuracy')\n\nplt.subplot(1, 2, 2)\nplt.plot(epochs_range, loss, label='Training Loss')\nplt.plot(epochs_range, val_loss, label='Validation Loss')\nplt.legend(loc='upper right')\nplt.title('Training and Validation Loss')\nplt.show() "

## Decoding Translated Sentences

Even if the translations are perfect, the outputs of our model are not meaningful sentences. The model only outputs numerical tokens. In order to turn these into something that a human can read they need to be decoded.

Below is a function to decode these translated sentences. This function takes in an English sentence, runs it through our translator model then works its way through the output of the model, converting the output into words in Tibetan using our tokenizers.

Part of decoding the sequence is sampling the probabilities of tokens that should follow the existing translation. The sampler is the algorithm that is used to select that next work or token. Here I've used the Greedy sampler which simply finds the highest likelihood next word and adds it to the translated sentence. It is computationally inexpensive and because the outputs are pretty short we don't need to worry about the Greedy sampler outputting long, repetitive sentences that don't make much sense, which can be an issue with the algorithm.


In [23]:
def tib_eng_translate(input_sentences):
    batch_size = tf.shape(input_sentences)[0]

    encoder_input_tokens = eng_tokenizer(input_sentences).to_tensor(
        shape=(None, MAX_SEQUENCE_LENGTH)
    )

    def next(prompt, cache, index):
        logits = tib_eng_translator([encoder_input_tokens, prompt])[:, index - 1, :]
        hidden_states = None
        return logits, hidden_states, cache
    
    length = 40
    start = tf.fill((batch_size, 1), tib_tokenizer.token_to_id("[START]"))
    pad = tf.fill((batch_size, length - 1), tib_tokenizer.token_to_id("[PAD]"))
    prompt = tf.concat((start, pad), axis=-1)

    generated_tokens = keras_nlp.samplers.GreedySampler()(
        next,
        prompt,
        end_token_id=tib_tokenizer.token_to_id("[END]"),
        index=1
    )
    generated_sentences = tib_tokenizer.detokenize(generated_tokens)
    try:
        generated_sentences = generated_sentences.numpy()[0].decode("utf-8")

        generated_sentences = (
            generated_sentences.replace("[PAD]", "")
            .replace("[START]", "")
            .replace("[END]", "")
            .replace("[UNK]", "")
            .strip()
        )
    except:
        pass
    return generated_sentences

### Example Translations

Now, let's look at some example translations from the model.

In [24]:
input_sentence = 'accomplished the transcendent perfection of generosity'
translated = tib_eng_translate(tf.constant([input_sentence]))

print(f"** Example **")
print(input_sentence)
print(translated)

** Example **
accomplished the transcendent perfection of generosity
tf.Tensor([b'[START]paenpa dzokp gyur S dzokp g [PAD] o \x00this\x00\x00\x00\x00\x06\x00\x00\x00teyata\x00\x00\x04\x00\x00\x00rang\x00\x00\x00\x00\x03\x00\x00\x00pen\x00\x03\x00\x00\x00nga\x00\x02\x00\x00\x00ng\x00\x00\x05\x00\x00\x00mnlam\x00\x00\x00\x04\x00\x00\x00kntu\x00\x00\x00\x00\x04\x00\x00\x00gawa\x00\x00\x00\x00\x04\x00\x00\x00dzok\x00\x00\x00\x00\x06\x00\x00\x00dzinpa\x00\x00\x03\x00\x00\x00duk\x00\n\x00\x00\x00drongkhyer\x00\x00\x02\x00\x00\x00bk\x00\x00\x07\x00\x00\x00Pearcey\x00\x04\x00\x00\x00Adam\x00\x00\x00\x00\x04\x00\x00\x00zhik\x00\x00\x00\x00\x01\x00\x00\x00z\x00\x00\x00\x03\x00\x00\x00yam\x00\x02\x00\x00\x00us\x00\x00\x06\x00\x00\x00uddhas\x00\x00\x03\x00\x00\x00tok\x00\x03\x00\x00\x00rim\x00\x0b\x00\x00\x00ranslations\x00\x02\x00\x00\x00pe\x00\x00\x03\x00\x00\x00pam\x00\x03\x00\x00\x00nyi\x00\x03\x00\x00\x00ntu\x00\x03\x00\x00\x00ngp\x00\x02\x00\x00\x00ll\x00\x00\x03\x00\x00\x00lam\x00\x05\x00\x