# Translating Tibetan to English

The purpose of this notebook is to present how the trained translation model can be used.

## Preparation

First, we'll import the necessary libraries.

In [10]:
import tensorflow as tf
import keras_nlp
import pickle

Now, we'll load in the trained tokenizers and model.

In [11]:
with open('/home/j/Documents/Projects/MLotsawa/models/keras/tokenizers/big-dataset/eng-tokenizer.pickle', 'rb') as handle:
    eng_tokenizer = pickle.load(handle)

with open('/home/j/Documents/Projects/MLotsawa/models/keras/tokenizers/big-dataset/tib-tokenizer.pickle', 'rb') as handle:
    tib_tokenizer = pickle.load(handle)

In [12]:
tib_eng_translator = tf.keras.models.load_model("/home/j/Documents/Projects/MLotsawa/models/keras/tib-eng-translator-big-dataset.keras")

We also need to bring our constants back.

In [13]:
MAX_SEQUENCE_LENGTH = 15
VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

AUTOTUNE = tf.data.AUTOTUNE

## Decoding Translated Sentences

Even if the translations are perfect, the outputs of our model are not meaningful sentences. The model only outputs numerical tokens. In order to turn these into something that a human can read they need to be decoded.

Below is a function to decode these translated sentences. This function takes in an English sentence, runs it through our translator model then works its way through the output of the model, converting the output into words in Tibetan using our tokenizers.

Part of decoding the sequence is sampling the probabilities of tokens that should follow the existing translation. The sampler is the algorithm that is used to select that next work or token. Here I've used the Greedy sampler which simply finds the highest likelihood next word and adds it to the translated sentence. It is computationally inexpensive and because the outputs are pretty short we don't need to worry about the Greedy sampler outputting long, repetitive sentences that don't make much sense, which can be an issue with the algorithm.

In [14]:
def tib_eng_translate(input_sentence):

    input_sentences = tf.constant([input_sentence])

    batch_size = tf.shape(input_sentences)[0]

    encoder_input_tokens = tib_tokenizer(input_sentences).to_tensor(
        shape=(None, MAX_SEQUENCE_LENGTH)
    )

    def next(prompt, cache, index):
        logits = tib_eng_translator([encoder_input_tokens, prompt])[:, index - 1, :]
        hidden_states = None
        return logits, hidden_states, cache
    
    length = MAX_SEQUENCE_LENGTH
    start = tf.fill((batch_size, 1), eng_tokenizer.token_to_id("[START]"))
    pad = tf.fill((batch_size, length - 1), eng_tokenizer.token_to_id("[PAD]"))
    prompt = tf.concat((start, pad), axis=-1)

    generated_tokens = keras_nlp.samplers.GreedySampler()(
        next,
        prompt,
        end_token_id=eng_tokenizer.token_to_id("[END]"),
        index=1
    )
    generated_sentences = eng_tokenizer.detokenize(generated_tokens)
    try:
        generated_sentences = generated_sentences.numpy()[0].decode("utf-8")

        generated_sentences = (
            generated_sentences.replace("[PAD]", "")
            .replace("[START]", "")
            .replace("[END]", "")
            .replace("[UNK]", "")
            .strip()
        )
    except:
        pass
    return generated_sentences

### Example Translations

Now, let's look at some example translations from the model.

In [15]:
input = ['sangye chö dang tsok kyi chok nam la', 
         'changchub bardu dak ni kyab su chi',
         'dak gi jin sok gyipé sönam kyi',
         'dro la pen chir sangye drubpar shok']

translated = [tib_eng_translate(sentence) for sentence in input]

In [16]:
translated

['the most powerful and abundant    invoking the handsome',
 'my own mind now i have arrived in their transferred before us  after',
 'the auspicious opportunity recalling the enlightened ladys rage and abundant   and the',
 'the outer and inner benefit to all other offerings']