# Translating Tibetan to English

The purpose of this notebook is to present how the trained translation model can be used.

## Preparation

First, we'll import the necessary libraries.

In [1]:
import tensorflow as tf
import keras_nlp
import pickle

2023-10-12 21:13:43.674354: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-12 21:13:43.674406: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-12 21:13:43.674457: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-12 21:13:43.684694: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Using TensorFlow backend


Now, we'll load in the trained tokenizers and model.

In [2]:
with open('../tokenizers/eng-tokenizer.pickle', 'rb') as handle:
    eng_tokenizer = pickle.load(handle)

with open('../tokenizers/tib-tokenizer.pickle', 'rb') as handle:
    tib_tokenizer = pickle.load(handle)

2023-10-12 21:13:46.456821: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-12 21:13:46.463798: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-12 21:13:46.464084: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

In [3]:
tib_eng_translator = tf.keras.models.load_model("../models/tib-eng-translator-0.2.0.keras")

We also need to bring our constants back.

In [4]:
MAX_SEQUENCE_LENGTH = 15
VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

AUTOTUNE = tf.data.AUTOTUNE

## Decoding Translated Sentences

Even if the translations are perfect, the outputs of our model are not meaningful sentences. The model only outputs numerical tokens. In order to turn these into something that a human can read they need to be decoded.

Below is a function to decode these translated sentences. This function takes in an English sentence, runs it through our translator model then works its way through the output of the model, converting the output into words in Tibetan using our tokenizers.

Part of decoding the sequence is sampling the probabilities of tokens that should follow the existing translation. The sampler is the algorithm that is used to select that next work or token. Here I've used the Greedy sampler which simply finds the highest likelihood next word and adds it to the translated sentence. It is computationally inexpensive and because the outputs are pretty short we don't need to worry about the Greedy sampler outputting long, repetitive sentences that don't make much sense, which can be an issue with the algorithm.

In [11]:
def tib_eng_translate(input_sentence):

    input_sentences = tf.constant([input_sentence])

    batch_size = tf.shape(input_sentences)[0]

    encoder_input_tokens = tib_tokenizer(input_sentences).to_tensor(
        shape=(None, MAX_SEQUENCE_LENGTH)
    )

    def next(prompt, cache, index):
        logits = tib_eng_translator([encoder_input_tokens, prompt])[:, index - 1, :]
        hidden_states = None
        return logits, hidden_states, cache
    
    length = MAX_SEQUENCE_LENGTH
    start = tf.fill((batch_size, 1), eng_tokenizer.token_to_id("[START]"))
    pad = tf.fill((batch_size, length - 1), eng_tokenizer.token_to_id("[PAD]"))
    prompt = tf.concat((start, pad), axis=-1)

    generated_tokens = keras_nlp.samplers.GreedySampler()(
        next,
        prompt,
        end_token_id=eng_tokenizer.token_to_id("[END]"),
        index=1
    )
    generated_sentences = eng_tokenizer.detokenize(generated_tokens)
    try:
        generated_sentences = generated_sentences.numpy()[0].decode("utf-8")

        generated_sentences = (
            generated_sentences.replace("[PAD]", "")
            .replace("[START]", "")
            .replace("[END]", "")
            .replace("[UNK]", "")
            .strip()
        )
    except:
        pass
    return generated_sentences

### Example Translations

Now, let's look at some example translations from the model.

In [12]:
input_sentence = 'yum ni nyimé chö kyi ying'
translated = tib_eng_translate(input_sentence)

print(f"** Example **")
print(input_sentence)
print(translated)

** Example **
yum ni nyimé chö kyi ying
space of abiding intense streng
