# Tibetan to English Translation
## Setup

In this notebook, I will create a model to translate Tibetan sentences into English sentences. To create this model, I drew on the Keras tutorial provided here:

https://keras.io/examples/nlp/neural_machine_translation_with_keras_nlp/

I've adapted the model from the tutorial to translate Tibetan into English, rather than English to Spanish, and streamlined the code for simplicity and to meet my need for computational efficiency. Additionally, I've substantially altered the model in order to more fully optimize for the much, much smaller dataset available for the Tibetan language.

The first step of this process is to import the necessary libraries.

In [18]:
import pathlib
import random
import tensorflow as tf
from tensorflow import keras
import keras_nlp
import matplotlib.pyplot as plt
import pickle

TF_GPU_ALLOCATOR = 'cuda_malloc_async'


In [19]:
BATCH_SIZE = 128
EPOCHS = 1000
MAX_SEQUENCE_LENGTH = 45
VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

AUTOTUNE = tf.data.AUTOTUNE

### Working With Sentence Pairs

I've split the sentence pairs into two sets and set every English letter to be lowercase to avoid any confusion in the model. This is not necessary for Tibetan because the script does not use upper and lower cases.

In [20]:
text_file = pathlib.Path('/home/j/Documents/Projects/Iron-Bridge/lotsawa/data/training-batches/training-batch-1.txt')

with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    try:
        tib, eng = line.split(",")[:2]
        eng = eng.lower()
        text_pairs.append((tib, eng))
    except:
        pass

In [21]:
random.shuffle(text_pairs)
num_val_samples = int(0.05 * len(text_pairs))
num_train_samples = len(text_pairs) - num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")

499987 total pairs
474988 training pairs
24999 validation pairs


## Creating the Tokenizer

The tokenizer will assign each unique word in the dataset a 'token' a unique number that allows the data to be treated numerically during model training. In order to do this, a "vocabulary" must first be created. This is a complete list of the unique English and Tibetan words in the dataset.

### Vocabulary

In [22]:
""" def train_word_piece(text_samples, vocab_size, reserved_tokens):
    word_piece_ds = tf.data.Dataset.from_tensor_slices(text_samples)
    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
        reserved_tokens=reserved_tokens,
    )

    return vocab """

' def train_word_piece(text_samples, vocab_size, reserved_tokens):\n    word_piece_ds = tf.data.Dataset.from_tensor_slices(text_samples)\n    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(\n        word_piece_ds.batch(1000).prefetch(2),\n        vocabulary_size=vocab_size,\n        reserved_tokens=reserved_tokens,\n    )\n\n    return vocab '

### Tokenizing

Note that I've set aside some peculiar tokens. These correspond to whitespace,unknown characters, the beginnings and endings of sentences. I don't want the tokenizer to treat these things as words that need to be tokenized.

In [23]:
""" reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

tib_samples = [text_pair[0] for text_pair in train_pairs]
tib_vocab = train_word_piece(tib_samples, VOCAB_SIZE, reserved_tokens)

eng_samples = [text_pair[1] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, VOCAB_SIZE, reserved_tokens) """

' reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]\n\ntib_samples = [text_pair[0] for text_pair in train_pairs]\ntib_vocab = train_word_piece(tib_samples, VOCAB_SIZE, reserved_tokens)\n\neng_samples = [text_pair[1] for text_pair in train_pairs]\neng_vocab = train_word_piece(eng_samples, VOCAB_SIZE, reserved_tokens) '

Finally, we can tokenize the vocabularies.

In [24]:
""" eng_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab, lowercase=False
)

tib_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=tib_vocab, lowercase=False
) """

' eng_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(\n    vocabulary=eng_vocab, lowercase=False\n)\n\ntib_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(\n    vocabulary=tib_vocab, lowercase=False\n) '

Below, I'll save the tokenizers.

In [25]:
""" with open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/tokenizers/eng-tokenizer.pickle', 'wb') as handle:
    pickle.dump(eng_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/tokenizers/tib-tokenizer.pickle', 'wb') as handle:
    pickle.dump(tib_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL) """

" with open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/tokenizers/eng-tokenizer.pickle', 'wb') as handle:\n    pickle.dump(eng_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)\n\nwith open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/tokenizers/tib-tokenizer.pickle', 'wb') as handle:\n    pickle.dump(tib_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL) "

In [26]:
with open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/tokenizers/tib-tokenizer.pickle', 'rb') as handle:
    tib_tokenizer = pickle.load(handle)

with open('/home/j/Documents/Projects/Iron-Bridge/lotsawa/tokenizers/eng-tokenizer.pickle', 'rb') as handle:
    eng_tokenizer = pickle.load(handle)

### Data Preprocessing

Next, I will preprocess each batch of data. This consists of re-assembling the English-Tibetan sentence pairs. Each sentence must be padded with the "[PAD]" whitespace token in order to make each sequence of tokens the same length. This is because the model expects inputs of a particular shape. Once the sentence has been padded to the appropriate length, a [START] token can be appended to the beginning and an [END] token appended to the end.

Finally, this assembled dataset can be split into training and validation sets.

In [27]:
def tib_eng_preprocess_batch(tib, eng):

    tib = tib_tokenizer(tib)
    eng = eng_tokenizer(eng)
    

    # add special tokens [start] and [end] and pad tib
    tib_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length = MAX_SEQUENCE_LENGTH,
        start_value = tib_tokenizer.token_to_id("[START]"),
        end_value = tib_tokenizer.token_to_id("[END]"),
        pad_value = tib_tokenizer.token_to_id("[PAD]")
    )

    tib = tib_start_end_packer(tib)

    # pad eng to max_sequence_length
    eng_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH+1,
        pad_value = eng_tokenizer.token_to_id("[PAD]"),
    )

    eng = eng_start_end_packer(eng)



    return (
        {
        "encoder_inputs": tib,
        "decoder_inputs": eng[:, :-1]
        },
        eng[:, 1:],
    )

def make_dataset(pairs, batch_size=BATCH_SIZE):
    tib_texts, eng_texts = zip(*pairs)
    tib_texts = list(tib_texts)
    eng_texts = list(eng_texts)
    dataset = tf.data.Dataset.from_tensor_slices((tib_texts, eng_texts))
    dataset=dataset.batch(batch_size)
    dataset = dataset.map(tib_eng_preprocess_batch, num_parallel_calls=AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()

tib_eng_train_ds = make_dataset(train_pairs)
tib_eng_val_ds = make_dataset(val_pairs)

### Creating the Model

Now it's time to build the model itself. This model is an Autoencoder, which consists of an encoder and a decoder. 

The encoder input layer takes in a set of tokenized inputs. These inputs are then passed to a layer that accounts for the number assigned to the token as well as the position of that token in the sentence. The next layer is a typical dense Encoder layer.

The decoder takes in a set of tokenized inputs from the Tibetan dataset and passes them to a layer that will account for the token number and position of the token in those sentences. This is then passed to a typical dense Decoder layer.

Both the Encoder and Decoder layers are helpfully provided out-of-the-box by Keras.

In [28]:
""" encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length = MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(encoder_inputs)

encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim = INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(VOCAB_SIZE, activation="softmax")(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

tib_eng_translator = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="tib_eng_translator",
) """

tib_eng_translator = tf.keras.models.load_model("/home/j/Documents/Projects/Iron-Bridge/lotsawa/models/tib-eng-translator-0.4.0.keras")

### Model Summary

In [29]:
tib_eng_translator.summary()

Model: "tib_eng_translator"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 token_and_position_embeddi  (None, None, 256)            3851520   ['encoder_inputs[0][0]']      
 ng (TokenAndPositionEmbedd                                                                       
 ing)                                                                                             
                                                                                                  
 decoder_inputs (InputLayer  [(None, None)]               0         []           

## Compilation

Now, I've compiled the model.

Of note here is the choice of optimization algorith. I have used RMSProp. RMSProp is similar to Adagrad, which we studied in class, and as a result it converges much more quickly than, say, SGD. However, it is less susceptible to vanishing gradients. This is perfect for our small dataset with small batch sizes.

The loss function is Sparse Categorical Crossentropy. Not every word appears in every sentence so the data for most natural language related tasks is necessarily sparse.

In [30]:
tib_eng_translator.compile(
    "rmsprop", 
    loss="sparse_categorical_crossentropy", 
    metrics=["accuracy"]
)

## Fitting the Model

In [31]:
acc_callback = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3, restore_best_weights=True)

In [32]:
tib_eng_history = tib_eng_translator.fit(
    tib_eng_train_ds, 
    epochs=EPOCHS, 
    validation_data=tib_eng_val_ds,
    callbacks=[acc_callback]
    )

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
 175/3711 [>.............................] - ETA: 3:06 - loss: 0.6057 - accuracy: 0.9097

KeyboardInterrupt: 

In [33]:
tib_eng_translator.save('/home/j/Documents/Projects/Iron-Bridge/lotsawa/models/tib-eng-translator-0.4.0.keras')

In [None]:
acc = tib_eng_history.history['accuracy']
val_acc = tib_eng_history.history['val_accuracy']

loss = tib_eng_history.history['loss']
val_loss = tib_eng_history.history['val_loss']

epochs_range = range(len(tib_eng_history.history['loss']))

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

NameError: name 'tib_eng_history' is not defined