# Large Language Models for Global Minority Languages
## Introduction

The recent success of Large Language Models (LLMs) has driven a great deal of interest in applications ranging from homework help to global domination.

One of the most exciting possibilities for LLMs is their application to problems of knowledge accessibility. Much effort has been devoted to, for example, the translation of natural language into SQL, a coding language used to query large databases. 

LLMs can also summarize long documents which may be difficult to comprehend for individuals with learning differences. Additionally, LLMs can be used to rephrase complex or jargon-heavy writing for individuals without a large amount of expertise in a given subject matter.

All these things present the possibility of a great leap forward for information accessibility. However, LLMs are almost exclusively developed in English and other majority languages. This leaves a significant gap for those who speak languages that are in the global minority.

One such language is Nepali. Nepali is spoken natively by 16 million people and is used as a second language by an additional 9 million, yet it is rarely available as an option for machine translation and LLMs do not cater to its speakers.

In this project I will attempt to create a first draft of solution and a potential template for similar projects that could cater to other global minority language groups.

## Text Generator

To start, I will create a mini-GPT model for text generation uses the KerasNLP library. This model will be trained on the simplebooks-92 dataset. This dataset uses a simplified English vocabulary. This is useful both for training purposes and for creating generated output that is readily understandable by individuals who do not speak English as a first language.

### Set Up

In [1]:
import os
import keras_nlp
import tensorrt
import tensorflow as tf
from tensorflow import keras

Using TensorFlow backend


2023-08-08 17:03:04.619606: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Settings

Below I've selected some key hyperparameters. Particularly notable here is the minimum traing sequence length. This sets the smallest number of tokens that will be examined by the model during training. We want this number to be large enough that the model is not attempting to train on single words or brief phrases which may eat up training time while providing little in the way of performance improvements.

In [27]:
BATCH_SIZE = 64
SEQ_LEN = 128
MIN_TRAINING_SEQ_LEN = 450

EMBED_DIM = 256
FEED_FORWARD_DIM = 256
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000

EPOCHS = 6

NUM_TOKENS_TO_GENERATE = 80

AUTOTUNE = tf.data.AUTOTUNE

### Load Simplebooks Data

In [3]:
raw_train_ds = (
    tf.data.TextLineDataset('simplebooks/simplebooks-92-raw/train.txt')
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

raw_val_ds = (
    tf.data.TextLineDataset("simplebooks/simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
)

2023-08-08 10:13:02.764170: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-08 10:13:04.359995: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-08 10:13:04.361001: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

### Tokenizer for Generator
Here I've defined the vocabulary for the model. This vocabulary is made up of words ('tokens') from the dataset that the model needs to be able to represent and understand.

PAD, UNK, BOS represent padding, unknown, and beginning-of-sentence. These tokens are set aside as non-words for our purposes.

I've then loaded in KerasNLP's tokenizer and used it to preprocess the data for training. This strips out punctuation, sets every word to be all lowercase and then assigns a unique integer to each word. This allows the model to train on the data as quantified data.

In [4]:
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size = VOCAB_SIZE,
    lowercase = True,
    reserved_tokens = ["[PAD]", "[UNK]", "[BOS]"],
)

In [5]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

In [6]:
start_packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]")
)

def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels

train_ds = raw_train_ds.map(tf.autograph.experimental.do_not_convert(preprocess), num_parallel_calls=AUTOTUNE).prefetch(AUTOTUNE)

val_ds = raw_val_ds.map(tf.autograph.experimental.do_not_convert(preprocess), num_parallel_calls=AUTOTUNE).prefetch(AUTOTUNE)

### Constructing the Model

In [7]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)

embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)

x = embedding_layer(inputs)

for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)

outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=True)

In [8]:
model.compile(optimizer='adam', loss=loss_fn, metrics=[])

In [9]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 transformer_decoder (Trans  (None, None, 256)         394749    
 formerDecoder)                                                  
                                                                 
 transformer_decoder_1 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 dense_4 (Dense)             (None, None, 5000)        128500

In [10]:
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

Epoch 1/6


2023-08-08 10:18:45.612434: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-08-08 10:18:46.442102: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7ff733660840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-08-08 10:18:46.442154: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 4070, Compute Capability 8.9
2023-08-08 10:18:46.876931: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-08-08 10:18:48.198173: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8902
2023-08-08 10:18:49.106469: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:231] Falling back to the CUDA driver for PTX compilation; ptxas does not sup

   3169/Unknown - 258s 76ms/step - loss: 4.5749

2023-08-08 10:22:55.672360: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 13188886257418549400
2023-08-08 10:22:55.672522: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 8378862843496919279
2023-08-08 10:22:55.672564: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 926227300504702987
2023-08-08 10:22:55.672593: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 15899955706847232733




2023-08-08 10:22:57.367724: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 13329655432636820324
2023-08-08 10:22:57.367773: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 3243849121640297520
2023-08-08 10:22:57.367780: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 17492471859100900918


Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.src.callbacks.History at 0x7ff7381a1b20>

In [11]:
model.save('text-generator.keras')

#### Testing the Text Generator

In [15]:
prompt_tokens = start_packer(tokenizer(["Don't do that! Just don't!"]))

In [16]:
def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    hidden_states = None
    return logits, hidden_states, cache

In [17]:
sampler = keras_nlp.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt = prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Generated text: \n{txt}\n")

Generated text: 
[b'[BOS] " oh , i \' ll tell you ! " cried mrs . snap . " i \' ll have to give you all your life , and you \' ll never be able to keep the quiet of the house in the little room where you are . you have to be a thievest - - you can \' t be on your hands . you \' ll get it into a house - - i \' ll do you with me . but you \' ll be too busy to make me think of it . you see , you can \' t get up and get back to the kitchen and keep your eye on . i \'']



## English to Nepali Translator

### Setup

In [2]:
import pathlib
import random
from tensorflow_text.tools.wordpiece_vocab import (
    bert_vocab_from_dataset as bert_vocab
)

In [43]:
BATCH_SIZE = 16
EPOCHS = 10
MAX_SEQUENCE_LENGTH = 40
ENG_VOCAB_SIZE = 15000
NEP_VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

In [44]:
text_file = pathlib.Path('npi-eng/npi.txt')

In [45]:
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    eng, nep = line.split("\t")[:2]
    eng = eng.lower()
    nep = nep.lower()
    text_pairs.append((eng, nep))

Below, I've printed some example sentence pairs.

In [46]:
for _ in range(5):
    print(random.choice(text_pairs))

("tom didn't watch tv.", 'टमले टिभी हेरेनन्।')
('i went aboard.', 'म विदेश गएँ ।')
('tom was in heaven.', 'टम स्वर्गमा थियो।')
('tell tom what that is.', 'टमलाई भन्नुहोस् कि त्यो के हो।')
('i eat rice almost every day.', 'म लगभग हरेक दिन भात खान्छु।')


Now, we can split the sentence pairs into training, validation, and test sets. Notice that this dataset is quite small. This is one of the challenges of creating models for global minority languages. There is substantially less data to work with than if we were working with, for example, Spanish or French.

In [47]:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

1574 total pairs
1102 training pairs
236 validation pairs
236 test pairs


### Tokenizer for Translator

In [48]:
def train_word_piece(text_samples, vocab_size, reserved_tokens):
    word_piece_ds = tf.data.Dataset.from_tensor_slices(text_samples)
    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
        reserved_tokens=reserved_tokens,
    )

    return vocab

In [49]:
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

eng_samples = [text_pair[0] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)

nep_samples = [text_pair[1] for text_pair in train_pairs]
nep_vocab = train_word_piece(nep_samples, NEP_VOCAB_SIZE, reserved_tokens)


In [50]:
print("English Tokens: ", eng_vocab[100:110])
print("Nepali Tokens: ", nep_vocab[100:110])

English Tokens:  ['##ly', '##on', '##an', '##day', 'as', 'll', 'on', 'will', '##ome', '##se']
Nepali Tokens:  ['##लाई', '##क', '##र', '##ै', '##्न', 'टमलाई', '##ो', 'थियो', '##ि', 'धेरै']


In [51]:
eng_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab, lowercase=False
)

nep_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=nep_vocab, lowercase=False
)

### Format the Text Datasets

In [52]:
def preprocess_batch(eng, nep):
    batch_size = tf.shape(nep)[0]

    eng = eng_tokenizer(eng)
    nep = nep_tokenizer(nep)

    # pad eng to max_sequence_length
    eng_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value = eng_tokenizer.token_to_id("[PAD]"),
    )

    eng = eng_start_end_packer(eng)

    # add special tokens [start] and [end] and pad nep
    nep_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length = MAX_SEQUENCE_LENGTH + 1,
        start_value = nep_tokenizer.token_to_id("[START]"),
        end_value = nep_tokenizer.token_to_id("[END]"),
        pad_value = nep_tokenizer.token_to_id("[PAD]")
    )

    nep = nep_start_end_packer(nep)

    return (
        {
        "encoder_inputs": eng,
        "decoder_inputs": nep[:, :-1]
        },
        nep[:, 1:],
    )

def make_dataset(pairs):
    eng_texts, nep_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    nep_texts = list(nep_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, nep_texts))
    dataset=dataset.batch(BATCH_SIZE)
    dataset = dataset.map(preprocess_batch, num_parallel_calls=AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

### Building the Translator Model

In [53]:
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ENG_VOCAB_SIZE,
    sequence_length = MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(encoder_inputs)

encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim = INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=NEP_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(NEP_VOCAB_SIZE, activation="softmax")(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

transformer = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="transformer",
)

In [54]:
transformer.summary()

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 token_and_position_embeddi  (None, None, 256)            3850240   ['encoder_inputs[0][0]']      
 ng_5 (TokenAndPositionEmbe                                                                       
 dding)                                                                                           
                                                                                                  
 decoder_inputs (InputLayer  [(None, None)]               0         []                  

In [55]:
transformer.compile(
    "rmsprop", 
    loss="sparse_categorical_crossentropy", 
    metrics=["accuracy"]
)

In [56]:
transformer.fit(
    train_ds, 
    epochs=10, 
    validation_data=val_ds
    )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7fe7b18c32e0>

In [57]:
transformer.save('translator.keras')

In [62]:
def decode_sequences(input_sentences):
    batch_size = tf.shape(input_sentences)[0]

    encoder_input_tokens = eng_tokenizer(input_sentences).to_tensor(
        shape=(None, MAX_SEQUENCE_LENGTH)
    )

    def next(prompt, cache, index):
        logits = transformer([encoder_input_tokens, prompt])[:, index - 1, :]
        hidden_states = None
        return logits, hidden_states, cache
    
    length = 40
    start = tf.fill((batch_size, 1), nep_tokenizer.token_to_id("[START]"))
    pad = tf.fill((batch_size, length - 1), nep_tokenizer.token_to_id("[PAD]"))
    prompt = tf.concat((start, pad), axis=-1)

    generated_tokens = keras_nlp.samplers.GreedySampler()(
        next,
        prompt,
        end_token_id=nep_tokenizer.token_to_id("[END]"),
        index=1
    )
    generated_sentences = nep_tokenizer.detokenize(generated_tokens)
    return generated_sentences

test_eng_texts = [pair[0] for pair in test_pairs]
for i in range(5):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequences(tf.constant([input_sentence]))
    translated = translated.numpy()[0].decode("utf-8")
    translated = (
        translated.replace("[PAD]", "")
        .replace("[START]", "")
        .replace("[END]", "")
        .strip()
    )

    print(f"** Example {i} **")
    print(input_sentence)
    print(translated)
    print()

** Example 0 **
i'll call you right back.
म स्क्वाइनँ ।

** Example 1 **
the last time i saw tom was in october.
स्बाम्रेलनुबा बुराहनन् ।

** Example 2 **
who are you calling for?
यो स्वाताइनुभर ?

** Example 3 **
i think i'd like to do that.
म मानेर्कनुँ ।

** Example 4 **
i got used to wearing a mask.
म स्क्वासनुर्कनुरको छ ।

