# Large Language Models for Global Minority Languages
## Introduction

The recent success of Large Language Models (LLMs) has driven a great deal of interest in applications ranging from homework help to global domination.

One of the most exciting possibilities for LLMs is their application to problems of knowledge accessibility. Much effort has been devoted to, for example, the translation of natural language into SQL, a coding language used to query large databases. 

LLMs can also summarize long documents which may be difficult to comprehend for individuals with learning differences. Additionally, LLMs can be used to rephrase complex or jargon-heavy writing for individuals without a large amount of expertise in a given subject matter.

All these things present the possibility of a great leap forward for information accessibility. However, LLMs are almost exclusively developed in English and other majority languages. This leaves a significant gap for those who speak languages that are in the global minority.

One such language is Nepali. Nepali is spoken natively by 16 million people and is used as a second language by an additional 9 million, yet it is rarely available as an option for machine translation and LLMs do not cater to its speakers.

In this project I will attempt to create a first draft of solution and a potential template for similar projects that could cater to other global minority language groups.

## Text Generator

To start, I will create a mini-GPT model for text generation uses the KerasNLP library. This model will be trained on the simplebooks-92 dataset. This dataset uses a simplified English vocabulary. This is useful both for training purposes and for creating generated output that is readily understandable by individuals who do not speak English as a first language.

### Set Up

In [3]:
import os
import keras_nlp
import tensorflow as tf
from tensorflow import keras

### Settings

Below I've selected some key hyperparameters. Particularly notable here is the minimum traing sequence length. This sets the smallest number of tokens that will be examined by the model during training. We want this number to be large enough that the model is not attempting to train on single words or brief phrases which may eat up training time while providing little in the way of performance improvements.

In [4]:
BATCH_SIZE = 64
SEQ_LEN = 128
MIN_TRAINING_SEQ_LEN = 450

EMBED_DIM = 256
FEED_FORWARD_DIM = 256
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000

EPOCHS = 6

NUM_TOKENS_TO_GENERATE = 80

AUTOTUNE = tf.data.AUTOTUNE

### Load Simplebooks Data

In [11]:
raw_train_ds = (
    tf.data.TextLineDataset('simplebooks/simplebooks-92-raw/train.txt')
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

raw_val_ds = (
    tf.data.TextLineDataset("simplebooks/simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
    .batch(BATCH_SIZE)
)

### Tokenizer
Here I've defined the vocabulary for the model. This vocabulary is made up of words ('tokens') from the dataset that the model needs to be able to represent and understand.

PAD, UNK, BOS represent padding, unknown, and beginning-of-sentence. These tokens are set aside as non-words for our purposes.

I've then loaded in KerasNLP's tokenizer and used it to preprocess the data for training. This strips out punctuation, sets every word to be all lowercase and then assigns a unique integer to each word. This allows the model to train on the data as quantified data.

In [12]:
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size = VOCAB_SIZE,
    lowercase = True,
    reserved_tokens = ["[PAD]", "[UNK]", "[BOS]"],
)

In [13]:
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

In [27]:
start_packer = keras_nlp.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]")
)

def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels

train_ds = raw_train_ds.map(tf.autograph.experimental.do_not_convert(preprocess), num_parallel_calls=AUTOTUNE).prefetch(AUTOTUNE)

val_ds = raw_val_ds.map(tf.autograph.experimental.do_not_convert(preprocess), num_parallel_calls=AUTOTUNE).prefetch(AUTOTUNE)

### Constructing the Model

In [28]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)

embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)

x = embedding_layer(inputs)

for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)

outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=True)

In [32]:
model.compile(optimizer='adam', loss=loss_fn, metrics=[])

In [33]:
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 256)         1312768   
 ng_3 (TokenAndPositionEmbe                                      
 dding)                                                          
                                                                 
 transformer_decoder_6 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 transformer_decoder_7 (Tra  (None, None, 256)         394749    
 nsformerDecoder)                                                
                                                                 
 dense_19 (Dense)            (None, None, 5000)        1285

In [1]:
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

NameError: name 'model' is not defined