## Instruction fine-tuning Kinyarwanda 


this is an experimental notebook to fine-tune llama 3 for Kinyarwanda 

in this notebook we fine-tune for machine translation 


we use 
- llama3-8b model that which was "continue-pretrained" on Kinyarwanda 
- Unsloth as a fine-tuning framework 
- datasets: 


In [6]:
#Import libraries 

from unsloth import FastLanguageModel
import torch

from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments


from datasets import load_dataset
from datasets import Dataset

import json 
import pandas as pd 

import random 




## 1 . loading the model & fine-tuning parameters 

In [2]:
# we use unsloth & here we load the model 

max_seq_length = 2048 # this can be adapted for longer context 
dtype = None # the datatype will be auto-detected : Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # we use 4bit quantization to reduce memory usage. 

# pre-trained model 

xmodel = '/home/mike/xGitHubRepos/kinyarwanda_ft_llm/02_continue_pretraining/llamarwanda_rw_v002'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = xmodel , 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)



==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj",
                      "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"], 
                      # we exclude "embed_tokens", "lm_head",] used for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


Unsloth: Casting embed_tokens to float32
Unsloth: Casting lm_head to float32


## 2 . loading the datasets 



In [3]:
xlst_all = []

xfile_name = '/home/mike/xTemp_data_infrastructure/_kinyarwanda_datasets/kinyarwanda_MT.jsonl'

with open(xfile_name, 'r') as file:
    for line in file:
        xjson = json.loads(line)
        xlst_all.append( xjson )
        
print(len(xlst_all))        


47824


In [4]:
def split_list(xlist):
    '''
    split list in two equal parts 
    '''
    xhalf = int(len(xlist)/2)
    xlst_a = xlist[0:xhalf]
    xlst_b = xlist[0:xhalf]    
    return xlst_a, xlst_b

In [31]:

# shuffle list 
xlst_all = random.sample(xlst_all, len(xlst_all))
       
# we filter some strange cases 
xlst_all = [x for x in xlst_all if isinstance(x.get('kin'), str) ]
xlst_all = [x for x in xlst_all if isinstance(x.get('en'), str) ]
    
    
    
##splt the list in train and test 

xtest = xlst_all[0:2812]
xtrain = xlst_all[2812:]    
    
print(len(xlst_all), len(xtest), len(xtrain))

#create splits to use for kin->en and en-kin 

xtest_a, xtest_b = split_list(xtest )
xtrain_a, xtrain_b = split_list(xtrain )
len(xtest_a), len(xtest_b), len(xtrain_a), len(xtrain_b)



47812 2812 45000


(1406, 1406, 22500, 22500)

In [37]:
def translate_kin_en(xtext):
    '''
    apply template to kin_en 
    '''
    x1 = 'translate the following text from kinyarwanda to english'
    messages=[{ 'role': 'user', 'content': x1},
              { 'role': 'user', 'content': xtext.get('kin') },
              { 'role': 'assistant', 'content': xtext.get('en')}] 
    inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return inputs
    
def translate_en_kin(xtext):
    '''
    apply template to kin_en 
    '''
    x1 = 'translate the following text from english to kinyarwanda'

    messages=[{ 'role': 'user', 'content': x1},
              { 'role': 'user', 'content': xtext.get('en') },
              { 'role': 'assistant', 'content': xtext.get('kin')}] 
    inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return inputs



xtext = random.sample(xtest_a,1)[0]    
    
print(translate_kin_en(xtext))
print(translate_en_kin(xtext))

    

<|im_start|>user
translate the following text from kinyarwanda to english<|im_end|>
<|im_start|>user
Niba ubwiye mwarimu wawe ko wabuze umukoro wawe, ntabwo azakwemera.<|im_end|>
<|im_start|>assistant
If you tell your teacher that you have lost your homework, he or she will not accept you.<|im_end|>
<|im_start|>assistant

<|im_start|>user
translate the following text from english to kinyarwanda<|im_end|>
<|im_start|>user
If you tell your teacher that you have lost your homework, he or she will not accept you.<|im_end|>
<|im_start|>assistant
Niba ubwiye mwarimu wawe ko wabuze umukoro wawe, ntabwo azakwemera.<|im_end|>
<|im_start|>assistant



In [50]:
lsttext_train = []
lsttext_test  = []

for xtext in  xtrain_a:
    xdict = {}
    xdict['text'] =  translate_kin_en(xtext)
    lsttext_train.append(xdict)
for xtext in  xtrain_b:
    xdict = {}
    xdict['text'] =  translate_en_kin(xtext)
    lsttext_train.append(xdict)
    
for xtext in  xtest_a:
    xdict = {}
    xdict['text'] =  translate_kin_en(xtext)
    lsttext_test.append(xdict)    
    
for xtext in  xtest_b:
    xdict = {}
    xdict['text'] =  translate_en_kin(xtext)
    lsttext_test.append(xdict)        
    

#to test 
lsttext_train = lsttext_train[0:1000]
lsttext_test = lsttext_test[0:100]
    
    
dataset_train = Dataset.from_pandas(pd.DataFrame(lsttext_train))
dataset_test  = Dataset.from_pandas(pd.DataFrame(lsttext_test))


dataset_train = dataset_train.shuffle(seed=42)
dataset_test = dataset_test.shuffle(seed=42)

dataset_train, dataset_test



(Dataset({
     features: ['text'],
     num_rows: 1000
 }),
 Dataset({
     features: ['text'],
     num_rows: 100
 }))

In [51]:
dataset_train[1]

{'text': '<|im_start|>user\ntranslate the following text from kinyarwanda to english<|im_end|>\n<|im_start|>user\nMu icupa hari amazi make. <|im_end|>\n<|im_start|>assistant\nThere is little water in the bottle.<|im_end|>\n<|im_start|>assistant\n'}

## 3  Training arguments 

In [53]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_train,
    eval_dataset = dataset_test,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        #max_steps = 120,
        #warmup_steps = 10,
        warmup_ratio = 0.1,
        num_train_epochs = 2, 

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs"
    ))

Map (num_proc=2):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

In [54]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
13.82 GB of memory reserved.


## train 

In [55]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 124
 "-____-"     Number of trainable parameters = 335,544,320


Step,Training Loss
1,2.444
2,2.6134
3,2.1869
4,1.759
5,1.2572
6,1.0359
7,0.8578
8,0.8266
9,0.8375
10,0.884


In [56]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

204.3434 seconds used for training.
3.41 minutes used for training.
Peak reserved memory = 13.82 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 58.443 %.
Peak reserved memory for training % of max memory = 0.0 %.


# inference 


### THIS IS TESTS FROM THE ABOVE SETTINGS WE USE ONLY 1000 examples, the training takes about 4 minutes 



In [58]:
def eval_translate_kin_en(xtext):
    '''
    apply template to kin_en 
    '''
    x1 = 'translate the following text from kinyarwanda to english'
    messages=[{ 'role': 'user', 'content': x1},
              { 'role': 'user', 'content': xtext }] 
    inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return inputs
    
def eval_translate_en_kin(xtext):
    '''
    apply template to kin_en 
    '''
    x1 = 'translate the following text from english to kinyarwanda'

    messages=[{ 'role': 'user', 'content': x1},
              { 'role': 'user', 'content': xtext }]
    inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return inputs

In [77]:
def trans_kin_en(xtext):
    '''
    translate from Kinyarwanda to english 
    '''
    
    #apply template 
    xtpl = eval_translate_kin_en(xtext)
    xinputs = tokenizer(xtpl , return_tensors = "pt").to("cuda")
    outputs = model.generate(**xinputs, max_new_tokens = 500, use_cache = True)
    q1 = tokenizer.batch_decode(outputs)[0]
    #extract the part 
    q2 = q1.split('assistant\n')[1].split('<|im_end|>')[0]
    return q2 


def trans_en_kin(xtext):
    '''
    translate from english to Kinyarwanda 
    '''
    
    #apply template 
    xtpl = eval_translate_en_kin(xtext)
    xinputs = tokenizer(xtpl , return_tensors = "pt").to("cuda")
    outputs = model.generate(**xinputs, max_new_tokens = 500, use_cache = True)
    q1 = tokenizer.batch_decode(outputs)[0]
    #extract the part 
    q2 = q1.split('assistant\n')[1].split('<|im_end|>')[0]
    return q2 

    

In [76]:
## texts are taken from the headlines of https://igihe.com/index.php
## on 2 July 2024 



xtexts = ['Israel yasabye abahungiye mu Majyepfo ya Gaza kongera guhungira aho bavuye',
         'Bugarama: Inkomoko y’umuco wo gushyingura umuntu babyina',
         'Perezida Biden yitiranyije u Bufaransa n’u Butaliyani',
         'Ishyaka ritavuga rumwe na Perezida Macron ryatsinze icyiciro cya mbere cy’amatora y’abadepite',
         'Museveni yabwiye urubyiruko imyaka myiza yo gukora imibonano mpuzabitsina',
         'Ibibazo bitanu wibaza ku modoka zikoresha amashanyarazi mu Rwanda, n’ibisubizo byabyo',
          'Ese Iran ishobora gufasha Hezbollah mu ntambara na Israel?',
          'Hagaragajwe uruhare rw’abajyanama b’ubuzima mu kurwanya indwara z’ibyorezo',
          'Euro 2024: Espagne yandagaje Georgie isanga u Budage muri ¼',
          'Perezida Kagame na Motsepe wa CAF batashye Stade Amahoro nshya'
         ]

for xtext in xtexts :
    x = trans_kin_en(xtext)
    print('text kinyarwanda:', xtext)
    print('text english:',  x)    
    print('-----')
    



Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Israel yasabye abahungiye mu Majyepfo ya Gaza kongera guhungira aho bavuye
text english: Israel has asked refugees in Southern Gaza to return to their homes
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Bugarama: Inkomoko y’umuco wo gushyingura umuntu babyina
text english: Bugarama: The origin of the culture of burying a person dancing
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Perezida Biden yitiranyije u Bufaransa n’u Butaliyani
text english: Biden confused France with Italy
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Ishyaka ritavuga rumwe na Perezida Macron ryatsinze icyiciro cya mbere cy’amatora y’abadepite
text english: The party opposed to President Macron won the first round of the parliamentary elections
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Museveni yabwiye urubyiruko imyaka myiza yo gukora imibonano mpuzabitsina
text english: Museveni told the youth the good age for having sex
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Ibibazo bitanu wibaza ku modoka zikoresha amashanyarazi mu Rwanda, n’ibisubizo byabyo
text english: Five Questions about Electric Cars in Rwanda, and their Answers
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Ese Iran ishobora gufasha Hezbollah mu ntambara na Israel?
text english: Can Iran help Hezbollah in the war with Israel?
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Hagaragajwe uruhare rw’abajyanama b’ubuzima mu kurwanya indwara z’ibyorezo
text english: The role of health advisors in preventing pandemic diseases has been highlighted
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text kinyarwanda: Euro 2024: Espagne yandagaje Georgie isanga u Budage muri ¼
text english: Euro 2024: Spain humiliates Georgia finds Germany in ¼
-----
text kinyarwanda: Perezida Kagame na Motsepe wa CAF batashye Stade Amahoro nshya
text english: Perezida Kagame and Motsepe of CAF open the new Amahoro Stadium
-----


In [81]:
##  headlines from the Guardian and BBC on 2 July 2024 


xtexts = ['Marine Le Pen says National Rally should not try to form government without a majority',
          'Far-right politician says National Rally ‘wish to govern’ France but cannot do so properly without a majority',
          'Portugal and Ronaldo save face as Costa’s shootout heroics sink Slovenia',
          'Greece introduces ‘growth-oriented’ six-day working week',
          'Biden denounces supreme court decision on Trump immunity: ‘He’ll be more emboldened’',
          'At least 39 killed in Kenya’s anti-tax protests, says rights watchdog',
          'Girmay first black African to win Tour de France stage',
          'Suspected female suicide bombers death toll rises to 32 in Nigeria',
          'The Moroccan man sentenced to death for fighting for Ukraine',
         'Zelensky sacks top general accused of incompetence']
          


for xtext in xtexts :
    x = trans_en_kin(xtext)
    print('text english:', xtext)
    print('text Kinyarwanda:',  x)    
    print('-----')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: Marine Le Pen says National Rally should not try to form government without a majority
text Kinyarwanda: Marine Le Pen says National Rally should not try to form government without a majority
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: Far-right politician says National Rally ‘wish to govern’ France but cannot do so properly without a majority
text Kinyarwanda: Un politicien de droite dit au National Rally 'une volonté de gouverner la France sans pouvoir le faire correctement sans une majorité
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: Portugal and Ronaldo save face as Costa’s shootout heroics sink Slovenia
text Kinyarwanda: Portugal and Ronaldo save face as Costa’s shootout heroics sink Slovenia
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: Greece introduces ‘growth-oriented’ six-day working week
text Kinyarwanda: Greece yashyizeho icyumweru cyimyaka itandatu cyakazi
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: Biden denounces supreme court decision on Trump immunity: ‘He’ll be more emboldened’
text Kinyarwanda: Biden denounces supreme court decision on Trump immunity: ‘He’ll be more emboldened’
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: At least 39 killed in Kenya’s anti-tax protests, says rights watchdog
text Kinyarwanda: At least 39 killed in Kenyas anti-tax protests, says rights watchdog
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: Girmay first black African to win Tour de France stage
text Kinyarwanda: Girmay first black African to win Tour de France stage
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: Suspected female suicide bombers death toll rises to 32 in Nigeria
text Kinyarwanda: The death toll from a suspected female suicide bomber in Nigeria rises to 32
-----


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


text english: The Moroccan man sentenced to death for fighting for Ukraine
text Kinyarwanda: The Moroccan man sentenced to death for fighting for Ukraine
-----
text english: Zelensky sacks top general accused of incompetence
text Kinyarwanda: Zelensky dismisses the top general accused of incompetence
-----


## save model 



In [82]:
#model.save_pretrained("llamarwanda_rw_v1") # Local saving
#tokenizer.save_pretrained("llamarwanda_rw_v1")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving