## Continue Pre-training Kinyarwanda 


this is an experimental notebook to fine-tune llama 3 for Kinyarwanda 

in this notebook we only try the "continous pre-training"

(we leave it to later work the "instruction-finetuning"). 


we use 
- llama2-8b as basis model (a 4bit quantized version) 
- Unsloth as a fine-tuning framework 
- datasets: kinyarwanda - wikipedia & kinyarwanda news (see notebook on dataset about their preparation) 




In [5]:
#Import libraries 

from unsloth import FastLanguageModel
import torch

from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments


from datasets import load_dataset
from datasets import Dataset

import json 
import pandas as pd 




## 1 . loading the model & fine-tuning parameters 

In [10]:
# we use unsloth & here we load the model 

max_seq_length = 2048 # this can be adapted for longer context 
dtype = None # the datatype will be auto-detected : Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # we use 4bit quantization to reduce memory usage. 


xmodel = 'unsloth/llama-3-8b-bnb-4bit'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = xmodel , 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)



==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
## parameters 

model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Casting embed_tokens to float32
Unsloth: Casting lm_head to float32


## 2 . loading the dataset 

see notebook **Kinyarwanda_Finetuning_Datasets** on how the datasets were created 


In [8]:
xfiles = ['kinyarwanda_monolingual_rwandannews.jsonl',
          'kinyarwanda_monolingual_wikipedia20231101.jsonl']

text_data = []

for xfile in xfiles:
    xfile_name = './_datasets/'+ xfile
    with open(xfile_name, 'r') as file:
        for line in file:
            xjson = json.loads(line)
            if xfile == 'kinyarwanda_monolingual_wikipedia20231101.jsonl':
                xtext_field = xjson.get('title') + ' ' + xjson.get('text')
            else:
                xtext_field =  xjson.get('text')                
                
            xdict_text = {'text': xtext_field}
            
            
            text_data.append(xdict_text)

# into a dataset 
dataset = Dataset.from_pandas(pd.DataFrame(text_data))

# shuffle 

dataset = dataset.shuffle(seed=42)

print(len(dataset)) 
    

33787


In [9]:
dataset[1]

{'text': 'Icyogajuru cy’u Rwanda ’RWASAT-1’ kiratangira gutanga amakuru ku gihugu mu gutaha Nta gihindutse, bitarenze tariki 18 z’ukwezi kwa cumi na kumwe 2019 icyogajuru cya mbere cyubatswe n’Abanyarwanda ku bufatanye na Kaminuza ya Tokyo mu Buyapani cyahawe izina rya RWASAT-1 cyanamaze kugera mu isanzure, kizatangira gutanga amakuru ku gihugu.\nKuri uyu wa Kabiri nibwo Minisitiri w’Ikoranabuhanga na Inovasiyo, Ingabire Paula, Ambasaderi w’u Buyapani mu Rwanda, Takayuki Miyashita, Umuyobozi w’Urwego Ngenzuramikorere, RURA, Lt Col Nyirishema Patrick na Prof Takayoshi Fukuyo wo muri Kaminuza ya Tokyo, bagiranye ikiganiro n’abanyamakuru ku mushinga w’icyogajuru cya mbere cyakozwe n’abanyarwanda, RWASAT-1.\nMuri uyu mushinga wa RWASAT-1, Guverinoma y’u Rwanda yashyizemo $250 000, Guverinoma y’u Buyapani ishyiramo $675000.\nUmwarimu muri kaminuza ya Tokyo akaba n’uwungirije uhagarariye umushinga wo gukora icyogajuru RWASAT-1, Prof.\nTakayoshi Fukuyo, yavuze ko icyogajuru kimaze iminsi cyoh

## 3  Training arguments 

In [16]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        #max_steps = 120,
        #warmup_steps = 10,
        warmup_ratio = 0.1,
        num_train_epochs = 2, 

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs"
    ))

Map (num_proc=2):   0%|          | 0/33787 [00:00<?, ? examples/s]

In [18]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
9.648 GB of memory reserved.


## train 

In [19]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 33,787 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 4,222
 "-____-"     Number of trainable parameters = 1,386,217,472


Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for embed_tokens.
Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for lm_head.


Step,Training Loss
1,2.7489
2,2.7797
3,2.821
4,2.8308
5,2.8768
6,2.8298
7,2.8286
8,2.6426
9,2.8635
10,2.916


In [20]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

38365.9733 seconds used for training.
639.43 minutes used for training.
Peak reserved memory = 22.354 GB.
Peak reserved memory for training = 12.706 GB.
Peak reserved memory % of max memory = 94.532 %.
Peak reserved memory for training % of max memory = 53.732 %.


# inference 

In [22]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer('Imana ', return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 1000, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Imana izarinda u Rwanda kubera kwemera Yesu Kristo - Perezida Kagame Perezida Paul Kagame yavuze ko Imana y’u Rwanda ari yo yonyine ishobora kuzarinda igihugu mu gihe abaturage bacyo batarabasha kugera ku kwemera Yesu Kristo nk’Umwami n’Umukiza wabo.\nYabivuze kuri iki Cyumweru ubwo yifatanyaga n’Abanyarwanda mu giterane cy’iminsi itatu cyabereye mu Mujyi wa Kigali.\nYagize ati “U Rwanda rufite Imana, ariko ntabwo yaba ari Imana y’u Rwanda iyo abaturage bayo batarabasha kwemera Yesu nk’Umwami n’Umukiza wabo.\nIcyo ni ikibazo gikomeye.\nIyo Imana itarashyirwa mu mitima y’abaturage, ntabwo yaba ari Imana y’u Rwanda.”\nYakomeje avuga ko Abanyarwanda bakeneye kumenya Imana, bakayibona, bakayemera kugira ngo bazabashe kuyirinda.\nAti “Iyo abantu batabonye Imana, batabasha kuyirinda.\nIyo abantu batabonye Imana, ntabwo baba bazi icyo Imana ibakeneyeho.\nIcyo ni ikibazo gikomeye.\nIyo Imana itarashyirwa mu mitima y’abaturage, ntabwo yaba ari Imana y’u Rwanda.”\nYasabye Aban

In [29]:
xprompt = 'Umugabo yaraje abwira abantu ati '


# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(xprompt, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 500, use_cache = False)
q1 = tokenizer.batch_decode(outputs)

print(q1[0])


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Umugabo yaraje abwira abantu ati  “Ndi umunyabwenge kandi nta muntu n’umwe nigeze njya”- VIDEO Mbonigaba Jean Claude ni umugabo w’imyaka 34 y’amavuko utuye mu Murenge wa Nyamirambo mu Karere ka Nyarugenge, mu Mujyi wa Kigali.
Avuga ko yabaye umunyabwenge kuva mu bwana bwe kugeza ubu, kandi ko nta muntu n’umwe yigeze ajya mu buzima bwo gukora ibyaha, ahubwo ko yagize amahirwe yo kubaho muri sosiyete ishyize imbere ubwenge, aho yabashije kubona akazi mu ruganda rwenga ibinyobwa rwa Bralirwa.
Avuga ko nubwo yabaye umunyabwenge, ariko hari abantu bamubeshyera ko ari umugome, kandi ko ari we uba uwo.
Ati “Abantu barantuka bavuga ngo ndi umugome, ariko njye ni umunyabwenge.
Abantu benshi bakora ibintu byinshi ariko batabizi, ariko njye ni uko niko nzi, nta muntu n’umwe nigeze njya”.
Uyu mugabo yashize amanga abantu benshi, ababwira ko nta muntu n’umwe yigeze ajya mu buzima bwo gukora ibyaha, kandi ko nubwo yabaye umunyabwenge, ariko abantu bamubeshyera ko ari umugome.
Yagize

In [30]:
# alpaca_prompt = Copied from above

xprompt = '''mu gihugu cy'ubufaransa'''

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(xprompt, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 500, use_cache = False)
q1 = tokenizer.batch_decode(outputs)
print(q1[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>mu gihugu cy'ubufaransa mu mujyi wa paris hafi y'umupaka wa beligium haravugwa inkuru y'umugabo w'umufaransa w'umugabo w'imyaka 52 wafashwe n'abapolisi bari mu muhanda agerageza guhunga ariko nyuma akaza gusubizwa aho yari ari akaba ari mu maboko ya polisi.
umugabo witwa Christophe Andre yafashwe n'abapolisi bari mu muhanda aho yari yicaye ku kagare ke, aho yari yicaye nta muntu wari uhari usibye umugore we witwa Marie Andre, aho uyu mugabo yagerageje guhunga abapolisi bari mu muhanda bamufata bamushyira mu maboko ya polisi.
nyuma y'iminota mike uyu mugabo yaje gusubizwa aho yari yicaye, ubwo abapolisi bari mu muhanda bamwakiraga bamutangaho akanyamuneza bamubwira ko yari ari mu maboko ya polisi, uyu mugabo akaba yari yitwaje icyemezo cy'uko yari afite uburwayi bwo mu mutwe, ariko ngo abapolisi bamaze kumenya ko yari afite uburwayi bwo mu mutwe, bakaba bari bamaze kumenya ko ubwo burwayi bwe bwari bukomeye, maze bahita bamufata bamushyira mu maboko ya polisi.
nyuma y'u

In [31]:
xprompt = '''amateka ya Afurika'''

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(xprompt, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 500, use_cache = False)
q1 = tokenizer.batch_decode(outputs)

print(q1[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>amateka ya Afurika n'abayituye 

Afurika n'abayituye 

Afurika n'umugabane uherereye mu majyaruguru y'isi, ugizwe n'ibihugu 54, n'ibindi bihugu bibiri by'ubumwe byashinzwe n'abanyafurika, Umuryango w’ubumwe bw’Afurika n’umuryango w’ubumwe bw’Afurika y’epfo. Afurika ifite ubuso bungana na kilometero kare miliyoni 30.3, niwo mugabane wa kabiri munini ku isi, nyuma ya Aziya. Ifite abaturage bagera kuri miliyari 1.4, niwo mugabane wa gatatu munini ku isi, nyuma y’Aziya na Abarabu.

Afurika y’iburasirazuba n’iburengerazuba by’Afurika bifite ibibazo by’imihindagurikire y’ikirere, aho imvura igwa nke cyane mu turere twa Sahara, ariko ikomeza kwiyongera mu turere dushyuha. Ibi bitera imihindagurikire y’ibihe muri Afurika yo hagati, ibihe by’izuba, imyuzure, n’imihindagurikire y’imyaka y’ubushyuhe.

Afurika n'umugabane w’isi ugizwe n’ibihugu byinshi, kandi buri gihugu gifite umwihariko wacyo w’amateka, umuco, n’imibereho y’abaturage. Ubuhamya bwa kera bw’umugabane wa Afurika bw

In [32]:
xprompt = '''Ejo bundi umugabo yaje nk'iya Gatera '''

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(xprompt, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 500, use_cache = False)
q1 = tokenizer.batch_decode(outputs)
print(q1[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Ejo bundi umugabo yaje nk'iya Gatera 2 mu kazi k'ubukaraniye, yanyuze mu nzira z'abacuruzi, ariko ntiyigeze arya ku mugabo w'umukaraniye, ahubwo yaje ku mugabo w'umukaraniye, wari wagiye mu rugo kugira ngo azane ibyo gukora mu rugo.
Iyo ndorerwamo yari imwe, yari iyo mu gikari, yari ifite ibyuma byinshi byo mu biro, ifite ibyuma byinshi byo mu biro, byose byari bimaze imyaka myinshi, byari bimeze nk'ibyari bimaze imyaka myinshi, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byarashaje, byari byar

In [33]:
xprompt = '''the history of the persian empire  '''

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(xprompt, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 500, use_cache = False)
q1 = tokenizer.batch_decode(outputs)

print(q1[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>the history of the persian empire   5. The Sassanid Dynasty (224–651 AD)

The Sassanid Dynasty (224–651 AD) 

The Sassanid Dynasty was the last of the three great dynasties of the ancient Persians. The Sassanid kings ruled from 224 AD to 651 AD, when the Persians were conquered by the Arab armies of the Islamic Empire. The Sassanid Empire was the last great Empire of the ancient world before the rise of Islam. It was the most powerful Empire in the Middle East for several hundred years, until its defeat by the Arab armies. The Sassanid Empire was also the first great Empire to use gunpowder, and the first great Empire to use the wheel as a weapon. The Sassanid Empire was also the first great Empire to use the wheel as a weapon, and the first great Empire to use the wheel as a means of transportation. The Sassanid Empire was also the first great Empire to use the wheel as a means of transportation, and the first great Empire to use the wheel as a means of transportation

In [34]:
xprompt = '''umwana wange yarambwiye   '''
xprompt = '''Ejo bundi umwana yagiye '''


FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(xprompt, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
q1 = tokenizer.batch_decode(outputs)

print(q1[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Ejo bundi umwana yagiye igitaramo yambaye ubusa, ababyeyi baramubaza, asubiza atya (VIDEO) Muri iki gihe abantu benshi bakunda gukora ibintu bitandukanye kugira ngo babashe kwishimisha, ariko hari abandi bakabifata nk’ibintu bidasanzwe ndetse bikaba byanabagiraho ingaruka zitandukanye.
Hari abantu benshi bakora ibintu batitaye ku byo ababyeyi babo baba bari bemeje cyangwa batumye, bityo ugasanga bishyize mu kaga ndetse rimwe na rimwe bakaba banabifatirwamo n’inzego z’umutekano.
Muri iyi nkuru tugiye kurebera hamwe uburyo umwana yagiye igitaramo yambaye ubusa ababyeyi baramubaza asubiza atya.
Uyu mwana yitwa Kaitlyn na we akaba ari umwana w’umukobwa w’imyaka 15, ak


## save model 



In [35]:
model.save_pretrained("llamarwanda_rw_v1") # Local saving
tokenizer.save_pretrained("llamarwanda_rw_v1")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('llamarwanda_rw_v1/tokenizer_config.json',
 'llamarwanda_rw_v1/special_tokens_map.json',
 'llamarwanda_rw_v1/tokenizer.json')