## Continue Pre-training Kinyarwanda 


this is an experimental notebook to fine-tune llama 3 for Kinyarwanda 

in this notebook we only try the "continous pre-training"

(we leave it to later work the "instruction-finetuning"). 


we use 
- llama2-8b as basis model (a 4bit quantized version) 
- Unsloth as a fine-tuning framework 
- datasets: kinyarwanda - wikipedia & kinyarwanda news (see notebook on dataset about their preparation) 




In [1]:
#Import libraries 

from unsloth import FastLanguageModel
import torch

from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments


from datasets import load_dataset
from datasets import Dataset

import json 
import pandas as pd 




🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [2]:
import datetime

## 1 . loading the model & fine-tuning parameters 

In [3]:
# we use unsloth & here we load the model 

max_seq_length = 2048 # this can be adapted for longer context 
dtype = None # the datatype will be auto-detected : Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # we use 4bit quantization to reduce memory usage. 


xmodel = 'unsloth/llama-3-8b-bnb-4bit'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = xmodel , 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)



==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
## parameters 

model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Casting embed_tokens to float32
Unsloth: Casting lm_head to float32


## 2 . loading the dataset 

see notebook **Kinyarwanda_Finetuning_Datasets** on how the datasets were created 


In [5]:
xdirectory = '/home/mike/xTemp_data_infrastructure/_kinyarwanda_datasets/'

xfiles = ['kinyarwanda_monolingual_rwandannews.jsonl',
          'kinyarwanda_monolingual_wikipedia20231101.jsonl',
         'kinyarwanda_monolingual_newssites.json']



text_data = []

for xfile in xfiles:
    xfile_name = xdirectory + xfile
    with open(xfile_name, 'r') as file:
        for line in file:
            xjson = json.loads(line)
            
            # it seems entries have no "text" (this should be fixed in the datasets)
            if xjson.get('text'):
                

                if xfile in ['kinyarwanda_monolingual_wikipedia20231101.jsonl', 
                             'kinyarwanda_monolingual_newssites.json']:
                    xtext_field = xjson.get('title') + ' ' + xjson.get('text')
                else:
                    xtext_field =  xjson.get('text')                

                xdict_text = {'text': xtext_field}


                text_data.append(xdict_text)

# into a dataset 
dataset = Dataset.from_pandas(pd.DataFrame(text_data))

# shuffle 

dataset = dataset.shuffle(seed=42)

print(len(dataset)) 
    

78353


In [6]:
## add EOS 

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
print(EOS_TOKEN)

def add_eos(example):
    example['text'] = example['text'] + ' ' + EOS_TOKEN
    return example

# Apply the function to the dataset
dataset = dataset.map(add_eos)

dataset[0]

<|end_of_text|>


Map:   0%|          | 0/78353 [00:00<?, ? examples/s]

{'text': '"Amafaranga dukuramo ntiwayagura imodoka, ntayindi politiki y’abarobyi uretse inzoga no kurongora. " Abarobyi bo mu Kivu Ntibimenyerewe  mu muco nyarwanda ko umuntu yiyemerera ko akora akazi k’uburaya kumugaragaro, ariko aba barobyi bo mu karere ka Rubavu mu murenge wa Nyamyumba ahazwi nko kuri Braseries, baratangaza ko bitewe n’uko amafaranga bakura muri ubu burobyi ari make ntakindi bayamaza, bo bayajyana mu nzoga n’indaya gusa akaba ari nayo nkomoko y’ubwiyongere bw’agakoko gatera SIDA.Ubuyobozi bw’aka karere ka Rubavu bwemera ko icyo kibazo cy’ubusambanyi gihari bityo bukanagaragaza ingamba zo kukirwanya.\xa0(...)Ntibimenyerewe  mu muco nyarwanda ko umuntu yiyemerera ko akora akazi k’uburaya kumugaragaro, ariko aba barobyi bo mu karere ka Rubavu mu murenge wa Nyamyumba ahazwi nko kuri Braseries, baratangaza ko bitewe n’uko amafaranga bakura muri ubu burobyi ari make ntakindi bayamaza, bo bayajyana mu nzoga n’indaya gusa akaba ari nayo nkomoko y’ubwiyongere bw’agakoko gate

## 3  Training arguments 

In [7]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2, 
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        #max_steps = 120,
        #warmup_steps = 10,
        warmup_ratio = 0.1,
        num_train_epochs = 3, 

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,

        output_dir = "outputs",
        save_strategy = "epoch",
        save_steps = 100,        

        logging_steps = 1
        
        
        
    ))

Map (num_proc=2):   0%|          | 0/78353 [00:00<?, ? examples/s]

In [8]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
9.648 GB of memory reserved.


## train 

In [9]:
print(datetime.datetime.now())

2024-06-28 15:35:30.858142


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 78,353 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 14,691
 "-____-"     Number of trainable parameters = 1,386,217,472


Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for embed_tokens.
Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for lm_head.


Step,Training Loss
1,2.5566
2,2.8776
3,2.7994
4,2.792
5,2.7364
6,2.7137
7,2.8686
8,2.7812
9,2.8058
10,2.6835


In [None]:
print(datetime.datetime.now())

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## save model 

In [None]:
model.save_pretrained("llamarwanda_rw_v002") # Local saving
tokenizer.save_pretrained("llamarwanda_rw_v002")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving