# Kinyarwanda datasets for fine-tuning 

datasets which an be used to explore the potential of fine-tuning LLM to work on Kinyarwanda


**1. mono-lingual corpora**

those are just texts in Kinyarwanda 

**2. parallel corpora**

texts in kinyarwanda which are associated with translation in other langauges 

**3. instruction corpora**

text which are suitable for instruction fine-tuning 

**4. other **

other type of text 


This jupyter notebook contains scripts to prepare the datasets in moer or less same format for simpler use in finetuning 

In [None]:
## import needed libraries 

In [1]:

import json
import os 
from datasets import load_dataset

## 2. Monolingual corpora

### 2.1 Wikipedia 

- Wikipemedia foundation released on huggingface a dataset containing the cleaned articles of all languages 

- the dataset corresponds to wikipedia from 01 November 2023 

- the kinyarwanda dataset is **20231101.rw**

- it contains xxx rows and can be explored here : https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.rw


In [4]:


dataset = load_dataset("wikimedia/wikipedia", "20231101.rw")['train']

print('number of texts:', len(dataset)) 

print('example:\n', dataset[3])


number of texts: 8063
example:
 {'id': '1651', 'url': 'https://rw.wikipedia.org/wiki/Afurika', 'title': 'Afurika', 'text': 'Afurika ni umugabane wa kabiri ku isi nini kandi wa kabiri utuwe cyane, nyuma ya  Aziya mubice byombi. Kuri kilometero zigera kuri 30.3 km2 (kilometero kare miliyoni 11.7) harimo ibirwa byegeranye, bifite 20% byubutaka bwisi na 6% byubuso bwose. Hafi ya miliyari 1.4 kugeza mu 2021, bingana na 18% by\'abatuye isi. Abatuye Afurika ni bato mu migabane yose; imyaka yo hagati muri 2012 yari 19.7, mugihe isi yo hagati yisi yari 30.4. Nubwo umutungo kamere utandukanye, Afurika nu mugabane ukize cyane ku mugabane wa buri muntu kandi uwa kabiri ukize cyane ku butunzi bwose, inyuma ya Oseyaniya. Intiti zabyitiriye ibintu bitandukanye birimo geografiya, ikirere, amoko, ubukoloni, Intambara y\'ubutita, neocolonialism, kubura demokarasi, na ruswa. Nubwo ubwo butunzi bwibanze cyane, kwagura ubukungu vuba hamwe n’abaturage benshi n’urubyiruko bituma Afurika iba isoko ry’ubukungu

In [5]:
## save to json_l file 

with open('kinyarwanda_monolingual_wikipedia20231101.jsonl', 'w') as xfile:
    # Iterate over the dataset 
    for xnr, xtext in enumerate(dataset):
        xfile.write(json.dumps(xtext) + '\n')
        #if xnr == 10:
        #    break
            
print('done')  
    


done


### 2.2  Kinyarwanda news 

- Kinyarwanda news is a dataset create by Nzeyimana, A., & Niyongabo Rubungo, A. (2022)
- Reference: 

Nzeyimana, A., & Niyongabo Rubungo, A. (2022). KinyaBERT: a Morphology-aware Kinyarwanda Language Model. ArXiv, abs/2203.08459.


- dataset which kas about 25k articles is accessible here https://github.com/anzeyimana/kinyabert-acl2022


- in this script we format it to have the text body 'txt' which consist of the title and the text of the article  



In [2]:

xFld = '/home/mike/Downloads/kinyabert-acl2022-master/datasets/RW_NEWS/original/'

xlst_texts = []

for xfile in os.listdir(xFld):
    qq1 = xFld + xfile


    with open(qq1, 'r') as xff:
        xtext_collection = xff.read()
        xtexts =  xtext_collection.split('\n\n')
        for xtext_all in xtexts:
            if xtext_all:
                xtext_line1 = xtext_all.split('\n')[0]
                xtext = xtext_all.replace(xtext_line1+'\n', '')

                xid_1, xid_2, xlabel, xurl = xtext_line1.split('\t')
                xtext_dict = {}
                xtext_dict['id_1'] = xid_1
                xtext_dict['id_2'] = xid_2
                xtext_dict['label'] = xlabel
                xtext_dict['url'] = xurl 
                xtext_dict['text'] = xtext  
                xlst_texts.append(xtext_dict)
        
        

        
    
with open('kinyarwanda_monolingual_rwandannews.jsonl', 'w') as xfile:
    # Iterate over the dataset 
    for xnr, xtext in enumerate(xlst_texts):
        xfile.write(json.dumps(xtext) + '\n')
        #if xnr == 10:
        #    break
            
print(len(xlst_texts ))      

25724


## count tokens 

In [None]:

# we use Llama3 tokenizers to count the tokens 
from transformers import AutoTokenizer
# Load the Llama 3 tokenizer (non-gated model)
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3-8B-Instruct", use_fast=True)

def tokenize_and_count(xText):
    '''count tokens:
    we just count text by text. 
    function can be improve to run in parallel
    '''
    # Tokenize the text 
    tokens = tokenizer(xText, truncation=False, padding=False, return_length=True)
    xlen_words = len(xText.split(' '))
    xlen_tokens = tokens['length'][0]
    xdict = {'length_words': xlen_words, 'length_tokens': xlen_tokens }
    return xdict
