# **LoRAfrica: Scaling LLM Fine Tuning for African History**

## **Data Consolidation, Splitting and Publishing on Huggingface**

### **Plan of Action**
- Download data from Huggingface Links
- Consolidate data in one dataset as both dataframe and Jsonl formats
- Push to Huggingface Hub

In [None]:
# when using Colab, uncomment the following line to install required packages
# !pip install datasets huggingface_hub

### **Access tokens**

In [None]:
# # Log in to Hugging Face Hub if using Colab
# from google.colab import userdata
# from huggingface_hub import login

# hf_token = userdata.get('HF_TOKEN')
# if hf_token:
#    login(hf_token)
#    print("Successfully logged in to Hugging Face!")
# else:
#    print("Token is not set. Please save the token first.")

Successfully logged in to Hugging Face!


In [None]:
# # Log in to Hugging Face Hub if using Runpod or other environments
# from huggingface_hub import notebook_login
# notebook_login()

### **Datasources**

In [None]:
from datasets import load_dataset, concatenate_datasets, DatasetDict

In [None]:
dataset_one = load_dataset("Svngoku/Global-African-History-QA")
dataset_two = load_dataset("Svngoku/African-History-Extra-11-30-24-QA-Pairs")
dataset_three = load_dataset("Svngoku/African-History-Extra-Dspy-QA-Reasoning")

### **Check for columns present**

In [None]:
# Combine datasets into a list
datasets_lists = [dataset_one, dataset_two, dataset_three]

In [None]:
def dataset_columns_checker(data:list)->None:
  """
  Docstring for dataset_columns_checker
  
  :param data: list of datasets to check columns
  :type data: list
  :return: None
  """
  for index,dataset in enumerate(data):
    print(f"Columns present for dataset_{index+1} are:\n{dataset.column_names}")

In [None]:
dataset_columns_checker(datasets_lists)

Columns present for dataset_1 are:
{'train': ['question', 'answer', 'timestamp', 'category']}
Columns present for dataset_2 are:
{'train': ['title', 'question', 'answer', 'explanation']}
Columns present for dataset_3 are:
{'train': ['title', 'question', 'answer', 'key_concepts', 'confidence', 'reasoning']}


### **Extract needed columns**

In [None]:
# Select only 'question' and 'answer' columns from each dataset
dataset_one = dataset_one.select_columns(['question','answer'])['train']
dataset_two = dataset_two.select_columns(['question','answer'])['train']
dataset_three = dataset_three.select_columns(['question','answer'])['train']

In [None]:
datasets_lists = [dataset_one, dataset_two, dataset_three]
dataset_columns_checker(datasets_lists)

Columns present for dataset_1 are:
['question', 'answer']
Columns present for dataset_2 are:
['question', 'answer']
Columns present for dataset_3 are:
['question', 'answer']


### **Check length of each dataset**

In [None]:
def dataset_length_checker(data:list)->None:
  """
  Docstring for dataset_length_checker

  :param data: list of datasets to check lengths
  :type data: list
  :return: None
  """
  for index,dataset in enumerate(data):
    print(f"Length of dataset_{index+1} is:\n{len(dataset)}")

In [None]:
dataset_length_checker(datasets_lists)

Length of dataset_1 is:
1015
Length of dataset_2 is:
844
Length of dataset_3 is:
555


### **Combining Each dataset**

In [None]:
# Concatenate datasets
full_dataset = concatenate_datasets([dataset_one,dataset_two, dataset_three])

In [None]:
full_dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 2414
})

In [None]:
dataset_columns_checker([full_dataset])
dataset_length_checker([full_dataset])

Columns present for dataset_1 are:
['question', 'answer']
Length of dataset_1 is:
2414


### **Train-validation-test-split**

In [None]:
# Split the concatenated dataset into train and test sets
train_test_split = full_dataset.train_test_split(test_size=0.0414, seed=42)
train_test_split

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 2314
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 100
    })
})

In [None]:
# Further split the train set into train and validation sets
train_val_split = train_test_split['train'].train_test_split(test_size=0.086, seed=42)
train_val_split

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 2114
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 200
    })
})

In [None]:
final_dataset = DatasetDict({
    'train': train_val_split['train'],
    'validation': train_val_split['test'],
    'test': train_test_split['test']
})

In [None]:
final_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 2114
    })
    validation: Dataset({
        features: ['question', 'answer'],
        num_rows: 200
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 100
    })
})

In [None]:
train = final_dataset['train']
validation = final_dataset['validation']
test = final_dataset['test']
split_data_list = [train, validation, test]

In [None]:
dataset_length_checker(split_data_list)

Length of dataset_1 is:
2114
Length of dataset_2 is:
200
Length of dataset_3 is:
100


### **Publishing to Huggingface Hub**

In [None]:
# Pushing the dataset to Hugging Face Hub
print("=" * 70)
print("Pushing Dataset to Hugging Face Hub")
print("=" * 70)
print()

username = "DannyAI" # Replace with your Hugging Face username
dataset_name = "African-History-QA-Dataset" # Desired name for the dataset on the Hub

dataset_name  = f"{username}/{dataset_name}"

# Pushing to Hub
final_dataset.push_to_hub(dataset_name,private=False)

print("✓ After pushing, your dataset will be available at:")
print(f"  https://huggingface.co/datasets/{dataset_name}")

Pushing Dataset to Hugging Face Hub



Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              : 100%|##########|  409kB /  409kB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              : 100%|##########| 42.7kB / 42.7kB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              : 100%|##########| 23.9kB / 23.9kB            

✓ After pushing, your dataset will be available at:
  https://huggingface.co/datasets/DannyAI/African-History-QA-Dataset
