ZaBantu ^Beta

Training Cross-Lingual Language Models for South African Bantu Languages

By: Ndamulelo Nemakhavhani

Overview

ZaBantu is a project that aims to train cross-lingual language models for South African languages using the XLM-RoBERTa architecture. The project is inspired by the AfriBERTa and XLM-RoBERTa models. The project is currently in the beta phase and is being activily developed to ensure we have sufficient data and resources to benchmark the models.

Documentation structure

You can navigate the documentation using the links below:

Machine setup - Instructions on how to setup your machine to run the code in this repository on a CUDA GPU
Get the data - Instructions on how to get the data used in this project
Training the model - Instructions on how to train the model on our data or your own custom dataset
Experiment tracking - Instructions on how to track your experiments using Comet.ml. This is optional but recommended

Getting Started

You can quickly get started with training a light-weight model to see how everything works by following these instructions:
WE ASSUME, that you already have access to a machine running Ubuntu 20.04 with 1 x NVIDIA Tesla T4 GPU. Any other version of Ubuntu or GPU should work similarly, but we have only tested on this configuration.

1. Clone this repository to your local machine.

git clone https://github.com/ndamulelonemakh/zabantu-beta.git
cd zabantu-beta

1. Install NVIDIA drivers (if not already installed)

   bash scripts/nvidia_setup.sh

   # This script will cause your machine to reboot
   # Wait for the machine to reboot, then run it again(next step)

1. Wait for the machine to reboot, then install the CUDA toolkit

   bash scripts/nvidia_setup.sh

1. Install Python Dependencies

bash scripts/server-setup.sh

# If any steps fail, try running the individual commands manually

Optional If you intend to use comet.ml and other optional tools, copy the env.template file to .env and fill in the required fields

1. Verify that your Pytorch installation is aware of your CUDA installation

# Optional:
python -c "import torch; print(torch.cuda.is_available())"

1. Run the sample training pipeline by running the following command:

make train_lite

# If you are using comet.ml, you should be able to see the training progress on the comet.ml dashboard

Contributing

Refer to the CONTRIBUTING.md file for instructions on how to contribute to this project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

ZaBantu ^Beta

Training Cross-Lingual Language Models for South African Bantu Languages

Overview

Documentation structure

Getting Started

Contributing

Files

index.md

Latest commit

History

index.md

File metadata and controls

ZaBantu Beta

Training Cross-Lingual Language Models for South African Bantu Languages

Overview

Documentation structure

Getting Started

Contributing

ZaBantu ^Beta