# African NER Notebook

This notebook is designed for experiments with language adapters as implemented using the [AdapterHub codebase](adapterhub.ml/). It is designed so that you can run every cell straight from top to bottom, and all the settings are in one place. 


It can: 
1. Train a language adapter using some monolingual data, by default a Hausa model trained using data from the MAFAND dataset. 
2. Fine-tune on a downstream tasks using that adapter to inject linguistic knowledge.  Masakhane NER using that monolingual data.

The notebook is divided up into sections: 
0. Setup: This is where we do necessary installations, downloads, etc. 
1. Settings (EDIT THIS). This is where you set everything for your experiment. You pick which language to use, etc. 
2. Training a language adapter
3. Using the language adapter to finetune on NER. 

BEFORE YOU BEGIN: 
* Make sure you setup your Weights and Biases account first and are part of the Masakhane team (https://wandb.ai/double-bind-ner)! 
* Add your name and language to this Google sheet. https://docs.google.com/spreadsheets/d/1Its1Yjcr0WHE9vFbuSGuuIYOqDS13qIYXVV_Zk7oCtM/edit#gid=510456181
* In Google Colab's "Runtime" settings, click "Change Runtime type" and be sure you are using a GPU. 

WHEN YOU ARE READY:
* go to the Settings section below
* edit your language and other settings you wish to change. 
* run each cell from top to bottom in the whole notebook. 


If you have any questions, ask in the slack (this link here: https://masakhane-nlp.slack.com/archives/C04HFLKU17D), or contact Colin Leong (cleong1@udayton.edu) or Herumb Shandilya.



## Setup (do not edit)


In [None]:
!git clone https://github.com/krypticmouse/double-bind-training.git
%cd double-bind-training

In [None]:
!git checkout train-lm-adapter

In [None]:
! pip install adapter-transformers seqeval ptvsd wandb datasets

#### Login to Weights and Biases for experiment tracking
Used to save all the metrics and settings for later reference. 

In [None]:
import wandb
wandb.login()

## Settings (Edit this)

This is where you set language and other settings. 

Batch sizes are important because they are the main thing you change based on what GPU you have. If you get "out of memory" errors, this is the main one to change. 

In [None]:
language_code = "hau" #@param {type:"string"}
model_name  = "roberta-base" #@param {type:"string"}
import os
os.environ["language_code"] = language_code
# TODO: add tags in wandb for: 
# * language code
# * model name
# * GPU? 
# etc.

In [None]:
#@title Adapter LM Training Settings 
per_device_train_batch_size = 8 #@param {type:"number"}
per_device_eval_batch_size = 8 #@param {type:"number"}
epochs = 20 #@param {type:"number"}

%env lm_adapter_training_model_name_or_path model_name
%env lm_adapter_training_per_device_train_batch_size per_device_train_batch_size
%env lm_adapter_training_per_device_eval_batch_size per_device_eval_batch_size
%env lm_adapter_training_num_train_epochs epochs

In [None]:
#@title MASAKHANE NER Settings

%env MAX_LENGTH=164
%env ADAPTER_MODEL=model_name
%env OUTPUT_DIR=hau_ner
%env BATCH_SIZE=32
%env NUM_EPOCHS=20
%env SAVE_STEPS=10000
%env SEED=1

## Training a Language Adapter

### download and preprocess training data from MAFAND
https://huggingface.co/datasets/masakhane/mafand is the Mafand dataset. It is intended for machine translation but we can just use it for language modeling. 

['en-amh', 'en-hau', 'en-ibo', 'en-kin', 'en-lug', 'en-nya', 'en-pcm', 'en-sna', 'en-swa', 'en-tsn', 'en-twi', 'en-xho', 'en-yor', 'en-zul', 'fr-bam', 'fr-bbj', 'fr-ewe', 'fr-fon', 'fr-mos', 'fr-wol'] are the available translation sets. 

We simply download the first dataset in the list containing the requested language code. 

TODO: code to deal with other datasets

In [None]:
from datasets import load_dataset

MAFAND_SETS = ['en-amh', 'en-hau', 'en-ibo', 'en-kin', 'en-lug', 'en-nya', 'en-pcm', 'en-sna', 'en-swa', 'en-tsn', 'en-twi', 'en-xho', 'en-yor', 'en-zul', 'fr-bam', 'fr-bbj', 'fr-ewe', 'fr-fon', 'fr-mos', 'fr-wol'] 
MAFAND_SETS_containing_code = [config_name for config_name in MAFAND_SETS if language_code in config_name]

dataset = load_dataset("masakhane/mafand", MAFAND_SETS_containing_code[0])

#### Converting translation to language modeling set

Our code expects the dataset to be in .txt files, we oblige. 

Pull out one language, convert to language modeling set. Language modeling sets are monolingual, and each data item has a "text" field like this example from https://colab.research.google.com/github/Adapter-Hub/adapter-transformers/blob/master/notebooks/01_Adapter_Training.ipynb


In [None]:
flat_dataset = dataset.flatten()
flat_dataset

Here we drop/remove the data that is not in the target language. 

In [None]:
for split in flat_dataset:
  for feature in flat_dataset[split].features:
    print(feature)
    if language_code not in feature:
      flat_dataset[split] = flat_dataset[split].remove_columns(feature)
  # flat_dataset[split] = flat_dataset[split].remove_columns('translation.en')
flat_dataset

We have removed the English/French, leaving us with just the target language. Now we write it out to .txt files. 

In [None]:
for split in flat_dataset:
  split_strings = []
  for data_item in flat_dataset[split]:
    values = data_item.values()
    for value in values: 
      # print(value)
      split_strings.append(value)
  with open(f"{split}.txt", "w") as spf:
    spf.writelines('\n'.join(split_strings))

use `pwd` command to make sure we're in the right directory

In [None]:
!pwd

Optional: if restarting from scratch and you want to remove the existing output directory, you can uncomment this and run it to delete the directory.

In [None]:
# !rm -rf /tmp/test-mlm

Use `ls` command to see if we've got the training files made correctlyy

In [None]:
!ls

### Actually run the  Adapter training


In [None]:
!CUDA_VISIBLE_DEVICES=0,1 python train_lm_adapter.py \
    --model_name_or_path "$lm_adapter_training_model_name_or_path" \
    --train_file train.txt \
    --validation_file validation.txt \
    --per_device_train_batch_size "$lm_adapter_training_per_device_train_batch_size" \
    --per_device_eval_batch_size "$lm_adapter_training_per_device_eval_batch_size" \
    --train_adapter \
    --do_train \
    --do_eval \
    --num_train_epochs "$lm_adapter_training_num_train_epochs" \
    --report_to "wandb" \
    --run_name "adapter-training-lm-test" \
    --output_dir /tmp/test-mlm \
    --tags "$language_code,$lm_adapter_training_model_name_or_path" \
    --logging_steps 99

In [None]:
!ls -alh /tmp/test-mlm

## Using the Language Adapter to finetune on NER

In [None]:
!CUDA_VISIBLE_DEVICES=0,1 python3 train_ner_adapter.py --data_dir data/hau/ \
--model_type roberta \
--model_name_or_path $ADAPTER_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS --learning_rate 5e-4 \
--seed $SEED \
--tags "$language_code,$lm_adapter_training_model_name_or_path" \
--path_to_adapter /tmp/test-mlm \
--overwrite_output_dir \
--do_train \
--do_eval \
--do_predict

# Backup results to your Google Drive

## Mount your Google Drive

In [None]:
from google.colab import drive
drive.mount('/gdrive')

## Backup the language adapter

See below for how to backup your language adapter. In this example we use the name of the wandb training run. 


In [None]:
%env backup_adapter_folder_name dancing-fish-5

In [None]:
!mkdir -p /gdrive/MyDrive/masakhane/double-bind/dancing-fish-5/

In [None]:
!cp -rv /tmp/test-mlm/* /gdrive/MyDrive/masakkhane/double-bind/dancing-fish-5/