<center>
<img src="https://raw.githubusercontent.com/afrisenti-semeval/afrisent-semeval-2023/main/afrisenti-logo.png" width="30%" />
</center>

<center>

#SemEval 2023 Shared Task 12: AfriSenti (Task B)

###Starter Notebook

</center>

#Leveraging Pre-trained A Language Model to Train A Sentiment Classifier

**Authors:**
[Idris Abdulmumin](https://www.hausanlp.org/author/idris-abdulmuminu/), [David Adelani](https://dadelani.github.io/) and [Shamsuddeen Hassan Muhammad](https://www.hausanlp.org/author/shamsuddeen-hassan-muhammad/).

**Introduction:** 

You are welcome to participate in our first-of-its-kind SemEval Shared Task! 

In this starter notebook, we will take you through the process of fine-tuning a pre-trained language model on a sample data to build a sentiment classifier. The notebook was adapted from a [Hugginface implementation]( https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_xnli.py) for such tasks.

**Language (Track)**
* Track 14: Multilingual

**Level:** <font color='blue'>`Beginner to Intermediate`</font>

**Outline:** 

1. Installation and importation of necessary libraries
2. Setting up the project parameters.
3. Running training and evaluation

**Before you start:**

It is **strongly advised** that you use a GPU to speed up training. To do this, go to the "Runtime" menu in Colab, select "Change runtime type" and then in the popup menu, choose "GPU" in the "Hardware accelerator" box.

NB: 

- **The codes in this notebook are provided to familiarize yourselves with fine-tuning language models for sentiment classification. You may extend and (or) modify as appropriate to obtain competitive performances**

- **We also use the data as it is, without any cleaning such as removal of emoji and hyperlinks.**




#1) Installations and imports

##a. Mount drive (if you are running on colab)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##b. Clone or update competition repository

After cloning, under MyDrive, you will see afrisenti-semeval-2023 folder with all the the data for the afrisenti shared task (training and dev) 

In [None]:
%cd /content/drive/MyDrive

import os

PROJECT_DIR = '/content/drive/MyDrive/afrisent-semeval-2023'
PROJECT_GITHUB_URL = 'https://github.com/afrisenti-semeval/afrisent-semeval-2023.git'

if not os.path.isdir(PROJECT_DIR):
  !git clone {PROJECT_GITHUB_URL}
else:
  %cd {PROJECT_DIR}
  !git pull {PROJECT_GITHUB_URL}

##c. Install required libraries

- Set the project dire
ctory in the cell below, where the requirements file should also be located, and run the cell

In [None]:
if os.path.isdir(PROJECT_DIR):
  #The requirements file should be in PROJECT_DIR
  if os.path.isfile(os.path.join(PROJECT_DIR, 'starter_kit/requirements.txt')):
    !pip install -r starter_kit/requirements.txt
  else:
    print('requirements.txt file not found')

else:
  print("Project directory not found, please check again.")

##d. Import libraries

Import libraries below

In [None]:
import pandas as pd
import numpy as np

#2) Dataset

##a. Formatting

The training dataset that was provided for the competition is in the following format:

| ID | text | label |
| --- | --- | --- |
| twt001 | example text | negative |
| twt002 | example text | positive |
| ... | ... | ... |

However, the code in the starter kit do not expect the 
ID and require the training (and evaluation) data to be in the following format

|text | label |
|--- | --- |
|example text | negative |
|example text | positive |
|... | ... |

To reformat the data run the following cell

In [None]:
df = pd.read_csv('/content/drive/MyDrive/afrisent-semeval-2023/SubtaskB/multilingual_train.tsv', sep='\t', header=0)
df_1 = pd.read_csv('/content/drive/MyDrive/afrisent-semeval-2023/SubtaskB/dev_gold/multilingual_dev_gold_label.tsv', sep='\t', header=0)
df = pd.concat([df, df_1], axis=0)

In [None]:
df.to_csv('/content/drive/MyDrive/afrisent-semeval-2023/SubtaskB/multilingual_train.tsv', sep='\t', index=False)

In [None]:
# Training Data Paths

TASK = 'SubtaskB'
TRAINING_DATA_DIR = os.path.join(PROJECT_DIR, TASK)
FORMATTED_TRAIN_DATA = os.path.join(TRAINING_DATA_DIR, 'formatted-train-data')

if os.path.isdir(TRAINING_DATA_DIR):
  print('Data directory found.')
  if not os.path.isdir(FORMATTED_TRAIN_DATA):
    print('Creating directory to store formatted data.')
    os.mkdir(FORMATTED_TRAIN_DATA)
else:
  print(TRAINING_DATA_DIR + ' is not a valid directory or does not exist!')

Data directory found.


In [None]:
%cd {TRAINING_DATA_DIR}

training_files = os.listdir()

if len(training_files) > 0:
  for training_file in training_files:
    if training_file.endswith('train.tsv'):

      data = training_file.split('_')[0]
      if not os.path.isdir(os.path.join(FORMATTED_TRAIN_DATA, data)):
        print(data, 'Creating directory to store train, dev and test splits.')
        os.mkdir(os.path.join(FORMATTED_TRAIN_DATA, data))
      
      df_original = pd.read_csv(training_file, sep='\t', names=['ID', 'text', 'label'], header=0)
      df_gold = pd.read_csv('/content/drive/MyDrive/afrisent-semeval-2023/SubtaskB/dev_gold/multilingual_dev_gold_label.tsv',  sep='\t', names=['ID', 'text', 'label'], header=0)
      df = pd.concat([df_original, df_gold], axis=0)
      df[['text', 'label']].to_csv(os.path.join(FORMATTED_TRAIN_DATA, data, 'train.tsv'), sep='\t', index=False)
    else:
      print(training_file + ' skipped!')
else:
  print('No files are found in this directory!')

/content/drive/MyDrive/afrisent-semeval-2023/SubtaskB
README.txt skipped!
multilingual_dev.tsv skipped!
formatted-train-data skipped!
splitted-train-dev-test skipped!
splitted-train-dev skipped!
dev_gold skipped!
test skipped!
.DS_Store skipped!


After running the code above, a new folder (called formated-train-data) with formated files is created in the "datasets" folder in the train sub-folder.

##b. <font color='red'>`(Optional) Creating Evaluation (Dev and Test) sets from the available training data`</font>

You may wish to create train and evaluation (dev and test) sets from the training data provided. If you wish to do so, you can run any of the cells below`

###i. If you want to create both the Dev and Test sets, run this cell

In [None]:
if os.path.isdir(FORMATTED_TRAIN_DATA):
  print('Data directory found.')
  SPLITTED_DATA = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev-test')
  if not os.path.isdir(SPLITTED_DATA):
    print('Creating directory to store train, dev and test splits.')
    os.mkdir(SPLITTED_DATA)
else:
  print(FORMATTED_TRAIN_DATA + ' is not a valid directory or does not exist!')

%cd {FORMATTED_TRAIN_DATA}
formatted_training_files = os.listdir()

if len(formatted_training_files) > 0:
  for data_name in formatted_training_files:
    formatted_training_file = os.path.join(data_name, 'train.tsv')
    if os.path.isfile(formatted_training_file):
      labeled_tweets = pd.read_csv(formatted_training_file, sep='\t', names=['text', 'label'], header=0)
      train, dev, test = np.split(labeled_tweets.sample(frac=1, random_state=42), [int(.7*len(labeled_tweets)), int(.8*len(labeled_tweets))])

      if not os.path.isdir(os.path.join(SPLITTED_DATA, data_name)):
        print(data_name, 'Creating directory to store train, dev and test splits.')
        os.mkdir(os.path.join(SPLITTED_DATA, data_name))

      train.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'train.tsv'), sep='\t', index=False)
      dev.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'dev.tsv'), sep='\t', index=False)
      test.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name,'test.tsv'), sep='\t', index=False)
    else:
      print(training_file + ' is not a supported file!')
else:
  print('No files are found in this directory!')

Data directory found.
/content/drive/MyDrive/afrisent-semeval-2023/SubtaskB/formatted-train-data


After running the code above, a new folder (called splitted-train-dev-test) with train-dev-test split is created in the "datasets" folder in the train sub-folder. Here, the train-dev-test split is 70/10/20



###ii. If you want to create only the Dev set from the training data, please run this

In [None]:
from sklearn.model_selection import train_test_split

if os.path.isdir(FORMATTED_TRAIN_DATA):
  print('Data directory found.')
  SPLITTED_DATA = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev')
  if not os.path.isdir(SPLITTED_DATA):
    print('Creating directory to store train, dev and test splits.')
    os.mkdir(SPLITTED_DATA)
else:
  print(FORMATTED_TRAIN_DATA + ' is not a valid directory or does not exist!')

%cd {FORMATTED_TRAIN_DATA}
formatted_training_files = os.listdir()

if len(formatted_training_files) > 0:
  for data_name in formatted_training_files:
    formatted_training_file = os.path.join(data_name, 'train.tsv')
    if os.path.isfile(formatted_training_file):
      labeled_tweets = pd.read_csv(formatted_training_file, sep='\t', names=['text', 'label'], header=0)
      train, dev = train_test_split(labeled_tweets, test_size=0.3)

      if not os.path.isdir(os.path.join(SPLITTED_DATA, data_name)):
        print(data_name, 'Creating directory to store train, dev and test splits.')
        os.mkdir(os.path.join(SPLITTED_DATA, data_name))

      train.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'train.tsv'), sep='\t', index=False)
      dev.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'dev.tsv'), sep='\t', index=False)
    else:
      print(training_file + ' is not a supported file!')
else:
  print('No files are found in this directory!')

Data directory found.
/content/drive/MyDrive/afrisent-semeval-2023/SubtaskB/formatted-train-data


After running the code above, a new folder (called splitted-train-dev) with train-dev split is created in the "datasets" folder in the train sub-folder. Here, the train-dev split is 70/30


#3) Training setup

##a. Set project parameters

For a list of models that be used for fine-tuning, you can check [HERE](https://huggingface.co/models).

In [None]:
%cd {PROJECT_DIR}

# Model Training Parameters
MODEL_NAME_OR_PATH = 'masakhane/afroxlmr-large-ner-masakhaner-1.0_2.0'
BATCH_SIZE = 4
LEARNING_RATE = 1e-5
NUMBER_OF_TRAINING_EPOCHS = 5
MAXIMUM_SEQUENCE_LENGTH = 128
SAVE_STEPS = -1

print('Everything set. You can now start model training.')

/content/drive/MyDrive/afrisent-semeval-2023
Everything set. You can now start model training.


##b. Train the model

In the section below, we provide three options: 

- 1) training model without any validation; 
- 2) training model with validation but without testing; 
- 3) training a model with validation and test set.

###i. Training on only Train set, without any evaluation

In [None]:
DATA_DIR = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev', 'multilingual')
OUTPUT_DIR = os.path.join(PROJECT_DIR, 'models', 'multilingual' + '_no_eval')

!CUDA_VISIBLE_DEVICES=0 python starter_kit/run_textclass.py \
  --model_name_or_path {MODEL_NAME_OR_PATH} \
  --data_dir {DATA_DIR} \
  --do_train \
  --per_device_train_batch_size {BATCH_SIZE} \
  --learning_rate {MAXIMUM_SEQUENCE_LENGTH} \
  --num_train_epochs {NUMBER_OF_TRAINING_EPOCHS} \
  --max_seq_length {MAXIMUM_SEQUENCE_LENGTH} \
  --output_dir {OUTPUT_DIR} \
  --save_steps {SAVE_STEPS}

As you may observe, the training loss is very large. As a start, you can tune the training parameters and model to get a competitive result. 

You can observe also, there is no validation metrics (e.g., accuracy, loss etc) since we are only training without validtaion 

###ii. Training on only Train and Dev sets

In [None]:
DATA_DIR = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev', 'multilingual')
OUTPUT_DIR = os.path.join(PROJECT_DIR, 'models', 'multilingual' + '_no_test3')

!CUDA_VISIBLE_DEVICES=0 python starter_kit/run_textclass.py \
  --model_name_or_path {MODEL_NAME_OR_PATH} \
  --data_dir {DATA_DIR} \
  --do_train \
  --do_eval \
  --per_device_train_batch_size {BATCH_SIZE} \
  --learning_rate {LEARNING_RATE} \
  --num_train_epochs {NUMBER_OF_TRAINING_EPOCHS} \
  --max_seq_length {MAXIMUM_SEQUENCE_LENGTH} \
  --overwrite_output_dir yes \
  --output_dir {OUTPUT_DIR} \
  --save_steps {SAVE_STEPS} 

INFO:__main__:Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_me

Now, you can observe, there is evalidation metrics (e.g., accuracy, loss etc) since we are evaluating our model performance on the validation set we created from the 
training data.






###iii. Training with Train, Dev and Test sets

In [None]:
DATA_DIR = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev-test', 'multilingual')
OUTPUT_DIR = os.path.join(PROJECT_DIR, 'models', 'multilingual')

!CUDA_VISIBLE_DEVICES=0 python starter_kit/run_textclass.py \
  --model_name_or_path {MODEL_NAME_OR_PATH} \
  --data_dir {DATA_DIR} \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size {BATCH_SIZE} \
  --learning_rate {MAXIMUM_SEQUENCE_LENGTH} \
  --num_train_epochs {NUMBER_OF_TRAINING_EPOCHS} \
  --max_seq_length {MAXIMUM_SEQUENCE_LENGTH} \
  --output_dir {OUTPUT_DIR} \
  --save_steps {SAVE_STEPS}

INFO:__main__:Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=True,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_met

Now that you trained your best model and find the best  parameters, you can submit your prediction on dev or test set on CodaLab competition page.

#4) Submission

- For submission after training your model, unlabeled tweets were provided for dev (development phase) and test (evaluation phase). 

- To generate their sentiment prediction, provide the path to the file containing the unlabeled tweets.

**What the code does**
1. Predicting sentiments of the unlabeled tweets (dev or test)
2. Create a file in the submission format

In [None]:
%cd {PROJECT_DIR}

OUTPUT_DIR = os.path.join(PROJECT_DIR, 'models', 'multilingual'+'_no_test3')
FILE_NAME = os.path.join(PROJECT_DIR, TASK, 'test', 'multilingual' + '_test_participants.tsv')
TEXT_COLUMN = 'tweet'

!python starter_kit/run_predict.py \
  --model_path {OUTPUT_DIR} \
  --file_name {FILE_NAME} \
  --text_column {TEXT_COLUMN} 

/content/drive/MyDrive/afrisent-semeval-2023
***** Running Prediction *****
  Num examples = 30211
  Batch size = 8
100% 3777/3777 [18:32<00:00,  3.39it/s]
Data directory found.


- Congratulations. You now trained sentiment classifier and predict on the unlabelled tweets.

- The prediction file (pred_multilingual.tsv) is in "afrisenti-semval-2023" folder under "submissions" folder and is ready for submission. The submission file is in the format below:

<center>

|ID | label |
|--- | --- |
|hau_dev_00001| negative |
|hau_dev_00002| positive |
|... | ... |

</center>

- Inside the same folder, you will also see a file "multilingual_predictions.tsv" with the format below to see tweets with corresponding sentiment predictions. This file is not for submission.


<center>

|ID | text | label |
|--- | --- | --- | 
|hau_dev_00001| @user Allah Miki albarkah 🙏🙏🙏 |  positive |
|hau_dev_00002| @user Kidan ma zai dadi😂	 |  negative |
|... | ... | ... |

</center>




