<a href="https://colab.research.google.com/github/abumafrim/Hausa-Visual-Genome-Dataset/blob/main/AfriSenti_SemEval_2023_Starter_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<img src="https://raw.githubusercontent.com/afrisenti-semeval/afrisent-semeval-2023/main/afrisenti-logo.png" width="30%" />
</center>

<center>

#SemEval 2023 Shared Task 12: AfriSenti

###Starter Notebook

</center>

#Leveraging Pre-trained A Language Model to Train A Sentiment Classifier

**Authors:**
[Idris Abdulmumin](https://www.hausanlp.org/author/idris-abdulmuminu/), [David Adelani](https://dadelani.github.io/) and [Shamsuddeen Hassan Muhammad](https://www.hausanlp.org/author/shamsuddeen-hassan-muhammad/).

**Introduction:** 

You are welcome to participate in our first-of-its-kind SemEval Shared Task! 

In this starter notebook, we will take you through the process of fine-tuning a pre-trained language model on a sample data to build a sentiment classifier. The notebook was adapted from a [Hugginface implementation]( https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_xnli.py) for such tasks.

**Level:** <font color='blue'>`Beginner to Intermediate`</font>

**Outline:** 

1. Installation and importation of necessary libraries
2. Setting up the project parameters.
3. Running training and evaluation

**Before you start:**

It is **strongly advised** that you use a GPU to speed up training. To do this, go to the "Runtime" menu in Colab, select "Change runtime type" and then in the popup menu, choose "GPU" in the "Hardware accelerator" box.

**NB: The codes in this notebook are provided to familiarize yourselves with fine-tuning language models for sentiment classification. You may extend and (or) modify as appropriate to obtain competitive performances**

#1) Installations and imports

##a. Mount drive (if you are running on colab)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

##b. Clone competition repository

In [None]:
%cd /content/drive/MyDrive

!git clone https://github.com/afrisenti-semeval/afrisent-semeval-2023.git

##c. Install required libraries

Set the project directory in the cell below, where the requirements file shold also be located, and run the cell

In [None]:
PROJECT_DIR = '/content/drive/MyDrive/afrisent-semeval-2023'

if os.path.isdir(PROJECT_DIR):
  %cd {PROJECT_DIR}
else:
  print("Project directory not found, please check again.")

#The requirements file should be in PROJECT_DIR
if os.path.isfile('starter_kit/requirements.txt'):
  !pip install -r starter_kit/requirements.txt
else:
  print('requirements.txt file not found')

##d. Import libraries

In [None]:
import pandas as pd
import numpy as np
import os

#2) Dataset

##a. Formatting

The training dataset that was provided for the competition is in the following format:

| ID | text | label |
| --- | --- | --- |
| twt001 | example text | negative |
| twt002 | example text | positive |
| ... | ... | ... |

However, the code in the starter kit do not expect the tweets ID and require the training (and evaluation) data to be in the following format

|text | label |
|--- | --- |
|example text | negative |
|example text | positive |
|... | ... |

To reformat the data, therefore, please provide the directory of the competition training data and run the following cell

In [None]:
TRAINING_DATA_DIR = '/content/drive/MyDrive/afrisent-semeval-2023/datasets/train'

if os.path.isdir(TRAINING_DATA_DIR):
  FORMATTED_TRAIN_DATA = os.path.join(TRAINING_DATA_DIR, 'formatted-train-data')
  print('Data directory found.')
  if not os.path.isdir(FORMATTED_TRAIN_DATA):
    print('Creating directory to store formatted data.')
    os.mkdir(FORMATTED_TRAIN_DATA)
else:
  print(TRAINING_DATA_DIR + ' is not a valid directory or does not exist!')

%cd {TRAINING_DATA_DIR}

training_files = os.listdir()

if len(training_files) > 0:
  for training_file in training_files:
    if training_file.endswith('.tsv'):

      data = training_file.split('_')[0]
      if not os.path.isdir(os.path.join(FORMATTED_TRAIN_DATA, data)):
        print(data, 'Creating directory to store train, dev and test splits.')
        os.mkdir(os.path.join(FORMATTED_TRAIN_DATA, data))
      
      df = pd.read_csv(training_file, sep='\t', names=['ID', 'text', 'label'], header=0)
      df[['text', 'label']].to_csv(os.path.join(FORMATTED_TRAIN_DATA, data, 'train.tsv'), sep='\t', index=False)
    
    else:
      print(training_file + ' is not a supported file!')
else:
  print('No files are found in this directory!')

##b. <font color='red'>`(Optional) Creating Evaluation (Dev and Test) sets from the available training data`</font>

You may wish to create train and evaluation (dev and test) sets from the training data provided. If you wish to do so, you can run any of the cells below`

###i. If you want to create both the Dev and Test sets, run this cell

In [None]:
if os.path.isdir(FORMATTED_TRAIN_DATA):
  print('Data directory found.')
  SPLITTED_DATA = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev-test')
  if not os.path.isdir(SPLITTED_DATA):
    print('Creating directory to store train, dev and test splits.')
    os.mkdir(SPLITTED_DATA)
else:
  print(FORMATTED_TRAIN_DATA + ' is not a valid directory or does not exist!')

%cd {FORMATTED_TRAIN_DATA}
formatted_training_files = os.listdir()

if len(formatted_training_files) > 0:
  for data_name in formatted_training_files:
    formatted_training_file = os.path.join(data_name, 'train.tsv')
    if os.path.isfile(formatted_training_file):
      labeled_tweets = pd.read_csv(formatted_training_file, sep='\t', names=['text', 'label'], header=0)
      train, dev, test = np.split(labeled_tweets.sample(frac=1, random_state=42), [int(.7*len(labeled_tweets)), int(.8*len(labeled_tweets))])

      if not os.path.isdir(os.path.join(SPLITTED_DATA, data_name)):
        print(data_name, 'Creating directory to store train, dev and test splits.')
        os.mkdir(os.path.join(SPLITTED_DATA, data_name))

      train.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'train.tsv'), sep='\t', index=False)
      dev.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'dev.tsv'), sep='\t', index=False)
      test.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name,'test.tsv'), sep='\t', index=False)
    else:
      print(training_file + ' is not a supported file!')
else:
  print('No files are found in this directory!')

###ii. If you want to create only the Dev set from the training data, please run this

In [None]:
from sklearn.model_selection import train_test_split

if os.path.isdir(FORMATTED_TRAIN_DATA):
  print('Data directory found.')
  SPLITTED_DATA = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev')
  if not os.path.isdir(SPLITTED_DATA):
    print('Creating directory to store train, dev and test splits.')
    os.mkdir(SPLITTED_DATA)
else:
  print(FORMATTED_TRAIN_DATA + ' is not a valid directory or does not exist!')

%cd {FORMATTED_TRAIN_DATA}
formatted_training_files = os.listdir()

if len(formatted_training_files) > 0:
  for data_name in formatted_training_files:
    formatted_training_file = os.path.join(data_name, 'train.tsv')
    if os.path.isfile(formatted_training_file):
      labeled_tweets = pd.read_csv(formatted_training_file, sep='\t', names=['text', 'label'], header=0)
      train, dev = train_test_split(labeled_tweets, test_size=0.3)

      if not os.path.isdir(os.path.join(SPLITTED_DATA, data_name)):
        print(data_name, 'Creating directory to store train, dev and test splits.')
        os.mkdir(os.path.join(SPLITTED_DATA, data_name))

      train.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'train.tsv'), sep='\t', index=False)
      dev.sample(frac=1).to_csv(os.path.join(SPLITTED_DATA, data_name, 'dev.tsv'), sep='\t', index=False)
    else:
      print(training_file + ' is not a supported file!')
else:
  print('No files are found in this directory!')

#3) Training setup

##a. Set project parameters

For a list of models that be used for fine-tuning, you can check [HERE](https://huggingface.co/models).

In [None]:
%cd {PROJECT_DIR}

# Language to train sentiment classifier for
LANGUAGE_CODE = 'am'

# Model Training Parameters
MODEL_NAME_OR_PATH = 'Davlan/afro-xlmr-mini'
BATCH_SIZE = 32
LEARNING_RATE = 5e-5
NUMBER_OF_TRAINING_EPOCHS = 1.0
MAXIMUM_SEQUENCE_LENGTH = 128
SAVE_STEPS = -1

##b. Train the model

###i. Training on only Train set, without any evaluation

In [None]:
DATA_DIR = os.path.join(TRAINING_DATA_DIR, 'formatted-train-data', LANGUAGE_CODE)
OUTPUT_DIR = os.path.join('/content/drive/MyDrive/afrisent-semeval-2023/models', LANGUAGE_CODE + '_no_eval')

!CUDA_VISIBLE_DEVICES=0 python starter_kit/run_textclass.py \
  --model_name_or_path {MODEL_NAME_OR_PATH} \
  --data_dir {DATA_DIR} \
  --do_train \
  --per_device_train_batch_size {BATCH_SIZE} \
  --learning_rate {MAXIMUM_SEQUENCE_LENGTH} \
  --num_train_epochs {NUMBER_OF_TRAINING_EPOCHS} \
  --max_seq_length {MAXIMUM_SEQUENCE_LENGTH} \
  --output_dir {OUTPUT_DIR} \
  --save_steps {SAVE_STEPS}

###ii. Training on only Train and Dev sets

In [None]:
DATA_DIR = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev', LANGUAGE_CODE)
OUTPUT_DIR = os.path.join('/content/drive/MyDrive/afrisent-semeval-2023/models', LANGUAGE_CODE + '_no_test')

!CUDA_VISIBLE_DEVICES=0 python starter_kit/run_textclass.py \
  --model_name_or_path {MODEL_NAME_OR_PATH} \
  --data_dir {DATA_DIR} \
  --do_train \
  --do_eval \
  --per_device_train_batch_size {BATCH_SIZE} \
  --learning_rate {MAXIMUM_SEQUENCE_LENGTH} \
  --num_train_epochs {NUMBER_OF_TRAINING_EPOCHS} \
  --max_seq_length {MAXIMUM_SEQUENCE_LENGTH} \
  --output_dir {OUTPUT_DIR} \
  --save_steps {SAVE_STEPS}

###iii. Training with Train, Dev and Test sets

In [None]:
DATA_DIR = os.path.join(TRAINING_DATA_DIR, 'splitted-train-dev-test', LANGUAGE_CODE)
OUTPUT_DIR = os.path.join('/content/drive/MyDrive/afrisent-semeval-2023/models', LANGUAGE_CODE)

!CUDA_VISIBLE_DEVICES=0 python starter_kit/run_textclass.py \
  --model_name_or_path {MODEL_NAME_OR_PATH} \
  --data_dir {DATA_DIR} \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size {BATCH_SIZE} \
  --learning_rate {MAXIMUM_SEQUENCE_LENGTH} \
  --num_train_epochs {NUMBER_OF_TRAINING_EPOCHS} \
  --max_seq_length {MAXIMUM_SEQUENCE_LENGTH} \
  --output_dir {OUTPUT_DIR} \
  --save_steps {SAVE_STEPS}

#4) Submission

Some unlabeled tweets were provided for submission to the competition codalab website. To generate their translations, you may provide the path to the file containing the unlabeled tweets for the trained model to predict their sentiment classes.

**What the code does**
1. Predicting sentiments of the unlabeled tweets
2. Create a file in the submission format

In [None]:
%cd {PROJECT_DIR}

OUTPUT_DIR = os.path.join('/content/drive/MyDrive/afrisent-semeval-2023/models', LANGUAGE_CODE)
FILE_NAME = '/content/drive/MyDrive/Shamsu_SA/dev_codalab/am_dev.tsv'
TEXT_COLUMN = 'text'

!python starter_kit/run_predict.py \
  --model_path {OUTPUT_DIR} \
  --file_name {FILE_NAME} \
  --text_column {TEXT_COLUMN} \
  --lang_code {LANGUAGE_CODE}