### *Step 1: Install the necessary packages*

We need two packages:
* Transformers package made available by Huggingface
* Dataset package made availale by Huggingface


In [1]:
# installing the transformers package
!pip install transformers



In [2]:
# installing the dataset package
!pip install datasets --use-deprecated=legacy-resolver



### *Step 2: Import the necessay libraries from the installed packages*

In [3]:
!pip install evaluate



In [4]:
#importing the datasets package
from datasets import Dataset
import datasets
#import evaluate for model evaluation
import evaluate

In [5]:
#import numpy and pandas for mathematical computation and data manipulation respectively
import numpy as np
import pandas as pd
#import drive package to connect this colab file with the drive where the data will be retrived from
from google.colab import drive
#import the pipeline of transformers
from transformers import pipeline
#import AutoTokenizer for tokenization purposes
from transformers import AutoTokenizer
#import the Trainer API
from transformers import TrainingArguments, Trainer
#import early stopping callback
from transformers import EarlyStoppingCallback, IntervalStrategy


In [6]:
#import torch
import torch
#import Data loader from torch
from torch.utils.data import DataLoader
#import an optimizer
from torch.optim import AdamW
#import tqdm for a progress bar
from tqdm.auto import tqdm

In [7]:
#import train_test_split from sklearn for dividing the dataset into training, testing and validation
from sklearn.model_selection import train_test_split

### *Step 3: Import the dataset to be used for Training the model*


The dataset used for this project is an Amharic dataset that was made available by Data Mendeley. It contains Amharic posts and comments retrieved from Facebook and Telegram. It has 50,000 rows. The dataset can be accessed from [here](https://data.mendeley.com/datasets/fhvsvsbvtg/3)

In [8]:
#mount google drive to access the dataset directly from the drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Mounted at /content/drive

In [9]:
import zipfile
zip_ref=zipfile.ZipFile("/content/drive/MyDrive/Colab Notebooks/Hate Speech Detection using Amharic Language/Dataset.zip", 'r')
zip_ref.extractall("/content/datasets")
zip_ref.close()

In [10]:

#fetch the dataset from the drive
Labels=pd.read_csv('/content/datasets/Dataset/Labels.txt',header=None)
Posts=pd.read_csv('/content/datasets/Dataset/Posts.txt',header=None)


### *Step 4: Preprocess the Dataset*

When the dataset was retrived, the labels and the post were in different files.


*   Hence, the first step in this phase is merging the files into one panda's dataframe.
*   Second step is Label encoding. Lable encoding is the process of converting the labels(classes) into numeric format to make it easier for the machine to understand it
*   Third step is dividing the dataset into training, validation and testing categories. The division ratio is 7:1:2 respectively.
*   Last step is to remove an unncessary columns from the main dataset and merging the all the categories into one main dataset



In [11]:
#naming the columns
Labels.columns = ["labels"]
Posts.columns = ["post"]

In [12]:
#encoding the classes into numerical data
Labels = Labels.replace(['Free', 'Free ','Hate'],[0,0,1])

In [13]:
#check the encoded label data
Labels.head(10)

Unnamed: 0,labels
0,0
1,0
2,0
3,0
4,1
5,0
6,0
7,0
8,1
9,0


In [14]:
#check the Amharic data
Posts.head(1000)

Unnamed: 0,post
0,·ä†·àµ·âÄ·ãµ·àú ·å•·ã´·âÑ·ã¨ ·â†·å®·ãã·äê·âµ ·â†·ãç·àµ·å• ·àò·àµ·àò·à≠ ·ä•·äï·ã≤·ã∞·à≠·àµ·ãé ·ä†·ãµ·à≠·åå ·çç·âµ·àÖ·äï ·àà...
1,·ä•·äê·ãö·àÖ·äï ·ãà·à≥·äù ·åâ·ã≥·ãÆ·âΩ·äï ·ã®·àö·ã´·àµ·çà·çÖ·àù ·ä†·ä´·àç ·ä•·äï·ã≤·âã·âã·àù·äì ·ä≠·âµ·âµ·àç ·ä•·äï·ã≤·ã∞...
2,·ã®·ä†·àõ·à´ ·àÖ·ãù·â• ·â†·ä†·ä•·àù·àÆ ·ä≠·äï·çâ ·ã´·àç·â†·à®·à®·â†·âµ ·å•·â†·â•·äì ·çç·àç·àµ·çç·äì ·ã´·àç·ä®·çà·â∞·ãç ·ã®...
3,·ä®·ä†·àõ·à´ ·àÖ·ãù·â• ·ã®·àÄ·åà·à™·â± ·ãò·à≠·çà ·â•·ãô ·ä•·ãç·âÄ·âµ ·àò·äï·å≠·â∂ ·ã®·àû·àã·â†·âµ·ä®·àô·àã·â±·àù ·â†·àò·àç...
4,·ãõ·à¨ ·â†·ã®·âµ·äõ·ãç·àù ·àò·àà·ä™·ã´ ·ã≠·àÅ·äï ·àò·àò·ãò·äõ ·ä¢·âµ·ãÆ·åµ·ã´·ãä·äê·âµ ·ã®·àö·äï·çÄ·â£·à®·âÄ·ãç ·â†·ä†·àõ·à´...
...,...
995,·àò·çà·äï·âÖ·àà·àò·äï·åç·àµ·âµ ·ä®·àΩ·çè·àç·ä•·äï·ã¥ ·ã®·ã®·ä≠·àç·àâ ·àÖ·ãù·â• ·àù·äï ·ã≠·å†·â•·âÉ·àç ·àÜ ·â•·àé ·àÑ·ã∂ ...
996,·à∞·ãç ·â†·à©·äï ·ä•·äï·ã∞·çà·àà·åà ·àò·ä≠·çà·âµ ·ä•·äï·ã∞·çà·àà·åà ·àò·ãù·åã·âµ ·ã≠·âΩ·àã·àç ·ã®·àù·äï ·ä†·ãç·âÖ·àç·àª...
997,·ä≠·ä≠·ä≠·ä≠·ä≠ ·ã®·à±·ã≥·äï ·ãú·åã ·äì·âΩ·àÅ ·ä•·äï·ã¥ ·â£·àà ·äê·å†·àã ·åé·åÉ·àú ·àÅ·àã
998,·ã®·àù·äï ·àõ·àà·âÉ·âÄ·àµ ·äê·ãç ·àù·ãµ·à® ·çé·ä´·à™ ·àÅ·àã


In [15]:
#merge the datasets
Frames = [Labels, Posts]
Merged = pd.concat(Frames, axis=1)

In [16]:
#preview of merged data
Merged

Unnamed: 0,labels,post
0,0,·ä†·àµ·âÄ·ãµ·àú ·å•·ã´·âÑ·ã¨ ·â†·å®·ãã·äê·âµ ·â†·ãç·àµ·å• ·àò·àµ·àò·à≠ ·ä•·äï·ã≤·ã∞·à≠·àµ·ãé ·ä†·ãµ·à≠·åå ·çç·âµ·àÖ·äï ·àà...
1,0,·ä•·äê·ãö·àÖ·äï ·ãà·à≥·äù ·åâ·ã≥·ãÆ·âΩ·äï ·ã®·àö·ã´·àµ·çà·çÖ·àù ·ä†·ä´·àç ·ä•·äï·ã≤·âã·âã·àù·äì ·ä≠·âµ·âµ·àç ·ä•·äï·ã≤·ã∞...
2,0,·ã®·ä†·àõ·à´ ·àÖ·ãù·â• ·â†·ä†·ä•·àù·àÆ ·ä≠·äï·çâ ·ã´·àç·â†·à®·à®·â†·âµ ·å•·â†·â•·äì ·çç·àç·àµ·çç·äì ·ã´·àç·ä®·çà·â∞·ãç ·ã®...
3,0,·ä®·ä†·àõ·à´ ·àÖ·ãù·â• ·ã®·àÄ·åà·à™·â± ·ãò·à≠·çà ·â•·ãô ·ä•·ãç·âÄ·âµ ·àò·äï·å≠·â∂ ·ã®·àû·àã·â†·âµ·ä®·àô·àã·â±·àù ·â†·àò·àç...
4,1,·ãõ·à¨ ·â†·ã®·âµ·äõ·ãç·àù ·àò·àà·ä™·ã´ ·ã≠·àÅ·äï ·àò·àò·ãò·äõ ·ä¢·âµ·ãÆ·åµ·ã´·ãä·äê·âµ ·ã®·àö·äï·çÄ·â£·à®·âÄ·ãç ·â†·ä†·àõ·à´...
...,...,...
29995,1,·â†·ä†·àâ ·ã®·àÅ·àâ·àù ·ä¢·âµ·ãÆ·åµ·ã´·ãä ·àµ·àã·àç·àÜ·äê ·â†·ä¶·àÆ·àù·äõ·ãç ·â¢·àà·çã·ã∞·ãµ ·àù·äï ·ä†·åà·â£·äï
29996,0,·â∞·â£·à®·ä≠ ·ä†·â•·âπ ·çà·à≠ ·âÄ·ã≥·åÖ ·àµ·àà·àÜ·äï·àÖ ·àò·åã·à®·åÉ·ãç ·àò·âÄ·ã∞·ãµ ·àµ·àà·åÄ·àò·à®
29997,0,·ä•·àµ·ä® ·ä†·àÅ·äï ·ä†·äï·â∞ ·â•·âª ·äê·ãç ·â† ·àò·çÖ·àÄ·çç ·ã´·àç·âª·àç·ä®·ãç ·ä†·äï·â∞·àù ·â≥·à™·ä≠ ·ä•...
29998,1,·àÖ·åà·ãà·å•·âµ ·å†·âÖ·àã·ã≠ ·àö·äï·àµ·âµ·à≠ ·çÖ·â§·âµ ·ã®·â∞·çà·âÄ·ã∞ ·àÜ·äñ ·àÖ·ãù·â•·äï ·ä•·äï·ã¥·âµ ·àÖ·åç ·ä†·ä≠...


In [17]:
#Divide the dataset into train, validation and test categories
train_val_df, test_dataset = train_test_split(Merged, test_size=0.20, random_state=42)
train_dataset, evaluation_dataset = train_test_split(train_val_df, test_size=0.115, random_state=42)
print('Training dataset shape: ', train_dataset.shape)
print('Validation dataset shape: ', evaluation_dataset.shape)
print('Testing dataset shape: ', test_dataset.shape)

Training dataset shape:  (21240, 2)
Validation dataset shape:  (2760, 2)
Testing dataset shape:  (6000, 2)


In [18]:
#convert format of the dataset to HuggingFace Dataset from Pandas DataFrame
test_dataset=Dataset.from_pandas(test_dataset)

In [19]:
#convert the format of the dataset to HuggingFace Dataset from Pandas DataFrame
train_dataset=Dataset.from_pandas(train_dataset)

In [20]:
#convert the format of the dataset to HuggingFace Dataset from Pandas DataFrame
evaluation_dataset=Dataset.from_pandas(evaluation_dataset)

In [21]:
#preview of the dataset after conversion
(test_dataset)

Dataset({
    features: ['labels', 'post', '__index_level_0__'],
    num_rows: 6000
})

In [22]:
#remove unnecessary column
test_dataset=test_dataset.remove_columns("__index_level_0__")
train_dataset=train_dataset.remove_columns("__index_level_0__")
evaluation_dataset=evaluation_dataset.remove_columns("__index_level_0__")

In [23]:
#combine the train and test dataset into one datset
main_dataset= datasets.DatasetDict({
    'train': train_dataset,
    'test': test_dataset,
    'evaluate': evaluation_dataset
})

In [24]:
#preview of the dataset after merging
main_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'post'],
        num_rows: 21240
    })
    test: Dataset({
        features: ['labels', 'post'],
        num_rows: 6000
    })
    evaluate: Dataset({
        features: ['labels', 'post'],
        num_rows: 2760
    })
})

In [25]:
# training and testing data size
training_data_size = main_dataset['train'].num_rows
testing_data_size = main_dataset['test'].num_rows
evaluation_data_size = main_dataset['evaluate'].num_rows

### *Step 5: Tokenizing Dataset*

A Tokenizer is used to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

In this case, the tokenizer used is an AutoTokenizer from the fine-tuned mBERT model made available by Hugging face [here](https://huggingface.co/Davlan/bert-base-multilingual-cased-finetuned-amharic)


In this phase, we have the following tasks:
* Load the tokenizer
* Create a tokenizer function that takes the dataset in batches and tokenize them using the tokenizer loaded from the model
* Call the tokenizer function on the whole dataset

In [26]:
#loading a tokenizer from the pretrained model
tokenizer = AutoTokenizer.from_pretrained("Davlan/bert-base-multilingual-cased-finetuned-amharic")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [27]:
#Have a tokenizer function that uses the tokenizer
def tokenize_function(data):
    return tokenizer(data["post"], padding="max_length", truncation=True)

In [28]:
#Tokenize all the data using the mapping functionality
tokenized_datasets = main_dataset.map(tokenize_function)


Map:   0%|          | 0/21240 [00:00<?, ? examples/s]

Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2760 [00:00<?, ? examples/s]

In [29]:
#empty cache
torch.cuda.empty_cache()

### *Step 6: Prepare the tokenized Dataset*

In this phase, we do the following tasks:

* Remove unnecessary columns such as the "posts" column from the tokenized dataset as we no longer need them
* Change the format of the tokenized dataset into pytorch since we are using pytorch
* Load the dataset using DataLoader with the proper batch size
* Preview the features of the dataset to make sure everything is okay

In [30]:
#remove the posts column as it is no longer needed
tokenized_datasets = tokenized_datasets.remove_columns(["post"])

In [31]:
#changing the format of the tokenized dataset to torch
tokenized_datasets.set_format("torch")

In [32]:
#shuffeling and selecting the needed size of dataset for training and evaluating the model
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(training_data_size))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(testing_data_size))
small_eval_dataset = tokenized_datasets["evaluate"].shuffle(seed=42).select(range(evaluation_data_size))

In [33]:
# preview of the shuffeled and selected evaluation dataset
small_eval_dataset

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2760
})

In [34]:
# preview of the shuffeled and selected training dataset
small_train_dataset

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 21240
})

In [35]:
# preview of the shuffeled and selected testing dataset
small_test_dataset

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 6000
})

In [36]:
#load the dataset using DataLoader
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=4)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=4)
test_dataloader = DataLoader(small_test_dataset, batch_size=4)

### *Step 7: Fine-tune the model*

This phase has the following steps:
* Load the model
* Specify the computing metric
* Specify the Training/fine-tuning arguments
* Load the Trainer class
* Fine-tune the model

**7.1 Load the model**<br>
We load the fine-tuned mBERT mode in this step

In [37]:
#Load auto mode classifier from the pretrained model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("Davlan/bert-base-multilingual-cased-finetuned-amharic", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Davlan/bert-base-multilingual-cased-finetuned-amharic and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**7.2 Computing Metrics** <br>
In this stage, we load the computing metrics. The computing metrics used in this phase are the f1-score and the accuracy. These computing metrics are used during the validation and testing phase  

In [38]:
import numpy as np
metric = evaluate.load("f1","accuracy")

In [39]:
#Function that uses the loaded metrics to compute the performance of the model
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

**7.3 Specify the training arguments** <br>
This phase includes loading the training parameters and hyperparameters.
It also specifies the validation interval during the fine-tuning process.

In [40]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

In [41]:
from transformers import EarlyStoppingCallback, IntervalStrategy

training_args = TrainingArguments(
    output_dir="training_with_callbacks",
    evaluation_strategy=IntervalStrategy.STEPS,  # Evaluate every few steps
    warmup_steps=1000,  # Increase warmup steps to stabilize training
    save_steps=2000,
    eval_steps=2000,  # Evaluate and save every 2000 steps
    save_total_limit=3,  # Keep only the last 3 models
    learning_rate=5e-5,  # Adjust learning rate to 5e-5
    per_device_train_batch_size=8,  # Increase batch size to 8
    per_device_eval_batch_size=8,  # Increase batch size to 8
    num_train_epochs=20,  # Increase epochs to 20
    weight_decay=0.01,  # Keep weight decay the same, or adjust if needed
    push_to_hub=False,
    metric_for_best_model='f1',
    do_predict=True,
    load_best_model_at_end=True
)



**7.4 Load the Trainer class**<br>
In the trainer class, early stopping strategy is called. Early Stopping is a an optimization technique used to reduce overfitting without compromising on model accuracy. It allows to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset. For this model, the early stopping patience used is 10 epoches.

In [42]:
print(len(small_train_dataset))
print(len(small_eval_dataset))

21240
2760


In [43]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)],
)

**7.5 Fine-tune the model** <br>
Fine-tuning process embbeds the validation within itself. After every 2000 steps of finetuning, the model is validated on the loaded computing metrics to modify the hyperparameters to make the model perform well

In [45]:
trainer.train()

Step,Training Loss,Validation Loss


Error: You must call wandb.init() before wandb.log()

In [None]:
# Save the trained model to a specific directory
trainer.save_model("/content")