# Fine-Tuning NER Model for Amharic Text

## Overview
This notebook demonstrates how to fine-tune a Named Entity Recognition (NER) model specifically for Amharic text using the XLM-RoBERTa transformer model. The project focuses on identifying entities like products, prices, and locations in Amharic commercial text.

## Key Features
- Data preprocessing for CoNLL format
- Token alignment for transformer models
- Fine-tuning XLM-RoBERTa for multilingual NER
- Comprehensive evaluation metrics
- Model persistence and inference capabilities

## 1. Environment Setup and Library Installation

Installing all required dependencies for the NER fine-tuning process.

In [1]:
# Install Required Libraries
!pip install transformers datasets torch accelerate evaluate seqeval
!pip install huggingface_hub --upgrade
!pip install --upgrade ipywidgets nbformat nbconvert

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70

## 2. Import Dependencies

Importing all necessary libraries for data processing, model training, and evaluation.

In [2]:
import pandas as pd
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
import json
from datetime import datetime
import re
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score
from google.colab import drive, files
import io

## 3. Google Drive Integration

Mounting Google Drive to access training data and save model artifacts.

In [3]:
# Mount Google Drive to access your data
drive.mount('/content/drive')

Mounted at /content/drive


## 4. Data Loading and Preprocessing Functions

Defining utility functions to load CoNLL format data and convert it into a structured format suitable for transformer models.

In [4]:
def load_conll_data(file_path):
    """Load CoNLL format data and convert to structured format"""
    sentences = []
    labels = []

    current_sentence = []
    current_labels = []

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line == "":  # Empty line indicates sentence boundary
                if current_sentence:
                    sentences.append(current_sentence)
                    labels.append(current_labels)
                    current_sentence = []
                    current_labels = []
            else:
                parts = line.split('\t')
                if len(parts) >= 2:
                    token = parts[0]
                    label = parts[1]
                    current_sentence.append(token)
                    current_labels.append(label)

    # Add the last sentence if it doesn't end with empty line
    if current_sentence:
        sentences.append(current_sentence)
        labels.append(current_labels)

    return sentences, labels

## 5. Sample Data Generation

Creating sample CoNLL format data for demonstration purposes with Amharic text and entity labels.

In [5]:
def create_sample_conll_data():
    """Create sample CoNLL data for demonstration"""
    sample_data = """LIFESTAR	B-PRODUCT
1	I-PRODUCT
Million	I-PRODUCT
4K	I-PRODUCT
Android	I-PRODUCT
ሪሲቨር	I-PRODUCT
ዋጋ	B-PRICE
7000	I-PRICE
ብር	I-PRICE
ነው	O
.	O

MAGIC	B-PRODUCT
REMOTE	I-PRODUCT
በ	O
ማራኪ	B-LOC
ሳት	I-LOC
አለ	O
.	O

የመኪና	B-PRODUCT
ANDROID	I-PRODUCT
SCREEN	I-PRODUCT
በጥራት	O
ዋጋ	B-PRICE
በግማሽ	I-PRICE
የቀነሰ	I-PRICE
.	O

Xcruiser	B-PRODUCT
Magic	I-PRODUCT
Box	I-PRODUCT
መርካቶ	B-LOC
አንዋር	I-LOC
መስጂድ	I-LOC
ጎንደር	I-LOC
ውስጥ	O
.	O"""

    # Save sample data to file
    with open('/content/sample_conll_data.txt', 'w', encoding='utf-8') as f:
        f.write(sample_data)

    return '/content/sample_conll_data.txt'

## 6. Data Upload and Loading

Uploading the CoNLL format training data and loading it into memory for processing.

In [6]:
# Upload your CoNLL data file
print("Please upload your CoNLL format labeled data file:")
uploaded = files.upload()
conll_file_path = list(uploaded.keys())[0] if uploaded else create_sample_conll_data()

# Load the data
sentences, labels = load_conll_data(conll_file_path)
print(f"Loaded {len(sentences)} sentences")
print(f"Sample sentence: {sentences[0]}")
print(f"Sample labels: {labels[0]}")

Please upload your CoNLL format labeled data file:


Saving auto_labeled_training.conll to auto_labeled_training.conll
Loaded 401 sentences
Sample sentence: ['# Format: TOKEN']
Sample labels: ['LABEL']


In [None]:
from google.colab import output
output.enable_custom_widget_manager()

Support for third party widgets will remain active for the duration of the session. To disable support:

In [None]:
from google.colab import output
output.disable_custom_widget_manager()

## 7. Label Mapping Creation

Creating bidirectional mappings between entity labels and their corresponding IDs for model training.

In [7]:
# Create label mappings
unique_labels = set()
for label_seq in labels:
    unique_labels.update(label_seq)

label_list = sorted(list(unique_labels))
label_to_id = {label: i for i, label in enumerate(label_list)}
id_to_label = {i: label for i, label in enumerate(label_list)}

print(f"Unique labels: {label_list}")
print(f"Number of labels: {len(label_list)}")

Unique labels: ['B-LOC', 'B-PRICE', 'B-Product', 'I-LOC', 'I-PRICE', 'I-Product', 'LABEL', 'O']
Number of labels: 8


## 8. Tokenization and Label Alignment

Implementing the crucial tokenization function that aligns entity labels with subword tokens generated by the transformer tokenizer.

In [8]:
def tokenize_and_align_labels(examples, tokenizer, label_to_id):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding=False
    )

    labels = []
    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label_to_id[label[word_idx]])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

## 9. Dataset Preparation and Splitting

Preparing the dataset by splitting it into training, validation, and test sets for robust model evaluation.

In [9]:
# Prepare Dataset
df = pd.DataFrame({
    'tokens': sentences,
    'labels': labels
})

# Split the data
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)

print(f"Train: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

# Convert to Hugging Face datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

Train: 240, Validation: 80, Test: 81


## 10. Model and Tokenizer Initialization

Loading the pre-trained XLM-RoBERTa model and tokenizer, configured for token classification with our specific entity labels.

In [10]:
# Model Setup
MODEL_NAME = "xlm-roberta-base"  # You can also try "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label_list),
    id2label=id_to_label,
    label2id=label_to_id
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 11. Dataset Tokenization

Applying tokenization and label alignment to all dataset splits for model training compatibility.

In [11]:
# Tokenize datasets
train_tokenized = train_dataset.map(
    lambda x: tokenize_and_align_labels(x, tokenizer, label_to_id),
    batched=True
)
val_tokenized = val_dataset.map(
    lambda x: tokenize_and_align_labels(x, tokenizer, label_to_id),
    batched=True
)
test_tokenized = test_dataset.map(
    lambda x: tokenize_and_align_labels(x, tokenizer, label_to_id),
    batched=True
)

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/81 [00:00<?, ? examples/s]

## 12. Data Collator Setup

Configuring the data collator for dynamic padding and batch preparation during training.

In [12]:
# Data Collator
data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer,
    padding=True
)

## 13. Evaluation Metrics Configuration

Defining comprehensive evaluation metrics including precision, recall, F1-score, and accuracy using seqeval for proper sequence evaluation.

In [13]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [id_to_label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id_to_label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "accuracy": accuracy_score(true_labels, true_predictions),
    }
    return results

## 14. Training Configuration

Setting up comprehensive training arguments including learning rate, batch sizes, evaluation strategy, and model saving parameters.

In [21]:
# Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    save_total_limit=2,
    report_to=None,  # Disable wandb logging
    dataloader_pin_memory=False,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


## 15. Trainer Initialization

Initializing the Hugging Face Trainer with our model, datasets, and configuration parameters.

In [22]:
# Trainer Setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


## 16. Model Training

Starting the fine-tuning process. This cell will train the model for the specified number of epochs with automatic evaluation and checkpointing.

In [23]:
# Train the Model
print("Starting training...")
trainer.train()

Starting training...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.636,0.417272,0.548936,0.221459,0.315596,0.875386
2,0.3827,0.318922,0.484375,0.505579,0.49475,0.917576
3,0.3234,0.284403,0.495502,0.567382,0.529012,0.924367


TrainOutput(global_step=45, training_loss=0.430129517449273, metrics={'train_runtime': 223.8797, 'train_samples_per_second': 3.216, 'train_steps_per_second': 0.201, 'total_flos': 186551506794240.0, 'train_loss': 0.430129517449273, 'epoch': 3.0})

## 17. Model Evaluation

Evaluating the trained model on the held-out test set to assess final performance metrics.

In [24]:
# Evaluate on Test Set
test_results = trainer.evaluate(test_tokenized)
print(f"Test Results: {test_results}")

Test Results: {'eval_loss': 0.29333582520484924, 'eval_precision': 0.48769128409846974, 'eval_recall': 0.566023166023166, 'eval_f1': 0.5239456754824875, 'eval_accuracy': 0.9226082817705854, 'eval_runtime': 2.5, 'eval_samples_per_second': 32.4, 'eval_steps_per_second': 2.4, 'epoch': 3.0}




## 18. Model Persistence

Saving the fine-tuned model and tokenizer for future use and deployment.

In [25]:
# Save the Model
model.save_pretrained("./xlm-roberta-amharic-ner")
tokenizer.save_pretrained("./xlm-roberta-amharic-ner")

print("Model saved successfully!")

Model saved successfully!


## 19. Inference Function Implementation

Implementing a utility function for making predictions on new text using the trained model.

In [26]:
def predict_entities(text, model, tokenizer, id_to_label):
    tokens = text.split()
    inputs = tokenizer(tokens, truncation=True, is_split_into_words=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    predictions = torch.argmax(outputs.logits, dim=2)

    word_ids = inputs.word_ids()
    previous_word_idx = None
    predicted_labels = []

    for i, word_idx in enumerate(word_ids):
        if word_idx is not None and word_idx != previous_word_idx:
            predicted_labels.append(id_to_label[predictions[0][i].item()])
        previous_word_idx = word_idx

    return list(zip(tokens, predicted_labels))

## 20. Model Testing and Demonstration

Testing the trained model with sample Amharic text to demonstrate entity recognition capabilities.

In [27]:
# Test the Model
test_text = "LIFESTAR 1 Million 4K Android ሪሲቨር ዋጋ 7000 ብር መርካቶ ውስጥ አለ"

def predict_entities(text, model, tokenizer, id_to_label):
    # Determine the device of the model
    device = model.device

    tokens = text.split()
    # Tokenize the text and get the output object
    tokenized_output = tokenizer(tokens, truncation=True, is_split_into_words=True, return_tensors="pt")

    # Get word_ids from the tokenizer output before moving to device
    word_ids = tokenized_output.word_ids(batch_index=0) # Assuming single sentence batch

    # Move input tensors (which are in the tokenized_output dictionary) to the same device as the model
    inputs = {key: val.to(device) for key, val in tokenized_output.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    # Move predictions back to CPU if needed for processing
    predictions = torch.argmax(outputs.logits, dim=2).cpu()

    previous_word_idx = None
    predicted_labels = []

    # Iterate through the word_ids obtained earlier
    for i, word_idx in enumerate(word_ids):
        # Check for None word_idx and ensure the token is not a special token (-100 label in training)
        if word_idx is not None and word_idx != previous_word_idx:
            # Use the label ID from the predictions, indexing into the predictions tensor
            predicted_labels.append(id_to_label[predictions[0][i].item()])
        previous_word_idx = word_idx

    # Return tokens and their predicted labels. Note: predicted_labels might not align 1:1 with original tokens if some tokens were mapped to special tokens (-100).
    # A more robust prediction would align predicted labels back to the original tokens based on word_ids.
    # For this simple case, we assume a basic alignment based on the processed tokens.
    # A more complex function would be needed for exact original word alignment.
    return list(zip(tokens, predicted_labels))


predictions = predict_entities(test_text, model, tokenizer, id_to_label)
print(f"Test text: {test_text}")
print(f"Predictions: {predictions}")

Test text: LIFESTAR 1 Million 4K Android ሪሲቨር ዋጋ 7000 ብር መርካቶ ውስጥ አለ
Predictions: [('LIFESTAR', 'O'), ('1', 'B-PRICE'), ('Million', 'O'), ('4K', 'B-PRICE'), ('Android', 'O'), ('ሪሲቨር', 'O'), ('ዋጋ', 'B-PRICE'), ('7000', 'B-PRICE'), ('ብር', 'B-PRICE'), ('መርካቶ', 'O'), ('ውስጥ', 'O'), ('አለ', 'O')]


## Conclusion

This notebook successfully demonstrates the complete pipeline for fine-tuning a multilingual transformer model for Amharic NER tasks. The model can now identify:

- **Products**: Electronic devices and commercial items
- **Prices**: Monetary values and currency information  
- **Locations**: Geographic places and addresses

The fine-tuned model is ready for deployment and can be further improved with additional training data or hyperparameter optimization.