## Task 3: Fine Tune NER Model
Objective: Fine-tune a Named Entity Recognition (NER) model to extract key entities (e.g., products, prices, and location) from Amharic Telegram messages.
Steps:
Use Google Colab or any other environment with GPU support for faster training.
Install necessary libraries by running the following commands:
You will use the pre-trained XLM-Roberta or bert-tiny-amharic or afroxmlr model, which supports multilingual tasks, including Amharic.
Load the labeled dataset in CoNLL format from the previous task.
You can use Hugging Face's datasets library to load the data or manually parse the CoNLL format into a pandas DataFrame.
Tokenize the data and align the labels with tokens produced by the tokenizer
Set up training arguments, such as learning rate, number of epochs, batch size, and evaluation strategy.



In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
# Install Hugging Face Transformers, Datasets, and PEFT (for parameter-efficient fine-tuning)
!pip install transformers datasets accelerate peft

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [2]:
!pip install transformers datasets seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=8311b3a35a409d50c8aa3c4f03d7dd4c81fe41d4ddfa5eff6367c9343c995c8e
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [13]:

import pandas as pd
from datasets import load_dataset, Dataset, ClassLabel
from transformers import XLMRobertaTokenizer, Trainer, TrainingArguments, XLMRobertaForTokenClassification
import numpy as np
from sklearn.metrics import precision_recall_fscore_support

In [None]:

def load_conll_dataset(file_path):
    sentences, labels = [], []
    with open(file_path, 'r', encoding='utf-8') as f:
        sentence, label = [], []
        for line in f:
            line = line.strip()
            if line:
                token, tag = line.split()
                sentence.append(token)
                label.append(tag)
            else:
                if sentence:
                    sentences.append(sentence)
                    labels.append(label)
                    sentence, label = [], []
        if sentence:
            sentences.append(sentence)
            labels.append(label)
    return sentences, labels

# Load your CoNLL dataset
file_path = r'/content/drive/MyDrive/labele_data.conll'
sentences, labels = load_conll_dataset(file_path)

# Create a DataFrame
data = {'tokens': sentences, 'ner_tags': labels}
df = pd.DataFrame(data)

# Convert labels to ClassLabel
unique_labels = set(tag for label in labels for tag in label)
class_labels = ClassLabel(names=list(unique_labels))

# Create a Dataset object
dataset = Dataset.from_pandas(df)

# Step 3: Tokenize the data
tokenizer = XLMRobertaTokenizerFast.from_pretrained('xlm-roberta-base')

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if id is None else class_labels.str2int(label[id]) for id in word_ids]
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

# Step 4: Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
)

# Step 5: Fine-tune the model
#num_labels = len(class_labels)
model = XLMRobertaForTokenClassification.from_pretrained('xlm-roberta-base')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,  # Use a separate validation set if available
    tokenizer=tokenizer,
)

trainer.train()

# Step 6: Evaluate the model
predictions, labels, _ = trainer.predict(tokenized_dataset)
predictions = np.argmax(predictions, axis=2)

# Convert predictions and labels to a list format
true_labels = [[class_labels.int2str(label) for label in label_row if label != -100] for label_row in labels]
pred_labels = [[class_labels.int2str(pred) for pred, label in zip(pred_row, label_row) if label != -100] for pred_row, label_row in zip(predictions, labels)]

# Calculate precision, recall, and F1 score
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, pred_labels, average='weighted')

print(f'Precision: {precision}, Recall: {recall}, F1 Score: {f1}')

# Step 7: Save the model
model.save_pretrained('./fine_tuned_ner_model')
tokenizer.save_pretrained('./fine_tuned_ner_model')

print("Model saved successfully!")

Map:   0%|          | 0/3257 [00:00<?, ? examples/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: