## Fine Tune NER Model

To fine-tune a Named Entity Recognition (NER) model to extract key entities (products, prices, and location) from Amharic Telegram messages, we will follow these steps.

**Step 1:** Set Up Environment with GPU Support

- Use Google Colab or GPU-Enabled Environment Ensure that selected a runtime with GPU in Google Colab:

  - Go to Runtime > Change runtime type > Select GPU.
  
- Install Necessary Libraries
  - Run the following commands in a code cell to install the required libraries:

In [None]:
# Uncomment below line, and run the cell
#!pip install pyarrow==10.0.1 datasets==2.4.0 seqeval


- Import necessary libraries

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import XLMRobertaTokenizerFast
from datasets import Dataset, Features, Sequence, Value
from transformers import TrainingArguments
from transformers import XLMRobertaForTokenClassification, AutoModelForTokenClassification, AutoTokenizer, Trainer


Once the required libraries are installed, we will use transformers for the model and datasets for loading data, and seqeval for evaluating the NER model.

**Step 2:** Load the Labeled Dataset from CoNLL File
- Load the CoNLL Dataset
  - we can load our CoNLL formatted data into a DataFrame. Here's how we can do that:

- Upload the conll file

In [None]:
from google.colab import files
uploaded = files.upload()


Saving labeled_data_conll.conll to labeled_data_conll.conll


In [17]:
# Function to load CoNLL formatted data
def load_conll(file_path):
    sentences = []
    labels = []
    with open(file_path, 'r', encoding='utf-8') as f:
        sentence = []
        label = []
        for line in f:
            if line.strip():  # Non-empty line
                token, label_item = line.split()
                sentence.append(token)
                label.append(label_item)
            else:  # Empty line indicates end of a sentence
                sentences.append(sentence)
                labels.append(label)
                sentence = []
                label = []
    return pd.DataFrame({'tokens': sentences, 'labels': labels})

# Load your CoNLL file
df = load_conll('labeled_data_conll.conll')


In [None]:
# Explore the first few rows
df.head()

Unnamed: 0,tokens,labels
0,"[ይህን, መፍጫ, ከሁሉም, የተሻለ, ሆኖ, አግኝተነዋል, 1, አስተማማኝ,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"[ማጂክ, መወልወያ, ውሃ, በከፍተኛ, ደረጃ, ይመጣል, በራሱ, ይጨምቃል,...","[O, O, O, O, O, O, O, O, O, O, O, B-PRICE, I-P..."
2,"[ባለ1, እና, ባለ, 2, ተች, ስቶቭ, ግዜዎን, እና, ጉልበትዎን, የሚ...","[O, O, O, O, O, B-PROD, O, O, O, O, O, B-PROD,..."
3,"[ሶስት, ፍሬ, የዳቦ, እና, የኬክ, ቅርጽ, ማውጫ, መጋገሪያ, ትራ, ት...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[360, የሚዞር, በቀላሉ, የውሃ, ቱቦ, ላይ, የሚገጠም, ለመኪና, እጥ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


**Define Unique Labels:**
- Extract unique labels from the DataFrame and create a mapping from labels to IDs.

In [18]:
unique_labels = set(label for sublist in df['labels'] for label in sublist)
label2id = {label: i for i, label in enumerate(unique_labels)}
id2label = {i: label for label, i in label2id.items()}


In [None]:
unique_labels

{'B-LOC', 'B-PRICE', 'B-PROD', 'I-PRICE', 'O'}

In [19]:
df['labels'] = df['labels'].apply(lambda x: [label2id[label] for label in x])


**Step 3:** Convert DataFrame to Hugging Face Dataset

In [20]:
# Convert DataFrame to Hugging Face Dataset
# Make sure 'labels' is a list of lists
# Define the features with the correct data types
features = Features({
    'tokens': Sequence(Value('string')),  # List of strings for tokens
    'labels': Sequence(Value('int32'))    # List of integers for labels
})

# Convert DataFrame to Hugging Face Dataset with specified features
dataset = Dataset.from_pandas(df[['tokens', 'labels']], features=features)

In [None]:
# Explore the datast
dataset

Dataset({
    features: ['tokens', 'labels'],
    num_rows: 967
})

**Step 4:** Tokenization and Label Alignment



In [21]:
# Initialize the Fast Tokenizer
# Use the fast tokenizer
# For XLM-Roberta
tokenizer = XLMRobertaTokenizerFast.from_pretrained(
    "xlm-roberta-base",
    clean_up_tokenization_spaces=True
    )
# For DistilBERT
# tokenizer_distilbert = AutoTokenizer.from_pretrained(
#     'distilbert-base-multilingual-cased',
#     clean_up_tokenization_spaces=True
#     )
# # For mBERT
# tokenizer_mbert = AutoTokenizer.from_pretrained(
#     'bert-base-multilingual-cased',
#     clean_up_tokenization_spaces=True
#     )


Define the Tokenization Function

In [22]:
# Tokenization and alignment function
def tokenize_and_align_labels(examples):
  tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True, padding="max_length", max_length=128)  # Set max_length as needed
  labels = []

  for i in range(len(examples['tokens'])):
      label = examples['labels'][i]
      tokenized_label = [-100] * len(tokenized_inputs['input_ids'][i])  # Default label for all tokens

      # Aligning labels with tokens
      for j, token in enumerate(tokenized_inputs['input_ids'][i]):
          # Check if this token corresponds to the original word
          original_word_idx = tokenizer.decode(token).strip()
          if original_word_idx in examples['tokens'][i]:
              token_index = examples['tokens'][i].index(original_word_idx)
              tokenized_label[j] = label[token_index]  # Use the corresponding label

      labels.append(tokenized_label)

  tokenized_inputs['labels'] = labels
  return tokenized_inputs

Tokenize the dataset


In [23]:
# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
tokenized_dataset

Dataset({
    features: ['tokens', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 967
})

- Split the dataset into train and test data

In [24]:
# Split into train and validation datasets
train_test_split = tokenized_dataset.train_test_split(test_size=0.1)  # 90% train, 10% validation

In [None]:
# Print the lengths of input_ids, attention_mask, and labels for verification
print(f"Number of samples: {len(tokenized_dataset)}")
print(f"Input IDs length: {[len(x) for x in tokenized_dataset['input_ids']]}")
print(f"Attention Mask length: {[len(x) for x in tokenized_dataset['attention_mask']]}")
print(f"Labels length: {[len(x) for x in tokenized_dataset['labels']]}")

Number of samples: 967
Input IDs length: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128

In [None]:
# Check the train and test split
train_test_split

DatasetDict({
    train: Dataset({
        features: ['tokens', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 870
    })
    test: Dataset({
        features: ['tokens', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 97
    })
})

**Step 5:** Set Up Training Arguments
Configure the training arguments for your model.

In [25]:
# Set up training arguments with adjustments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",     # Evaluates at the end of each epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    num_train_epochs=3,
    weight_decay=0.01,               # Strength of weight decay
    max_grad_norm=1.0,  # Gradient clipping
    logging_dir='./logs',            # Directory for storing logs
    logging_strategy="steps",        # Log at regular intervals
    logging_steps=50,                # Log every 50 steps
    save_strategy="epoch",           # Save model at the end of each epoch
    report_to="none",                # Only show logs in the output (no TensorBoard)
)


**Step 6:** Load and Fine-Tune the pre-trained model

- Use Hugging Face Trainer API
Fine-tune the model using the Trainer API.

- fine-tune each of the following pre-trained models:

- `xlm-roberta-base`

- `DistilBERT`

- `mBERT`

In [26]:
# Initialize each of the models
# For XLM-Roberta
model_xlmr = XLMRobertaForTokenClassification.from_pretrained("xlm-roberta-base", num_labels=len(unique_labels)) # Ensure unique_labels is defined

# For DistilBERT
model_distilbert = AutoModelForTokenClassification.from_pretrained('distilbert-base-multilingual-cased', num_labels=len(unique_labels))

# For mBERT
model_distilbert = AutoModelForTokenClassification.from_pretrained('bert-base-multilingual-cased', num_labels=len(unique_labels))



Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


- Set Up Trainer for Each Model

In [27]:
trainer_xlmr = Trainer(
    model=model_xlmr,
    args=training_args,
    train_dataset=train_test_split['train'],
    eval_dataset=train_test_split['test'],  # Changed from validation to test based on split
)
trainer_distilbert = Trainer(
    model=model_distilbert,
    args=training_args,
    train_dataset=train_test_split['train'],
    eval_dataset=train_test_split['test'],  # Changed from validation to test based on split
)
trainer_mbert = Trainer(
    model=model_distilbert,
    args=training_args,
    train_dataset=train_test_split['train'],
    eval_dataset=train_test_split['test'],  # Changed from validation to test based on split
)

**Step 7:** Evaluate and Train each model

In [1]:
# Fine-tune XLM-Roberta
trainer_xlmr.train()
trainer_xlmr.evaluate()

# Fine-tune DistilBERT
trainer_distilbert.train()
trainer_distilbert.evaluate()

# Fine-tune mBERT
trainer_mbert.train()
trainer_mbert.evaluate()


NameError: name 'trainer_xlmr' is not defined

**Step 7:** Save the trained model

In [None]:
# Save the model
model.save_pretrained("./fine_tuned_ner_model")
tokenizer.save_pretrained("./fine_tuned_ner_model")

**Step 8:** Evaluate the model

In [None]:
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.004838519264012575, 'eval_runtime': 0.7838, 'eval_samples_per_second': 123.749, 'eval_steps_per_second': 31.894, 'epoch': 3.0}


In [None]:
from seqeval.metrics import classification_report

predictions, labels, _ = trainer.predict(tokenized_dataset['validation'])
preds = np.argmax(predictions, axis=2)

# Create a list of true labels and predicted labels
true_labels = [[label_list[l] for l in label] for label in labels]
pred_labels = [[label_list[p] for p in pred] for pred in preds]

print(classification_report(true_labels, pred_labels))


In [None]:
from datasets import Dataset
from transformers import XLMRobertaTokenizer, XLMRobertaForTokenClassification, Trainer, TrainingArguments
import pandas as pd

# Load your labeled data (replace with your actual data loading method)


# Load the tokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

# Tokenization and alignment function
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True, padding="max_length", max_length=128)  # Set max_length as needed
    labels = []

    for i in range(len(examples['tokens'])):
        label = examples['labels'][i]
        tokenized_label = [-100] * len(tokenized_inputs['input_ids'][i])  # Default label for all tokens

        # Aligning labels with tokens
        for j, token in enumerate(tokenized_inputs['input_ids'][i]):
            # Check if this token corresponds to the original word
            original_word_idx = tokenizer.decode(token).strip()
            if original_word_idx in examples['tokens'][i]:
                token_index = examples['tokens'][i].index(original_word_idx)
                tokenized_label[j] = label[token_index]  # Use the corresponding label

        labels.append(tokenized_label)

    tokenized_inputs['labels'] = labels
    return tokenized_inputs

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

# Split into train and validation datasets
train_test_split = tokenized_dataset.train_test_split(test_size=0.1)  # 90% train, 10% validation

# Print the lengths of input_ids, attention_mask, and labels for verification
print(f"Number of samples: {len(tokenized_dataset)}")
print(f"Input IDs length: {[len(x) for x in tokenized_dataset['input_ids']]}")
print(f"Attention Mask length: {[len(x) for x in tokenized_dataset['attention_mask']]}")
print(f"Labels length: {[len(x) for x in tokenized_dataset['labels']]}")

# Set up training arguments with adjustments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=1e-5,  # Reduced learning rate
    per_device_train_batch_size=4,  # Reduced batch size
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    max_grad_norm=1.0,  # Gradient clipping
)



# Train the model
trainer.train()






  0%|          | 0/1 [00:00<?, ?ba/s]

Number of samples: 967
Input IDs length: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,0.003028
2,No log,0.000393
3,0.071600,0.00033


('./fine_tuned_ner_model/tokenizer_config.json',
 './fine_tuned_ner_model/special_tokens_map.json',
 './fine_tuned_ner_model/sentencepiece.bpe.model',
 './fine_tuned_ner_model/added_tokens.json')

NameError: name 'trainer' is not defined