# Arabic Named-Entity Recognition (NER) — Assignment

This notebook guides you through building an Arabic NER model using the ANERCorp dataset (`asas-ai/ANERCorp`). Fill in the TODO cells to complete the exercise.

- **Objective:** Train a token-classification model (NER) that labels tokens with entity tags (e.g., people, locations, organizations).
- **Dataset:** `asas-ai/ANERCorp` — contains tokenized Arabic text and tag sequences.
- **Typical Labels:** `B-PER`, `I-PER` (person), `B-LOC`, `I-LOC` (location), `B-ORG`, `I-ORG` (organization), and `O` (outside/no entity). Your code should extract the exact label set from the dataset and build `label_list`, `id2label`, and `label2id` mappings.
- **Key Steps (what you will implement):**
  1. Load the dataset and inspect samples.
  2. Convert the provided words into sentence groupings (use `.` `?` `!` as sentence delimiters) before tokenization so sentence boundaries are preserved.
  3. Tokenize with a pretrained Arabic tokenizer and align tokenized sub-words with original labels (use `-100` for tokens to ignore in loss).
  4. Prepare `tokenized_datasets` and data collator for dynamic padding.
  5. Configure and run model training using `AutoModelForTokenClassification` and `Trainer`.
  6. Evaluate using `seqeval` (report precision, recall, F1, and accuracy) and run inference with a pipeline.

- **Evaluation:** Use the `seqeval` metric (entity-level precision, recall, F1). When aligning predictions and labels, filter out `-100` entries so only real token labels are compared.

- **Deliverables:** Completed notebook with working cells for data loading, tokenization/label alignment, training, evaluation, and an inference example. Add short comments explaining choices (e.g., sentence-splitting strategy, tokenizer settings).

Good luck — implement each TODO in order and run the cells to verify output.

In [1]:
# TODO: Install the required packages for Arabic NER with transformers
# Required packages: transformers, datasets, seqeval, evaluate, accelerate
# Use pip install with -q flag to suppress output

!pip install transformers datasets seqeval evaluate accelerate -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [2]:
# TODO: List the files in the current directory to explore the workspace
# Hint: Use a simple command to display directory contents

# YOUR CODE HERE
import os
print(os.listdir('.'))

['.config', 'sample_data']


In [3]:
# TODO: Load the ANERCorp dataset and extract label mappings
# Steps:
# 1. Import required libraries (datasets, numpy)
# 2. Load the "asas-ai/ANERCorp" dataset using load_dataset()
# 3. Inspect the dataset structure - print the splits and a sample entry
# 4. Extract unique tags from the training split
# 5. Create label_list (sorted), id2label, and label2id mappings

# YOUR CODE HERE
import numpy as np
from datasets import load_dataset

dataset = load_dataset("asas-ai/ANERCorp")

print(f"Dataset Split: {dataset}")
print(f"Sample Entry: {dataset['train'][0]}")

unique_tags = set(dataset['train']['tag'])

label_list = sorted(list(unique_tags))
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}

print(f"\nLabel List: {label_list}")
print(f"id2label: {id2label}")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-83c5047e14e689(…):   0%|          | 0.00/855k [00:00<?, ?B/s]

data/test-00000-of-00001-245173671c05c71(…):   0%|          | 0.00/175k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/125102 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25008 [00:00<?, ? examples/s]

Dataset Split: DatasetDict({
    train: Dataset({
        features: ['word', 'tag'],
        num_rows: 125102
    })
    test: Dataset({
        features: ['word', 'tag'],
        num_rows: 25008
    })
})
Sample Entry: {'word': 'فرانكفورت', 'tag': 'B-LOC'}

Label List: ['B-LOC', 'B-MISC', 'B-ORG', 'B-PERS', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PERS', 'O']
id2label: {0: 'B-LOC', 1: 'B-MISC', 2: 'B-ORG', 3: 'B-PERS', 4: 'I-LOC', 5: 'I-MISC', 6: 'I-ORG', 7: 'I-PERS', 8: 'O'}


In [4]:
# TODO: Verify the dataset was loaded correctly
# Print the dataframe or dataset summary to inspect the data structure

# YOUR CODE HERE
import pandas as pd
print("Sample Data (First 10 rows):")
df_sample = pd.DataFrame(dataset['train'][:10])
print(df_sample)

# Print the features summary again
print("\nDataset Features:")
print(dataset['train'].features)

Sample Data (First 10 rows):
        word    tag
0  فرانكفورت  B-LOC
1         (د      O
2          ب      O
3         أ)      O
4       أعلن      O
5      اتحاد  B-ORG
6      صناعة  I-ORG
7   السيارات  I-ORG
8         في      O
9    ألمانيا  B-LOC

Dataset Features:
{'word': Value('string'), 'tag': Value('string')}


In [5]:
# TODO: Load tokenizer and create tokenization function
# Steps:
# 1. Import AutoTokenizer from transformers
# 2. Set model_checkpoint to "aubmindlab/bert-base-arabertv02"
# 3. Load the tokenizer using AutoTokenizer.from_pretrained()
# 4. Create tokenize_and_align_labels function that:
#    - Tokenizes the input text (is_split_into_words=True)
#    - Maps tokens to their original words
#    - Handles special tokens by setting them to -100
#    - Aligns labels with sub-word tokens
#    - Returns tokenized inputs with labels
# 5. Important: Convert words to sentences using punctuation marks ".?!" as sentence delimiters
#    - This helps the model understand sentence boundaries
#    - Hint (suggested approach): group `examples['word']` into sentence lists using ".?!" as end markers, e.g.:
#        sentences = []
#        current = []
#        for w in examples['word']:
#            current.append(w)
#            if w in ['.', '?', '!'] or (len(w) > 0 and w[-1] in '.?!'):
#                sentences.append(current)
#                current = []
#        if current:
#            sentences.append(current)
#      Then align `examples['tag']` accordingly to these sentence groups before tokenization.
# 6. Apply the function to the entire dataset using dataset.map()

from transformers import AutoTokenizer

from transformers import AutoTokenizer
from datasets import Dataset, DatasetDict

model_checkpoint = "aubmindlab/bert-base-arabertv02"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



def group_sentences(dataset_split):
    sentences = []
    labels = []
    
    current_sentence = []
    current_labels = []
    
    for word, tag in zip(dataset_split['word'], dataset_split['tag']):
        current_sentence.append(word)
        current_labels.append(label2id[tag]) # Convert tag string to ID here
        
        if word in ['.', '?', '!'] or (isinstance(word, str) and word.endswith(('.', '?', '!'))):
            sentences.append(current_sentence)
            labels.append(current_labels)
            current_sentence = []
            current_labels = []
            
    if current_sentence:
        sentences.append(current_sentence)
        labels.append(current_labels)
        
    return {"tokens": sentences, "ner_tags": labels}

print("Grouping words into sentences... this may take a moment.")
grouped_datasets = DatasetDict({
    "train": Dataset.from_dict(group_sentences(dataset['train'])),
    "test": Dataset.from_dict(group_sentences(dataset['test']))
})

print(f"Original Row Count (Words): {len(dataset['train'])}")
print(f"New Row Count (Sentences): {len(grouped_datasets['train'])}")
print(f"Sample Sentence: {grouped_datasets['train'][0]['tokens']}")



def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], 
        truncation=True, 
        is_split_into_words=True
    )

    labels = []
    
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = grouped_datasets.map(
    tokenize_and_align_labels, 
    batched=True,
    remove_columns=grouped_datasets['train'].column_names 
)

print("\nTokenization Complete!")
print("Input IDs shape:", len(tokenized_datasets['train'][0]['input_ids']))
print("Labels shape:", len(tokenized_datasets['train'][0]['labels']))

tokenizer_config.json:   0%|          | 0.00/381 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Grouping words into sentences... this may take a moment.
Original Row Count (Words): 125102
New Row Count (Sentences): 4262
Sample Sentence: ['فرانكفورت', '(د', 'ب', 'أ)', 'أعلن', 'اتحاد', 'صناعة', 'السيارات', 'في', 'ألمانيا', 'امس', 'الاول', 'أن', 'شركات', 'صناعة', 'السيارات', 'في', 'ألمانيا', 'تواجه', 'عاما', 'صعبا', 'في', 'ظل', 'ركود', 'السوق', 'الداخلية', 'والصادرات', 'وهي', 'تسعي', 'لان', 'يبلغ', 'الانتاج', 'حوالي', 'خمسة', 'ملايين', 'سيارة', 'في', 'عام', '2002', '.']


Map:   0%|          | 0/4262 [00:00<?, ? examples/s]

Map:   0%|          | 0/965 [00:00<?, ? examples/s]


Tokenization Complete!
Input IDs shape: 46
Labels shape: 46


In [6]:
# TODO: Define the compute_metrics function for model evaluation
# Steps:
# 1. Import evaluate and load "seqeval" metric
# 2. Create compute_metrics function that:
#    - Extracts predictions from model outputs using argmax
#    - Filters out -100 labels (special tokens and sub-words)
#    - Converts prediction and label IDs back to label names
#    - Computes seqeval metrics (precision, recall, f1, accuracy)
#    - Returns results as a dictionary

import evaluate
import numpy as np

# YOUR CODE HERE

seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    
    predictions = np.argmax(predictions, axis=2)


    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Downloading builder script: 0.00B [00:00, ?B/s]

In [7]:
# TODO: Load the model and configure training
# Steps:
# 1. Import AutoModelForTokenClassification, TrainingArguments, Trainer, and DataCollatorForTokenClassification
# 2. Load the model using AutoModelForTokenClassification.from_pretrained() with:
#    - model_checkpoint
#    - num_labels based on label_list length
#    - id2label and label2id mappings
# 3. Create TrainingArguments with:
#    - output directory "arabert-ner"
#    - evaluation_strategy="epoch"
#    - learning_rate=2e-5
#    - batch_size=16 (both train and eval)
#    - num_train_epochs=3
#    - weight_decay=0.01
# 4. Create a DataCollatorForTokenClassification for dynamic padding
# 5. Initialize the Trainer with model, args, datasets, data_collator, tokenizer, and compute_metrics
# 6. Call trainer.train() to start training

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification

# YOUR CODE HERE

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

args = TrainingArguments(
    output_dir="arabert-ner",
    eval_strategy="epoch",      
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    push_to_hub=False
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("Starting training")
trainer.train()

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Starting training


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.161721,0.82686,0.773304,0.799186,0.963092
2,0.113500,0.144664,0.816157,0.817943,0.817049,0.96925
3,0.113500,0.147613,0.818582,0.813567,0.816067,0.96893


TrainOutput(global_step=801, training_loss=0.08560325560647153, metrics={'train_runtime': 249.1531, 'train_samples_per_second': 51.318, 'train_steps_per_second': 3.215, 'total_flos': 592339052900340.0, 'train_loss': 0.08560325560647153, 'epoch': 3.0})

In [8]:
# TODO: Test the trained model with inference
# Steps:
# 1. Import pipeline from transformers
# 2. Create an NER pipeline using the trained model and tokenizer
# 3. Use aggregation_strategy="simple" to merge sub-tokens back into words
# 4. Test the pipeline with an Arabic text sample
# 5. Pretty print the results showing entity, label, and confidence score

from transformers import pipeline

# YOUR CODE HERE

ner_pipeline = pipeline(
    "token-classification", 
    model=model, 
    tokenizer=tokenizer, 
    aggregation_strategy="simple",
    device=0 
)

text = "أعلن المدير التنفيذي لشركة أبل تيم كوك عن افتتاح فرع جديد في الرياض."
print(f"Original Text: {text}\n")

results = ner_pipeline(text)

print("--- Extracted Entities ---")
for entity in results:

    print(f"Entity: {entity['word']:<12} | Label: {entity['entity_group']:<6} | Score: {entity['score']:.2f}")

Device set to use cuda:0


Original Text: أعلن المدير التنفيذي لشركة أبل تيم كوك عن افتتاح فرع جديد في الرياض.

--- Extracted Entities ---
Entity: أبل          | Label: ORG    | Score: 0.97
Entity: تيم كوك      | Label: PERS   | Score: 0.99
Entity: الرياض       | Label: LOC    | Score: 0.98
