# Amharic E-commerce Data Labeling

This notebook demonstrates how to label Amharic e-commerce data for Named Entity Recognition (NER) tasks.

## Overview

In this notebook, we will:

1. Load the preprocessed data
2. Convert the data to CoNLL format for labeling
3. Provide a labeling interface for manual annotation
4. Validate the labeled data
5. Generate statistics on the labeled entities


In [2]:
# Import required libraries
import os
import sys
import json
from pathlib import Path
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import ipywidgets as widgets
from IPython.display import display, HTML

# Add the project root directory to the Python path
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import the NERLabeler class from our custom module
from src.data.labeling_utils import NERLabeler


In [3]:
# Initialize the NERLabeler
labeler = NERLabeler()

# Define input and output directories
processed_data_dir = project_root / "data" / "processed"
labeled_data_dir = project_root / "data" / "labeled"

# Create output directory if it doesn't exist
os.makedirs(labeled_data_dir, exist_ok=True)

print(f"Processed data directory: {processed_data_dir}")
print(f"Labeled data directory: {labeled_data_dir}")


Processed data directory: D:\10-Academy\Week4\amharic-ecommerce-extractor\data\processed
Labeled data directory: D:\10-Academy\Week4\amharic-ecommerce-extractor\data\labeled


In [4]:
# Load the preprocessed data
def load_ner_ready_data():
    """
    Load the NER-ready data from the processed data directory.
    
    Returns:
        DataFrame with token and entity columns
    """
    ner_data_path = processed_data_dir / "ner_ready_data.csv"
    
    if not ner_data_path.exists():
        print(f"NER-ready data not found at {ner_data_path}")
        return None
    
    df = pd.read_csv(ner_data_path)
    print(f"Loaded {len(df)} tokens from {ner_data_path}")
    
    return df

# Load the NER-ready data
ner_data = load_ner_ready_data()

# Display the first few rows
if ner_data is not None:
    display(ner_data.head(10))
else:
    print("No data to display")


Loaded 58875 tokens from D:\10-Academy\Week4\amharic-ecommerce-extractor\data\processed\ner_ready_data.csv


Unnamed: 0,message_id,channel,token,entity
0,188877.0,@tikvahethmart,📢,O
1,188877.0,@tikvahethmart,ይህ,O
2,188877.0,@tikvahethmart,የቲክቫህ,O
3,188877.0,@tikvahethmart,ቢዝነስ,O
4,188877.0,@tikvahethmart,ቤተሰብ,O
5,188877.0,@tikvahethmart,ቤት,O
6,188877.0,@tikvahethmart,ነው,O
7,188877.0,@tikvahethmart,፣,O
8,188877.0,@tikvahethmart,ሰውን,O
9,188877.0,@tikvahethmart,እንዲሁም,O


In [5]:
# Convert to CoNLL format for labeling
def convert_to_conll():
    """
    Convert the NER-ready data to CoNLL format for labeling.
    
    Returns:
        Path to the CoNLL file
    """
    if ner_data is None:
        print("No data to convert")
        return None
    
    # Define the path for the CoNLL file
    conll_path = labeled_data_dir / "unlabeled_data.conll"
    
    # Convert to CoNLL format
    conll_text = labeler.csv_to_conll(ner_data, conll_path)
    
    print(f"Converted data to CoNLL format and saved to {conll_path}")
    
    return conll_path

# Convert to CoNLL format
conll_path = convert_to_conll()


2025-06-22 12:51:36,079 - src.data.labeling_utils - ERROR - Error converting CSV to CoNLL: argument of type 'method' is not iterable


Converted data to CoNLL format and saved to D:\10-Academy\Week4\amharic-ecommerce-extractor\data\labeled\unlabeled_data.conll


## Manual Labeling Instructions

To label the data for NER, follow these steps:

1. Open the CoNLL file in a text editor
2. For each token, replace the "O" label with the appropriate entity label:
   - B-Product: Beginning of a product entity
   - I-Product: Inside a product entity
   - B-PRICE: Beginning of a price entity
   - I-PRICE: Inside a price entity
   - B-LOC: Beginning of a location entity
   - I-LOC: Inside a location entity
   - B-DELIVERY_FEE: Beginning of a delivery fee entity
   - I-DELIVERY_FEE: Inside a delivery fee entity
   - B-CONTACT_INFO: Beginning of a contact info entity
   - I-CONTACT_INFO: Inside a contact info entity
   - O: Not an entity (Outside)

3. Save the labeled file as "labeled_data.conll" in the labeled data directory

### Example:

```
ጥሩ O
ሱፐር B-Product
ማርኬት I-Product
ውስጥ O
ያሉ O
ሁሉም O
አይነት O
የህፃናት B-Product
ምግቦች I-Product
በ O
ዋጋ O
ቅናሽ O
ይቀርባል O
፡፡ O
የህፃናት B-Product
ወተት I-Product
በ O
250 B-PRICE
ብር I-PRICE
ብቻ O
፡፡ O
ቦሌ B-LOC
አካባቢ I-LOC
ነው O
፡፡ O
```


In [6]:
# Validate the labeled data
def validate_labeled_data():
    """
    Validate the labeled data for common errors.
    
    Returns:
        Tuple of (is_valid, errors)
    """
    labeled_conll_path = labeled_data_dir / "labeled_data.conll"
    
    if not labeled_conll_path.exists():
        print(f"Labeled data not found at {labeled_conll_path}")
        return False, ["File not found"]
    
    # Validate the CoNLL file
    is_valid, errors = labeler.validate_conll(labeled_conll_path)
    
    if is_valid:
        print("Labeled data is valid!")
    else:
        print(f"Found {len(errors)} errors in the labeled data:")
        for error in errors:
            print(f"- {error}")
    
    return is_valid, errors

# This function should be run after manual labeling is complete
# validate_labeled_data()


In [7]:
# Generate statistics on labeled data
def generate_label_statistics():
    """
    Generate statistics on the labeled data.
    
    Returns:
        Dictionary with statistics
    """
    labeled_conll_path = labeled_data_dir / "labeled_data.conll"
    
    if not labeled_conll_path.exists():
        print(f"Labeled data not found at {labeled_conll_path}")
        return None
    
    # Generate statistics
    statistics = labeler.generate_statistics(labeled_conll_path)
    
    # Display statistics
    print(f"Total tokens: {statistics['total_tokens']}")
    print(f"Total sentences: {statistics['total_sentences']}")
    
    print("\nEntity token counts:")
    for entity, count in statistics['entity_token_counts'].items():
        if count > 0:
            print(f"- {entity}: {count}")
    
    print("\nEntity span counts:")
    for entity, count in statistics['entity_span_counts'].items():
        print(f"- {entity}: {count}")
    
    print("\nEntity examples:")
    for entity, examples in statistics['entity_examples'].items():
        if examples:
            print(f"\n{entity} examples:")
            for example in examples:
                print(f"- {example}")
    
    return statistics

# This function should be run after manual labeling is complete
# statistics = generate_label_statistics()


In [8]:
# Convert labeled data to CSV format
def convert_to_csv():
    """
    Convert the labeled CoNLL data back to CSV format.
    
    Returns:
        Path to the CSV file
    """
    labeled_conll_path = labeled_data_dir / "labeled_data.conll"
    
    if not labeled_conll_path.exists():
        print(f"Labeled data not found at {labeled_conll_path}")
        return None
    
    # Define the path for the CSV file
    csv_path = labeled_data_dir / "labeled_data.csv"
    
    # Convert to CSV format
    df = labeler.conll_to_csv(labeled_conll_path, csv_path)
    
    print(f"Converted labeled data to CSV format and saved to {csv_path}")
    print(f"Shape: {df.shape}")
    
    # Display the first few rows
    display(df.head(10))
    
    return csv_path

# This function should be run after manual labeling is complete
# csv_path = convert_to_csv()


## Summary and Next Steps

In this notebook, we have:

1. Loaded the preprocessed data from the previous step
2. Converted the data to CoNLL format for labeling
3. Provided instructions for manual annotation
4. Created functions to validate the labeled data
5. Implemented tools to generate statistics on the labeled entities
6. Prepared the labeled data for model fine-tuning

After completing the manual labeling:

1. Run the `validate_labeled_data()` function to check for errors
2. Run the `generate_label_statistics()` function to analyze the labeled entities
3. Run the `convert_to_csv()` function to prepare the data for the next step

In the next notebook, we will fine-tune a transformer model for Amharic NER using this labeled data.
