# Amharic E-commerce Data Preprocessing

This notebook demonstrates how to preprocess Amharic e-commerce data for Named Entity Recognition (NER) tasks.

## Overview

In this notebook, we will:

1. Load the raw data collected from Telegram channels
2. Clean and normalize the Amharic text
3. Extract potential entities using pattern matching
4. Prepare the data for NER labeling
5. Save the processed data for the next step


In [5]:
from pathlib import Path
import sys

# This gets the notebook's current working directory and goes two levels up to project root
project_root = Path.cwd().parent.parent
sys.path.append(str(project_root))

# Now your import will work
from src.data.preprocessor import AmharicPreprocessor

# You can initialize and use the preprocessor now
preprocessor = AmharicPreprocessor()


In [2]:
# Import required libraries
import os
import sys
import json
import re
from datetime import datetime
from pathlib import Path
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
# Add the project root directory to the Python path
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import the AmharicPreprocessor class from our custom module
from src.data.preprocessor import AmharicPreprocessor


In [3]:
# Initialize the AmharicPreprocessor
preprocessor = AmharicPreprocessor()

# Define input and output directories
raw_data_dir = project_root / "data" / "raw"
processed_data_dir = project_root / "data" / "processed"

# Create output directory if it doesn't exist
os.makedirs(processed_data_dir, exist_ok=True)

print(f"Raw data directory: {raw_data_dir}")
print(f"Processed data directory: {processed_data_dir}")


Raw data directory: D:\10-Academy\Week4\amharic-ecommerce-extractor\data\raw
Processed data directory: D:\10-Academy\Week4\amharic-ecommerce-extractor\data\processed


In [4]:
# Function to load data from raw files
def load_raw_data(file_path=None):
    """
    Load raw data from JSON or CSV file.
    
    Args:
        file_path: Path to the data file (optional)
        
    Returns:
        List of dictionaries containing message data
    """
    if file_path is None:
        # Find the most recent JSON file
        json_files = list(raw_data_dir.glob("all_messages_*.json"))
        csv_files = list(raw_data_dir.glob("telegram_data_*.csv"))
        
        if json_files:
            # Use the most recent JSON file
            file_path = sorted(json_files)[-1]
        elif csv_files:
            # Use the most recent CSV file
            file_path = sorted(csv_files)[-1]
        else:
            print("No data files found in the raw data directory")
            return []
    
    print(f"Loading data from {file_path}")
    data = preprocessor.load_data(file_path)
    print(f"Loaded {len(data)} messages")
    
    return data

# Load raw data
raw_data = load_raw_data()


Loading data from D:\10-Academy\Week4\amharic-ecommerce-extractor\data\raw\all_messages_20250622_102341.json
Loaded 1045 messages


In [12]:
# Examine a sample of the raw data
if raw_data:
    # Convert to DataFrame for easier examination
    df_raw = pd.DataFrame(raw_data)
    
    # Display basic information
    print(f"Number of messages: {len(df_raw)}")
    print(f"Columns: {df_raw.columns.tolist()}")
    
    # Display a sample message
    print("\nSample message:")
    sample = df_raw.sample(1).iloc[0]
    for key, value in sample.items():
        if key == 'text':
            # Truncate long text
            print(f"{key}: {value[:200]}..." if len(value) > 200 else f"{key}: {value}")
        else:
            print(f"{key}: {value}")
else:
    print("No data to examine")


Number of messages: 1045
Columns: ['id', 'date', 'text', 'views', 'channel', 'has_media', 'media_type']

Sample message:
id: 4358
date: 2024-11-23T18:01:42+00:00
text: ፍንትው ላለ የልብስ ንፃት የሚገለግል የልብስ ሳሙና ለማዘዝ 0974312223 ይደውሉ  ወይም https://t.me/helloo_market_bot?start=131210010 ይጠቀሙ! 
#Madeinethiopia #Detergents #CleaningSplices #Ethiopian #Marketplace #BuyEthiopia
views: 3152.0
channel: @helloomarketethiopia
has_media: True
media_type: photo


In [13]:
# Process the raw data
if raw_data:
    # Process the data using our AmharicPreprocessor
    processed_data = preprocessor.process_data(
        raw_data,
        output_dir=str(processed_data_dir),
        output_filename="processed_messages"
    )
    
    print(f"Processed {len(processed_data)} messages")
    
    # Convert to DataFrame for examination
    df_processed = pd.DataFrame(processed_data)
    
    # Display a sample processed message
    print("\nSample processed message:")
    sample = df_processed.sample(1).iloc[0]
    
    print(f"Original text: {sample['text'][:100]}..." if len(sample['text']) > 100 else f"Original text: {sample['text']}")
    print(f"Normalized text: {sample['normalized_text'][:100]}..." if len(sample['normalized_text']) > 100 else f"Normalized text: {sample['normalized_text']}")
    print(f"Tokens: {sample['tokens'][:20]}..." if len(sample['tokens']) > 20 else f"Tokens: {sample['tokens']}")
    
    if 'potential_entities' in sample:
        print("\nPotential entities:")
        for entity_type, entities in sample['potential_entities'].items():
            print(f"{entity_type}: {entities}")
else:
    print("No data to process")


Processing messages: 100%|██████████| 1045/1045 [00:00<00:00, 1552.70it/s]
2025-06-22 10:41:27,285 - src.data.preprocessor - INFO - Saved 1045 processed messages to D:\10-Academy\Week4\amharic-ecommerce-extractor\data\processed\processed_messages.json


Processed 1045 messages

Sample processed message:
Original text: **ቴሌግራም****⭐️**** **https://t.me/nejashionlinemarketing**
     " መለያችን ታማኝ መሆናችን ጥሩ እቃ ማቅረባችን 🛍 ነጃሺ ...
Normalized text: **ቴሌግራም****⭐️**** **https://t.me/nejashionlinemarketing** " መለያችን ታማኝ መሆናችን ጥሩ እቃ ማቅረባችን 🛍 ነጃሺ onlin...
Tokens: ['**ቴሌግራም****⭐️***', '*', '**https://t.me/nejashionlinemarketing*', '*', '"', 'መለያችን', 'ታማኝ', 'መሆናችን', 'ጥሩ', 'እቃ', 'ማቅረባችን', '🛍', 'ነጃሺ', 'online', 'SHOP', '"*', '*', '**KUMTEL', '70', 'ሊትር']...

Potential entities:
prices: ['15,500']
locations: []


In [14]:
# Prepare data for NER labeling
if 'processed_data' in locals() and processed_data:
    # Prepare data for NER using our AmharicPreprocessor
    ner_data = preprocessor.prepare_for_ner(
        processed_data,
        output_dir=str(processed_data_dir),
        output_filename="ner_ready_data"
    )
    
    # Display the first few rows of the NER-ready data
    print("NER-ready data sample:")
    display(ner_data.head(10))
    
    # Display statistics
    print(f"\nTotal tokens: {len(ner_data)}")
    print(f"Messages: {ner_data['message_id'].nunique()}")
    print(f"Channels: {ner_data['channel'].nunique()}")
else:
    print("No processed data available for NER preparation")


2025-06-22 10:41:32,953 - src.data.preprocessor - INFO - Saved NER-ready data to D:\10-Academy\Week4\amharic-ecommerce-extractor\data\processed\ner_ready_data.csv


NER-ready data sample:


Unnamed: 0,message_id,channel,token,entity
0,188877,@tikvahethmart,📢,O
1,188877,@tikvahethmart,ይህ,O
2,188877,@tikvahethmart,የቲክቫህ,O
3,188877,@tikvahethmart,ቢዝነስ,O
4,188877,@tikvahethmart,ቤተሰብ,O
5,188877,@tikvahethmart,ቤት,O
6,188877,@tikvahethmart,ነው,O
7,188877,@tikvahethmart,፣,O
8,188877,@tikvahethmart,ሰውን,O
9,188877,@tikvahethmart,እንዲሁም,O



Total tokens: 58875
Messages: 1046
Channels: 9


In [15]:
# Helper function to visualize entity patterns
def visualize_entity_patterns():
    """
    Visualize common patterns for potential entities in the data.
    """
    if 'df_processed' not in locals() or len(df_processed) == 0:
        print("No processed data available")
        return
    
    # Extract price patterns
    price_pattern = r'(\d+(?:,\d+)*(?:\.\d+)?)\s*(?:ብር|ETB|Birr|birr)'
    prices = []
    
    for text in df_processed['normalized_text']:
        matches = re.findall(price_pattern, text, re.IGNORECASE)
        prices.extend(matches)
    
    print(f"Found {len(prices)} price mentions")
    print(f"Sample prices: {prices[:10]}")
    
    # Count location mentions
    locations = []
    for row in df_processed.iterrows():
        if 'potential_entities' in row[1] and 'locations' in row[1]['potential_entities']:
            locations.extend(row[1]['potential_entities']['locations'])
    
    location_counts = pd.Series(locations).value_counts()
    print("\nLocation mentions:")
    print(location_counts.head(10))

# Visualize entity patterns
visualize_entity_patterns()


No processed data available


## Summary and Next Steps

In this notebook, we have:

1. Loaded raw data collected from Telegram channels
2. Cleaned and normalized the Amharic text
3. Tokenized the text for NER tasks
4. Extracted potential entities using pattern matching
5. Prepared the data in a format suitable for NER labeling

The preprocessed data has been saved to:
- `processed_messages.json`: Contains the full processed messages with normalized text and potential entities
- `ner_ready_data.csv`: Contains tokenized data ready for NER labeling

In the next notebook, we will manually label a subset of this data for NER training.
