# Amharic E-commerce Data Collection

This notebook demonstrates how to collect Amharic e-commerce data from Telegram channels for Named Entity Recognition (NER) tasks.

## Overview

In this notebook, we will:

1. Set up the Telegram API client
2. Connect to Ethiopian e-commerce Telegram channels
3. Scrape messages containing product information
4. Save the collected data for further processing

## Requirements

Before running this notebook, make sure you have:

1. Telegram API credentials (API ID and API Hash)
2. A registered Telegram phone number
3. The required Python packages installed


In [1]:
# Import required libraries
import os
import sys
import json
import asyncio
from datetime import datetime
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm
# Add the project root directory to the Python path
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import the TelegramScraper class from our custom module
from src.data.telegram_scraper import TelegramScraper

# Note: We'll set environment variables directly in the notebook
# instead of using dotenv for simplicity


ModuleNotFoundError: No module named 'pandas'

In [None]:
# Set up Telegram API credentials
# Replace these with your actual credentials
API_ID = "YOUR_API_ID"  # Get from https://my.telegram.org/apps
API_HASH = "YOUR_API_HASH"  # Get from https://my.telegram.org/apps
PHONE = "+1234567890"  # Your phone number with country code

# List of Ethiopian e-commerce Telegram channels to scrape
CHANNELS = [
    "@shageronlinestore",  # Example channel
    # Add more channels here
]

# Output directory for saving scraped data
OUTPUT_DIR = str(project_root / "data" / "raw")
os.makedirs(OUTPUT_DIR, exist_ok=True)


In [None]:
async def scrape_telegram_channels():
    """
    Main function to scrape data from Telegram channels.
    """
    try:
        # Initialize the TelegramScraper
        scraper = TelegramScraper(API_ID, API_HASH, PHONE)
        
        # Connect to Telegram API
        await scraper.connect()
        
        # Scrape messages from multiple channels
        await scraper.scrape_multiple_channels(
            CHANNELS,
            limit_per_channel=200,  # Adjust as needed
            output_dir=OUTPUT_DIR,
            output_format="json"
        )
        
        # Disconnect from Telegram API
        await scraper.disconnect()
        
        print(f"Successfully scraped data from {len(CHANNELS)} channels")
        
    except Exception as e:
        print(f"Error scraping Telegram channels: {e}")

# Run the scraping function
# Note: This will prompt for authentication code when first connecting
# asyncio.run(scrape_telegram_channels())


In [None]:
# Function to load and display the scraped data
def load_and_display_data(file_path=None):
    """
    Load and display the scraped data.
    
    Args:
        file_path: Path to the JSON file with scraped data
    """
    if file_path is None:
        # Find the most recent file
        json_files = list(Path(OUTPUT_DIR).glob("all_messages_*.json"))
        if not json_files:
            print("No data files found")
            return None
        
        # Use the most recent file
        file_path = sorted(json_files)[-1]
    
    # Load the data
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    
    # Convert to DataFrame for easier analysis
    df = pd.DataFrame(data)
    
    # Display basic statistics
    print(f"Loaded {len(df)} messages from {file_path}")
    print(f"Channels: {df['channel'].nunique()}")
    print(f"Date range: {df['date'].min()} to {df['date'].max()}")
    print(f"Messages with media: {df['has_media'].sum()}")
    
    # Display sample messages
    print("\nSample messages:")
    for i, row in df.sample(min(3, len(df))).iterrows():
        print(f"\n--- Message from {row['channel']} ---")
        print(row['text'][:200] + "..." if len(row['text']) > 200 else row['text'])
    
    return df

# Uncomment to load and display data after scraping
# df = load_and_display_data()


In [None]:
# Analyze message statistics
def analyze_message_statistics(df):
    """
    Analyze message statistics from the scraped data.
    
    Args:
        df: DataFrame with scraped messages
    """
    if df is None or len(df) == 0:
        print("No data to analyze")
        return
    
    # Convert date to datetime
    df['date'] = pd.to_datetime(df['date'])
    
    # Messages per channel
    channel_counts = df['channel'].value_counts()
    print("Messages per channel:")
    print(channel_counts)
    
    # Messages per day
    df['day'] = df['date'].dt.date
    messages_per_day = df.groupby('day').size()
    print("\nMessages per day:")
    print(messages_per_day)
    
    # Average message length
    df['text_length'] = df['text'].str.len()
    avg_length = df['text_length'].mean()
    print(f"\nAverage message length: {avg_length:.2f} characters")
    
    # Messages with media
    media_counts = df['media_type'].value_counts()
    print("\nMedia types:")
    print(media_counts)
    
    # Top 10 most viewed messages
    if 'views' in df.columns:
        top_views = df.sort_values('views', ascending=False).head(10)[['channel', 'views', 'text']]
        print("\nTop 10 most viewed messages:")
        for i, row in top_views.iterrows():
            print(f"\n--- {row['channel']} ({row['views']} views) ---")
            print(row['text'][:100] + "..." if len(row['text']) > 100 else row['text'])

# Uncomment to analyze message statistics after loading data
# analyze_message_statistics(df)


In [None]:
# Save the data for the next step in the pipeline
def save_for_preprocessing(df, output_file=None):
    """
    Save the data for the preprocessing step.
    
    Args:
        df: DataFrame with scraped messages
        output_file: Path to save the data
    """
    if df is None or len(df) == 0:
        print("No data to save")
        return
    
    if output_file is None:
        # Create a timestamped filename
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_file = str(project_root / "data" / "raw" / f"telegram_data_{timestamp}.csv")
    
    # Save to CSV
    df.to_csv(output_file, index=False, encoding="utf-8")
    print(f"Data saved to {output_file}")

# Uncomment to save data for preprocessing
# save_for_preprocessing(df)


In [None]:
## Summary and Next Steps

In this notebook, we have:

1. Set up a Telegram API client to connect to Ethiopian e-commerce channels
2. Created a function to scrape messages from these channels
3. Implemented functions to analyze the collected data
4. Prepared the data for the next step in our pipeline

To use this notebook:

1. Replace the API credentials with your own
2. Add the Telegram channels you want to scrape to the CHANNELS list
3. Uncomment and run the scrape_telegram_channels() function
4. Load and analyze the scraped data
5. Save the data for preprocessing

In the next notebook, we will preprocess this data to prepare it for NER labeling.
