
## Task 1: Data Ingestion and  Data Preprocessing
Set up a data ingestion system to fetch messages from multiple  Ethiopian-based Telegram e-commerce channels. Prepare the raw data (text, images) for entity extraction.
List of channels 
You have to select atleast 5 channels to fetch data and you can share each other since fine tuning needs more data
Steps:
Identify and connect to relevant Telegram channels using a custom scraper.
Implement a message ingestion system to collect text, images, and documents as they are posted in real time.
Preprocess text data by tokenizing, normalizing, and handling Amharic-specific linguistic features.
Clean and structure the data into a unified format, separating metadata (e.g., sender, timestamp) from message content.
Store preprocessed data in a structured format for further analysis


In [29]:
import pandas as pd
import re
from clean import delete_null


In [1]:
id=28083480
hash='075e74a23e6e74708690f1f11bf0b266'
phone_number='0919180119'


In [3]:
from telethon import TelegramClient
import csv
import os
import asyncio
from dotenv import load_dotenv
from sqlalchemy.exc import OperationalError

# Load environment variables
load_dotenv('.env')
api_id = id
api_hash = hash

# Function to scrape data from a single channel
async def scrape_channel(client, channel_username, writer, media_dir):
    entity = await client.get_entity(channel_username)
    channel_title = entity.title
    async for message in client.iter_messages(entity, limit=100):
        media_path = None
        if message.media and hasattr(message.media, 'photo'):
            filename = f"{channel_username}_{message.id}.jpg"
            media_path = os.path.join(media_dir, filename)
            await client.download_media(message.media, media_path)
        writer.writerow([channel_title, channel_username, message.id, message.message, message.date, media_path])

async def main():
    async with TelegramClient('scraping_session', api_id, api_hash, timeout=10) as client:
        media_dir = 'photos'
        os.makedirs(media_dir, exist_ok=True)

        with open('telegram_data.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(['Channel Title', 'Channel Username', 'ID', 'Message', 'Date', 'Media Path'])
            channels = ['@Shageronlinestore', '@ZemenExpress','@AwasMart','@leyueqa','@sinayelj']
            for channel in channels:
                success = False
                while not success:
                    try:
                        await scrape_channel(client, channel, writer, media_dir)
                        success = True
                        print(f"Scraped data from {channel}")
                    except OperationalError as e:
                        print(f"Database is locked. Retrying... {e}")
                        await asyncio.sleep(1)  # Wait before retrying



In [16]:
df=pd.read_excel(r"C:\Users\ASUS VIVO\Downloads\new.xlsx")

In [17]:
df['Channel Title'].unique()

array(['Sheger online-store', 'Zemen Express', 'Sinayelij', 'Leyueqa',
       'AwasMart'], dtype=object)

In [24]:
df = df.dropna(subset=['Message'])

# Print the shape of the dataset after dropping NaN values in the "Message" column
print(f"Dataset shape after dropping NaN values in 'Message' column: {df.shape}")

Dataset shape after dropping NaN values in 'Message' column: (3257, 6)


In [27]:
df['Message'].isnull().sum()

0

removing emojis

In [30]:

def remove_emojis(text):
    emoji_pattern = re.compile(
        "[" 
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251" 
        "]+", 
        flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)


df['Message'] = df['Message'].apply(remove_emojis)


print(df.head())


         Channel Title    Channel Username    ID  \
2  Sheger online-store  @Shageronlinestore  3791   
4  Sheger online-store  @Shageronlinestore  1100   
5  Sheger online-store  @Shageronlinestore  1037   
7  Sheger online-store  @Shageronlinestore   672   
8  Sheger online-store  @Shageronlinestore  4220   

                                             Message  \
2   Yoga mat\n\nSize  61cm*173cm*4mm\n\nዋጋ፦     7...   
4   Marado Electric Kettle \n\n 1.7L Capacity \n ...   
5   3.6L Glass dispenser jar with Bamboo stand\n\...   
7  detangling Curling brush\n    \nዋጋ- 300ብር\n\n ...   
8                      How to set up detagling brush   

                        Date                          Media Path  
2  2023-03-18 07:17:30+00:00  photos/@Shageronlinestore_1319.jpg  
4  2024-05-15 17:00:19+00:00  photos/@Shageronlinestore_4155.jpg  
5  2024-05-21 12:26:09+00:00  photos/@Shageronlinestore_4219.jpg  
7  2024-06-26 15:20:48+00:00                                 NaN  
8  2022-12-02 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Message'] = df['Message'].apply(remove_emojis)


In [38]:
df=df.drop('Media Path' ,axis=1)

labelled the message

In [43]:


def label_message_utf8_with_birr(message):
    # Split the message at the first occurrence of '\n'
    if '\n' in message:
        first_line, remaining_message = message.split('\n', 1)
    else:
        first_line, remaining_message = message, ""
    
    labeled_tokens = []
    
    # Tokenize the first line
    first_line_tokens = re.findall(r'\S+', first_line)
    
    # Label the first token as B-PRODUCT and the rest as I-PRODUCT
    if first_line_tokens:
        labeled_tokens.append(f"{first_line_tokens[0]} B-PRODUCT")  # First token as B-PRODUCT
        for token in first_line_tokens[1:]:
            labeled_tokens.append(f"{token} I-PRODUCT")  # Remaining tokens as I-PRODUCT
    
    # Process the remaining message normally
    if remaining_message:
        lines = remaining_message.split('\n')
        for line in lines:
            tokens = re.findall(r'\S+', line)  # Tokenize each line while considering non-ASCII characters
            
            for token in tokens:
                # Check if token is a price (e.g., 500 ETB, $100, or ብር)
                if re.match(r'^\d{10,}$', token):
                    labeled_tokens.append(f"{token} O")  # Label as O for "other" or outside of any entity
                elif re.match(r'^\d+(\.\d{1,2})?$', token) or 'ETB' in token or 'ዋጋ' in token or '$' in token or 'ብር' in token:
                    labeled_tokens.append(f"{token} I-PRICE")
                # Check if token could be a location (e.g., cities or general location names)
                elif any(loc in token for loc in ['Addis Ababa', 'ለቡ', 'ለቡ መዳህኒዓለም', 'መገናኛ', 'ቦሌ', 'ሜክሲኮ']):
                    labeled_tokens.append(f"{token} I-LOC")
                # Assume other tokens are part of a product name or general text
                else:
                    labeled_tokens.append(f"{token} O")
    
    return "\n".join(labeled_tokens)

# Apply the updated function to the non-null messages
df['Labeled_Message'] = df['Message'].apply(label_message_utf8_with_birr)




In [41]:
df[df['Channel Title']=='AwasMart'].head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Labeled_Message
271,AwasMart,@AwasMart,5081,"Aromatherapy humidifier, difuser\n\n Brings cl...",2022-02-01 19:14:29+00:00,"Aromatherapy B-PRODUCT\nhumidifier, I-PRODUCT\..."
359,AwasMart,@AwasMart,5094,Robotic Cushion Massage Seat For Car / Home,2022-01-23 19:28:49+00:00,Robotic B-PRODUCT\nCushion I-PRODUCT\nMassage ...
698,AwasMart,@AwasMart,5077,ለጤናችን-Health & Personal Care\n\nFingerTip Puls...,2021-04-12 08:36:40+00:00,ለጤናችን-Health B-PRODUCT\n& I-PRODUCT\nPersonal ...
927,AwasMart,@AwasMart,5100,Reusable convenience LVy Grip Tape\n\nSize: 3...,2022-01-21 10:15:06+00:00,Reusable B-PRODUCT\nconvenience I-PRODUCT\nLVy...
943,AwasMart,@AwasMart,5107,Saachi 3in1 Blender / Grinder\nለጁስ\nለቡና እንዲሁም ...,2022-01-17 04:08:15+00:00,Saachi B-PRODUCT\n3in1 I-PRODUCT\nBlender I-PR...


In [42]:
df[df['Channel Title']=='Leyueqa'].head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Labeled_Message
181,Leyueqa,@Leyueqa,5059,"ከ""ጃር"" ላይ የመጠጥ ዉሃ የሚስብ ተንቀሳቃሽ ማሽን\n......water ...",2021-10-28 15:48:02+00:00,"ከ""ጃር"" B-PRODUCT\nላይ I-PRODUCT\nየመጠጥ I-PRODUCT\..."
1369,Leyueqa,@Leyueqa,5051,Mini Ultrasonic turbine washer,2021-10-31 04:22:41+00:00,Mini B-PRODUCT\nUltrasonic I-PRODUCT\nturbine ...
2501,Leyueqa,@Leyueqa,5048,Retractable Clothesline Rope,2021-10-31 04:25:38+00:00,Retractable B-PRODUCT\nClothesline I-PRODUCT\n...
2882,Leyueqa,@Leyueqa,5060,"Anti-theft Lightweight Backpack 15.6""\n ...",2021-10-25 07:59:46+00:00,Anti-theft B-PRODUCT\nLightweight I-PRODUCT\nB...
2891,Leyueqa,@Leyueqa,5055,Sweat Shaper\n Slimming body shaper \n\nለ...,2021-10-30 14:19:16+00:00,Sweat B-PRODUCT\nShaper I-PRODUCT\nSlimming O\...


Save the updated labeled dataset to a file in CoNLL format

In [44]:

labeled_data_birr_path = r"C:\Users\ASUS VIVO\Downloads\labele_data.txt-"
with open(labeled_data_birr_path, 'w', encoding='utf-8') as f:
    for index, row in df.iterrows():
        f.write(f"{row['Labeled_Message']}\n\n")