## Text Processing: Handling Amharic text, tokenization, and preprocessing techniques.
To preprocess the scraped Amharic text data for tasks like tokenization, normalization, and handling Amharic-specific linguistic features, we need to follow several preprocessing steps tailored for the language.

Here’s how we can approach this task:

### Steps to Preprocess Amharic Text

* Tokenization: Tokenization is the process of splitting text into individual units such as words or subwords. Since Amharic uses a different script and has some unique linguistic features, tokenizing might need adjustments.

Use specialized libraries that handle Amharic text or a custom rule-based tokenizer.
Normalization: This step involves cleaning and converting the text into a standard format:

Remove special characters, punctuation, and numbers.
* Normalize similar-looking characters.
Convert text to a standard form (for example, removing diacritics if necessary).
Handling Amharic-Specific Features:

Amharic, like other Semitic languages, has specific features such as root-and-pattern morphology.

* Handling unique orthographic variants and considering suffixes, prefixes, and infixes in the language.

Identifying verb conjugations, plural forms, and possessives for better tokenization.

In [2]:
# Import necessary libraries
import pandas as pd
import logging
import os, sys
import matplotlib.pyplot as plt
from matplotlib import font_manager
from collections import Counter
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import data preprocessor class
from amharic_text_processor import AmharicTextPreprocessor
from amharic_labeler import AmharicNERLabeler

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

logger.info("Imported libraries and configured logging.")

2025-01-21 22:07:13,363 - INFO - Imported libraries and configured logging.


In [3]:
# Read the data
data = pd.read_csv('../data/telegram_data.csv')
# Explore the first five rows
data.head()

Unnamed: 0,channel_title,channel_username,channel_id,message_id,message,date,media_path
0,Zemen Express®,@ZemenExpress,1307493052,6065,💥💥..........................💥💥\n\n📌 Profession...,2025-01-21 11:03:15+00:00,../data/photos/@ZemenExpress_6065.jpg
1,Zemen Express®,@ZemenExpress,1307493052,6064,💥💥..........................💥💥\n\n📌 Profession...,2025-01-21 11:03:04+00:00,
2,Zemen Express®,@ZemenExpress,1307493052,6063,💥💥...................................💥💥\n\n📌 1...,2025-01-21 07:11:47+00:00,../data/photos/@ZemenExpress_6063.jpg
3,Zemen Express®,@ZemenExpress,1307493052,6062,💥💥...................................💥💥\n\n📌 1...,2025-01-21 07:11:33+00:00,../data/photos/@ZemenExpress_6062.jpg
4,Zemen Express®,@ZemenExpress,1307493052,6061,,2025-01-21 05:20:45+00:00,../data/photos/@ZemenExpress_6061.jpg


In [4]:
# Check the last five rows
data.tail()

Unnamed: 0,channel_title,channel_username,channel_id,message_id,message,date,media_path
3390,Zemen Express®,@ZemenExpress,1307493052,2557,,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2557.jpg
3391,Zemen Express®,@ZemenExpress,1307493052,2556,,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2556.jpg
3392,Zemen Express®,@ZemenExpress,1307493052,2555,,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2555.jpg
3393,Zemen Express®,@ZemenExpress,1307493052,2554,🎯Momcoc® Smiley Face Non Stick Pancake Pan 😄😆\...,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2554.jpg
3394,Zemen Express®,@ZemenExpress,1307493052,2553,🎯Momcoc® Smiley Face Non Stick Pancake Pan 😄😆\...,2023-04-06 05:20:09+00:00,


In [5]:
data.shape

(3395, 7)

In [6]:
# Let's check the missing values
data.isnull().sum()

channel_title         0
channel_username      0
channel_id            0
message_id            0
message             976
date                  0
media_path          853
dtype: int64

In [7]:
# Preprocess and tokenizes the amharic message
if __name__ == "__main__":
    # Amharic text sample
    amharic_text = "ሰላም እንዴት ነህ? እንኳን ደህና መጣህ።"

    preprocessor = AmharicTextPreprocessor()

    # Preprocess the text
    tokens = preprocessor.preprocess_dataframe(data, 'message')
    display(tokens)

Unnamed: 0,channel_title,channel_username,channel_id,message_id,message,date,media_path,preprocessed_message
0,Zemen Express®,@ZemenExpress,1307493052,6065,💥💥..........................💥💥\n\n📌 Profession...,2025-01-21 11:03:15+00:00,../data/photos/@ZemenExpress_6065.jpg,3 120 ዋጋ 1100 ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1መገናኛ መሰረ...
1,Zemen Express®,@ZemenExpress,1307493052,6064,💥💥..........................💥💥\n\n📌 Profession...,2025-01-21 11:03:04+00:00,,3 120 ዋጋ 1100 ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1መገናኛ መሰረ...
2,Zemen Express®,@ZemenExpress,1307493052,6063,💥💥...................................💥💥\n\n📌 1...,2025-01-21 07:11:47+00:00,../data/photos/@ZemenExpress_6063.jpg,12 ዋጋ 800 ብር ውስን ፍሬ ነው ያለን አድራሻ ቁ1መገናኛ መሰረት ደፋ...
3,Zemen Express®,@ZemenExpress,1307493052,6062,💥💥...................................💥💥\n\n📌 1...,2025-01-21 07:11:33+00:00,../data/photos/@ZemenExpress_6062.jpg,12 ዋጋ 800 ብር ውስን ፍሬ ነው ያለን አድራሻ ቁ1መገናኛ መሰረት ደፋ...
4,Zemen Express®,@ZemenExpress,1307493052,6061,,2025-01-21 05:20:45+00:00,../data/photos/@ZemenExpress_6061.jpg,
...,...,...,...,...,...,...,...,...
3390,Zemen Express®,@ZemenExpress,1307493052,2557,,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2557.jpg,
3391,Zemen Express®,@ZemenExpress,1307493052,2556,,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2556.jpg,
3392,Zemen Express®,@ZemenExpress,1307493052,2555,,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2555.jpg,
3393,Zemen Express®,@ZemenExpress,1307493052,2554,🎯Momcoc® Smiley Face Non Stick Pancake Pan 😄😆\...,2023-04-06 05:20:42+00:00,../data/photos/@ZemenExpress_2554.jpg,100 የማይዝ በአንድ ግዜ 7 ኬክ ይጋግራል ዋጋ 1700 ብር ውስን ፍሬ ...


In [8]:
# Drop NaN 

data.dropna(subset='message', inplace=True)

In [9]:
list(data['preprocessed_message'])

['3 120 ዋጋ 1100 ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1መገናኛ መሰረት ደፋር ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05/06 ቁ2ያሳ ጊዮርጊስ አደባባይ ራመትታቦርኦዳህንፃ 1ኛ ፎቅ ሱቅ ቁ 1 107 0902660722 0928460606 ያሳ ቅርንጫፍ 0941337070 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 '3 120 ዋጋ 1100 ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1መገናኛ መሰረት ደፋር ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05/06 ቁ2ያሳ ጊዮርጊስ አደባባይ ራመትታቦርኦዳህንፃ 1ኛ ፎቅ ሱቅ ቁ 1 107 0902660722 0928460606 ያሳ ቅርንጫፍ 0941337070 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 '12 ዋጋ 800 ብር ውስን ፍሬ ነው ያለን አድራሻ ቁ1መገናኛ መሰረት ደፋር ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05/06 ቁ2ያሳ ጊዮርጊስ አደባባይ ራመትታቦርኦዳህንፃ 1ኛ ፎቅ ሱቅ ቁ 1 107 0902660722 0928460606 ያሳ ቅርንጫፍ 0941337070 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 '12 ዋጋ 800 ብር ውስን ፍሬ ነው ያለን አድራሻ ቁ1መገናኛ መሰረት ደፋር ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05/06 ቁ2ያሳ ጊዮርጊስ አደባባይ ራመትታቦርኦዳህንፃ 1ኛ ፎቅ ሱቅ ቁ 1 107 0902660722 0928460606 ያሳ ቅርንጫፍ 0941337070 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌግራም ገፃችን ///',
 'ዋጋ 900 ብር ውስን ፍሬ ነው ያለው አድራሻ ቁ1መገናኛ መሰረት ደፋር ሞል ሁለተኛ ፎቅ ቢሮ ቁ 05/06 ቁ2ያሳ ጊዮርጊስ አደባባይ ራመትታቦርኦዳህንፃ 1ኛ ፎቅ ሱቅ ቁ 1 107 0902660722 0928460606 ያሳ ቅርንጫፍ 0941337070 በ ለማዘዝ ይጠቀሙ ለተጨማሪ ማብራሪያ የቴሌ

In [10]:
# Ensure there are no NaN values in the preprocessed column
preprocessed_texts = tokens['preprocessed_message'].dropna().tolist()
df = pd.Series(preprocessed_texts).reset_index(name='message')

In [11]:
# Initialize the labeler

labeler = AmharicNERLabeler()

# Ensure there are no NaN values in the preprocessed column
preprocessed_texts = tokens['preprocessed_message'].dropna().tolist()
df = pd.Series(preprocessed_texts).reset_index(name='message')
# df = df.iloc[10:15]
df['Tokenized'] = df['message'].apply(lambda x: x.split())
# Label the tokens in the DataFrame
labeled_df = labeler.label_dataframe(df, 'Tokenized')


# Save to CoNLL format
labeler.save_conll_format(labeled_df, '../labeled_data_conll.conll')

In [12]:
labeled_df.drop(columns=['index'], inplace=True)

In [13]:
labeled_df['message'].duplicated().sum()

np.int64(1332)