### Text Processing: Handling Amharic text, tokenization, and preprocessing techniques.

To preprocess the scraped Amharic text data for tasks like tokenization, normalization, and handling Amharic-specific linguistic features, we need to follow several preprocessing steps tailored for the language. 

Here’s how we can approach this task:

**Steps to Preprocess Amharic Text**

- **Tokenization**: Tokenization is the process of splitting text into individual units such as words or subwords. Since Amharic uses a different script and has some unique linguistic features, tokenizing might need adjustments. 
    - Use specialized libraries that handle Amharic text or a custom rule-based tokenizer.

- **Normalization**: This step involves cleaning and converting the text into a standard format:

    - Remove special characters, punctuation, and numbers.
    - Normalize similar-looking characters.
    - Convert text to a standard form (for example, removing diacritics if necessary).

- **Handling Amharic-Specific Features:**

    - Amharic, like other Semitic languages, has specific features such as root-and-pattern morphology.

    - Handling unique orthographic variants and considering suffixes, prefixes, and infixes in the language.

    - Identifying verb conjugations, plural forms, and possessives for better tokenization.

In [4]:
# Import necessary libraries
import pandas as pd
import logging
import os, sys
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import data preprocessor class
from amharic_text_processor import AmharicTextPreprocessor

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

logger.info("Imported libraries and configured logging.")

2024-09-25 15:46:50,870 - INFO - Imported libraries and configured logging.


**Load the scraped Telegram data**

In [9]:
# Read the data
data = pd.read_csv('../data/telegram_data.csv')
# Explore the first five rows
data.head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,ልዩ እቃ,@Leyueqa,5819,🔠🔠🔠🔠🔠Siliver crest ➡️Brand ባለ1 እና ባለ 2 ተች ስቶ...,2024-09-25 17:39:49+00:00,
1,ልዩ እቃ,@Leyueqa,5818,🔠🔠🔠🔠ሶስት ፍሬ የዳቦ እና የኬክ ቅርጽ ማውጫ ( መጋገሪያ ፓትራ )\n\...,2024-09-25 10:38:58+00:00,
2,ልዩ እቃ,@Leyueqa,5817,🧳🧳🧳HIGH PRESSURE WATER GUN HEAD SET\n👉 360° የሚ...,2024-09-25 07:44:47+00:00,
3,ልዩ እቃ,@Leyueqa,5816,,2024-09-25 05:48:40+00:00,
4,ልዩ እቃ,@Leyueqa,5815,,2024-09-25 05:48:40+00:00,../data/photos/@Leyueqa_5815.jpg


In [10]:
# Check the last five rows
data.tail()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
1666,ልዩ እቃ,@Leyueqa,148,ይመቻቹ ፈታ ያለ ምሽት ተመኘሁ,2018-10-25 13:09:24+00:00,
1667,ልዩ እቃ,@Leyueqa,136,,2018-10-20 12:46:15+00:00,
1668,ልዩ እቃ,@Leyueqa,70,,2018-09-04 15:28:25+00:00,
1669,ልዩ እቃ,@Leyueqa,55,,2018-08-23 20:18:56+00:00,
1670,ልዩ እቃ,@Leyueqa,1,,2018-08-02 07:30:19+00:00,


In [11]:
# Let's check the missing values
data.isnull().sum()

Channel Title         0
Channel Username      0
ID                    0
Message             704
Date                  0
Media Path          546
dtype: int64

In [6]:
if __name__ == "__main__":
    # Amharic text sample
    amharic_text = "ሰላም እንዴት ነህ? እንኳን ደህና መጣህ።"

    preprocessor = AmharicTextPreprocessor()

    # Preprocess the text
    tokens = preprocessor.preprocess(amharic_text)
    print("Tokenized Amharic Text:", tokens)


Tokenized Amharic Text: ['ሰላም', 'እንዴት', 'ነህ', '?', 'እንኳን', 'ደህና', 'መጣህ']
