# CENG442 Assignment 1 - Azerbaijani Text Preprocessing & Word Embeddings

**Group Members:**
* Talha Ubeydullah Gamga | 20050111078
* Aziz Önder | 22050141021
* Muhammed Fatih Asan | 23050151026
* Buğra Bildiren | 20050111022

## Step 1: Setup and Imports

In this step, we import all necessary libraries for data processing and text cleaning, including standard libraries like `pandas`, `re` (regex), and `ftfy` (for text normalization).

We also import the custom utility functions (e.g., domain detection, emoji/negation handling) from the `ozel_temizlik.py` script.

Finally, we define and create the `OUTPUT_DIR` (`clean_data/`) where our processed Excel files will be saved.

In [2]:
import pandas as pd
import re
import os
import unicodedata
import ftfy

# --- Import Custom Utility Script ---
# This script contains helper functions for domain detection,
# negation, emoji mapping, and other specific cleaning tasks.
import ozel_temizlik

# --- Setup Output Directory ---
OUTPUT_DIR = "clean_data"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Libraries imported successfully.")
print(f"Utility functions from 'ozel_temizlik.py' imported.")
print(f"Output directory '{OUTPUT_DIR}' is ready.")

Libraries imported successfully.
Utility functions from 'ozel_temizlik.py' imported.
Output directory 'clean_data' is ready.


## Step 2: Define Core Helper Functions

In this step, we define the core helper functions required by the main processing pipeline. These functions are responsible for:

1.  **`map_sentiment_value`**: Standardizing the various sentiment labels (e.g., "positive", 1, 0.0) from the 5 datasets into a single numeric float format (0.0, 0.5, 1.0).
2.  **`lower_az`**: Handling the specific lowercase conversion for Azerbaijani characters (e.g., 'İ' -> 'i', 'I' -> 'ı').
3.  **`basic_regex_clean`**: Applying the fundamental, non-domain-specific cleaning rules (like removing HTML, normalizing URLs, Emails, Numbers) based on the code snippets provided in the PDF.

In [3]:
# ----------------------------------------------------------------
# 2.1: Sentiment Label Standardization
# (Maps all labels to 0.0, 0.5, 1.0 as float)
# ----------------------------------------------------------------
def map_sentiment_value(label):
    """
    Converts various sentiment labels (str, int) from different
    datasets into a standard float value (0.0, 0.5, or 1.0).
    Returns None if the label is unmappable.
    """
    
    # Handle string labels
    if isinstance(label, str):
        label_low = label.lower().strip()
        if label_low in ['positive', 'pos', '1']:
            return 1.0
        elif label_low in ['negative', 'neg', '0']:
            return 0.0
        elif label_low in ['neutral', 'neu', '0.5']:
            return 0.5
    
    # Handle integer labels
    if isinstance(label, int):
        if label == 1:
            return 1.0
        elif label == 0:
            return 0.0
            
    # Handle float labels
    try:
        f_label = float(label)
        if f_label == 1.0: return 1.0
        if f_label == 0.0: return 0.0
        if f_label == 0.5: return 0.5
    except (ValueError, TypeError):
        pass 
    
    # If no match is found
    return None

# ----------------------------------------------------------------
# 2.2: Azerbaijani-Specific Lowercasing
# (PDF Section 5.1.4: 'İ' -> 'i', 'I' -> 'ı')
# ----------------------------------------------------------------
def lower_az(text):
    """Applies Azerbaijani-specific lowercase conversion."""
    if not isinstance(text, str):
        return str(text) # Ensure input is string
    text = text.replace('İ', 'i').replace('I', 'ı')
    return text.lower() # Apply standard lowercasing

# ----------------------------------------------------------------
# 2.3: Basic Text Normalization (Regex)
# (Based on PDF Section 5.1 code snippets)
# ----------------------------------------------------------------
def basic_regex_clean(text):
    """
    Applies fundamental regex cleaning rules as specified 
    in the assignment PDF (e.g., HTML, URL, EMAIL, NUM).
    """
    
    # Fix broken Unicode (e.g., â€™ -> ’) - Recommended by PDF
    text = ftfy.fix_text(text)
    
    # Remove HTML tags (PDF Section 5.1.1)
    text = re.sub(r'<[^>]+>', ' ', text)
    
    # Normalize URLs (PDF Section 5.1.2)
    text = re.sub(r'http\S+|www\S+', '<URL>', text)
    
    # Normalize Emails (PDF Section 5.1.2)
    text = re.sub(r'\S+@\S+', '<EMAIL>', text)
    
    # Normalize @mentions (PDF Section 5.1.2)
    text = re.sub(r'@\w+', '<USER>', text)
    
    # Normalize Phone (simple rule) (PDF Section 5.1.2)
    # (Note: PDF has a typo r'(\+?d... , corrected to \d)
    text = re.sub(r'(\+?\d[\d\s-]{7,}\d)', '<PHONE>', text)
    
    # Normalize Numbers (as per PDF Section 5.1.6)
    text = re.sub(r'\b\d+[\.,\d]*\b', '<NUM>', text)
    
    # Normalize repeating characters (e.g., çooox -> çoxx) (PDF Section 5.1.6)
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)
    
    return text

print("Core helper functions (map_sentiment_value, lower_az, basic_regex_clean) defined.")

Core helper functions (map_sentiment_value, lower_az, basic_regex_clean) defined.


## Step 3: Define the Main Normalization Pipeline (normalize_text_az)

This is the main "glue" function for our pipeline. It's responsible for executing all cleaning steps in the correct logical order.

It combines the **basic** cleaning functions (defined in Step 2, e.g., `basic_regex_clean`, `lower_az`) with the **advanced/specialized** functions imported from `ozel_temizlik.py` (e.g., `split_hashtags`, `handle_negation`).

The main `process_file` function (which we will use in the next step) will call this single function to perform the complete text normalization.