# CENG442 Assignment 1 - Azerbaijani Text Preprocessing & Word Embeddings

**Group Members:**
* Talha Ubeydullah Gamga | 20050111078
* Aziz Önder | 22050141021
* Muhammed Fatih Asan | 23050151026
* Buğra Bildiren | 20050111022

## Step 1: Setup and Imports

In this step, we import all necessary libraries for data processing and text cleaning, including standard libraries like `pandas`, `re` (regex), and `ftfy` (for text normalization).

We also import the custom utility functions (e.g., domain detection, emoji/negation handling) from the `ozel_temizlik.py` script.

Finally, we define and create the `OUTPUT_DIR` (`clean_data/`) where our processed Excel files will be saved.

In [10]:
# RUN THIS CELL INITIALLY, IF YOU ARE RUNNING IN COLAB
!git clone https://github.com/eiziiaizii1/ceng442-assignment1-GroupTAFB.git
%cd ceng442-assignment1-GroupTAFB
!pip install pandas gensim openpyxl regex ftfy scikit-learn

Cloning into 'ceng442-assignment1-GroupTAFB'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (74/74), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 74 (delta 34), reused 45 (delta 14), pack-reused 0 (from 0)[K
Receiving objects: 100% (74/74), 15.72 MiB | 26.17 MiB/s, done.
Resolving deltas: 100% (34/34), done.
/content/ceng442-assignment1-GroupTAFB/ceng442-assignment1-GroupTAFB


In [11]:
import pandas as pd
import re
import os
import unicodedata
import ftfy

# --- Import Custom Utility Script ---
# This script contains helper functions for domain detection,
# negation, emoji mapping, and other specific cleaning tasks.
import ozel_temizlik

# --- Setup Output Directory ---
OUTPUT_DIR = "clean_data"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Libraries imported successfully.")
print(f"Utility functions from 'ozel_temizlik.py' imported.")
print(f"Output directory '{OUTPUT_DIR}' is ready.")

Libraries imported successfully.
Utility functions from 'ozel_temizlik.py' imported.
Output directory 'clean_data' is ready.


## Step 2: Define Core Helper Functions

In this step, we define the core helper functions required by the main processing pipeline. These functions are responsible for:

1.  **`map_sentiment_value`**: Standardizing the various sentiment labels (e.g., "positive", 1, 0.0) from the 5 datasets into a single numeric float format (0.0, 0.5, 1.0).
2.  **`lower_az`**: Handling the specific lowercase conversion for Azerbaijani characters (e.g., 'İ' -> 'i', 'I' -> 'ı').
3.  **`basic_regex_clean`**: Applying the fundamental, non-domain-specific cleaning rules (like removing HTML, normalizing URLs, Emails, Numbers) based on the code snippets provided in the PDF.

In [12]:
# ----------------------------------------------------------------
# 2.1: Sentiment Label Standardization
# (Maps all labels to 0.0, 0.5, 1.0 as float)
# ----------------------------------------------------------------
def map_sentiment_value(label):
    """
    Converts various sentiment labels (str, int) from different
    datasets into a standard float value (0.0, 0.5, or 1.0).
    Returns None if the label is unmappable.
    """

    # Handle string labels
    if isinstance(label, str):
        label_low = label.lower().strip()
        if label_low in ['positive', 'pos', '1']:
            return 1.0
        elif label_low in ['negative', 'neg', '0']:
            return 0.0
        elif label_low in ['neutral', 'neu', '0.5']:
            return 0.5

    # Handle integer labels
    if isinstance(label, int):
        if label == 1:
            return 1.0
        elif label == 0:
            return 0.0

    # Handle float labels
    try:
        f_label = float(label)
        if f_label == 1.0: return 1.0
        if f_label == 0.0: return 0.0
        if f_label == 0.5: return 0.5
    except (ValueError, TypeError):
        pass

    # If no match is found
    return None

# ----------------------------------------------------------------
# 2.2: Azerbaijani-Specific Lowercasing
# (PDF Section 5.1.4: 'İ' -> 'i', 'I' -> 'ı')
# ----------------------------------------------------------------
def lower_az(text):
    """Applies Azerbaijani-specific lowercase conversion."""
    if not isinstance(text, str):
        return str(text) # Ensure input is string
    text = text.replace('İ', 'i').replace('I', 'ı')
    return text.lower() # Apply standard lowercasing

# ----------------------------------------------------------------
# 2.3: Basic Text Normalization (Regex)
# (Based on PDF Section 5.1 code snippets)
# ----------------------------------------------------------------
def basic_regex_clean(text):
    """
    Applies fundamental regex cleaning rules as specified
    in the assignment PDF (e.g., HTML, URL, EMAIL, NUM).
    """

    # Fix broken Unicode (e.g., â€™ -> ’) - Recommended by PDF
    text = ftfy.fix_text(text)

    # Remove HTML tags (PDF Section 5.1.1)
    text = re.sub(r'<[^>]+>', ' ', text)

    # Normalize URLs (PDF Section 5.1.2)
    text = re.sub(r'http\S+|www\S+', '<URL>', text)

    # Normalize Emails (PDF Section 5.1.2)
    text = re.sub(r'\S+@\S+', '<EMAIL>', text)

    # Normalize @mentions (PDF Section 5.1.2)
    text = re.sub(r'@\w+', '<USER>', text)

    # Normalize Phone (simple rule) (PDF Section 5.1.2)
    # (Note: PDF has a typo r'(\+?d... , corrected to \d)
    text = re.sub(r'(\+?\d[\d\s-]{7,}\d)', '<PHONE>', text)

    # Normalize Numbers (as per PDF Section 5.1.6)
    text = re.sub(r'\b\d+[\.,\d]*\b', '<NUM>', text)

    # Normalize repeating characters (e.g., çooox -> çoxx) (PDF Section 5.1.6)
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    return text

print("Core helper functions (map_sentiment_value, lower_az, basic_regex_clean) defined.")

Core helper functions (map_sentiment_value, lower_az, basic_regex_clean) defined.


## Step 3: Define the Main Normalization Pipeline (normalize_text_az)

This is the main "glue" function for our pipeline. It's responsible for executing all cleaning steps in the correct logical order.

It combines the **basic** cleaning functions (defined in Step 2, e.g., `basic_regex_clean`, `lower_az`) with the **advanced/specialized** functions imported from `ozel_temizlik.py` (e.g., `split_hashtags`, `handle_negation`).

The main `process_file` function (which we will use in the next step) will call this single function to perform the complete text normalization.

In [13]:
# ----------------------------------------------------------------
# 3.1: Main Normalization Pipeline Function
# (This function is called by the PDF's process_file skeleton)
# ----------------------------------------------------------------
def normalize_text_az(raw_text, domain):
    """
    Applies the full sequence of cleaning and normalization steps.

    This function combines the basic regex cleaning with the
    domain-specific and special challenge functions in a logical order.

    Args:
        raw_text (str): The original, unprocessed text.
        domain (str): The detected domain ('news', 'social', etc.).

    Returns:
        str: The fully cleaned and normalized text.
    """

    # Step 1: Basic Azeri Lowercasing (from Step 2)
    text = lower_az(raw_text)

    # Step 2: Basic Regex Cleaning (HTML, URL, NUM, etc.) (from Step 2)
    text = basic_regex_clean(text)

    # --- Apply special functions from ozel_temizlik.py ---

    # Step 3: Split CamelCase hashtags (e.g., #QarabagIsBack -> qarabag is back)
    text = ozel_temizlik.split_hashtags(text)

    # Step 4: Map Emojis (e.g., :) -> EMO_POS)
    text = ozel_temizlik.map_emojis_and_normalize(text)

    # Step 5: Deasciify/Slang (e.g., cox -> çox)
    text = ozel_temizlik.deasciify_slang(text)

    # Step 6: Domain-Specific Normalization (e.g., 50 azn -> <PRICE>)
    # (Must run BEFORE negation to avoid tagging <PRICE>_NEG)
    text = ozel_temizlik.domain_specific_normalize(text, domain)

    # Step 7: Handle Negation (e.g., "yoxdur" -> "yoxdur")
    # (Note: Negation function in ozel_temizlik.py handles _NEG tagging)
    text = ozel_temizlik.handle_negation(text)

    # Step 8: Final cleanup (remove extra whitespace created during cleaning)
    text = re.sub(r'\s+', ' ', text).strip()

    return text

print("Main pipeline function 'normalize_text_az' defined.")

Main pipeline function 'normalize_text_az' defined.


## Step 4: Define Main Processing Function and Dataset List

Now we define the final pieces needed to run the entire pipeline:

1.  **`datasets_to_process`**: A list of dictionaries defining the 5 raw data files to process.
2.  **`process_file`**: The main function skeleton provided in the assignment PDF. This function reads a file, applies all our helper functions (`map_sentiment_value`, `normalize_text_az`, etc.) in the correct order, removes duplicates/empties, and saves the final two-column Excel file to our `OUTPUT_DIR`.

In [14]:
# ----------------------------------------------------------------
# 4.1: Define the list of datasets to process
# (Based on PDF Section 2)
# ----------------------------------------------------------------
datasets_to_process = [
    {
        "in_file": "data/labeled-sentiment.xlsx",
        "text_col": "text",
        "label_col": "sentiment",
        "out_file": os.path.join(OUTPUT_DIR, "labeled-sentiment_2col.xlsx")
    },
    {
        "in_file": "data/test__1_.xlsx",
        "text_col": "text",
        "label_col": "label",
        "out_file": os.path.join(OUTPUT_DIR, "test_1_2col.xlsx") # Using PDF canonical name
    },
    {
        "in_file": "data/train__3_.xlsx",
        "text_col": "text",
        "label_col": "label",
        "out_file": os.path.join(OUTPUT_DIR, "train_3_2col.xlsx") # Using PDF canonical name
    },
    {
        "in_file": "data/train-00000-of-00001.xlsx",
        "text_col": "text",
        "label_col": "labels",
        "out_file": os.path.join(OUTPUT_DIR, "train-00000-of-00001_2col.xlsx")
    },
    {
        "in_file": "data/merged_dataset_CSV__1_.xlsx",
        "text_col": "text",
        "label_col": "labels",
        "out_file": os.path.join(OUTPUT_DIR, "merged_dataset_CSV_1_2col.xlsx") # Using PDF canonical name
    }
]

# ----------------------------------------------------------------
# 4.2: Define the main processing function
# (Based on the skeleton from PDF Section 7.1)
# ----------------------------------------------------------------
def process_file(in_file, text_col, label_col, out_file):
    """
    Reads a raw dataset, applies the full cleaning pipeline,
    and saves the required two-column (cleaned_text, sentiment_value)
    Excel file.
    """
    print(f"\nProcessing: {in_file}...")
    try:
        # 1. Read Data
        df = pd.read_excel(in_file)

        # 2. Drop rows with missing text or labels (PDF Section 5.1.7)
        df.dropna(subset=[text_col, label_col], inplace=True)

        # 3. Ensure text column is string
        df[text_col] = df[text_col].astype(str)

        # 4. Drop duplicate texts (PDF Section 5.1.7)
        df.drop_duplicates(subset=[text_col], inplace=True)

        # 5. Map sentiment labels (Using our function from Step 2)
        df['sentiment_value'] = df[label_col].apply(map_sentiment_value)

        # 6. Drop rows where mapping failed (e.g., unmappable labels)
        df.dropna(subset=['sentiment_value'], inplace=True)

        # 7. Detect domain (Using function from ozel_temizlik.py)
        # (This must be done on the *raw text* to catch hashtags, URLs, etc.)
        df['domain'] = df[text_col].apply(ozel_temizlik.detect_domain)

        # 8. Apply the main normalization pipeline (Using our function from Step 3)
        print("Applying normalization pipeline...")
        df['cleaned_text'] = df.apply(
            lambda row: normalize_text_az(row[text_col], row['domain']),
            axis=1
        )

        # 9. Select only the required two columns
        final_df = df[['cleaned_text', 'sentiment_value']]

        # 10. Save to Excel
        final_df.to_excel(out_file, index=False, engine='openpyxl')

        print(f"SUCCESS: Saved {len(final_df)} rows to {out_file}")
        return len(final_df)

    except Exception as e:
        print(f"!!! ERROR processing {in_file}: {e}")
        return 0

print("Dataset list and main 'process_file' function defined.")

Dataset list and main 'process_file' function defined.


## Step 5: Execute the Processing Pipeline

Now that all helper functions, the main normalization pipeline (`normalize_text_az`), and the main processing function (`process_file`) are defined, we can execute the process.

This final step iterates through the `datasets_to_process` list and calls the `process_file` function for each dataset.

The output will be 5 separate `.xlsx` files, saved in the `clean_data/` directory.

In [15]:
# ----------------------------------------------------------------
# 5.1: Run the main processing loop
# ----------------------------------------------------------------
print(f"Starting the processing of {len(datasets_to_process)} datasets...")
total_rows_processed = 0

for dataset in datasets_to_process:
    # Call the main function defined in Step 4
    rows = process_file(
        in_file=dataset["in_file"],
        text_col=dataset["text_col"],
        label_col=dataset["label_col"],
        out_file=dataset["out_file"]
    )
    total_rows_processed += rows

print("\n" + "="*30)
print(f"ALL PROCESSING COMPLETE.")
print(f"Total rows processed across all files: {total_rows_processed}")
print(f"Please check the '{OUTPUT_DIR}' directory for the 5 Excel files.")

Starting the processing of 5 datasets...

Processing: data/labeled-sentiment.xlsx...
Applying normalization pipeline...
SUCCESS: Saved 2955 rows to clean_data/labeled-sentiment_2col.xlsx

Processing: data/test__1_.xlsx...
Applying normalization pipeline...
SUCCESS: Saved 4198 rows to clean_data/test_1_2col.xlsx

Processing: data/train__3_.xlsx...
Applying normalization pipeline...
SUCCESS: Saved 19557 rows to clean_data/train_3_2col.xlsx

Processing: data/train-00000-of-00001.xlsx...
Applying normalization pipeline...
SUCCESS: Saved 41756 rows to clean_data/train-00000-of-00001_2col.xlsx

Processing: data/merged_dataset_CSV__1_.xlsx...
Applying normalization pipeline...
SUCCESS: Saved 55662 rows to clean_data/merged_dataset_CSV_1_2col.xlsx

ALL PROCESSING COMPLETE.
Total rows processed across all files: 124128
Please check the 'clean_data' directory for the 5 Excel files.


## Step 6: Training Word2Vec and FastText embedding Models

Following the preprocessing steps, we now have five cleaned Excel files. The next task is to train the **Word2Vec** and **FastText** models as specified in the assignment.

The code below performs the following actions:
1.  Initializes an empty list called `sentences`.
2.  Loops through each of the five `_2col.xlsx` files and reads the `cleaned_text` column using `pandas`.
3.  Converts each row of cleaned text into a list of tokens (by splitting on spaces) and adds these lists to the main `sentences` collection.
4.  Creates the `embeddings/` directory if it doesn't already exist.
5.  Trains a `Word2Vec` model using the `sentences` corpus. Key parameters include `vector_size=300`, `window=5`, `min_count=3`, and `sg=1` (Skip-gram).
6.  Trains a `FastText` model using the same corpus and similar parameters, but also includes subword information (`min_n=3`, `max_n=6`).
7.  Saves both trained models to the `embeddings/` folder as `word2vec.model` and `fasttext.model`.

In [16]:
from gensim.models import Word2Vec, FastText
import pandas as pd
from pathlib import Path

files = [
    f"{OUTPUT_DIR}/labeled-sentiment_2col.xlsx",
    f"{OUTPUT_DIR}/test_1_2col.xlsx",
    f"{OUTPUT_DIR}/train_3_2col.xlsx",
    f"{OUTPUT_DIR}/train-00000-of-00001_2col.xlsx",
    f"{OUTPUT_DIR}/merged_dataset_CSV_1_2col.xlsx",
]

sentences = []
for f in files:
    df = pd.read_excel(f, usecols=["cleaned_text"])
    sentences.extend(df["cleaned_text"].astype(str).str.split().tolist())

Path("embeddings").mkdir(exist_ok=True)
w2v = Word2Vec(sentences=sentences, vector_size=300, window=5, min_count=3, sg=1,
negative=10, epochs=10)
w2v.save("embeddings/word2vec.model")
ft  = FastText(sentences=sentences, vector_size=300, window=5, min_count=3, sg=1,
min_n=3, max_n=6, epochs=10)
ft.save("embeddings/fasttext.model")
print("Saved embeddings.")

Saved embeddings.


## Step 7: Model Evaluation: Word2Vec vs. FastText (Quantitative & Qualitative Metrics)

Model Evaluation section presents a comparative evaluation of the generated Word2Vec and FastText models. To assess their respective strengths in capturing the semantics of the Azerbaijani corpus, the analysis employs three distinct evaluation metrics.

### 1. Lexical Coverage (Quantitative)

This metric quantifies the **vocabulary coverage** of each model, measuring the percentage of unique tokens from our cleaned datasets that are found within the model's learned vocabulary.

This is a critical test for comparing the two architectures. Word2Vec, being a word-level model, is inherently limited to its training vocabulary and cannot represent **out-of-vocabulary (OOV)** words. In contrast, FastText, which learns vectors for character n-grams (subwords), can construct vectors for *any* word, including neologisms, misspellings, or rare words not encountered during training.

### 2. Semantic Similarity (Quantitative)

A successful embedding model should capture meaningful **semantic relationships**, placing words with similar meanings close together in the vector space and words with opposite meanings far apart.

To quantify this, we measure the average **cosine similarity** for two predefined sets of word pairs:
* **Synonym Pairs** (e.g., `yaxşı`, `əla`): We expect a high similarity score (close to 1.0), indicating semantic proximity.
* **Antonym Pairs** (e.g., `yaxşı`, `pis`): We expect a low or negative similarity score (close to -1.0 or 0.0), indicating semantic distance.

A "Separation Score" (calculated as `Synonym Similarity - Antonym Similarity`) is then used to provide a single, robust measure of the model's ability to discriminate between semantic similarity and opposition. A higher separation score is better.

### 3. Nearest Neighbors Analysis (Qualitative)

Beyond quantitative scores, a **qualitative analysis** of the embedding space is performed by inspecting the **nearest neighbors** for a set of predefined seed words.

By examining the top 5 most similar words for a given seed (e.g., `bahalı` or `pis`), we can intuitively assess the quality of the learned representations. This helps us judge whether the model has learned logical contexts (e.g., are the neighbors of "expensive" other price-related words?) or if it has merely learned superficial co-occurrence patterns.

In [18]:
import pandas as pd
from gensim.models import Word2Vec, FastText
import re

w2v = Word2Vec.load("embeddings/word2vec.model")
ft  = FastText.load("embeddings/fasttext.model")

seed_words = ["yaxşı","pis","çox","bahalı","ucuz","mükəmməl","dəhşət","<PRICE>","<RATING_POS>"]
syn_pairs  = [("yaxşı","əla"), ("bahalı","qiymətli"), ("ucuz","sərfəli")]
ant_pairs  = [("yaxşı","pis"), ("bahalı","ucuz")]

def lexical_coverage(model, tokens):
    vocab = model.wv.key_to_index
    return sum(1 for t in tokens if t in vocab) / max(1,len(tokens))

files = [
    f"{OUTPUT_DIR}/labeled-sentiment_2col.xlsx",
    f"{OUTPUT_DIR}/test_1_2col.xlsx",
    f"{OUTPUT_DIR}/train_3_2col.xlsx",
    f"{OUTPUT_DIR}/train-00000-of-00001_2col.xlsx",
    f"{OUTPUT_DIR}/merged_dataset_CSV_1_2col.xlsx",
]

def read_tokens(f):
    df = pd.read_excel(f, usecols=["cleaned_text"])
    return [t for row in df["cleaned_text"].astype(str) for t in row.split()]

print("== Lexical coverage (per dataset) ==")
for f in files:
    toks = read_tokens(f)
    cov_w2v = lexical_coverage(w2v, toks)
    cov_ftv = lexical_coverage(ft, toks)  # FT still embeds OOV via subwords
    print(f"{f}: W2V={cov_w2v:.3f}, FT(vocab)={cov_ftv:.3f}")

from numpy import dot
from numpy.linalg import norm

def cos(a,b): return float(dot(a,b)/(norm(a)*norm(b)))

def pair_sim(model, pairs):
    vals = []
    for a,b in pairs:
        try: vals.append(model.wv.similarity(a,b))
        except KeyError: pass
    return sum(vals)/len(vals) if vals else float('nan')

syn_w2v = pair_sim(w2v, syn_pairs)
syn_ft  = pair_sim(ft,  syn_pairs)
ant_w2v = pair_sim(w2v, ant_pairs)
ant_ft  = pair_sim(ft,  ant_pairs)

print("\n== Similarity (higher better for synonyms; lower better for antonyms) ==")
print(f"Synonyms: W2V={syn_w2v:.3f}, FT={syn_ft:.3f}")
print(f"Antonyms: W2V={ant_w2v:.3f}, FT={ant_ft:.3f}")
print(f"Separation (Syn - Ant): W2V={(syn_w2v - ant_w2v):.3f}, FT={(syn_ft - ant_ft):.3f}")

def neighbors(model, word, k=5):
  try: return [w for w,_ in model.wv.most_similar(word, topn=k)]
  except KeyError: return []

print("\n== Nearest neighbors (qualitative) ==")
for w in seed_words:
  print(f"  W2V NN for '{w}':", neighbors(w2v, w))
  print(f"  FT  NN for '{w}':", neighbors(ft,  w))

# (Optional) domain drift if you train domain-specific models separately:
# drift(word, model_a, model_b) = 1 - cos(vec_a, vec_b)

== Lexical coverage (per dataset) ==
clean_data/labeled-sentiment_2col.xlsx: W2V=0.920, FT(vocab)=0.920
clean_data/test_1_2col.xlsx: W2V=0.973, FT(vocab)=0.973
clean_data/train_3_2col.xlsx: W2V=0.976, FT(vocab)=0.976
clean_data/train-00000-of-00001_2col.xlsx: W2V=0.914, FT(vocab)=0.914
clean_data/merged_dataset_CSV_1_2col.xlsx: W2V=0.929, FT(vocab)=0.929

== Similarity (higher better for synonyms; lower better for antonyms) ==
Synonyms: W2V=0.340, FT=0.460
Antonyms: W2V=0.346, FT=0.411
Separation (Syn - Ant): W2V=-0.006, FT=0.049

== Nearest neighbors (qualitative) ==
  W2V NN for 'yaxşı': ['olardı', 'iyi', 'yaxshi', 'yaxşi', 'vətəndaşdır']
  FT  NN for 'yaxşı': ['yaxşı-yaxşı', 'yaxşı!', 'yaxşıı', 'yaxşı)', 'yaxşıkı']
  W2V NN for 'pis': ['pisdi', 'örnək', 'vərdişlərə', 'caynan', 'pisdir']
  FT  NN for 'pis': ['pis!', 'pis,', '(pis', 'pis.', 'pis.bu']
  W2V NN for 'çox': ['çoox', 'bəyəndim', 'tətbiqidir', 'gözəldir', 'çooxx']
  FT  NN for 'çox': ['çoxçox', 'çox.çox', '(çox', 'çoxx', '"