## Phase 2: Preprocessing and Dataset Preparation

**Objective**: Prepare the Bangla Sentiment Dataset (columns: Tense, Label) for model training, ensuring compatibility with traditional (TF-IDF-based) and neural (BanglaBERT) sentiment classification models while preserving class imbalance characteristics.



### Step 1: Clean Text (Remove Noise, Normalize Bangla Script)

- **Objective**: Clean the `Tense` column to remove noise (e.g., special characters, URLs) and normalize Bangla text for consistency.

In [1]:
import pandas as pd
import re
from bnlp import CleanText
    
# Initialize BNLP cleaner
clean_text = CleanText(
    fix_unicode=True,
    unicode_norm=True,
    unicode_norm_form="NFKC",
    remove_url=False,
    remove_email=False,
    remove_emoji=False,
    remove_number=False,
    remove_digits=False,
    remove_punct=False,
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_number="<NUMBER>",
    replace_with_digit="<DIGIT>",
    replace_with_punct = "<PUNC>"
)

punkt not found. downloading...


[nltk_data] Downloading package punkt_tab to /home/fahad/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
# Load the dataset from phase 1
dataset_path = "data-source/cleaned_dataset.csv"
df = pd.read_csv(dataset_path, encoding="utf-8")

df.head(3)

Unnamed: 0,Tense,Label
0,জিনিসপত্রের অতিরিক্ত দাম বৃদ্ধির জন্য এই শহরে ...,0
1,সঠিক ভাবে তদারকি করলে এই সমস্যা থেকে পরিত্রান ...,1
2,দেশের টাকা যখন বিদেশে চোলে যাচ্ছে তখন দেশের সর...,0


In [3]:
import re
import unicodedata
from typing import Dict, List, Optional

class BanglaUnicodeNormalizer:
    """
    Comprehensive Bangla text normalizer using Unicode standards.
    Handles various inconsistencies while preserving linguistic accuracy.
    """
    
    def __init__(self):
        self._setup_unicode_mappings()
        self._compile_patterns()
    
    def _setup_unicode_mappings(self):
        """Define Unicode character mappings for normalization."""
        
        # Zero-width and invisible characters to remove/replace
        self.invisible_chars = {
            '\u200d': '',      # Zero Width Joiner (ZWJ)
            '\u200c': '',      # Zero Width Non-Joiner (ZWNJ)
            '\u00a0': ' ',     # Non-breaking space → regular space
            '\ufeff': '',      # Byte Order Mark (BOM)
            '\u2060': '',      # Word Joiner
            '\u061c': '',      # Arabic Letter Mark
        }
        
        # Bangla digit mappings (if normalization needed)
        self.digit_mappings = {
            '০': '0', '১': '1', '২': '2', '৩': '3', '৪': '4',
            '৫': '5', '৬': '6', '৭': '7', '৮': '8', '৯': '9'
        }
        
        # Punctuation normalization
        self.punctuation_mappings = {
            '॥': '।',         # Devanagari double danda → single danda
            '‍': '',           # Zero width joiner variants
            '‌': '',           # Zero width non-joiner variants
            '।।': '।',        # Double danda → single
        }
        
        # Vowel sign normalization (careful - these affect pronunciation)
        self.vowel_normalizations = {
            # Composite vowels that can be normalized
            '\u09c7\u09be': '\u09cb',  # ে + া = ো (e + aa = o)
            '\u09c7\u09d7': '\u09cc',  # ে + ৗ = ৌ (e + au-length = au)
        }
        
        # Character variants that should be standardized
        self.character_variants = {
            # Only include mappings you're absolutely certain about
            '\u0995\u09cd\u09b7': '\u0995\u09cd\u09b7',  # ক্ষ normalization
        }
        
        # Common OCR/typing errors (use with caution)
        self.ocr_corrections = {
            # Add only well-established corrections
            'ব়': 'ব',  # Remove nukta from ba if incorrectly added
            'জ়': 'জ',  # Remove nukta from ja if incorrectly added
        }
    
    def _compile_patterns(self):
        """Compile regex patterns for efficient processing."""
        
        # Pattern for duplicate diacritics
        self.duplicate_diacritics = re.compile(r'([ািীুূেৈোৌংঁঃ])\1+')
        
        # Pattern for multiple whitespace
        self.multiple_whitespace = re.compile(r'\s+')
        
        # Pattern for Bangla digits
        self.bangla_digits = re.compile(r'[০-৯]')
        
        # Pattern for ASCII digits
        self.ascii_digits = re.compile(r'[0-9]')
        
        # Pattern for multiple punctuation
        self.multiple_punct = re.compile(r'([।,;:!?])\1+')
        
        # Pattern for hasanta (virama) normalization
        self.hasanta_pattern = re.compile(r'\u09cd(?=\s|$)')
    
    def normalize_unicode(self, text: str, form: str = 'NFC') -> str:
        """
        Apply Unicode normalization.
        
        Args:
            text: Input text
            form: Unicode normalization form ('NFC', 'NFD', 'NFKC', 'NFKD')
        
        Returns:
            Unicode normalized text
        """
        return unicodedata.normalize(form, text)
    
    def remove_invisible_chars(self, text: str) -> str:
        pattern = re.compile('|'.join(map(re.escape, self.invisible_chars.keys())))
        return pattern.sub(lambda m: self.invisible_chars[m.group(0)], text)

    
    def normalize_whitespace(self, text: str) -> str:
        """Normalize whitespace characters."""
        # Replace multiple whitespace with single space
        text = self.multiple_whitespace.sub(' ', text)
        # Strip leading/trailing whitespace
        return text.strip()
    
    def normalize_punctuation(self, text: str) -> str:
        """Normalize punctuation marks."""
        for punct, normalized in self.punctuation_mappings.items():
            text = text.replace(punct, normalized)
        
        # Handle multiple consecutive punctuation
        text = self.multiple_punct.sub(r'\1', text)
        return text
    
    def normalize_vowels(self, text: str) -> str:
        """Normalize vowel combinations and signs."""
        for combination, normalized in self.vowel_normalizations.items():
            text = text.replace(combination, normalized)
        
        # Remove duplicate diacritics
        text = self.duplicate_diacritics.sub(r'\1', text)
        return text
    
    def normalize_digits(self, text: str, to_ascii: bool = False, remove: bool = False) -> str:
        """
        Normalize digit representations.
        
        Args:
            text: Input text
            to_ascii: Convert Bangla digits to ASCII
            remove: Remove all digits
        
        Returns:
            Text with normalized digits
        """
        if remove:
            text = self.bangla_digits.sub('', text)
            text = self.ascii_digits.sub('', text)
        elif to_ascii:
            for bangla, ascii_digit in self.digit_mappings.items():
                text = text.replace(bangla, ascii_digit)
        
        return text
    
    def apply_character_variants(self, text: str) -> str:
        """Apply character variant normalizations."""
        for variant, standard in self.character_variants.items():
            text = text.replace(variant, standard)
        return text
    
    def apply_ocr_corrections(self, text: str) -> str:
        """Apply common OCR error corrections (use with caution)."""
        for error, correction in self.ocr_corrections.items():
            text = text.replace(error, correction)
        return text
    
    def normalize_hasanta(self, text: str) -> str:
        """
        Normalize hasanta (virama) usage.
        Remove trailing hasanta that don't form conjuncts.
        """
        # Remove hasanta at word boundaries or end of text
        text = self.hasanta_pattern.sub('', text)
        return text
    
    def get_unicode_info(self, text: str) -> List[Dict]:
        """
        Get Unicode information for each character in text.
        Useful for debugging normalization issues.
        """
        info = []
        for char in text:
            info.append({
                'char': char,
                'unicode': f'U+{ord(char):04X}',
                'name': unicodedata.name(char, 'UNKNOWN'),
                'category': unicodedata.category(char),
                'combining': unicodedata.combining(char)
            })
        return info
    
    def normalize(self, 
                 text: str, 
                 unicode_form: str = 'NFC',
                 remove_digits: bool = False,
                 digits_to_ascii: bool = False,
                 apply_ocr_fixes: bool = False,
                 normalize_hasanta: bool = True) -> str:
        """
        Comprehensive text normalization.
        
        Args:
            text: Input text to normalize
            unicode_form: Unicode normalization form
            remove_digits: Remove all digits
            digits_to_ascii: Convert Bangla digits to ASCII
            apply_ocr_fixes: Apply OCR error corrections
            normalize_hasanta: Normalize hasanta usage
        
        Returns:
            Normalized text
        """
        
        # Step 1: Unicode normalization
        text = self.normalize_unicode(text, unicode_form)
        
        # Step 2: Remove invisible characters
        text = self.remove_invisible_chars(text)
        
        # Step 3: Normalize whitespace
        text = self.normalize_whitespace(text)
        
        # Step 4: Normalize punctuation
        text = self.normalize_punctuation(text)
        
        # Step 5: Normalize vowels and diacritics
        text = self.normalize_vowels(text)
        
        # Step 6: Handle digits
        text = self.normalize_digits(text, digits_to_ascii, remove_digits)
        
        # Step 7: Apply character variants
        text = self.apply_character_variants(text)
        
        # Step 8: Normalize hasanta (optional)
        if normalize_hasanta:
            text = self.normalize_hasanta(text)
        
        # Step 9: Apply OCR corrections (optional, use with caution)
        if apply_ocr_fixes:
            text = self.apply_ocr_corrections(text)
        
        return text

# Convenience function for quick normalization
def normalize_bangla_text(text: str, **kwargs) -> str:
    """
    Quick normalization function.
    
    Args:
        text: Text to normalize
        **kwargs: Additional options for normalization
    
    Returns:
        Normalized text
    """
    normalizer = BanglaUnicodeNormalizer()
    return normalizer.normalize(text, **kwargs)

In [4]:
# Initialize the normalizer once
normalizer = BanglaUnicodeNormalizer()

def preprocess_text(text: str) -> str:
    """
    Clean and normalize Bangla text using BNLP and custom rules.
    """
    # Step 1: BNLP text cleaning
    cleaned = clean_text(text)
    
    # Step 2: Remove URLs and hashtags (if any missed)
    cleaned = re.sub(r'(https?://\S+|www\.\S+|#\S+)', '', cleaned)
    
    # Step 3: Unicode normalization (NFC form)
    cleaned = normalizer.normalize_unicode(cleaned, form='NFC')
    
    # Step 4: Remove invisible/control characters (ZWJ, ZWNJ, etc.)
    cleaned = normalizer.remove_invisible_chars(cleaned)
    
    # Step 5: Normalize punctuation, whitespace, vowels, and hasanta
    cleaned = normalizer.normalize_punctuation(cleaned)
    cleaned = normalizer.normalize_vowels(cleaned)
    cleaned = normalizer.normalize_whitespace(cleaned)
    
    # Step 6 (Optional): Remove digits or apply OCR correction if needed
    cleaned = normalizer.normalize_digits(cleaned, remove=True)
    # cleaned = normalizer.apply_ocr_corrections(cleaned)

    return cleaned

df['Tense_Cleaned'] = df['Tense'].apply(preprocess_text)
    
# Check sample
print("Sample Cleaned Text:")
print(df[['Tense', 'Tense_Cleaned']].head(5))

Sample Cleaned Text:
                                               Tense  \
0  জিনিসপত্রের অতিরিক্ত দাম বৃদ্ধির জন্য এই শহরে ...   
1  সঠিক ভাবে তদারকি করলে এই সমস্যা থেকে পরিত্রান ...   
2  দেশের টাকা যখন বিদেশে চোলে যাচ্ছে তখন দেশের সর...   
3          ওনার মতো ব্যর্থ মন্ত্রীর পদত্যাগ করা উচিত   
4                 আল্লাহ তোদের বিচার করবে অপেক্ষা কর   

                                       Tense_Cleaned  
0  জিনিসপত্রের অতিরিক্ত দাম বৃদ্ধির জন্য এই শহরে ...  
1  সঠিক ভাবে তদারকি করলে এই সমস্যা থেকে পরিত্রান ...  
2  দেশের টাকা যখন বিদেশে চোলে যাচ্ছে তখন দেশের সর...  
3          ওনার মতো ব্যর্থ মন্ত্রীর পদত্যাগ করা উচিত  
4                 আল্লাহ তোদের বিচার করবে অপেক্ষা কর  


In [6]:
# Save the cleaned dataset
df.to_csv("outputs/cleaned_dataset.csv", encoding='utf-8', index=False)

In [7]:
# Check for null values in 'Tense_Cleaned' column
null_count = df['Tense_Cleaned'].isnull().sum()
print(f"Total null values in 'Tense_Cleaned': {null_count}")


Total null values in 'Tense_Cleaned': 0


### Step 2: Tokenize Texts for Traditional (TF-IDF) and Neural Models (BanglaBERT-Compatible Tokens)

- **Objective**: Tokenize cleaned text for traditional models (TF-IDF vectors) and neural models (BanglaBERT tokens).

In [8]:
# Import necessary libraries
from bnlp import NLTKTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import AutoTokenizer
import numpy as np

In [9]:
# Initialize BNLP tokenizer
bnlp_tokenizer = NLTKTokenizer()

# TF-IDF Tokenization    
def bnlp_tokenize(text):
    return [token.strip().lower() for token in bnlp_tokenizer.word_tokenize(text)]

In [10]:
# text representation using tf-idf
tfidf_vectorizer = TfidfVectorizer(tokenizer=bnlp_tokenize, max_features=5000)  

tfidf_matrix = tfidf_vectorizer.fit_transform(df['Tense_Cleaned'])
    

In [11]:
# Validate tokenization
print("Sample TF-IDF Tokens:", bnlp_tokenizer.word_tokenize(df['Tense_Cleaned'].iloc[0])[:10])

Sample TF-IDF Tokens: ['জিনিসপত্রের', 'অতিরিক্ত', 'দাম', 'বৃদ্ধির', 'জন্য', 'এই', 'শহরে', 'জীবন', 'ধারণ', 'করা']


In [12]:
# Save TF-IDF matrix (sparse format)
np.savez("text_representation/tfidf_matrix.npz", 
         data=tfidf_matrix.data, 
         indices=tfidf_matrix.indices,
         indptr=tfidf_matrix.indptr, 
         shape=tfidf_matrix.shape
        )
 

In [13]:
# BanglaBERT Tokenization
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")

def tokenize_for_bert(text):
    return tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors='np')

In [14]:
from tqdm import tqdm

# Tokenize in batches
batch_size = 1000
input_ids = []
attention_masks = []

for i in tqdm(range(0, len(df), batch_size), desc="Tokenizing"):
    batch_texts = df['Tense_Cleaned'].values[i:i+batch_size].tolist()

    # Tokenize with padding, truncation, and return tensors as NumPy-compatible
    batch_tokens = tokenizer(
        batch_texts,
        padding='max_length',
        truncation=True,
        max_length=128,
        return_tensors='np'  # Ensures NumPy format
    )

    input_ids.append(batch_tokens['input_ids'])
    attention_masks.append(batch_tokens['attention_mask'])

Tokenizing: 100%|██████████| 8/8 [00:01<00:00,  5.78it/s]


In [15]:
# Stack and save
bert_input_ids = np.vstack(input_ids)
bert_attention_masks = np.vstack(attention_masks)

In [16]:
# Validate tokenization
print("Sample BERT Tokens:", tokenizer.convert_ids_to_tokens(bert_input_ids[0][:10]))

Sample BERT Tokens: ['[CLS]', 'জিনিস', '##পত', '##রে', '##র', 'অতি', '##রিক', '##ত', 'দাম', 'বদ']


In [17]:
# Save both input_ids and attention_masks
np.save("text_representation/bert_input_ids.npy", bert_input_ids)
np.save("text_representation/bert_attention_masks.npy", bert_attention_masks)


### Step 3: Split Dataset into Training, Validation, and Test Sets (Stratified)

- **Objective**: Split the dataset into training (80%), validation (10%), and test (10%) sets, preserving class imbalance.

In [18]:
from sklearn.model_selection import train_test_split
    
# Stratified split
X = df['Tense_Cleaned']
y = df['Label']

# First split: 90% temp, 10% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.1, stratify=y, random_state=42
)

# Second split: ~80% train, ~10% val (because 0.1111 * 0.9 ≈ 0.1)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.1111, stratify=y_temp, random_state=42
)

In [19]:
# Verify distributions
print("Training Set Distribution:\n", y_train.value_counts(normalize=True) * 100)
print("Validation Set Distribution:\n", y_val.value_counts(normalize=True) * 100)
print("Test Set Distribution:\n", y_test.value_counts(normalize=True) * 100)

Training Set Distribution:
 Label
0    47.359922
2    29.081221
1    23.558857
Name: proportion, dtype: float64
Validation Set Distribution:
 Label
0    47.354839
2    29.032258
1    23.612903
Name: proportion, dtype: float64
Test Set Distribution:
 Label
0    47.354839
2    29.032258
1    23.612903
Name: proportion, dtype: float64


Split TF-IDF and BERT tokens accordingly

In [20]:
# Get the index of split datasets
train_idx, val_idx, test_idx = X_train.index, X_val.index, X_test.index

In [21]:
# Split tf-idf tokens
tfidf_train = tfidf_matrix[train_idx]
tfidf_val = tfidf_matrix[val_idx]
tfidf_test = tfidf_matrix[test_idx]

In [22]:
# Split bert tokens
bert_train_ids = bert_input_ids[train_idx]
bert_val_ids = bert_input_ids[val_idx]
bert_test_ids = bert_input_ids[test_idx]

bert_train_masks = bert_attention_masks[train_idx]
bert_val_masks = bert_attention_masks[val_idx]
bert_test_masks = bert_attention_masks[test_idx]    

In [23]:
# Save Split indices
split_indices = pd.DataFrame({
    'Index': list(train_idx) + list(val_idx) + list(test_idx),
    'Split': ['Train']*len(train_idx) + ['Val']*len(val_idx) + ['Test']*len(test_idx)
})

split_indices.to_csv("text_representation/split_indices.csv", index=False)


### Step 4: Save Preprocessed Data

- **Objective**: Save preprocessed datasets in formats suitable for traditional and neural models, ensuring compatibility with Phase 3 (model training).

In [24]:
# Save Sparse TF-IDF Matrices
import scipy.sparse as sp

sp.save_npz("text_representation/tfidf_train.npz", tfidf_train)
sp.save_npz("text_representation/tfidf_val.npz", tfidf_val)
sp.save_npz("text_representation/tfidf_test.npz", tfidf_test)

In [25]:
# Save labels
pd.DataFrame({'Label': y_train}).to_csv("text_representation/labels_train.csv", encoding='utf-8', index=False)
pd.DataFrame({'Label': y_val}).to_csv("text_representation/labels_val.csv", encoding='utf-8', index=False)
pd.DataFrame({'Label': y_test}).to_csv("text_representation/labels_test.csv", encoding='utf-8', index=False)    

In [26]:
# Save cleaned text splits
pd.DataFrame({'Tense_Cleaned': X_train}).to_csv("text_representation/text_train.csv", encoding='utf-8', index=False)
pd.DataFrame({'Tense_Cleaned': X_val}).to_csv("text_representation/text_val.csv", encoding='utf-8', index=False)
pd.DataFrame({'Tense_Cleaned': X_test}).to_csv("text_representation/text_test.csv", encoding='utf-8', index=False)    

In [27]:
# Verify saved files
print("TF-IDF Train Shape:", sp.load_npz("text_representation/tfidf_train.npz").shape)
print("Labels Train Shape:", pd.read_csv("text_representation/labels_train.csv").shape)
print("BERT Train IDs Shape:", np.load("text_representation/bert_input_ids.npy").shape)    

TF-IDF Train Shape: (6193, 5000)
Labels Train Shape: (6193, 1)
BERT Train IDs Shape: (7743, 128)


In [28]:
# Update README with improved formatting
with open("text_representation/preprocessed_data_README.md", "w", encoding='utf-8') as f:
    f.write(
        "# Preprocessed Bangla Sentiment Dataset\n\n"
        "This folder contains all the necessary files for training and evaluating sentiment classification models using Bangla text data.\n\n"
        "## Contents\n"
        "- **`cleaned_dataset.csv`**: Original dataset with an additional cleaned text column (`Tense_Cleaned`).\n"
        "- **`tfidf_matrix.npz`**: Full sparse TF-IDF matrix for all samples.\n"
        "- **`tfidf_train.npz`**, **`tfidf_val.npz`**, **`tfidf_test.npz`**: Sparse TF-IDF matrices for the training, validation, and test sets respectively.\n"
        "- **`labels_train.csv`**, **`labels_val.csv`**, **`labels_test.csv`**: Sentiment labels corresponding to each data split.\n"
        "- **`bert_input_ids.npy`**: Tokenized input IDs generated using the `BanglaBERT` tokenizer.\n"
        "- **`bert_attention_masks.npy`** *(optional if generated)*: Attention masks for BERT inputs.\n"
        "- **`text_train.csv`**, **`text_val.csv`**, **`text_test.csv`**: Cleaned Bangla text for each split.\n"
        "- **`split_indices.csv`**: Index mapping of samples to their respective dataset split (Train/Val/Test).\n\n"
        "## Notes\n"
        "- Labels are encoded as: `0 = Negative`, `1 = Positive`, `2 = Neutral`\n"
        "- All text has been preprocessed using BNLP and regex cleaning techniques.\n"
        "- BERT tokens are padded to a maximum length of 128.\n"
        "- TF-IDF features are saved in SciPy's `.npz` sparse format for efficient loading.\n"
    )
