# Emotion Classification in Persian tweets 
## Using CAR and Multinomial Naive Bayes Approaches


### Importing Required Libraries
This cell imports all necessary libraries for:
- Text processing (hazm for Persian NLP)
- Data manipulation (pandas, numpy)
- Machine learning (sklearn)
- Model evaluation and preprocessing
- File handling and system operations

In [1]:
import pandas as pd
import re
import hazm
from hazm import Stemmer, Normalizer, Lemmatizer, word_tokenize
from tqdm import tqdm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
import pathlib

### Text Preprocessing Setup
Initializes Persian NLP tools and defines preprocessing functions:
- `preprocess_text()`: Handles text normalization and cleaning
- `extract_tokens()`: Performs tokenization, lemmatization, and stemming
- Removes stopwords and short tokens
- Returns processed text ready for vectorization

In [2]:
# Initialize NLP tools
stemmer = Stemmer()
stopwords = hazm.stopwords_list()
normalizer = Normalizer()
lemmatizer = Lemmatizer()

def preprocess_text(text):
    text = normalizer.normalize(text)                 # Normalize first
    text = text.lower()                               # Then lowercase
    text = re.sub(r'[^آ-ی\s]', '', text)               # Remove non-Persian chars and '#' characters
    return text

def extract_tokens(text):
    text = preprocess_text(text)
    tokens = word_tokenize(text)                      # Tokenize
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize
    tokens = [stemmer.stem(token) for token in tokens]          # Optionally stem
    tokens = [token for token in tokens if token not in stopwords]  # Remove stopwords
    tokens = [token for token in tokens if len(token) > 2]  # Remove short tokens
    return ' '.join(tokens)  # Return as space-separated string

### Dataset Loading and Initial Processing
- Sets up directory structure for dataset and vocabulary
- Processes each emotion category (anger, fear, joy, sad, disgust, surprise)
- Applies text preprocessing to all tweets
- Displays initial dataset distribution showing class imbalance
- Key observation: Significant imbalance between classes (e.g., 34,328 'sad' vs 925 'disgust')


In [3]:
# Set up directories
dataset_dir = (pathlib.Path().absolute() / '../dataset').resolve()
vocab_dir = (pathlib.Path().absolute() / '../vocab').resolve()
vocab_dir.mkdir(parents=True, exist_ok=True)

# Process dataset
dataset_df = pd.DataFrame()
categories = ['anger', 'fear', 'joy', 'sad', 'disgust', 'surprise']

for category in categories:
    print(f"Processing {category} category")
    file = dataset_dir / 'raw' / f"{category}.csv"
    df = pd.read_csv(file)
    df['label'] = category
    df['processed_text'] = df['tweet'].apply(extract_tokens)
    dataset_df = pd.concat([dataset_df, df], axis=0)

# Print dataset statistics
print("\nDataset distribution:")
print(dataset_df['label'].value_counts())

Processing anger category
Processing fear category
Processing joy category
Processing sad category
Processing disgust category
Processing surprise category

Dataset distribution:
label
sad         34328
joy         28024
anger       20069
fear        17624
surprise    12859
disgust       925
Name: count, dtype: int64


### Multinomial Naive Bayes Implementation
- Initializes TF-IDF vectorizer with optimized parameters:
  - 10,000 max features
  - Minimum document frequency of 5
  - Maximum document frequency of 95%
  - Uses both unigrams and bigrams
- Transforms text data into TF-IDF features
- Prepares labels for classification

In [4]:

# Initialize TF-IDF vectorizer with limited vocabulary size
tfidf = TfidfVectorizer(
    max_features=10000,  # Limit vocabulary to top 10,000 words
    min_df=5,           # Ignore terms that appear in less than 5 documents
    max_df=0.95,        # Ignore terms that appear in more than 95% of documents
    ngram_range=(1, 2)  # Use both unigrams and bigrams
)

# Transform text to TF-IDF features
X = tfidf.fit_transform(dataset_df['processed_text'])
y = dataset_df['label'].values

### Data Balancing and Splitting
- Applies RandomOverSampler to address class imbalance:
  - Creates synthetic samples for minority classes
  - Ensures equal representation of all emotions
  - Helps prevent model bias towards majority classes
- Splits the balanced dataset:
  - 80% for training
  - 20% for testing
  - Uses random state for reproducibility
- This step is crucial for:
  - Fair model evaluation
  - Better learning of minority class patterns
  - More robust performance across all emotions

In [5]:
# Oversample minority classes
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Split dataset
x_train, x_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=0)

### Model Training and Evaluation (MNB)
- Trains Multinomial Naive Bayes classifier
- Makes predictions on test set
- Evaluates model performance:
  - Overall accuracy: 81%
  - Detailed classification report for each emotion
- Saves model and vectorizer for future use

In [6]:

# Train model
model = MultinomialNB()
model.fit(x_train, y_train)

# Make predictions
y_pred = model.predict(x_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))

# Save the model and vectorizer for later use
import joblib
model_dir = (pathlib.Path().absolute() / '../models').resolve()
model_dir.mkdir(parents=True, exist_ok=True)

joblib.dump(model, model_dir / 'emotion_classifier.joblib')
joblib.dump(tfidf, model_dir / 'tfidf_vectorizer.joblib')


Model Accuracy: 0.81

Detailed Classification Report:
              precision    recall  f1-score   support

       anger       0.79      0.77      0.78      6884
     disgust       0.82      0.96      0.88      6945
        fear       0.86      0.76      0.81      6898
         joy       0.82      0.76      0.78      6836
         sad       0.77      0.75      0.76      6850
    surprise       0.80      0.86      0.83      6781

    accuracy                           0.81     41194
   macro avg       0.81      0.81      0.81     41194
weighted avg       0.81      0.81      0.81     41194



['/home/aref/projects/uni-related/hashtag-suggestion/models/tfidf_vectorizer.joblib']

### CAR Model Implementation
- Applies feature scaling using StandardScaler
- Calculates class weights to handle imbalance
- Implements CAR using LogisticRegression with:
  - Adaptive regularization (C=1.0)
  - Class weights for imbalance handling
  - LBFGS solver for optimization
  - Multinomial approach for multi-class classification

In [8]:

scaler = StandardScaler(with_mean=False) 
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Calculate class weights
class_weights = dict(zip(np.unique(y_train), 
                        [len(y_train) / (len(np.unique(y_train)) * np.sum(y_train == c)) 
                         for c in np.unique(y_train)]))

# Train CAR model
car_model = LogisticRegression(
    C=1.0,  # Regularization strength
    class_weight=class_weights,
    max_iter=1000,
    solver='lbfgs',
    random_state=42
)

car_model.fit(x_train_scaled, y_train)

# Make predictions
y_pred = car_model.predict(x_test_scaled)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nCAR Model Accuracy: {accuracy:.2f}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))



CAR Model Accuracy: 0.90

Detailed Classification Report:
              precision    recall  f1-score   support

       anger       0.89      0.89      0.89      6884
     disgust       0.98      1.00      0.99      6945
        fear       0.89      0.93      0.91      6898
         joy       0.88      0.84      0.86      6836
         sad       0.86      0.79      0.82      6850
    surprise       0.93      0.97      0.95      6781

    accuracy                           0.90     41194
   macro avg       0.90      0.90      0.90     41194
weighted avg       0.90      0.90      0.90     41194





### Overall Metrics
| Metric | Multinomial Naive Bayes | CAR Model |
|--------|------------------------|-----------|
| Accuracy | 81% | 90% |
| Macro Avg F1 | 0.81 | 0.90 |
| Weighted Avg F1 | 0.81 | 0.90 |

### Class-wise Performance
| Emotion | MNB F1 | CAR F1 | Improvement |
|---------|--------|--------|-------------|
| Anger | 0.78 | 0.89 | +0.11 |
| Disgust | 0.88 | 0.99 | +0.11 |
| Fear | 0.81 | 0.91 | +0.10 |
| Joy | 0.78 | 0.86 | +0.08 |
| Sad | 0.76 | 0.82 | +0.06 |
| Surprise | 0.83 | 0.95 | +0.12 |

## Key Findings

### 1. Performance Improvements
- CAR model shows consistent improvement across all emotion categories
- Most significant improvements in:
  - Surprise classification (+0.12 F1)
  - Disgust classification (+0.11 F1)
  - Anger classification (+0.11 F1)
- More balanced performance across all classes

### 2. Technical Advantages
- Better handling of class imbalance
- More sophisticated feature space modeling
- Improved regularization and optimization
- Better capture of complex word relationships in Persian text

### 3. Practical Implications
- More reliable emotion classification
- Better performance on minority classes
- More balanced predictions across all emotions
- Reduced bias towards majority classes

## Conclusion

The CAR model demonstrates superior performance in Persian emotion classification compared to Multinomial Naive Bayes. The 9% improvement in overall accuracy and more balanced performance across all emotion categories make it a more suitable choice for this task. The model's ability to handle class imbalance and capture complex feature interactions is particularly valuable for Persian text analysis, where word relationships and context play crucial roles in emotion expression.
