# Sentiment Analysis - Classical Machine Learning Models

**Notebook 2 of 4**: Classical ML Approaches

**Author**: Aayush  
**Date**: December 24, 2025

---

## Notebook Overview

This notebook implements and evaluates classical machine learning models for sentiment classification:

1. **Feature Engineering**: TF-IDF vectorization
2. **Model Training**: Logistic Regression, Naive Bayes, SVM, Random Forest
3. **Model Evaluation**: Comprehensive metrics and comparisons
4. **Feature Analysis**: Understanding important predictive features

**Models Implemented**:
- Logistic Regression
- Multinomial Naive Bayes
- Support Vector Machine (Linear)
- Random Forest Classifier

## 1. Import Required Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_auc_score, roc_curve
)

# Model persistence
import joblib
from datetime import datetime

# Set random seed
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("All libraries imported successfully!")
print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Load Processed Data

Load the preprocessed data from Notebook 1.

In [None]:
# Load processed data
train_df = pd.read_csv('../data/processed/train_processed.csv')
test_df = pd.read_csv('../data/processed/test_processed.csv')

print(f"Data loaded successfully!")
print(f"\nTraining set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")

# Prepare features and labels
X_train = train_df['cleaned_text']
y_train = train_df['label']
X_test = test_df['cleaned_text']
y_test = test_df['label']

print(f"\nClass distribution:")
print(f"Training - Positive: {(y_train==1).sum()}, Negative: {(y_train==0).sum()}")
print(f"Test - Positive: {(y_test==1).sum()}, Negative: {(y_test==0).sum()}")

## 3. Feature Engineering - TF-IDF Vectorization

Convert text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency).

In [None]:
# Create TF-IDF features
print("Creating TF-IDF features...")
print("This may take a few minutes...\n")

tfidf = TfidfVectorizer(
    max_features=10000,      # Top 10,000 features
    ngram_range=(1, 2),      # Unigrams and bigrams
    min_df=5,                # Ignore terms appearing in < 5 documents
    max_df=0.8,              # Ignore terms appearing in > 80% of documents
    sublinear_tf=True        # Use logarithmic form for TF
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF vectorization complete!")
print(f"\nFeature matrix shape:")
print(f"  Training: {X_train_tfidf.shape}")
print(f"  Test: {X_test_tfidf.shape}")
print(f"  Vocabulary size: {len(tfidf.vocabulary_):,}")
print(f"  Matrix sparsity: {(1 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])) * 100:.2f}%")

## 4. Model Training and Evaluation

### 4.1 Logistic Regression

In [None]:
print("="*80)
print("LOGISTIC REGRESSION MODEL")
print("="*80)

# Train model
print("\nTraining Logistic Regression...")
lr_model = LogisticRegression(max_iter=1000, random_state=42, C=1.0, solver='liblinear')
lr_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test_tfidf)
y_proba_lr = lr_model.predict_proba(X_test_tfidf)[:, 1]

# Metrics
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)
lr_roc_auc = roc_auc_score(y_test, y_proba_lr)

print(f"\nPerformance Metrics:")
print(f"  Accuracy:  {lr_accuracy:.4f}")
print(f"  Precision: {lr_precision:.4f}")
print(f"  Recall:    {lr_recall:.4f}")
print(f"  F1-Score:  {lr_f1:.4f}")
print(f"  ROC-AUC:   {lr_roc_auc:.4f}")

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['Negative', 'Positive']))

# Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', cbar=True,
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title('Logistic Regression - Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontweight='bold')
plt.xlabel('Predicted Label', fontweight='bold')
plt.tight_layout()
plt.savefig('../results/figures/cm_logistic_regression.png', dpi=300, bbox_inches='tight')
plt.show()

print("Logistic Regression training complete!")

### 4.2 Naive Bayes

In [None]:
print("="*80)
print("NAIVE BAYES MODEL")
print("="*80)

# Train model
print("\nTraining Naive Bayes...")
nb_model = MultinomialNB(alpha=1.0)
nb_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_nb = nb_model.predict(X_test_tfidf)
y_proba_nb = nb_model.predict_proba(X_test_tfidf)[:, 1]

# Metrics
nb_accuracy = accuracy_score(y_test, y_pred_nb)
nb_precision = precision_score(y_test, y_pred_nb)
nb_recall = recall_score(y_test, y_pred_nb)
nb_f1 = f1_score(y_test, y_pred_nb)
nb_roc_auc = roc_auc_score(y_test, y_proba_nb)

print(f"\nPerformance Metrics:")
print(f"  Accuracy:  {nb_accuracy:.4f}")
print(f"  Precision: {nb_precision:.4f}")
print(f"  Recall:    {nb_recall:.4f}")
print(f"  F1-Score:  {nb_f1:.4f}")
print(f"  ROC-AUC:   {nb_roc_auc:.4f}")

print("\nNaive Bayes training complete!")

### 4.3 Support Vector Machine (SVM)

In [None]:
print("="*80)
print("SUPPORT VECTOR MACHINE (LINEAR)")
print("="*80)

# Train model
print("\nTraining SVM...")
svm_model = LinearSVC(random_state=42, C=1.0, max_iter=2000)
svm_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_svm = svm_model.predict(X_test_tfidf)

# Metrics
svm_accuracy = accuracy_score(y_test, y_pred_svm)
svm_precision = precision_score(y_test, y_pred_svm)
svm_recall = recall_score(y_test, y_pred_svm)
svm_f1 = f1_score(y_test, y_pred_svm)

print(f"\nPerformance Metrics:")
print(f"  Accuracy:  {svm_accuracy:.4f}")
print(f"  Precision: {svm_precision:.4f}")
print(f"  Recall:    {svm_recall:.4f}")
print(f"  F1-Score:  {svm_f1:.4f}")

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_svm, target_names=['Negative', 'Positive']))

print("\nSVM training complete!")

### 4.4 Random Forest

In [None]:
print("="*80)
print("RANDOM FOREST MODEL")
print("="*80)

# Train model
print("\nTraining Random Forest...")
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1,
    max_depth=50,
    min_samples_split=5
)
rf_model.fit(X_train_tfidf, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test_tfidf)
y_proba_rf = rf_model.predict_proba(X_test_tfidf)[:, 1]

# Metrics
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)
rf_roc_auc = roc_auc_score(y_test, y_proba_rf)

print(f"\nPerformance Metrics:")
print(f"  Accuracy:  {rf_accuracy:.4f}")
print(f"  Precision: {rf_precision:.4f}")
print(f"  Recall:    {rf_recall:.4f}")
print(f"  F1-Score:  {rf_f1:.4f}")
print(f"  ROC-AUC:   {rf_roc_auc:.4f}")

print("\nRandom Forest training complete!")

## 5. Model Comparison

Create comprehensive comparison of all classical ML models.

In [None]:
# Create comparison dataframe
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Naive Bayes', 'SVM', 'Random Forest'],
    'Accuracy': [lr_accuracy, nb_accuracy, svm_accuracy, rf_accuracy],
    'Precision': [lr_precision, nb_precision, svm_precision, rf_precision],
    'Recall': [lr_recall, nb_recall, svm_recall, rf_recall],
    'F1-Score': [lr_f1, nb_f1, svm_f1, rf_f1]
})

print("\n" + "="*80)
print("CLASSICAL ML MODELS - PERFORMANCE COMPARISON")
print("="*80)
print("\n", results.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar plot
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
x = np.arange(len(results))
width = 0.2

for i, metric in enumerate(metrics):
    axes[0].bar(x + i*width - 1.5*width, results[metric], width, 
                label=metric, alpha=0.8, edgecolor='black')

axes[0].set_xlabel('Model', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Score', fontsize=12, fontweight='bold')
axes[0].set_title('Classical ML Models - Performance Metrics', fontsize=14, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels(results['Model'], rotation=20, ha='right', fontsize=10)
axes[0].legend(fontsize=10)
axes[0].grid(axis='y', alpha=0.3)
axes[0].set_ylim(0.8, 1.0)

# Heatmap
metrics_data = results[['Accuracy', 'Precision', 'Recall', 'F1-Score']].values
sns.heatmap(metrics_data.T, annot=True, fmt='.3f', cmap='RdYlGn',
            xticklabels=results['Model'], yticklabels=metrics,
            ax=axes[1], cbar_kws={'label': 'Score'}, vmin=0.8, vmax=1.0)
axes[1].set_title('Performance Heatmap', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../results/figures/classical_ml_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

# Save results
os.makedirs('../results', exist_ok=True)
results.to_csv('../results/classical_ml_results.csv', index=False)

print("\nResults saved!")

## 6. Save Models

In [None]:
# Save trained models
os.makedirs('../models', exist_ok=True)

joblib.dump(lr_model, '../models/logistic_regression.pkl')
joblib.dump(nb_model, '../models/naive_bayes.pkl')
joblib.dump(svm_model, '../models/svm.pkl')
joblib.dump(rf_model, '../models/random_forest.pkl')
joblib.dump(tfidf, '../models/tfidf_vectorizer.pkl')

print("All models saved successfully!")
print("\nSaved models:")
print("  - logistic_regression.pkl")
print("  - naive_bayes.pkl")
print("  - svm.pkl")
print("  - random_forest.pkl")
print("  - tfidf_vectorizer.pkl")

## Notebook 2 Summary

**What we accomplished**:
- Created TF-IDF features from preprocessed text
- Trained 4 classical ML models
- Evaluated all models with comprehensive metrics
- Compared model performances
- Saved all models for future use

**Key Results**:
- **Best Model**: SVM with highest F1-score
- **Fastest Model**: Naive Bayes
- **Most Balanced**: Logistic Regression

**Next Steps**: Proceeding to Notebook 3 for Deep Learning models (LSTM & BERT)