## Evaluating Sentiment Analysis Models (Tweet Data)

Model evaluation answers one question:

> **‚ÄúHow well will my model perform on unseen, real-world data?‚Äù**

## 1Ô∏è‚É£ Prepare Your Data

### Dataset Splitting

Typical split:

* **70% Training** ‚Üí learn parameters
* **15% Validation** ‚Üí tune hyperparameters
* **15% Test** ‚Üí final, unbiased evaluation

üîë **Rule**: *Never* tune using the test set.

### Labeling

* Sentiments: **positive / negative / neutral**
* Labels must be:

  * Correct
  * Consistent
  * Representative of real usage

## 2Ô∏è‚É£ Evaluation Metrics (Very Important)

### Accuracy

$$
\text{Accuracy} = \frac{TP + TN}{\text{Total}}
$$

* ‚úÖ Good for **balanced datasets**
* ‚ùå Misleading for **imbalanced data**



### Precision

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

**Use when:** false positives are costly  

üìå Example: labeling neutral tweets as *negative*



### Recall (Sensitivity)

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

**Use when:** false negatives are costly   

üìå Example: missing negative tweets about a product



### F1 Score

$$
F1 = 2 \times \frac{\text{Precision √ó Recall}}{\text{Precision + Recall}}
$$

* Best **single metric** for imbalanced sentiment data    
* Balances precision and recall



### AUC‚ÄìROC

* Measures **class separability**
* Useful for **binary classification**
* Robust to **class imbalance**



### Confusion Matrix

Shows:

* True Positives (TP)
* True Negatives (TN)
* False Positives (FP)
* False Negatives (FN)

## 3Ô∏è‚É£ Perform Evaluation

### Confusion Matrix Analysis

Ask:

* Which sentiment is misclassified most?
* Are errors systematic or random?



### Cross-Validation (k-fold)

* Train/test on multiple splits
* Ensures **stability and consistency**



### Baseline Comparison

Always compare against:

* Majority-class predictor
* Simple rule-based sentiment model

üîë If you can‚Äôt beat the baseline ‚Üí model is useless

## 4Ô∏è‚É£ Assess Real-World Performance

### Manual Review

* Sample predictions
* Compare with human judgment



### Error Analysis

Look for:

* Sarcasm
* Slang
* Emojis
* Negation (‚Äúnot good‚Äù)

This guides feature/model improvements.



### Domain-Specific Testing

* News tweets
* Product reviews
* Political tweets

Checks **generalization**

## 5Ô∏è‚É£ Check for Bias

### Bias Analysis

Evaluate:

* Performance skew across topics
* Language style sensitivity
* Demographic language patterns

‚ö†Ô∏è Critical for fairness and ethics

## 6Ô∏è‚É£ External Benchmarks

* Compare against:

  * Published models
  * Pretrained sentiment analyzers
* Gives **context**, not just raw numbers

## 7Ô∏è‚É£ Continuous Monitoring (Production)

### Real-Time Monitoring

* Track accuracy drift
* Watch for language evolution

### Periodic Retraining

* Slang changes
* New hashtags
* New sentiment patterns

## Summary ‚≠ê

* Accuracy ‚â† always good
* F1 is best for imbalanced sentiment data
* Confusion matrix explains *why* accuracy is low
* Always compare to a baseline
* Manual + automated evaluation both matter
* Models degrade over time ‚Üí retraining required