# Performance Metrics for NLP Models

In Natural Language Processing (NLP), the performance of models can be evaluated using several different metrics, depending on the task at hand (e.g., classification, sequence tagging, generation, etc.). Below is a breakdown of the most commonly used evaluation metrics in NLP.

---

## **1. Accuracy**

### Definition:
Accuracy measures the percentage of correct predictions made by the model, typically used for classification tasks.

### Formula:
$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{\sum_{i=1}^{N} I(y_i = \hat{y}_i)}{N}
$$
Where:
- \( y_i \) = true label (e.g., actual class or category)
- \( \hat{y}_i \) = predicted label
- \( I \) = indicator function (1 if correct, 0 otherwise)
- \( N \) = number of samples

---

## **2. Precision, Recall, and F1-Score**

These metrics are especially useful for tasks where the class distribution is imbalanced (e.g., in sentiment analysis, spam classification).

### Precision:
Measures how many of the predicted positive instances are actually positive.

$$
\text{Precision} = \frac{TP}{TP + FP}
$$
Where:
- \( TP \) = True Positives (correct positive predictions)
- \( FP \) = False Positives (incorrect positive predictions)

### Recall (Sensitivity):
Measures how many of the actual positive instances were correctly predicted.

$$
\text{Recall} = \frac{TP}{TP + FN}
$$
Where:
- \( FN \) = False Negatives (incorrectly predicted as negative)

### F1-Score:
The harmonic mean of precision and recall, giving a balance between the two.

$$
\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

---

## **3. Macro and Micro Averaging (for Multi-Class Classification)**

When evaluating models on multi-class tasks, metrics like precision, recall, and F1 can be averaged over classes. Two common types of averaging are:

### Micro Averaging:
This calculates the metrics globally by counting the total true positives, false positives, etc., across all classes.
$$
\text{Micro Precision} = \frac{\sum_{i=1}^{C} TP_i}{\sum_{i=1}^{C} (TP_i + FP_i)}
$$

### Macro Averaging:
This computes metrics for each class individually and then takes the average. Each class is given equal weight regardless of its size.
$$
\text{Macro Precision} = \frac{1}{C} \sum_{i=1}^{C} \frac{TP_i}{TP_i + FP_i}
$$
Where \( C \) is the number of classes.

---

## **4. BLEU Score (for Machine Translation)**

### Definition:
The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of machine-generated text, particularly in machine translation tasks. It compares the n-grams of the predicted text with those in the reference text.

### Formula:
$$
\text{BLEU} = \text{BP} \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)
$$
Where:
- \( BP \) = Brevity Penalty, which penalizes short translations.
- \( p_n \) = Precision for n-grams (n=1, 2, 3, 4).
- \( w_n \) = Weight for each n-gram.

---

## **5. Perplexity (for Language Models)**

### Definition:
Perplexity is a measure of how well a probability model predicts a sample. In NLP, it is commonly used to evaluate language models, with lower values indicating better performance.

### Formula:
$$
\text{Perplexity} = 2^{H(p)}
$$
Where:
- \( H(p) \) = Cross-entropy between the true distribution and the predicted distribution.
- Lower perplexity values indicate that the model has a better ability to predict the next word.

---

## **6. ROUGE Score (for Summarization)**

### Definition:
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used to evaluate automatic summarization and machine-generated text. It measures the overlap between n-grams, words, or word sequences in the predicted summary and the reference summary.

### ROUGE-N (Recall):
$$
\text{ROUGE-N} = \frac{\sum_{i=1}^{N} \text{Count of common n-grams}}{\sum_{i=1}^{N} \text{Count of n-grams in the reference summary}}
$$

### ROUGE-L (Longest Common Subsequence):
$$
\text{ROUGE-L} = \frac{\sum_{i=1}^{N} LCS(P, R)}{\sum_{i=1}^{N} |R|}
$$
Where:
- \( LCS(P, R) \) is the length of the longest common subsequence between the predicted (P) and reference (R) summaries.

---

## **7. Word Error Rate (WER) (for Speech Recognition)**

### Definition:
WER is a metric used to evaluate the performance of speech recognition systems. It calculates the difference between the predicted transcriptions and the reference transcriptions.

### Formula:
$$
\text{WER} = \frac{S + D + I}{N}
$$
Where:
- \( S \) = Substitutions (incorrectly recognized words)
- \( D \) = Deletions (missing words)
- \( I \) = Insertions (extra words)
- \( N \) = Total number of words in the reference

---

## **8. Kendall’s Tau and Spearman’s Rank Correlation (for Ranking Tasks)**

### Definition:
Kendall’s Tau and Spearman’s Rank Correlation are used to evaluate models in ranking tasks (e.g., recommendation systems). These metrics measure the correlation between the predicted ranking and the true ranking.

### Kendall’s Tau:
$$
\tau = \frac{C - D}{\sqrt{(C + D + T) \cdot (C + D + U)}}
$$
Where:
- \( C \) = number of concordant pairs
- \( D \) = number of discordant pairs
- \( T, U \) = tied pairs for prediction and actual rankings

### Spearman’s Rank Correlation:
$$
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
$$
Where:
- \( d_i \) = difference between ranks of each item
- \( n \) = number of items

---

## **9. Cosine Similarity (for Text Similarity)**

### Definition:
Cosine similarity is used to measure the similarity between two non-zero vectors, which can represent word embeddings or document vectors. It is commonly used in information retrieval and document similarity tasks.

### Formula:
$$
\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}
$$
Where:
- \( A \) and \( B \) are vectors representing two documents (or words).
- \( \|A\| \) and \( \|B\| \) are the magnitudes of the vectors.

---

## **10. Normalized Discounted Cumulative Gain (NDCG) (for Ranking)**

### Definition:
NDCG is used for evaluating the quality of ranked retrieval results, such as in search engines. It considers the position of relevant documents in the ranking.

### Formula:
$$
\text{NDCG@k} = \frac{Z_k}{\sum_{i=1}^{k} \frac{rel(i)}{\log_2(i + 1)}}
$$
Where:
- \( Z_k \) is a normalization factor to make the score range from 0 to 1.
- \( rel(i) \) is the relevance of the item at position \( i \).

---

## Conclusion

These metrics are critical in evaluating NLP models, particularly for tasks like classification, sequence labeling, machine translation, and summarization. The choice of metric depends on the specific task and the nature of the data. By understanding these metrics, you can better assess your model's performance, diagnose issues, and improve the overall quality of your NLP systems.
