#### **Evaluation Metrics for Text Classification**

Example Scenario - \
A model trained to predict sentiment of book reviews tells that the best seller has mostly negative reviews.
Should this judgement be accepted?

Let us first generate predictions before evaluating them.

In [None]:
# Initializing the model
rnn_model = RNNModel(input_size, hidden_siee, num_layers, num_classes)
# ...

# Model Training
for epoch in range(10):
    outputs = rnn_model(X_train)
    # ...
    print(f'Epoch: {epoch+1}, Loss: {loss.itme()}')

outputs = rnn_model(X_test)
_, predicted = torch.max(outputs, 1)

We can now use the `predicted` sentiments for evaluation.

The most straightforward metric is accuracy. We can calculate the ratio of correct predictions to the total predictions to obtain accuracty.

In [11]:
import torch
from torchmetrics import Accuracy

actual = torch.tensor([0, 1, 1, 0, 1, 0])
predicted = torch.tensor([0, 0, 1, 0, 1, 1])

accuracy = Accuracy(task='binary', num_classes=2)
accuracy_score = accuracy(predicted, actual)

print(f'Accuracy {accuracy_score*100:.4f}%')

Accuracy 66.6667%


Factors beyond accuracy -
- If the model is trained on 10,000 reviews, out of which 9,800 are positive, \
  then the model will always predict positive with 98% accuracy.

#### **Other Metrics to consider**

- Precision: shows confidence in labelling a review as negative
- Recall: shows how well the model spots negative reviews
- F1 Score: finds a balance between precision and recall

Focusing on accuracy alone can lead us to miss significant feedback on model's performace. \
Thus, we should consider adding more evaluation metrics to better understand the model's decision making.

#### **Precision and Recall**

- Precision :
$$
\Large\frac{Correctly \; Predicted \; Positive \; Observations}{Total \; Predicted \; Positives}
$$

- Recall:

$$
\Large\frac{Correctly \; Predicted \; Positive \; Observations}{All \; Observations \; in \; Positive \; Class}
$$

<!-- $ \Large\frac{Correctly \; Predicted \; Positive \; Observations}{All \; Observations \; in \; Positive \; Class} $ -->

In [17]:
from torchmetrics import Precision, Recall

precision = Precision(task='binary', num_classes=2)
recall = Recall(task='binary', num_classes=2)

prec = precision(predicted, actual)
rec = recall(predicted, actual)

print(f'Precision: {prec:.4f}')
print(f'Recall: {rec:.4f}')

Precision: 0.6667
Recall: 0.6667


#### **F1 Score**

- Harmonizes precision and recall
- Better measure for imbalanced classes

F1 Score scale -
- F1 Score of 1 = perfect precision and recall
- F1 score of 0 = worst performance

In [18]:
from torchmetrics import F1Score

f1 = F1Score(task='binary', num_classes=2)
f1_score = f1(predicted, actual)

print(f'F1 Score: {f1_score:.4f}')

F1 Score: 0.6667
