# Evaluate a real classifier

This code is an example of the use of VADER classifier from NLTK. It is a Naive-Bayes classifier that is trainded with a lexicon and dataset of movie reviews.

Look in the example how the library SKLearn is used to evaulate the classifier.

At the end you have an example on how to use the classifier en custom examples. 


In [21]:
import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report, confusion_matrix
import random


In [22]:
# Download required NLTK datasets
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [23]:
# Preprocess the data
stop_words = set(stopwords.words('english'))

def extract_features(words):
    return {word: True for word in words if word.lower() not in stop_words}

In [24]:
# Prepare the dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)  # Shuffle the dataset for better randomness

# Feature extraction
feature_sets = [(extract_features(words), category) for (words, category) in documents]

# Split the data into training and testing sets
train_size = int(len(feature_sets) * 0.8)
train_set, test_set = feature_sets[:train_size], feature_sets[train_size:]

# Train a Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(train_set)

In [25]:
# Evaluate the classifier
print("\nNaive Bayes Classifier Evaluation:")
print(f"Accuracy: {accuracy(classifier, test_set) * 100:.2f}%")
classifier.show_most_informative_features(10)


Naive Bayes Classifier Evaluation:
Accuracy: 70.25%
Most Informative Features
               miserably = True              neg : pos    =     13.9 : 1.0
               ludicrous = True              neg : pos    =     13.2 : 1.0
                captures = True              pos : neg    =     11.2 : 1.0
              astounding = True              pos : neg    =     10.8 : 1.0
                  avoids = True              pos : neg    =     10.8 : 1.0
                  debate = True              pos : neg    =     10.2 : 1.0
               insulting = True              neg : pos    =     10.0 : 1.0
               atrocious = True              neg : pos    =      9.8 : 1.0
            manipulation = True              pos : neg    =      9.5 : 1.0
          excruciatingly = True              neg : pos    =      9.2 : 1.0


In [26]:
# Prepare predictions and true labels for sklearn metrics
y_true = [label for (_, label) in test_set]
y_pred = [classifier.classify(features) for (features, _) in test_set]
# Evaluate using sklearn metrics
print("\nClassification Report:")
print(classification_report(y_true, y_pred))



Classification Report:
              precision    recall  f1-score   support

         neg       0.93      0.46      0.61       207
         pos       0.62      0.96      0.76       193

    accuracy                           0.70       400
   macro avg       0.78      0.71      0.69       400
weighted avg       0.78      0.70      0.68       400



In [27]:
# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))

# VADER Sentiment Analysis on custom examples
sia = SentimentIntensityAnalyzer()
example_sentences = [
    "I absolutely loved this movie! The acting was fantastic.",
    "This was the worst film I have ever seen.",
    "The plot was predictable, but the cinematography was beautiful.",
    "I wouldn't recommend it. It was boring and too long."
]



Confusion Matrix:
[[ 95 112]
 [  7 186]]


In [28]:
print("\nVADER Sentiment Analysis:")
for sentence in example_sentences:
    score = sia.polarity_scores(sentence)
    sentiment = "positive" if score['compound'] > 0 else "negative"
    print(f"Sentence: {sentence}\nSentiment: {sentiment} (Score: {score['compound']})\n")


VADER Sentiment Analysis:
Sentence: I absolutely loved this movie! The acting was fantastic.
Sentiment: positive (Score: 0.8436)

Sentence: This was the worst film I have ever seen.
Sentiment: negative (Score: -0.6249)

Sentence: The plot was predictable, but the cinematography was beautiful.
Sentiment: positive (Score: 0.7469)

Sentence: I wouldn't recommend it. It was boring and too long.
Sentiment: negative (Score: -0.5283)



# Exercise:

Create your own gold standard and measure Precission, Recall, and F1 manually and with SKLearn to check if the result is the same. 

---

## Creating the gold standard

I´ll create a list of tuples with 14 examples of phrases with positive and negative meanings, using the variable `gold_standard`

In [29]:
# Our gold standard data - manually labeled examples
gold_standard = [
    ("The movie was absolutely fantastic, I loved every minute of it", "pos"),
    ("The actors delivered terrible performances, complete waste of time", "neg"),
    ("While the special effects where great, the story was boring", "neg"),
    ("I found the plot somewhat predictable but still enjoyed it", "pos"),
    ("The plot was interesting, but the performance of the actors made it hard to watch", "neg"),
    ("One of the best movies I have seen this year", "pos"),
    ("Terrible movie, I would love if I could ask for a refund after watching it", "neg"),
    ("The director's vision really shines through in every scene", "pos"),
    ("Despite the large budget, the film feels cheap and rushed", "neg"),
    ("I was moved to tears by the powerful performances", "pos"),
    ("The dialogue was so poorly written it became unintentionally funny", "neg"),
    ("A masterpiece that will be remembered for generations", "pos"),
    ("The cinematography was stunning even though the plot had some holes", "pos"),
    ("Too long and drawn out, I found myself checking my watch repeatedly", "neg"),
    ("An average film that neither impresses nor disappoints", "neg")
]

---
### Separating the gold Standard Data into texts and labels

I do not extract the stopwords as Vader knows them, so I can skip this step

In [30]:
# Extract the first element of the tuples for the texts
texts = [value[0] for value in gold_standard]

# Extract the second element of the tuples for pos neg labels
true_labels = [value[1] for value in gold_standard]

In [31]:
print(texts)

print(true_labels)

['The movie was absolutely fantastic, I loved every minute of it', 'The actors delivered terrible performances, complete waste of time', 'While the special effects where great, the story was boring', 'I found the plot somewhat predictable but still enjoyed it', 'The plot was interesting, but the performance of the actors made it hard to watch', 'One of the best movies I have seen this year', 'Terrible movie, I would love if I could ask for a refund after watching it', "The director's vision really shines through in every scene", 'Despite the large budget, the film feels cheap and rushed', 'I was moved to tears by the powerful performances', 'The dialogue was so poorly written it became unintentionally funny', 'A masterpiece that will be remembered for generations', 'The cinematography was stunning even though the plot had some holes', 'Too long and drawn out, I found myself checking my watch repeatedly', 'An average film that neither impresses nor disappoints']
['pos', 'neg', 'neg', 'p

---
### Sentiment Analysis
In the next step, I will make a sentiment analysys of our gold standard using VADER, and stroing these predicted labels in a list to compare the results after.

In [32]:
predicted_labels = []

print("\nVADER Sentiment Analysis in custom gold Standard:")
for sentence in texts:
    score = sia.polarity_scores(sentence)
    sentiment = "pos" if score['compound'] > 0 else "neg"
    # Storing the prediction
    predicted_labels.append(sentiment)
    print(f"Sentence: {sentence}\nSentiment: {sentiment} (Score: {score['compound']})\n")


VADER Sentiment Analysis in custom gold Standard:
Sentence: The movie was absolutely fantastic, I loved every minute of it
Sentiment: pos (Score: 0.8431)

Sentence: The actors delivered terrible performances, complete waste of time
Sentiment: neg (Score: -0.7096)

Sentence: While the special effects where great, the story was boring
Sentiment: pos (Score: 0.6705)

Sentence: I found the plot somewhat predictable but still enjoyed it
Sentiment: pos (Score: 0.6652)

Sentence: The plot was interesting, but the performance of the actors made it hard to watch
Sentiment: pos (Score: 0.0644)

Sentence: One of the best movies I have seen this year
Sentiment: pos (Score: 0.6369)

Sentence: Terrible movie, I would love if I could ask for a refund after watching it
Sentiment: pos (Score: 0.2732)

Sentence: The director's vision really shines through in every scene
Sentiment: pos (Score: 0.25)

Sentence: Despite the large budget, the film feels cheap and rushed
Sentiment: neg (Score: 0.0)

Sentenc

---

### Next step: calculate precission, recall and F1 Score.

The next part is calculating manually this 3 metrics. Before that, I´ll make a short explanation of every one of this concepts

<div align="center">
    <img src="image.png" alt="alt text" width="60%">
</div>

1. **Precission**: out of all instances predicted as positive, how many of them where actually positive?

    $$\frac{\text{true\_positives}}{\text{true\_positives} + \text{false\_positives}}$$

2. **Recall**: out of all actual positive instances, how many were correctly predicted? 

    $$\frac{\text{true\_positives}}{\text{true\_positives} + \text{false\_negatives}}$$

3. **F1 Score:** harmonic mean of precission and recall.
    $$\frac{2*P*R}{P + R}$$

#### Now, I´ll define the true and false positives, true and false negatives, before proceeding to implement these metrics.


In [33]:
# Initialize counters
true_positives = 0
false_positives = 0
true_negatives = 0
false_negatives = 0


In [34]:
# Loop through each pair of true and predicted labels
for true, pred in zip(true_labels, predicted_labels):
    if true == "pos" and pred == "pos":
        true_positives += 1
    elif true == "neg" and pred == "pos":
        false_positives += 1
    elif true == "neg" and pred == "neg":
        true_negatives += 1
    elif true == "pos" and pred == "neg":
        false_negatives += 1



In [35]:
# Print the results
print(f"True Positives: {true_positives}")
print(f"False Positives: {false_positives}")
print(f"True Negatives: {true_negatives}")
print(f"False Negatives: {false_negatives}")

True Positives: 7
False Positives: 4
True Negatives: 4
False Negatives: 0


---
#### Defining the metrics

For this, I´ll use 3 simple functions that follows the equations of the last markdown

In [None]:
def precission_metric(gold_standard):
    precission = true_positives / (true_positives + false_positives)
    return precission

print(precission_metric(gold_standard))

0.6363636363636364


In [37]:
def recall_metric(gold_standard):
    recall = true_positives / (true_positives + false_negatives)

    return recall

print(recall_metric(gold_standard))

1.0


In [38]:
def f1_score(gold_standard):
    recall = float(recall_metric(gold_standard))
    precission = float(precission_metric(gold_standard))

    f1 = (2 * precission * recall) / (precission + recall)

    return f"f1-score: {f1}"

print(f1_score(gold_standard))

f1-score: 0.7777777777777778


---
## Comparison with Scikit-Learn metrics

Last step is pass this output from vader to scikit-learn built-in metrics and compare the results given.

In [39]:
print("\nClassification Report:")
print(classification_report(true_labels, predicted_labels))


Classification Report:
              precision    recall  f1-score   support

         neg       1.00      0.50      0.67         8
         pos       0.64      1.00      0.78         7

    accuracy                           0.73        15
   macro avg       0.82      0.75      0.72        15
weighted avg       0.83      0.73      0.72        15



In [40]:
# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(true_labels, predicted_labels))

Confusion Matrix:
[[4 4]
 [0 7]]


### Conclussions about differences between metrics


iinsert table

1. Scikit-learn metrics.




2. Custom metrics.


