# Evaluate a real classifier

This code is an example of the use of VADER classifier from NLTK. It is a Naive-Bayes classifier that is trainded with a lexicon and dataset of movie reviews.

Look in the example how the library SKLearn is used to evaulate the classifier.

At the end you have an example on how to use the classifier en custom examples. 


In [3]:

import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report, confusion_matrix
import random

# Download required NLTK datasets
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

# Preprocess the data
stop_words = set(stopwords.words('english'))

def extract_features(words):
    return {word: True for word in words if word.lower() not in stop_words}

# Prepare the dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)  # Shuffle the dataset for better randomness

# Feature extraction
feature_sets = [(extract_features(words), category) for (words, category) in documents]

# Split the data into training and testing sets
train_size = int(len(feature_sets) * 0.8)
train_set, test_set = feature_sets[:train_size], feature_sets[train_size:]

# Train a Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(train_set)

# Evaluate the classifier
print("\nNaive Bayes Classifier Evaluation:")
print(f"Accuracy: {accuracy(classifier, test_set) * 100:.2f}%")
classifier.show_most_informative_features(10)

# Prepare predictions and true labels for sklearn metrics
y_true = [label for (_, label) in test_set]
y_pred = [classifier.classify(features) for (features, _) in test_set]

# Evaluate using sklearn metrics
print("\nClassification Report:")
print(classification_report(y_true, y_pred))

# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))

# VADER Sentiment Analysis on custom examples
sia = SentimentIntensityAnalyzer()
example_sentences = [
    "I absolutely loved this movie! The acting was fantastic.",
    "This was the worst film I have ever seen.",
    "The plot was predictable, but the cinematography was beautiful.",
    "I wouldn't recommend it. It was boring and too long."
]

print("\nVADER Sentiment Analysis:")
for sentence in example_sentences:
    score = sia.polarity_scores(sentence)
    sentiment = "positive" if score['compound'] > 0 else "negative"
    print(f"Sentence: {sentence}\nSentiment: {sentiment} (Score: {score['compound']})\n")


[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/isaacgonzalez/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/isaacgonzalez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/isaacgonzalez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/isaacgonzalez/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!



Naive Bayes Classifier Evaluation:
Accuracy: 74.50%
Most Informative Features
             outstanding = True              pos : neg    =     16.3 : 1.0
                  seagal = True              neg : pos    =     12.6 : 1.0
                captures = True              pos : neg    =     11.7 : 1.0
                    slip = True              pos : neg    =     11.3 : 1.0
               insulting = True              neg : pos    =     11.3 : 1.0
                 idiotic = True              neg : pos    =     10.7 : 1.0
                 tribute = True              pos : neg    =     10.6 : 1.0
              astounding = True              pos : neg    =     10.0 : 1.0
                  avoids = True              pos : neg    =     10.0 : 1.0
                  darker = True              pos : neg    =     10.0 : 1.0

Classification Report:
              precision    recall  f1-score   support

         neg       0.93      0.49      0.65       188
         pos       0.68      0.97     

# Exercise:

Create your own gold standard and measure Precission, Recall, and F1 manually and with SKLearn to check if the result is the same. 

In [1]:
from sklearn.metrics import classification_report

# Manually create your own gold standard
true_labels = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

# Predicted labels from the classifier
predicted_labels = [0, 1, 0, 0, 1, 1, 0, 1, 1, 1]

# Calculate precision, recall, and F1 manually
true_positive = sum([1 for true, pred in zip(true_labels, predicted_labels) if true == 1 and pred == 1])
false_positive = sum([1 for true, pred in zip(true_labels, predicted_labels) if true == 0 and pred == 1])
false_negative = sum([1 for true, pred in zip(true_labels, predicted_labels) if true == 1 and pred == 0])

precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)
f1 = 2 * (precision * recall) / (precision + recall)

# Calculate precision, recall, and F1 using SKLearn

report = classification_report(true_labels, predicted_labels)

print("Manually calculated precision:", precision)
print("Manually calculated recall:", recall)
print("Manually calculated F1:", f1)

print("SKLearn calculated precision, recall, and F1:")
print(report)

Manually calculated precision: 0.6666666666666666
Manually calculated recall: 0.8
Manually calculated F1: 0.7272727272727272
SKLearn calculated precision, recall, and F1:
              precision    recall  f1-score   support

           0       0.75      0.60      0.67         5
           1       0.67      0.80      0.73         5

    accuracy                           0.70        10
   macro avg       0.71      0.70      0.70        10
weighted avg       0.71      0.70      0.70        10

