# Exploring VADER as a sentiment analysis tool

In this notebook, we began by cleaning the movie review dataset (removing HTML tags and extra white spaces) to reduce noise. We deliberately retained capitalisation, punctuation, emojis, and slang, as these elements contribute meaningfully to VADER's sentiment calculations.

We then applied the VADER sentiment analyzer to each review to extract positive, negative, neutral, and compound scores.

Using the compound score, we classified reviews as either positive or negative and computed the overall prediction accuracy.

To further refine our approach, we identified words missing from the VADER lexicon and experimented with different compound score thresholds. We found that a threshold of 0.5 yielded the best accuracy at 0.712.

Overall, our exploration suggests that while VADER is effective for shorter, informal social media texts, it may not be the optimal tool for analyzing the more complex sentiment expressed in movie reviews.

In [None]:
import torch
from torch import cuda

device = 'cuda' if cuda.is_available() else 'cpu'
print(f"Using device: {device}")

Using device: cuda


In [None]:
import nltk
import re
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from google.colab import drive
!pip install vaderSentiment
nltk.download('punkt_tab')
nltk.download('punkt')



# Preparing Dataset

In [None]:
drive.mount('/content/drive')

nltk.download('punkt')

neg_df = pd.read_csv('/content/drive/MyDrive/it1244/negcsv.csv')
pos_df = pd.read_csv('/content/drive/MyDrive/it1244/poscsv.csv')


neg_df['label'] = 0
pos_df['label'] = 1

df = pd.concat([neg_df, pos_df], ignore_index=True)

print("Sample rows:")
print(df.head())
print("\nTotal reviews:", len(df))

Mounted at /content/drive


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Sample rows:
    FileName                                            Content  label
0  23129.txt  Not even Goebbels could have pulled off a prop...      0
1  22912.txt  A plot that fizzled and reeked of irreconcilab...      0
2  23622.txt  The first look on the cover of this picture, i...      0
3  23637.txt  A drama at its very core, "Anna" displays that...      0
4  23109.txt  When THE MAGIC OF LASSIE opened at Radio City ...      0

Total reviews: 50000


# Cleaning Dataset

In [None]:
# The clean_text function removes only HTML tags and extra white spaces from the reviews.
# We intentionally keep punctuations and capitalisations intact because VADER leverages these features for sentiment scoring.

def clean_text(text):
    text = BeautifulSoup(text, "html.parser").get_text()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['Content'] = df['Content'].apply(clean_text)

print("Cleaned sample rows:")
print(df.head())

Cleaned sample rows:
    FileName                                            Content  label
0  23129.txt  Not even Goebbels could have pulled off a prop...      0
1  22912.txt  A plot that fizzled and reeked of irreconcilab...      0
2  23622.txt  The first look on the cover of this picture, i...      0
3  23637.txt  A drama at its very core, "Anna" displays that...      0
4  23109.txt  When THE MAGIC OF LASSIE opened at Radio City ...      0


In [None]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

# Getting sentiment scores of first few rows

In [None]:
analyzer = SentimentIntensityAnalyzer()

# This function computes VADER sentiment scores for a given text.
# The analyzer.polarity_scores function returns a dictionary containing negative, neutral, positive, and compound sentiment scores.
def get_vader_scores(text):
    return analyzer.polarity_scores(text)


df[['neg', 'neu', 'pos', 'compound']] = df['Content'].apply(
    lambda x: pd.Series(get_vader_scores(x))
)
print("\nSentiment scores (first few rows):")
print(df[['Content', 'neg', 'neu', 'pos', 'compound']].head())


Sentiment scores (first few rows):
                                             Content    neg    neu    pos  \
0  Not even Goebbels could have pulled off a prop...  0.078  0.846  0.076   
1  A plot that fizzled and reeked of irreconcilab...  0.223  0.699  0.077   
2  The first look on the cover of this picture, i...  0.128  0.677  0.195   
3  A drama at its very core, "Anna" displays that...  0.108  0.754  0.138   
4  When THE MAGIC OF LASSIE opened at Radio City ...  0.059  0.840  0.102   

   compound  
0   -0.5218  
1   -0.9704  
2    0.8690  
3    0.9866  
4    0.7453  


# Classifying reviews based on default threshold


In [None]:
# Apply VADER sentiment analysis threshold to classify reviews.
# The default threshold is:
# - Positive review: compound score >= 0.05
# - Negative review: compound score < 0.05

df['Predicted Sentiment'] = df['compound'].apply(
    lambda score: 1 if score >= 0.05 else 0
)

print("\nFirst few rows with predictions:")
print(df[['Content', 'Predicted Sentiment', 'compound', 'label']].head())


correct_predictions = (df['Predicted Sentiment'] == df['label']).sum()
accuracy = correct_predictions / len(df)
print(f"\nAccuracy of VADER sentiment analysis (threshold=0.05): {accuracy:.2f}")



First few rows with predictions:
                                             Content  Predicted Sentiment  \
0  Not even Goebbels could have pulled off a prop...                    0   
1  A plot that fizzled and reeked of irreconcilab...                    0   
2  The first look on the cover of this picture, i...                    1   
3  A drama at its very core, "Anna" displays that...                    1   
4  When THE MAGIC OF LASSIE opened at Radio City ...                    1   

   compound  label  
0   -0.5218      0  
1   -0.9704      0  
2    0.8690      0  
3    0.9866      0  
4    0.7453      0  

Accuracy of VADER sentiment analysis (threshold=0.05): 0.70


# Analysis of Results

In [None]:
# This function finds unknown words in a text.
# It tokenizes the text and then identifies words that are not present in VADER's lexicon.
# Finding unknown words from reviews that are not in VADER's lexicon is important because it highlights a limitation of VADER. This could be one possible reason for the lower accuracy observed.

def find_unknown_words(text):
    tokens = nltk.word_tokenize(text)
    unknown_words = [word for word in tokens if word.lower() not in analyzer.lexicon]
    return unknown_words

df['unknown_words'] = df['Content'].apply(find_unknown_words)

# Flatten the list of unknown words across all reviews
all_unknown_words = df['unknown_words'].sum()  # Combines lists into a single list
unique_unknown_words = list(set(all_unknown_words))  # Unique unknown words

print(f"\nSome unknown words not in VADER's lexicon: {unique_unknown_words[:10]}")


Some unknown words not in VADER's lexicon: ['Stallone\x97that', 'Deathstalker', 'more.Robin', 'harkness', 'chiselled', 'ills', 'rs', 'Bluest', 'Pitts', 'list.Rating']


Extracting wrongly classified examples and more performance metrics.

In [None]:
# Identify reviews where the predicted sentiment does not match the true label.
wrongly_labeled = df[df['Predicted Sentiment'] != df['label']]

def extract_words(text):
    return nltk.word_tokenize(text)

wrongly_labeled['Words'] = wrongly_labeled['Content'].apply(extract_words)

wrongly_labeled_table = wrongly_labeled[['Content', 'Predicted Sentiment', 'label', 'Words']]
print("\nSome wrongly labeled examples:")
print(wrongly_labeled_table.head())


# Compute accuracy and other classification metrics (precision, recall, f1-score).
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)

    acc = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Convert VADER compound scores to pseudo-logits (values that represent the model's confidence in each class before being converted into probabilities.)
# The compound score ranges from -1 (most negative) to 1 (most positive). We map it to probabilities: pos_prob = (compound + 1) / 2, neg_prob = 1 - pos_prob
logits = []
for c in df['compound']:
    pos_prob = (c + 1.0) / 2.0
    neg_prob = 1.0 - pos_prob
    logits.append([neg_prob, pos_prob])

# Convert the logits and labels into torch tensors.
logits = torch.tensor(logits)
labels = torch.tensor(df['label'].values)
eval_pred = (logits, labels)

metrics_result = compute_metrics(eval_pred)
print("\nDetailed Metrics with default threshold interpretation:")
for k, v in metrics_result.items():
    print(f"{k.capitalize()}: {v:.3f}")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wrongly_labeled['Words'] = wrongly_labeled['Content'].apply(extract_words)
  predictions = torch.argmax(torch.tensor(logits), dim=-1)



Some wrongly labeled examples:
                                             Content  Predicted Sentiment  \
2  The first look on the cover of this picture, i...                    1   
3  A drama at its very core, "Anna" displays that...                    1   
4  When THE MAGIC OF LASSIE opened at Radio City ...                    1   
5  There are few uplifting things to say about th...                    1   
7  I find it rather useless to comment on this "m...                    1   

   label                                              Words  
2      0  [The, first, look, on, the, cover, of, this, p...  
3      0  [A, drama, at, its, very, core, ,, ``, Anna, '...  
4      0  [When, THE, MAGIC, OF, LASSIE, opened, at, Rad...  
5      0  [There, are, few, uplifting, things, to, say, ...  
7      0  [I, find, it, rather, useless, to, comment, on...  

Detailed Metrics with default threshold interpretation:
Accuracy: 0.696
Precision: 0.649
Recall: 0.855
F1: 0.738


# Exploring different thresholds

In [None]:
# Here, we define a list of thresholds to test. Adjusting the threshold can help improve the accuracy of the VADER model.
thresholds = [-0.1, 0.0, 0.05, 0.1, 0.2, 0.5]
results = []

for t in thresholds:
    df['Predicted_T'] = df['compound'].apply(lambda x: 1 if x >= t else 0)

    y_true = df['label']
    y_pred = df['Predicted_T']
    acc = accuracy_score(y_true, y_pred)
    prec, rec, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')

    results.append({
        'Threshold': t,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1': f1
    })

print("\nThreshold Tuning Results:")
for r in results:
    print(f"Threshold={r['Threshold']}: Accuracy={r['Accuracy']:.3f}, "
          f"Precision={r['Precision']:.3f}, Recall={r['Recall']:.3f}, F1={r['F1']:.3f}")



Threshold Tuning Results:
Threshold=-0.1: Accuracy=0.694, Precision=0.645, Recall=0.860, F1=0.737
Threshold=0.0: Accuracy=0.696, Precision=0.649, Recall=0.856, F1=0.738
Threshold=0.05: Accuracy=0.697, Precision=0.650, Recall=0.853, F1=0.738
Threshold=0.1: Accuracy=0.699, Precision=0.652, Recall=0.851, F1=0.738
Threshold=0.2: Accuracy=0.702, Precision=0.657, Recall=0.844, F1=0.739
Threshold=0.5: Accuracy=0.712, Precision=0.676, Recall=0.812, F1=0.738


# Conclusion

In conclusion, our threshold tuning results indicate that a threshold of 0.5 yields the best performance, with the highest accuracy among the thresholds tested. However, even at this optimal threshold, the overall accuracy remains relatively low. This suggests that the VADER sentiment analysis model may not be sufficiently effective for movie sentiment analysis.