# Sentiment Analysis Ensemble Labeling using VADER and TextBlob

This notebook implements an ensemble approach to sentiment labeling by combining predictions from VADER and TextBlob sentiment analyzers. The process includes:

1. **Dual Annotation**: Apply both VADER and TextBlob to generate sentiment labels
2. **Agreement Analysis**: Calculate Cohen's Kappa to measure inter-annotator agreement
3. **Ensemble Creation**: Use majority voting to create final consensus labels

In [None]:
import pandas as pd
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

file_path = '/content/GenAI in education Dataset - latest cleaned.csv'
data = pd.read_csv(file_path)

data = data.drop(columns=['link'])

vader_analyzer = SentimentIntensityAnalyzer()

def vader_sentiment(text):
    score = vader_analyzer.polarity_scores(text)['compound']
    if score >= 0.05:
        return 'positive'
    elif score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

def textblob_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0:
        return 'positive'
    elif polarity < 0:
        return 'negative'
    else:
        return 'neutral'


data['vader_label'] = data['text_clean'].apply(vader_sentiment)
data['textblob_label'] = data['text_clean'].apply(textblob_sentiment)

# Save comparison CSV
data.to_csv('sentiment_labels_comparison.csv', index=False)

## Step 1: Dual Sentiment Annotation

We'll apply two different sentiment analysis approaches to create independent labels:

### VADER (Valence Aware Dictionary and sEntiment Reasoner)
- **Compound Score Thresholds**:
  - Positive: ≥ 0.05
  - Negative: ≤ -0.05  
  - Neutral: between -0.05 and 0.05

### TextBlob
- **Polarity Score Thresholds**:
  - Positive: > 0
  - Negative: < 0
  - Neutral: = 0

Both methods will analyze the `text_clean` field to generate sentiment classifications.

In [None]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score

# Load the CSV with labels
file_path = 'sentiment_labels_comparison.csv'
data = pd.read_csv(file_path)

# Drop rows with missing values in the sentiment label columns
data_cleaned = data.dropna(subset=['vader_label', 'textblob_label'])

# Calculate Cohen's Kappa
kappa = cohen_kappa_score(data_cleaned['vader_label'], data_cleaned['textblob_label'])

print(f"Cohen's Kappa between VADER and TextBlob labels: {kappa:.3f}")

Cohen's Kappa between VADER and TextBlob labels: 0.307


## Step 2: Inter-Annotator Agreement Analysis

Now we'll measure the agreement between VADER and TextBlob using **Cohen's Kappa coefficient**.

### Cohen's Kappa Interpretation:
- **κ < 0.20**: Poor agreement
- **0.20 ≤ κ < 0.40**: Fair agreement  
- **0.40 ≤ κ < 0.60**: Moderate agreement
- **0.60 ≤ κ < 0.80**: Good agreement
- **κ ≥ 0.80**: Very good agreement

This metric helps us understand how consistently the two methods classify sentiments.

## Step 3: Ensemble Labeling Strategy

We'll create consensus labels using a **majority voting approach** with the following logic:

### Voting Rules:
1. **Unanimous Positive**: Both VADER and TextBlob agree on "positive" → **positive**
2. **Unanimous Negative**: Both VADER and TextBlob agree on "negative" → **negative**  
3. **Disagreement**: Any other combination (including neutral) → **negative** (conservative approach)

This conservative approach ensures that only clearly positive sentiments are labeled as positive, while ambiguous cases default to negative. This helps create a more reliable training dataset for subsequent machine learning models.

In [None]:
def ensemble_vote(row):
    votes = [row['vader_label'], row['textblob_label']]

    # Count votes
    vote_counts = Counter(votes)

    vote_counts = Counter(votes)

    if vote_counts['positive'] == 2:
        return 'positive'
    elif vote_counts['negative'] == 2:
        return 'negative'
    else:
        return 'negative'

In [None]:
import pandas as pd
from collections import Counter

df = pd.read_csv("/content/GenAI in education Dataset - sentiment_labels_comparison.csv")

# Apply ensemble voting
df['ensemble_label'] = df.apply(ensemble_vote, axis=1)
df.to_csv("ensemble_sentiment_labels.csv", index=False)

## Step 4: Apply Ensemble Voting and Save Results

Finally, we'll apply the ensemble voting function to create the final consensus labels and save the annotated dataset for use in downstream machine learning tasks.

### Output Dataset
- **File**: `ensemble_sentiment_labels.csv`
- **New Column**: `ensemble_label` (positive/negative)
- **Purpose**: Training data for supervised sentiment analysis models

This ensemble-labeled dataset will serve as the ground truth for training more sophisticated sentiment analysis models like BERT or DistilBERT.