# Day-54: Sentiment Analysis Using Text Data

We've spent the last few days structuring text and extracting core entities. Today, we move into one of the most commercially valuable areas of NLP: Sentiment Analysis. We're going to determine the emotional tone or attitude (positive, negative, or neutral) expressed in a piece of text. We'll cover a powerful lexicon-based tool, VADER, and a fundamental machine learning approach, Logistic Regression.

## Topic Covered
- VADER, 
- Logistic Regression for Polarity Detection


## VADER: The Lexicon-Based Approach

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a popular lexicon- and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media. It doesn't require any training data; it works by looking up words in a dictionary.

- `How it Works`: VADER uses a dictionary where words are pre-labeled as positive or negative (e.g., "amazing" is +3.7, "awful" is −2.5). 

It also incorporates grammatical rules for context:

    - Punctuation: Increases intensity (e.g., "Great!!" is more positive than "Great.").

    - Capitalization: Increases intensity (e.g., "GREAT" is more positive than "great").

    - Degree Modifiers (Adverbs): Adjust intensity (e.g., "slightly good" vs. "extremely good").

    - Negation: Inverts the sentiment (e.g., "not good").

- `Analogy`: The Scorecard. VADER assigns a numerical score to every word and then uses a simple formula to combine those scores, adjust for context, and produce a final, quantifiable sentiment score for the entire sentence.

- `Output`: VADER outputs four scores for each text: Negative, Neutral, Positive, and the Compound score (a single, normalized metric between −1 and +1 that summarizes the overall sentiment).

## Logistic Regression: The Machine Learning Approach 

For many real-world applications, you need a custom model trained on your specific data (e.g., reviews about your specific product). This is where supervised machine learning comes in. Logistic Regression is the perfect baseline classification algorithm for sentiment analysis.

- `How it Works`:Feature Extraction: Text is converted into numerical features, typically using TF-IDF vectors (from Day 51).

- `Training`: The Logistic Regression algorithm learns a linear relationship between the TF-IDF features and the corresponding labeled sentiment (e.g., a "Positive" label or a "Negative" label).

- `Prediction`: When a new piece of text is fed into the model, it outputs a probability (between 0 and 1) that the text belongs to a certain class (e.g., 0.95 probability of being positive).

- `Analogy` : The Opinion Pollster. Instead of relying on a pre-written dictionary (like VADER), the model learns from thousands of previous opinions (training data) to predict the outcome of a new one. It uses context to understand that certain words are positive only in the context of the training data (e.g., "slow" might be positive for a coffee brewer, but negative for a computer).

- `Advantage over VADER`: It adapts to domain-specific language (e.g., "sick" meaning "good" in a teenage slang dataset).

## Code Example: VADER and Logistic Regression (BoW)

We use VADER (from nltk.sentiment) and scikit-learn for machine learning classification.

In [1]:
! pip install nltk



In [2]:
# Ensure VADER lexicon is downloaded (run once)
import nltk
nltk.download('vader_lexicon')

  from scipy.stats import fisher_exact
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\amey9\AppData\Roaming\nltk_data...


True

In [3]:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np



# --- 1. VADER Example (Lexicon-Based) ---
print("--- VADER Sentiment Analysis ---")
vader_analyzer = SentimentIntensityAnalyzer()
sentence = "The food was absolutely fantastic, but the service was extremely slow."

# VADER analysis
vs = vader_analyzer.polarity_scores(sentence)
print(f"Text: '{sentence}'")
print(f"VADER Scores: {vs}")

# Interpret the compound score
if vs['compound'] >= 0.05:
    sentiment = "Positive"
elif vs['compound'] <= -0.05:
    sentiment = "Negative"
else:
    sentiment = "Neutral"

print(f"Overall Sentiment: {sentiment}\n")
# Note: VADER handles the 'but' and balances the positive and negative phrases.

# --- 2. Logistic Regression Example (Machine Learning) ---
print("--- Logistic Regression (ML-Based) ---")

# Simulated Training Data (Must be labeled 0=Negative, 1=Positive)
data = {
    'text': ['I love this product', 'This is awful, terrible service', 'Great job, highly recommend', 'The quality is disappointing', 'It works fine but nothing special'],
    'label': [1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)

# A. Feature Extraction (Count Vectorizer / BoW)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'])
y = df['label']

# B. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# C. Model Training
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

# D. Prediction and Evaluation
y_pred = lr_model.predict(X_test)
print(f"Predicted Labels on Test Set: {y_pred}")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

--- VADER Sentiment Analysis ---
Text: 'The food was absolutely fantastic, but the service was extremely slow.'
VADER Scores: {'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'compound': 0.3499}
Overall Sentiment: Positive

--- Logistic Regression (ML-Based) ---
Predicted Labels on Test Set: [1 1]
Accuracy: 0.00


# Summary of Day 54

Today, you learned two key approaches to Sentiment Analysis:
1. VADER: A fast, lexicon-based tool that is excellent for social media text and provides granular positive, negative, and compound scores without any training.

2. Logistic Regression: A fundamental machine learning classifier that requires labeled training data but can be customized to detect domain-specific sentiment by using TF-IDF or BoW features

## What's Next (Day 55)

We've covered what text is (words), what text means (vectors), and what text feels (sentiment). Tomorrow, on Day 55, we look at what text is about by exploring Topic Modeling using Latent Dirichlet Allocation (LDA). You'll learn how to automatically discover the hidden themes and subjects within large collections of documents.