
# 🧠 Sentiment Analysis using Naive Bayes (NLP Hands-on Activity)

### 🎯 Objective
This notebook demonstrates how to perform **sentiment analysis** on movie reviews using the **Naive Bayes classifier**.

---
## 🧩 Structured vs. Unstructured Data

**Structured Data:** Organized in rows and columns (e.g., sales data, exam scores).  
**Unstructured Data:** Text, images, videos, or audio with no predefined format.

Natural Language Processing (NLP) allows computers to **understand and analyze unstructured text** data.

---


In [None]:

import nltk
from nltk.corpus import movie_reviews
import random
import pandas as pd

# Download dataset if not already available
nltk.download('movie_reviews')

# Prepare dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]  # list comprehension 

# Shuffle for randomness
random.shuffle(documents)

# Convert to DataFrame for convenience
reviews = [" ".join(words) for words, category in documents]
labels = [category for words, category in documents]
df = pd.DataFrame({"review": reviews, "label": labels})
df.head()



## ✂️ Tokenization and Text Preprocessing

**Tokenization:** Breaking text into words or subwords (tokens).  
**Stopwords:** Common words (e.g., "the", "is") that are usually removed.  
**Bag of Words (BoW):** Represents text by counting word frequencies, ignoring grammar and order.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['review'], df['label'], test_size=0.2, stratify=df['label'], random_state=42
)

# Vectorize text
vectorizer = CountVectorizer(stop_words='english', max_features=3000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Naive Bayes model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict on test data
y_pred = model.predict(X_test_vec)



## ⚙️ Model Training and Prediction

- **`fit()`**: The model learns from training data (word counts and their labels).  
- **`predict()`**: The model classifies new, unseen text as positive or negative.


In [None]:

# Evaluate model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, zero_division=0))
print("\nConfusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))



## 📊 Model Evaluation

Metrics used to assess model performance:

- **Accuracy:** Overall correctness  
- **Precision:** Correct positive predictions  
- **Recall:** How well it finds all positives  
- **F1-score:** Balance between precision and recall  

A **Confusion Matrix** shows correct vs. incorrect predictions for each class.


In [None]:

# Try custom predictions
test_sentences = [
    "The movie was fantastic! I loved it.",
    "It was the worst film I have ever seen.",
    "An average storyline but good acting."
]

test_vec = vectorizer.transform(test_sentences)
preds = model.predict(test_vec)

for sent, pred in zip(test_sentences, preds):
    print(f"'{sent}' --> Predicted sentiment: {pred}")



## 💬 What is Sentiment Analysis?

**Sentiment Analysis** is the process of determining whether a piece of text expresses a **positive**, **negative**, or **neutral** sentiment.

It’s widely used in:
- Product reviews  
- Customer feedback monitoring  
- Social media trend analysis  
- Brand reputation management


## Student Exercise: Twitter Sentiment Analysis

Now that you’ve explored text classification on a small built-in dataset, your next challenge is to apply the same NLP pipeline to real-world social media data.

### Objective

Perform sentiment analysis on tweets from the Twitter Sentiment Analysis (Sentiment140)
 dataset using Naive Bayes.

### Dataset Description

Source: [Kaggle – Sentiment140 Dataset with 1.6 million tweets](https://www.kaggle.com/datasets/kazanova/sentiment140)

Labels:

0 → Negative sentiment

4 → Positive sentiment

Columns:

target – Sentiment label (0 or 4)

ids – Tweet ID

date – Date of tweet

flag – Query flag (not relevant)

user – Username

text – Tweet content

Instructions

Download the dataset from Kaggle:
https://www.kaggle.com/datasets/kazanova/sentiment140

Load a sample subset (e.g., 10,000–20,000 rows) for quicker experimentation:

import pandas as pd
data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1', header=None)
data.columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
df = data.sample(20000, random_state=42)


Clean the text:

Convert to lowercase

Remove URLs, mentions (@user), hashtags, and special characters

Preprocess and vectorize:
Use CountVectorizer or TfidfVectorizer from sklearn.feature_extraction.text

Train a Naive Bayes classifier:

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)


Evaluate the model:
Compute and print:

Accuracy

Precision

Recall

F1-score

Confusion matrix

Optional extensions:

Compare CountVectorizer vs. TfidfVectorizer

Create a WordCloud for the most common positive/negative words

Try limiting tweet length or removing common stopwords differently