Here is a rewritten, comprehensive project description for your Twitter sentiment analysis project, adapted to be a structured, step-by-step guide for a data scientist. It emphasizes the progression from a binary to a multiclass classifier and outlines the rationale behind each step.

## 📊 Twitter Sentiment Analysis: A Multi-Stage NLP Project

This project focuses on building a robust Natural Language Processing (NLP) model to classify the sentiment of tweets about Apple and Google products. We'll approach this as an iterative process, starting with a simple binary classification and advancing to a more complex multiclass solution.

---

### **1. Business Problem & Project Goals** 🎯

Our primary goal is to create an automated system that can quickly and accurately gauge public opinion about major tech brands from social media data. The delivered model will enable businesses to monitor brand perception in real-time and make **data-driven decisions** in marketing, product development, and public relations.

**Key Deliverables:**

* A well-documented, reproducible NLP pipeline.
* A binary classification model (Positive vs. Negative).
* A multiclass classification model (Positive, Negative, and Neutral).
* A clear analysis of model performance and business-relevant insights.

---

### **2. Data Preparation & Cleaning** 🧹

Raw text data is inherently messy and requires careful cleaning before it can be used for modeling. The cleaning process transforms noise into valuable features.

**Core Cleaning Steps:**

1.  **Standardization:** Convert all text to lowercase to ensure uniformity.
2.  **Noise Removal:** Use regular expressions to strip out irrelevant elements like URLs, user mentions (`@username`), and hashtags (`#`).
3.  **Tokenization:** Break down the cleaned text into individual words or "tokens."
4.  **Stemming or Lemmatization:** Reduce words to their root form. **Lemmatization** is generally preferred as it converts words to their meaningful base form (e.g., "running" becomes "run"), which preserves more semantic context than stemming.
5.  **Stop Word Removal:** Eliminate common, low-information words (e.g., "the," "is") that do not contribute to sentiment.

---

### **3. Feature Engineering** 🛠️

Machine learning models require numerical input. This step is about converting our cleaned text tokens into a numerical representation.

**Vectorization Techniques:**

* **TF-IDF (Term Frequency-Inverse Document Frequency):** This is a highly effective method for text classification. It assigns a numerical score to each word that reflects its importance in a single tweet relative to the entire dataset. This method helps to down-weigh common words and highlight unique, important words that are crucial for sentiment detection.
* **N-grams:** Beyond single words, we will consider using n-grams (e.g., bigrams, trigrams). This captures the sentiment of multi-word phrases like **"not good"** or **"loved it,"** which a simple bag-of-words approach would miss.

---

### **4. Model Building: From Binary to Multiclass** 📈

This is the central part of the project, where we progressively build and refine our models.

#### **Stage 1: Binary Classification (Positive vs. Negative)**

* **Data Subset:** Filter the dataset to include only "positive" and "negative" tweets.
* **Model Selection:** We will start with a simple yet powerful algorithm, such as **Logistic Regression** or **Naive Bayes**. These models are excellent baselines for text classification and are highly interpretable.
* **Pipeline Creation:** We'll use a `scikit-learn` pipeline to combine our TF-IDF vectorizer and our classifier. This ensures a clean workflow and prevents data leakage.

#### **Stage 2: Multiclass Classification (Positive, Negative, and Neutral)**

* **Data Extension:** Re-introduce the "neutral" tweets to the dataset.
* **Model Adaptation:** The same classification algorithms can be used for multiclass problems.
* **Evaluation Focus:** Multiclass problems are more complex. While accuracy gives a general idea of performance, it's not enough.

---

### **5. Evaluation & Business-Informed Metrics** 📊

Evaluation is the most critical stage. The choice of metrics must be guided by the business problem.

#### **Choosing the Right Metrics:**

* **F1-Score (Macro Average):** This is the key metric for your project. A **macro-averaged F1-score** calculates the F1-score for each class independently and then takes the average. This is vital for imbalanced datasets because it gives equal weight to all classes, preventing the model from performing well on a majority class while ignoring a minority class. This aligns with our business goal of understanding sentiment across all three categories, not just the most common one.
* **Confusion Matrix:** This visual tool will show us exactly where our model is making mistakes (e.g., misclassifying a "negative" tweet as "neutral"). This helps us understand which misclassifications are most problematic. For example, misclassifying a negative review as positive is far more costly than classifying it as neutral.

---

### **6. Insights & Conclusion** 📝

The final phase involves interpreting our model's performance and translating it into actionable business intelligence. We'll analyze common terms and phrases associated with each sentiment and provide concrete recommendations for Apple and Google based on our findings. The ultimate deliverable is a proof-of-concept that demonstrates the power of NLP for real-time sentiment analysis.

In [14]:
# Core libraries
import pandas as pd
import numpy as np
import re
import string

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# NLP
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Sklearn - ML
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# Persistence
import joblib
from pathlib import Path

# Ensure NLTK resources
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
# Load dataset
file_path = r"data\judge-1377884607_tweet_product_company.csv"

# Try ISO-8859-1 (common fallback for these datasets)
df = pd.read_csv(file_path, encoding="ISO-8859-1")

# Basic dataset info
print("Dataset shape:", df.shape)
print("\nColumns:", df.columns.tolist())

# Display first few rows
df.head()

Dataset shape: (9093, 3)

Columns: ['tweet_text', 'emotion_in_tweet_is_directed_at', 'is_there_an_emotion_directed_at_a_brand_or_product']


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


Perfect ✅ The dataset has **9,093 tweets** and the 3 key columns we expected:

* **`tweet_text`** → the actual tweet content (our input features).
* **`emotion_in_tweet_is_directed_at`** → the entity/brand mentioned (Apple, Google, iPhone, etc.).
* **`is_there_an_emotion_directed_at_a_brand_or_product`** → the sentiment label (Positive, Negative, Neutral, etc.).

---

## 🧹 Step 2: Focus on Relevant Brands (Apple & Google)

### 🔎 Explanation

* The dataset covers multiple products/brands, but our project goal is **Apple vs Google**.
* We need to:

  1. **Map related terms** (e.g., "iPhone", "iPad", "MacBook" → Apple; "Android", "Nexus" → Google).
  2. Keep only those rows.
  3. Normalize the `is_there_an_emotion_directed_at_a_brand_or_product` column into simpler labels:

     * `positive`
     * `negative`
     * `neutral`

---


In [16]:
# Start fresh
df_clean = df.copy()

# Mapping from entity column
apple_entities = [
    "Apple", "iPhone", "iPad", "iPad or iPhone App", "Other Apple product or service"
]
google_entities = [
    "Google", "Android", "Android App", "Other Google product or service"
]

def map_brand_from_entity(entity):
    if pd.isna(entity):
        return None
    if entity in apple_entities:
        return "Apple"
    elif entity in google_entities:
        return "Google"
    else:
        return None

df_clean["brand"] = df_clean["emotion_in_tweet_is_directed_at"].apply(map_brand_from_entity)

# --- Fallback: detect brand in tweet text ---
apple_keywords = ["apple", "iphone", "ipad", "mac", "ipod", "ios", "imac", "macbook"]
google_keywords = ["google", "android", "nexus", "pixel", "gmail", "youtube", "chrome"]

def detect_brand_from_text(text):
    text = str(text).lower()
    for kw in apple_keywords:
        if kw in text:
            return "Apple"
    for kw in google_keywords:
        if kw in text:
            return "Google"
    return None

# Fill missing brands from tweet text
df_clean.loc[df_clean["brand"].isna(), "brand"] = df_clean.loc[
    df_clean["brand"].isna(), "tweet_text"
].apply(detect_brand_from_text)

# Drop rows that still have no brand
df_clean = df_clean.dropna(subset=["brand"])

print("Remaining rows after entity + text detection:", df_clean.shape)

# Normalize sentiment labels
def normalize_sentiment(label):
    if "Positive" in label:
        return "positive"
    elif "Negative" in label:
        return "negative"
    else:
        return "neutral"

df_clean["sentiment"] = df_clean[
    "is_there_an_emotion_directed_at_a_brand_or_product"
].apply(normalize_sentiment)

# Sentiment distribution check
print("\nSentiment counts:")
print(df_clean["sentiment"].value_counts())

df_clean.head(10)


Remaining rows after entity + text detection: (8338, 4)

Sentiment counts:
sentiment
neutral     4804
positive    2965
negative     569
Name: count, dtype: int64


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,brand,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,Apple,negative
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,Apple,positive
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,Apple,positive
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,Apple,negative
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,Google,positive
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product,Apple,neutral
7,"#SXSW is just starting, #CTIA is around the co...",Android,Positive emotion,Google,positive
8,Beautifully smart and simple idea RT @madebyma...,iPad or iPhone App,Positive emotion,Apple,positive
9,Counting down the days to #sxsw plus strong Ca...,Apple,Positive emotion,Apple,positive
10,Excited to meet the @samsungmobileus at #sxsw ...,Android,Positive emotion,Google,positive


In [17]:
# Explore all unique values for brand/product mentions
unique_entities = df["emotion_in_tweet_is_directed_at"].dropna().unique()

print("Unique entity count:", len(unique_entities))
print("\nSample entities:\n", unique_entities[:50])  # first 50 for quick look


Unique entity count: 9

Sample entities:
 ['iPhone' 'iPad or iPhone App' 'iPad' 'Google' 'Android' 'Apple'
 'Android App' 'Other Google product or service'
 'Other Apple product or service']
