# Twitter Sentiment Analysis Using Classical & Neural Models

## 1. Introduction
This project studies sentiment classification on Twitter using the Kaggle "Twitter Sentiment Dataset" (Saurabh Shahane, 2021). The dataset contains cleaned tweets (`clean_text`) and sentiment labels in `category` with values -1 (negative), 0 (neutral), +1 (positive). The goal is to compare classical machine learning models against a text-oriented neural model (CNN–LSTM hybrid) and identify which approach is best for multiclass sentiment classification in terms of accuracy and robust F1 (macro).


## 2. Research Questions
1. Which model achieves the best overall and per-class performance for predicting sentiment?
2. How do TF-IDF and Word2Vec features compare when used with classical models?
3. Does a CNN–LSTM hybrid outperform classical ensembles (Voting Classifier) on this dataset?


## 3. Dataset
- Source: Kaggle — Twitter Sentiment Dataset (Saurabh Shahane, 2021).
- Columns: `clean_text` (string), `category` (int: -1, 0, 1).
- Size: (162980, 2)


4. Methodology
EDA & preprocessing: class distribution, text length, missing data handling, tokenization, stopword removal.
Feature pipelines:
TF-IDF vectorizer (for baseline classical models).
Word2Vec embeddings (gensim): average tweet vectors for classical models.
Keras Tokenizer + padding for NN (embedding layer + CNN–LSTM).
Models:
Baselines: Decision Tree, KNN, Logistic Regression.
Ensemble: Voting Classifier (hard or soft) combining best performing classical models.
Neural: CNN–LSTM hybrid (Embedding → Conv1D → MaxPool → LSTM → Dense).
Hyperparameter tuning: GridSearchCV for classical models; manual / Keras Tuner for NN if time permits.
Evaluation: accuracy, precision, recall, F1 (macro & per class), confusion matrix, ROC-AUC (one-vs-rest).

In [1]:
import os
import argparse
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.utils import class_weight

# For Word2Vec
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# For neural model
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, LSTM, Dense, Dropout, GlobalMaxPooling1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Plotting utilities
from wordcloud import WordCloud
import seaborn as sns


2025-11-14 13:02:27.019543: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-14 13:02:27.019755: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-14 13:02:27.051496: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-14 13:02:27.673831: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off,

## Arguments & Globals

In [2]:
CSV_PLACEHOLDER = "twitter_data.csv"  # change if needed
TARGET_COL = "category"
RANDOM_STATE = 42
TEST_SIZE = 0.20

## Load dataset & quick EDA

In [3]:
# def parse_args():
#     parser = argparse.ArgumentParser()
#     parser.add_argument("--data_path", type=str, default="Twitter_Data.csv")
#     parser.add_argument("--output_dir", type=str, default="outputs")
#     parser.add_argument("--test_size", type=float, default=0.2)
#     parser.add_argument("--random_state", type=int, default=42)
#     return parser.parse_args()

# args = parse_args()
# os.makedirs(args.output_dir, exist_ok=True)
# RND = args.random_state

# df = pd.read_csv(args.data_path)

def load_data(path=CSV_PLACEHOLDER):
    if not os.path.exists(path):
        raise FileNotFoundError(f"CSV not found at {path}. Please place the dataset file there or change the path.")
    try:
        df = pd.read_csv(path, sep=';')
    except Exception:
        df = pd.read_csv(path, sep=';')
    return df

df = load_data(CSV_PLACEHOLDER)
print("Dataset shape:", df.shape)
print("Columns:", df.columns.tolist())
print(df.head())

# Basic checks
print("\n--- Missing Values ---")
print(df.isnull().sum())
# print("Class distribution:\n", df['category'].value_counts())

# # Quick distribution plot
# plt.figure(figsize=(6,4))
# sns.countplot(x='category', data=df, order=[-1,0,1])
# plt.title("Sentiment distribution")
# plt.savefig(os.path.join(args.output_dir, "sentiment_distribution.png"))
# plt.close()


Dataset shape: (162980, 1)
Columns: ['clean_text,category']
                                 clean_text,category
0  when modi promised “minimum government maximum...
1  talk all the nonsense and continue all the dra...
2  what did just say vote for modi  welcome bjp t...
3  asking his supporters prefix chowkidar their n...
4  answer who among these the most powerful world...

--- Missing Values ---
clean_text,category    0
dtype: int64


## 2. Preprocessing helpers

In [4]:
# import re
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
# import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# STOPWORDS = set(stopwords.words('english'))

# def clean_text(text):
#     if not isinstance(text, str):
#         return ""
#     text = text.lower()
#     text = re.sub(r'http\S+', '', text)      # remove URLs
#     text = re.sub(r'@\w+', '', text)         # remove mentions
#     text = re.sub(r'[^a-z\s]', '', text)     # remove non-letter chars
#     text = re.sub(r'\s+', ' ', text).strip()
#     return text

# df['clean_text'] = df['clean_text'].astype(str).apply(clean_text)

# # Add basic features helpful for EDA
# df['num_chars'] = df['clean_text'].apply(len)
# df['num_words'] = df['clean_text'].apply(lambda x: len(x.split()))
# print(df[['num_chars','num_words']].describe())

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download necessary NLTK data (run once)
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Remove headers, footers, and quotes (using parameters in fetch_20newsgroups is better for this specific dataset)
    # However, for demonstration, let's include some regex cleaning for general text
    text = re.sub(r'From:.*\n', '', text) # Remove From line
    text = re.sub(r'Subject:.*\n', '', text) # Remove Subject line
    text = re.sub(r'Organization:.*\n', '', text) # Remove Organization line
    text = re.sub(r'Lines:.*\n', '', text) # Remove Lines line
    text = re.sub(r'[\w\.-]+@[\w\.-]+', '', text) # Remove email addresses
    text = re.sub(r'http\S+', '', text) # Remove URLs
    text = re.sub(r'\S+\.com\S*', '', text) # Remove .com URLs

    text = text.lower() # Lowercasing
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace

    return text

# # Apply custom text cleaning (punctuation, lowercasing, numbers, etc.)
df_cleaned = df.copy()

[nltk_data] Downloading package wordnet to /home/bermar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
