# Sentiment Analysis System

## Overview
This project will be creating a sentiment analysis system from scratch using data from the [Sentiment140 Dataset](https://www.kaggle.com/datasets/kazanova/sentiment140)

## Goals/Objectives
To classify text into sentiment categories (positive, neutral, negative).
The system should handle short, possibly informal text

## Dataset
Content

It contains the following 6 fields:

    target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

    ids: The id of the tweet ( 2087)

    date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

    flag: The query (lyx). If there is no query, then this value is NO_QUERY.

    user: the user that tweeted (robotickilldozr)

    text: the text of the tweet (Lyx is cool)


# Import

In [15]:
import numpy as np
import pandas as pd
import re

#Sci-Kit Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
#from sklearn.svm import SVC
#from sklearn.ensemble import RandomForestClassifier

# Natural Language Toolkit (NLTK)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

import joblib

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/christopher/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/christopher/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/christopher/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
colNames = ['target', 'ids', 'date', 'flag', 'user', 'text']
data = pd.read_csv('./training.1600000.processed.noemoticon.csv', names=colNames, delimiter=',', encoding='ISO-8859-1')

display(data.head(), data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


None

# Pre-Processing
Steps:
- Convert text to lowercase.
- Remove unnecessary characters (punctuation, numbers, @'s, URLs, special symbols).
- Handle contractions (e.g., "can't" → "cannot").
- Remove stopwords (common words that don’t add meaning, like "the", "is").
- Tokenization (split text into words).
- Lemmatization or stemming (reduce words to base form: "running" → "run").

In [3]:
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove @'s, URLs, and special characters
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Rejoin tokens into text
    return ' '.join(tokens)

In [4]:
# Apply text cleaning
data['cleaned_text'] = [clean_text(text) for text in data['text']]

# Feature Engineering

In [5]:
# Convert text into numerical features (TF-IDF Vectorization)
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['cleaned_text'])
y = data['target']

y = y.map({0: 0, 2: 1, 4: 2})

# Model Selection & Training

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [7]:
# The following was the output used to determine the best model.
'''
Naive Bayes Performance:
Accuracy: 0.7585
              precision    recall  f1-score   support

           0       0.75      0.77      0.76    159815
           2       0.76      0.75      0.76    160185

    accuracy                           0.76    320000
   macro avg       0.76      0.76      0.76    320000
weighted avg       0.76      0.76      0.76    320000


Logistic Regression Performance:
Accuracy: 0.7734
              precision    recall  f1-score   support

           0       0.79      0.75      0.77    159815
           2       0.76      0.80      0.78    160185

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000
'''
# The following determined the best model.
'''
# The following models were tested but did not fit in a reasonable amount of time.
# "SVM": SVC(),
# "Random Forest": RandomForestClassifier(n_estimators=100)
models = {
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000),  
}

for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print(f"\n{name} Performance:")
    print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
    print(classification_report(y_test, preds))
'''

'\n# The following models were tested but did not fit in a reasonable amount of time.\n# "SVM": SVC(),\n# "Random Forest": RandomForestClassifier(n_estimators=100)\nmodels = {\n    "Naive Bayes": MultinomialNB(),\n    "Logistic Regression": LogisticRegression(max_iter=1000),  \n}\n\nfor name, model in models.items():\n    model.fit(X_train, y_train)\n    preds = model.predict(X_test)\n    print(f"\n{name} Performance:")\n    print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")\n    print(classification_report(y_test, preds))\n'

# Model Evaluation & Optimization
Based on the performance metrics of Naive Bayes and Logistic Regression, Logistic Regression will be used.

In [16]:
# Hyperparameter tuning for Logistic Regression
param_grid = {"C": [0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=0), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

In [17]:
best_model = grid_search.best_estimator_
preds = best_model.predict(X_test)

print("\nOptimized Logistic Regression Performance:")
print(f"Best Hyperparameter (C): {grid_search.best_params_['C']}")
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
print(classification_report(y_test, preds))


Optimized Logistic Regression Performance:
Best Hyperparameter (C): 10
Accuracy: 0.7749
              precision    recall  f1-score   support

           0       0.79      0.75      0.77    159815
           2       0.76      0.80      0.78    160185

    accuracy                           0.77    320000
   macro avg       0.78      0.77      0.77    320000
weighted avg       0.78      0.77      0.77    320000



In [19]:
# Save the optimized model and vectorizer
joblib.dump(best_model, "optimized_logistic_regression.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

['tfidf_vectorizer.pkl']