# Twitter Sentiment Pipeline Introduction

This project explores Natural Language Processing (NLP) techniques to analyze and classify the emotional tone and sentiment, expressed in social media text data in tweets from X (formerly Twitter). Using a combination of real-time data retrieval and supervised learning, we aim to build models that can accurately predict sentiment as positive or negative.

We use the Sentiment140 dataset, accessed via Hugging Face Datasets, as our training and evaluation foundation. This dataset provides a large corpus of pre-labeled tweets, making it ideal for benchmarking model performance. Real-time tweets are fetched using the Tweepy library and authenticated access to the Twitter API.

The main objective is to evaluate multiple machine learning classifiers such as Logistic Regression, Random Forest, and others-to determine which model performs best on the sentiment classification task.

## Data Pre-processing

First, we will import the nessecary libraries used for the entire project. We will evaluate and reduce the dimensions of the imported data (via tensorflow_datasets).  

In [14]:
# Required imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import GradientBoostingClassifier as GBM
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix
from statsmodels.stats.outliers_influence import variance_inflation_factor

import datasets
import twitter_pipeline

In [None]:
### Load Training Data for models ###

# Load Sentiment140 directly from Hugging Face datasets
dataset = datasets.load_dataset("sentiment140", trust_remote_code=True)

train_data = dataset["train"]

print(train_data[0])

Downloading data: 100%|██████████| 81.4M/81.4M [00:02<00:00, 32.1MB/s]
Generating train split: 100%|██████████| 1600000/1600000 [00:59<00:00, 26752.54 examples/s]
Generating test split: 100%|██████████| 498/498 [00:00<00:00, 16597.38 examples/s]

{'text': "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D", 'date': 'Mon Apr 06 22:19:45 PDT 2009', 'user': '_TheSpecialOne_', 'sentiment': 0, 'query': 'NO_QUERY'}





In [None]:
# Convert to pandas DataFrame for easier manipulation
train_df = train_data.to_pandas()
train_df.head()

Unnamed: 0,text,date,user,sentiment,query
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,0,NO_QUERY
1,is upset that he can't update his Facebook by ...,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,0,NO_QUERY
2,@Kenichan I dived many times for the ball. Man...,Mon Apr 06 22:19:53 PDT 2009,mattycus,0,NO_QUERY
3,my whole body feels itchy and like its on fire,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,0,NO_QUERY
4,"@nationwideclass no, it's not behaving at all....",Mon Apr 06 22:19:57 PDT 2009,Karoli,0,NO_QUERY


In [12]:
# Distinguish predictors and target variable
X_train = train_df['text']
y_train = train_df['sentiment']

print(X_train.shape, y_train.shape)
print(X_train[0])
print(y_train[0])
print(y_train.unique())

(1600000,) (1600000,)
@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
0
[0 4]


In [25]:
'''
Expand predictors into vectorized features

We are choosing to use TF-IDF vectorization for the text data.
With TF-IDF, we want to capture word importance across documents, 
    rather than simplly using word counts (using CountVectorizer).
'''

# Initialize the vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)



In [26]:
# Get the feature names (i.e., words)
feature_names = vectorizer.get_feature_names_out()

print(feature_names[:20])
print(X_train_tfidf.shape)

['00' '000' '00am' '00pm' '01' '02' '03' '04' '05' '06' '07' '08' '09'
 '10' '100' '1000' '100th' '101' '102' '103']
(1600000, 10000)


In [None]:
### Inspect Vectorized Content ####

# Convert first tweet to dense and get non-zero values
vector_0 = X_train_tfidf[0].toarray()[0]

# Get non-zero features and their scores
for idx, score in enumerate(vector_0):
    if score > 0:
        print(f"{feature_names[idx]}: {score}")

awww: 0.3068072523807425
bummer: 0.36625349232604393
carr: 0.5021652350435604
com: 0.20454775102574493
david: 0.34649559250309886
day: 0.18293282029404262
got: 0.19790956165638682
http: 0.19030721798770697
shoulda: 0.4335162707531134
twitpic: 0.2467245069379507


## Data Exploration

Check for imbalance and visualize data. 

In [28]:
# Check class distribution
y_train.value_counts()

sentiment
0    800000
4    800000
Name: count, dtype: int64

## Machine Learning Models

## Twitter Live Feed Integration (Twitter API and Machine Learning Model)