<a href="https://colab.research.google.com/github/andrewmuhoro/Tweets-SentimentAnalysis/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview

**Goal**

The main goal here is to use a simple machine learning solution to detect the sentiment of a tweet, whether it was negative or positive.

**Dataset**

The data used here can be found on Kaggle by following this [link](https://www.kaggle.com/datasets/kazanova/sentiment140).

This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

The dataset comprises of 6 columns:


1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

2. ids: The id of the tweet ( 2087)

3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.

5. user: the user that tweeted (robotickilldozr)

6. text: the text of the tweet (Lyx is cool)

# Data Gathering

In [1]:
# Install pyforest
# Pyforest installs all the common data science libraries with just one command
!pip install pyforest

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Import the pyforest library
import pyforest
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Authenticate the Kaggle API client. 

In [3]:
# Get the username and key from your Kaggle account
os.environ['KAGGLE_USERNAME'] = "username"
os.environ['KAGGLE_KEY'] = "key"

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Download the dataset from Kaggle

In [4]:
!kaggle datasets download -d kazanova/sentiment140

sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [5]:
# Unzip the downloaded dataset
!unzip sentiment140

Archive:  sentiment140.zip
replace training.1600000.processed.noemoticon.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: training.1600000.processed.noemoticon.csv  


In [6]:
# List all files
!ls

sample_data  sentiment140.zip  training.1600000.processed.noemoticon.csv


Load the downloaded tweets dataset

In [7]:
tweets_df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1')
tweets_df.head()

<IPython.core.display.Javascript object>

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


## Explore the Data

Looking at the tweets_df, we can see it has no header columns. Let us introduce that as the first row.

In [8]:
# Using the .columns method insert a list of the column names
tweets_df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']
tweets_df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [9]:
# Check columns with type of data holding and null values
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1599999 non-null  int64 
 1   id      1599999 non-null  int64 
 2   date    1599999 non-null  object
 3   flag    1599999 non-null  object
 4   user    1599999 non-null  object
 5   text    1599999 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [10]:
# Check count of rows and columns
tweets_df.shape

(1599999, 6)

In [11]:
# Check whether there are duplicate records
tweets_df.duplicated().sum()

0

In [12]:
# Check for null/missing values
tweets_df.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

# Data Cleaning

In [13]:
# Create a copy of the original dataset
tweets_cp = tweets_df.copy()
tweets_cp.head(3)

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire


In [14]:
# Check the polarity of the data by examining the target values
tweets_cp.target.value_counts()

4    800000
0    799999
Name: target, dtype: int64

In [15]:
# Drop unnecessary columns
#tweets_cp = tweets_cp.drop([], axis=1)
#tweets_cp.head()

In [16]:
# Map the target values to 0 (negative) or 1 (positive)
tweets_cp['target'] = tweets_cp['target'].replace({0: 0, 4: 1})

Pre-process the ***text column*** data using regular expressions to remove elements like punctuations, special characters, urls, hashtags, stopwords,usernames and convert all to lowercase. 

Please note that stopwords are English words that do not add much meaning to a sentence, so can be safely ignored without sacrificing the meaning of the sentence.

In [17]:
# import NLTK, Natural Language Toolkit, library
# This library provides good tools for loading and cleaning text
import nltk
import re
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# define a function to implement the pre-processing & cleaning of the text data
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@[^\s]+', '', text)  # Remove usernames
    text = re.sub(r'#([^\s]+)', r'\1', text)  # Remove hashtags
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply the above clean_text function to the text column values
# Drop the text column after adding the clean_text column to the dataframe
tweets_cp['clean_text'] = tweets_cp['text'].apply(clean_text)
tweets_cp.drop(['text'], axis=1)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,target,id,date,flag,user,clean_text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset cant update facebook texting might cry r...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,dived many times ball managed save 50 rest go ...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,whole body feels itchy like fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,behaving im mad cant see
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,whole crew
...,...,...,...,...,...,...
1599994,1,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,woke school best feeling ever
1599995,1,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,thewdbcom cool hear old walt interviews â
1599996,1,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,ready mojo makeover ask details
1599997,1,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,happy 38th birthday boo alll time tupac amaru ...


Convert the text data into numerical feature & Split the dataset into training and testing datasets.

In [18]:
#Convert the text data into numerical features using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X = tfidf.fit_transform(tweets_cp['clean_text'])

In [19]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, tweets_cp['target'], test_size=0.3, random_state=42)

Pick a machine learning algorithm to train the model and test the model. Common algorithms include: Naive Bayes, Support Vector Machines(SVM), Logistic Regression, etc. 

In this case, will go with Naive Bayes.

In [20]:
# Train a Naive Bayes classifier on the training data
nb = MultinomialNB()
nb.fit(X_train, y_train)

In [21]:
# Test the model on the testing data
y_pred = nb.predict(X_test)

Evaluate the Model Perfomance.

In [22]:
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-Score:', f1_score(y_test, y_pred))

Accuracy: 0.7511354166666667
Precision: 0.7564523638210522
Recall: 0.7427183457378064
F1-Score: 0.7495224455818614


Use the trained model to predict the sentiment of a new tweet.

In [23]:
new_tweet = 'I hate Mondays'
new_tweet_cleaned = clean_text(new_tweet)
new_tweet_vectorized = tfidf.transform([new_tweet_cleaned])
sentiment = nb.predict(new_tweet_vectorized)[0]
print('Sentiment:', sentiment)

Sentiment: 0


> The model predicts the new tweet is negative.