# Sentiment Analysis of Tweets
### Detecting Positive and Negative Sentiments from 1 Million Tweets

#### `WHY` should we have to do Sentiment Analysis of Tweets
Twitter sentiment analysis is essential for understanding public opinions and trends in real time. 

It helps businesses improve customer satisfaction, track brand perception, and enhance marketing strategies. 

Researchers and policymakers can leverage it to analyze societal issues and public sentiment on critical topics.

Ultimately, it provides valuable insights for data-driven decision-making.

#### `Workflow` The process of the Project 

<u>Data Preparation</u>
- Extract the Sentiment140 dataset.
- Clean and preprocess the text data.
- Split tweets into individual tokens.
- Remove common words that do not add meaning.
- Reduce words to their base or root forms.

<u>Model Training</u>
- Convert text data to numerical format (e.g., TF-IDF, Bag of Words).
- Divide the dataset into training and testing subsets.
- Choose a suitable machine learning model for sentiment classification.
- Train the model using the training dataset.

<u>Model Evaluation</u>
- Test the model using the testing dataset and calculate performance metrics.
- Use the trained model to classify new tweets into positive, negative, or neutal sentiments.
eniments.  


#### Data Description 
[ kaggle link](https://www.kaggle.com/datasets/kazanova/sentiment140)

- **The Sentiment140 Dataset** contains 1,000,000 tweets extracted using the Twitter API.  
- Each tweet is annotated with **sentiment polarity**:
  - `0 = Negative sentiment`
  - `4 = Positive sentiment`
  - `2 = Neutral sentiment` *(rarely present)*.

#### Dataset Fields:
- **target**: Sentiment polarity of the tweet (0, 2, or 4).
- **ids**: Unique identifier for the tweet.
- **date**: Timestamp when the tweet was posted.
- **flag**: Query used to retrieve the tweet, or `NO_QUERY` if none.
- **user**: Username of the person who tweeted.
- **text**: The actual tweet content.

### 1. Import Nessasary Libraries 

In [1]:
import pandas as pd 
import numpy as np
from warnings import filterwarnings
filterwarnings('ignore')

### 2. Import Data 

In [2]:
data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin1')
data

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


### 3. Rename the Column 

In [3]:
columns_names = ['target','Id','date','flag','user','text']
data = pd.read_csv('training.1600000.processed.noemoticon.csv',names = columns_names, encoding='ISO-8859-1')

In [4]:
df = data.groupby('target', group_keys=False).apply(lambda x: x.sample(frac=0.25, random_state=42))

In [5]:
df.shape

(400000, 6)

### 4. Check For Null Values 

In [6]:
df.isnull().sum()

target    0
Id        0
date      0
flag      0
user      0
text      0
dtype: int64

### 5. Replacing Values in the `Target` Column

In [8]:
df['target'].replace({0:0,4:1},inplace=True)

In [9]:
df['target'].value_counts()

target
0    200000
1    200000
Name: count, dtype: int64

### 6. Data Preparation

In [10]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Welcome\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [11]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [12]:
def process_text(text):
    # Initialize the PorterStemmer and stop words
    lem=WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    # Remove non-alphabetic characters
    text_cleaned = re.sub('[^a-zA-Z]', ' ', text)

    # Convert to lowercase and split into words
    words = text_cleaned.lower().split()

    # Remove stop words and apply stemming
    stemmed_words = [lem.lemmatize(word) for word in words if word not in stopwords.words('english')]

    # Join the words back into a single string
    processed_text = ' '.join(stemmed_words)

    return processed_text

In [13]:
df['Stemmed_Text']=df['text'].apply(process_text)

In [14]:
df.head()

Unnamed: 0,target,Id,date,flag,user,text,Stemmed_Text
212188,0,1974671194,Sat May 30 13:36:31 PDT 2009,NO_QUERY,simba98,@xnausikaax oh no! where did u order from? tha...,xnausikaax oh u order horrible
299036,0,1997882236,Mon Jun 01 17:37:11 PDT 2009,NO_QUERY,Seve76,A great hard training weekend is over. a coup...,great hard training weekend couple day rest le...
475978,0,2177756662,Mon Jun 15 06:39:05 PDT 2009,NO_QUERY,x__claireyy__x,"Right, off to work Only 5 hours to go until I...",right work hour go free xd
588988,0,2216838047,Wed Jun 17 20:02:12 PDT 2009,NO_QUERY,Balasi,I am craving for japanese food,craving japanese food
138859,0,1880666283,Fri May 22 02:03:31 PDT 2009,NO_QUERY,djrickdawson,Jean Michel Jarre concert tomorrow gotta work...,jean michel jarre concert tomorrow gotta work ...


In [15]:
x=df['Stemmed_Text'].values
y=df['target'].values

In [16]:
print(x)

['xnausikaax oh u order horrible'
 'great hard training weekend couple day rest let lot computer time put'
 'right work hour go free xd' ... 'amyeve glad like'
 'donniewahlberg love wake morn amp see tweetin ure lol luvs ya gonna great day'
 'piece work week stand betweeen end nd year']


In [17]:
print(y)

[0 0 0 ... 1 1 1]


In [18]:
from sklearn.model_selection import train_test_split

In [19]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=10)

In [20]:
print(xtrain.shape)
print(xtest.shape)
print(ytrain.shape)
ytest.shape

(320000,)
(80000,)
(320000,)


(80000,)

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
vect=TfidfVectorizer()

In [23]:
xtrain=vect.fit_transform(xtrain)
xtest=vect.transform(xtest)

In [24]:
print(xtrain)

  (0, 33181)	0.42122610845123787
  (0, 133797)	0.6238533587517132
  (0, 73485)	0.3922175221562081
  (0, 173416)	0.5287163394929993
  (1, 159531)	0.45540362861342765
  (1, 173979)	0.17997530537877518
  (1, 116896)	0.20703829801440152
  (1, 9702)	0.26631459555093334
  (1, 146761)	0.3039411635693178
  (1, 20244)	0.39322258983748715
  (1, 125423)	0.3572563072090752
  (1, 84147)	0.521336550359119
  (2, 43606)	0.5418912072379292
  (2, 55862)	0.3077507877303229
  (2, 164407)	0.38868081146428385
  (2, 34652)	0.514231868953075
  (2, 20673)	0.44287264977878094
  (3, 168782)	0.47081240350843917
  (3, 53319)	0.3241106193809267
  (3, 105208)	0.3313198766092873
  (3, 86733)	0.7506764459277546
  (4, 19506)	0.8223314475966161
  (4, 188760)	0.40797403964624823
  (4, 64361)	0.39664615625059996
  (5, 141260)	0.5433003954756446
  :	:
  (319996, 171824)	0.34532551588966576
  (319996, 53288)	0.3493841674942259
  (319996, 53229)	0.3147373199776172
  (319996, 158735)	0.31353555475197953
  (319996, 39932)	0.20

### 7. Model Building  

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [26]:
model=LogisticRegression(max_iter=1000)
model.fit(xtrain,ytrain)

In [27]:
xtrain_pred=model.predict(xtrain)
training_data_accuracy=accuracy_score(ytrain,xtrain_pred)

In [28]:
print('The Accuracy score of training data is', training_data_accuracy)

The Accuracy score of training data is 0.820784375


In [29]:
xtest_pred=model.predict(xtest)
testing_data_accuracy=accuracy_score(ytest,xtest_pred)

In [32]:
print('The Accuracy score of training data is', testing_data_accuracy)

The Accuracy score of training data is 0.7732625


### 8. Saving the Model 

In [38]:
from joblib import dump, load
import joblib

In [34]:
dump(model, 'Model 1.0.joblib')

['Model 1.0.joblib']

In [45]:
# Save the vectorizer for later use
joblib.dump(vect, 'tfidf_vectorizer.joblib')

['tfidf_vectorizer.joblib']

### 9. Check for a Tweet 

In [46]:
model = joblib.load('Model 1.0.joblib')
tfidf_vectorizer = joblib.load('tfidf_vectorizer.joblib')

In [48]:
# Function to predict the sentiment of a tweet
def predict_tweet_sentiment(tweet):
    # Preprocess the tweet (lowercase)
    tweet = tweet.lower()  # Basic preprocessing
    
    # Vectorization using TF-IDF
    tweet_vectorized = tfidf_vectorizer.transform([tweet])  # Transform the tweet to the TF-IDF vector
    
    # Make prediction
    prediction = model.predict(tweet_vectorized)  # Pass the vectorized tweet to the model
    return 'Positive' if prediction[0] == 1 else 'Negative'

In [49]:
# Example usage 1
tweet = "I love this product! It's amazing."
sentiment = predict_tweet_sentiment(tweet)
print(f'Tweet: "{tweet}" -> Sentiment: {sentiment}')

Tweet: "I love this product! It's amazing." -> Sentiment: Positive


# Example Tweet Link
You can view the tweet [here](https://x.com/sudi0ne/status/1716739204011405583).

In [52]:
tweet = "Samsung follows the same path as Apple, the path of charts from Excel This is not a good path Without numbers and crazy specifications there will be no success"
sentiment = predict_tweet_sentiment(tweet)
print(f'Tweet: "{tweet}" ->>>>>>>>>  `Sentiment` : {sentiment}')

Tweet: "Samsung follows the same path as Apple, the path of charts from Excel This is not a good path Without numbers and crazy specifications there will be no success" ->>>>>>>>>  `Sentiment` : Negative
