# Twitter Sentiment Analysis using ML
Tarining a Logictic regression model using the Sentiment 140 dataset to evaluate tweets.

Link: https://www.kaggle.com/datasets/kazanova/sentiment140

The dataset contains 1,600,000 tweets.



In [1]:
# installing kaggle library
!pip install kaggle



Upload Kaggle json file and set up acces to Kaggle dataset.

In [59]:
#configuring the path of kaggle.json file
!mkdir -p ~/.kaggle #!ls -a ~/ to view directory
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Import Twitter Sentiment Dataset

In [3]:
#Api to fetch the dataset from kaggle
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
 95% 77.0M/80.9M [00:01<00:00, 114MB/s]
100% 80.9M/80.9M [00:01<00:00, 76.7MB/s]


In [4]:
#estracting teh compressed dataset
from zipfile import ZipFile
dataset = '/content/sentiment140.zip'

with ZipFile(dataset, 'r') as zip:
  zip.extractall()
  print("The dataset is extracted")

The dataset is extracted


Importing the Dependencies

In [5]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [6]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
#printing the stopwords in English, not required for ML processing

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data processing

In [8]:
#Loading data from csv to pandas dataframe
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding = 'ISO-8859-1')

In [9]:
#checking the number of rows and columns
twitter_data.shape

(1599999, 6)

In [10]:
#printing the first 5 rows
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [11]:
#naming the columns and reading the dataset again
column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', names = column_names, encoding = 'ISO-8859-1')

In [12]:
twitter_data.shape

(1600000, 6)

In [13]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [14]:
#counting the number of missing values in the dataset
twitter_data.isnull().sum()

Unnamed: 0,0
target,0
id,0
date,0
flag,0
user,0
text,0


In [15]:
#checking the distribution of target column
twitter_data['target'].value_counts()


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


Convert the target "4" to "1"

In [16]:
twitter_data.replace({'target':{4:1}}, inplace=True)

In [17]:
#checking the distribution of target column
twitter_data['target'].value_counts()


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


0 ---> Negative Tweet




1 ---> Positive Tweet

# *Stemming*
The process to reducing a word to its root word

example: actor, acting, actress = act

In [24]:
 port_stem = PorterStemmer()

In [32]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z\s]','', content) #remove non-latin characters
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)

  return stemmed_content

In [33]:
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming) #50min exec. time

In [34]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot httptwitpiccomyzl awww that bummer ...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset cant updat facebook text might cri resul...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav im mad cant see


In [35]:
print(twitter_data['stemmed_content'])

0          switchfoot httptwitpiccomyzl awww that bummer ...
1          upset cant updat facebook text might cri resul...
2          kenichan dive mani time ball manag save rest g...
3                            whole bodi feel itchi like fire
4                      nationwideclass behav im mad cant see
                                 ...                        
1599995                           woke school best feel ever
1599996    thewdbcom cool hear old walt interview httpbli...
1599997                         readi mojo makeov ask detail
1599998    happi th birthday boo alll time tupac amaru sh...
1599999    happi charitytuesday thenspcc sparkschar speak...
Name: stemmed_content, Length: 1600000, dtype: object


In [36]:
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


In [38]:
# seperating the data and label
X = twitter_data['stemmed_content'].values
Y = twitter_data['target'].values

In [39]:
print(X)

['switchfoot httptwitpiccomyzl awww that bummer shoulda got david carr third day'
 'upset cant updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguphh']


In [40]:
print(Y)

[0 0 0 ... 1 1 1]


Splitting the data to training and test data

In [41]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [42]:
print(X.shape, X_train.shape, X_test.shape)

(1600000,) (1280000,) (320000,)


In [43]:
print(X_train)

['watch saw iv drink lil wine' 'hatermagazin im'
 'even though favourit drink think vodka coke wipe mind time think im gonna find new drink'
 ... 'eager monday afternoon'
 'hope everyon mother great day cant wait hear guy store tomorrow'
 'love wake folger bad voic deeper']


In [44]:
print(X_test)

['mmangen fine havent much time chat twitter hubbi back summer amp tend domin free time'
 'ah may show w ruth kim amp geoffrey sanhueza'
 'ishatara mayb bay area thang dammit' ...
 'destini nevertheless hooray member wonder safe trip' 'feel well'
 'supersandro thank']


Converting text to numerical values

In [45]:
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [46]:
print(X_train)

  (0, 590499)	0.271935973896376
  (0, 485808)	0.35734734000828555
  (0, 275143)	0.5287163126006141
  (0, 135664)	0.3751210798745192
  (0, 330499)	0.4200753866446622
  (0, 599523)	0.4490363130629349
  (1, 199663)	0.9736525845879415
  (1, 265981)	0.2280364982304841
  (2, 135664)	0.46222653304912487
  (2, 265981)	0.11789035957962253
  (2, 153171)	0.1905056115706456
  (2, 552429)	0.1883319097978993
  (2, 160220)	0.29161611256480313
  (2, 551021)	0.3225385174402054
  (2, 586717)	0.3320039094486021
  (2, 96511)	0.3154389006079266
  (2, 600052)	0.33687662733057383
  (2, 368908)	0.24369409966007974
  (2, 555054)	0.1531113773352699
  (2, 186388)	0.1889237729533197
  (2, 163721)	0.20357992328484822
  (2, 395936)	0.16862711364982916
  (3, 551021)	0.28972941659572027
  (3, 197107)	0.4472150500940576
  (3, 187999)	0.2779040561459071
  :	:
  (1279996, 587269)	0.2711295663878293
  (1279996, 531631)	0.2200653602406279
  (1279996, 509515)	0.360527343929314
  (1279996, 334155)	0.5223798733648134
  (1279

In [47]:
print(X_test)

  (0, 18732)	0.17227034982071815
  (0, 39017)	0.15827452510671522
  (0, 84309)	0.26125887586922236
  (0, 131463)	0.358745828899733
  (0, 163794)	0.2491385168135991
  (0, 171224)	0.2319187213687419
  (0, 199983)	0.21795643358385713
  (0, 258514)	0.2740305353913019
  (0, 374084)	0.44015549781966073
  (0, 384916)	0.1734590777271476
  (0, 529131)	0.21456948420123334
  (0, 541797)	0.338782303721007
  (0, 555054)	0.3071892805923686
  (0, 571692)	0.17566905040466332
  (1, 8007)	0.3053247665651762
  (1, 18732)	0.21441858437620598
  (1, 180079)	0.5700111071661289
  (1, 310417)	0.4060733877878853
  (1, 356789)	0.28423771310139145
  (1, 478347)	0.48070801231601
  (1, 500312)	0.24315001209093084
  (2, 28147)	0.35334726711330017
  (2, 43320)	0.3799460828604977
  (2, 111209)	0.36316906406311267
  (2, 272488)	0.5865060038453336
  :	:
  (319994, 600571)	0.25994636406247995
  (319995, 133807)	0.37415114246535175
  (319995, 135827)	0.3387823265608489
  (319995, 192771)	0.31055261383164634
  (319995, 265

Training the ML model


Logistic Regression

In [48]:
model = LogisticRegression(max_iter = 1000)

In [49]:
model.fit(X_train, Y_train)

Model Evaluation

Accuracy score

In [50]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [51]:
print('Accuracy score for the training data:', training_data_accuracy)

Accuracy score for the training data: 0.8140203125


In [52]:
# accuracy score on the training data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

In [53]:
print('Accuracy score for the test  data:', test_data_accuracy)

Accuracy score for the test  data: 0.78260625


Model accuracy = has to be close


Saving the model

In [54]:
import pickle

In [55]:
filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))

Using the saved model for future predictions

In [56]:
#loading the saved model
loaded_model = pickle.load(open('/content/trained_model.sav', 'rb'))

In [57]:
X_new = X_test[200]
print(Y_test[200])

prediction = model.predict(X_test)
print(prediction)

if (prediction[0] == 0):
  print("Negative tweet")

else:
  print('Positive Tweet')

1
[1 1 0 ... 1 0 1]
Positive Tweet


In [58]:
X_new = X_test[3]
print(Y_test[3])

prediction = model.predict(X_test)
print(prediction)

if (prediction[0] == 0):
  print("Negative tweet")

else:
  print('Positive Tweet')

0
[1 1 0 ... 1 0 1]
Positive Tweet
