<a href="https://colab.research.google.com/github/Yuvadi29/My-Journey-To-AI/blob/master/Twitter_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Installing Kaggle Library
! pip install kaggle



### **Upload your Kaggle.json file**

In [3]:
# Configuring the path of Kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


### **Importing Twitter Sentiment Dataset**

In [4]:
# Api to fetch Dataset from Kaggle
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
 89% 72.0M/80.9M [00:00<00:00, 104MB/s] 
100% 80.9M/80.9M [00:00<00:00, 96.4MB/s]


In [5]:
# Extract the Zip file downloaded from API
from zipfile import ZipFile
dataset = './sentiment140.zip'

with ZipFile(dataset, 'r') as zip:  #r means read
  zip.extractall()
  print("The dataset is extracted")

The dataset is extracted


### **Importing the Necessary Dependencies**

In [6]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [7]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
# Printing the stopwords in english
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### **Data Processing**

In [9]:
# Load the data from CSV to pandas dataframe
twitter_data = pd.read_csv('./training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1')
twitter_data

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [10]:
# Check the number of rows and columns
twitter_data.shape

(1599999, 6)

In [11]:
# Print the first 5 rows
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [12]:
# Naming the columns and reading dataset again
column_names = ['target', 'id','date', 'flag', 'user', 'text']
twitter_data = pd.read_csv('./training.1600000.processed.noemoticon.csv', names=column_names ,encoding='ISO-8859-1')

In [13]:
twitter_data

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [14]:
# Check for missing values
twitter_data.isnull().sum()

Unnamed: 0,0
target,0
id,0
date,0
flag,0
user,0
text,0


In [15]:
# Checking the Distribution of target columns
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


### **Convert the target "4" to "1"**

In [16]:
twitter_data.replace({'target': {4:1}}, inplace=True)

In [17]:
twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


0 ----> Negative Tweet
1 ----> Positive Tweet

### **Stemming**

In [18]:
port_stem = PorterStemmer()

In [19]:
def stemming(content):
  stemmed_content = re.sub('[a-zA-Z]',' ', content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)

  return stemmed_content

In [21]:
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)  #Takes 12 minutes to complete the execution

In [22]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","@ :// . /2 1 - , ' . . ;"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,' ... . !
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,@ . 50%
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....","@ , ' . ' . ? ' ."


In [23]:
# We will now only focus on the stemmed content and target
print(twitter_data['stemmed_content'])

0          @ :// . /2 1 - , ' . . ;
1                         ' ... . !
2                           @ . 50%
3                                  
4                 @ , ' . ' . ? ' .
                     ...           
1599995                           .
1599996         . - ! â« :// . /~8
1599997                           ?
1599998                      38 !!!
1599999                   # @ @ @ 4
Name: stemmed_content, Length: 1600000, dtype: object


In [24]:
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


## **Seperating Data and Label, Training Data**

In [25]:
X = twitter_data['stemmed_content'].values
Y = twitter_data['target'].values

In [26]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, stratify=Y, random_state=2) # Test size means 20% testing data and 80% training data, stratify means equal distribution of values of Y, randomstate means random distribution of data
print(X.shape, X_train.shape, X_test.shape)
print(Y.shape, Y_train.shape, Y_test.shape)

(1600000,) (1280000,) (320000,)
(1600000,) (1280000,) (320000,)


### **Vectorizing the data**

In [27]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [28]:
print(X_train)

  (5, 3469)	1.0
  (6, 8484)	0.7382151603694854
  (6, 8444)	0.6745653244872989
  (20, 2318)	1.0
  (30, 533)	1.0
  (45, 1447)	1.0
  (46, 740)	1.0
  (61, 739)	1.0
  (74, 5784)	1.0
  (88, 6276)	1.0
  (96, 4235)	0.5988307948879529
  (96, 4499)	0.8008755702940765
  (99, 4055)	1.0
  (106, 7028)	1.0
  (107, 2534)	1.0
  (108, 4318)	1.0
  (109, 1)	0.6527945207385148
  (109, 1784)	0.75753502473072
  (110, 6772)	1.0
  (123, 4825)	0.5532665234794572
  (123, 4773)	0.8330042941047515
  (136, 739)	1.0
  (139, 5644)	1.0
  (142, 6127)	1.0
  (149, 3921)	0.6436322420464065
  :	:
  (1279826, 6920)	1.0
  (1279836, 7608)	1.0
  (1279842, 5938)	1.0
  (1279843, 5062)	1.0
  (1279848, 789)	1.0
  (1279849, 3142)	1.0
  (1279851, 3550)	1.0
  (1279861, 1656)	1.0
  (1279863, 4476)	1.0
  (1279864, 2686)	1.0
  (1279880, 2687)	1.0
  (1279894, 304)	0.8053986539714999
  (1279894, 1207)	0.5927335051951225
  (1279900, 4536)	1.0
  (1279913, 5551)	1.0
  (1279914, 3047)	1.0
  (1279917, 739)	1.0
  (1279934, 5222)	0.8274527483087

In [29]:
print(X_test)

  (6, 7029)	1.0
  (15, 2470)	1.0
  (19, 1599)	1.0
  (20, 6667)	1.0
  (23, 3552)	1.0
  (37, 2470)	1.0
  (41, 6968)	1.0
  (54, 902)	1.0
  (61, 6074)	1.0
  (64, 739)	1.0
  (66, 3613)	0.7178771988917738
  (66, 2534)	0.45491671156885005
  (66, 377)	0.5269754385611195
  (78, 1447)	1.0
  (92, 6641)	1.0
  (110, 8270)	1.0
  (113, 974)	1.0
  (114, 4055)	1.0
  (123, 3550)	1.0
  (131, 8270)	1.0
  (149, 4655)	1.0
  (170, 740)	1.0
  (176, 740)	1.0
  (184, 739)	1.0
  (195, 6772)	1.0
  :	:
  (319773, 659)	1.0
  (319822, 7976)	1.0
  (319834, 3550)	1.0
  (319844, 2504)	1.0
  (319849, 6825)	1.0
  (319860, 2504)	0.6699573225007557
  (319860, 0)	0.7423996134344486
  (319862, 6276)	1.0
  (319878, 4234)	0.7759385055112661
  (319878, 3550)	0.6308085570638233
  (319879, 6362)	1.0
  (319887, 739)	1.0
  (319898, 5946)	1.0
  (319907, 3306)	1.0
  (319917, 3469)	1.0
  (319922, 5391)	1.0
  (319929, 825)	1.0
  (319941, 8206)	1.0
  (319952, 1803)	1.0
  (319961, 6565)	1.0
  (319970, 1447)	1.0
  (319982, 3140)	1.0
  (31

### **Training the ML Model**

In [30]:
model = LogisticRegression(max_iter=1000)

In [32]:
model.fit(X_train, Y_train) #Will try to find the relation between X and Y

### **Model Evaluation**

In [34]:
X_train_prediction = model.predict(X_train)
data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [36]:
print('Accuracy Score on Training data: ', data_accuracy)  #0.51 means 51% accuracy

Accuracy Score on Training data:  0.51601953125


In [37]:
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

In [38]:
print('Accuracy Score on Testing data: ', test_data_accuracy)  #0.51 means 51% accuracy

Accuracy Score on Testing data:  0.510903125


### **Model Accuracy = 51%**

In [39]:
# Saving the model
import pickle


In [40]:
filename = 'trained_model.sav'
pickle.dump(model, open(filename,'wb'))

In [44]:
# Using the model for future predictions
loaded_model = pickle.load(open('./trained_model.sav', 'rb'))

In [48]:
X_new = X_test[200]
print(Y_test[200])

prediction = model.predict(X_new)
print(prediction)

if(prediction[0] ==0):
  print('Negative Tweeet')

else:
  print('Positive Tweet')

1
[0]
Negative Tweeet


In [49]:
X_new = X_test[3]
print(Y_test[3])

prediction = model.predict(X_new)
print(prediction)

if(prediction[0] ==0):
  print('Negative Tweeet')

else:
  print('Positive Tweet')

0
[0]
Negative Tweeet
