In this project,we will train a model on a dataset which contains 1.6 million tweets extracted from twitter along with the kind of sentiment they convey so that we can predict on a given random tweet the kind of sentiment it conveys(positive,negative or neutral).The name of dataset is Sentiment140 and it was obtained from Kaggle.Nowadays,many people use Twitter as a platform to talk about issues;be it political issues,economic issues,entertainment,technology and the list goes on.
With the help of sentiment analysis we can do variety of things some of which are:
1.Judge whether a movie is good or not by knowing the sentiments of the tweets that mention them.
2.Get the opinion of public on some important political issue
3.Help curb the spread of hate speech.
So with that lets start the project by reading the data.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import chardet

In [3]:
#Trying to identify the encoding of the file using chardet,the output is saying ascii but that didnt work.But still it sometimes may help you out with character encodings
with open('Twitter_Sentiment_Data.csv','rb') as bytedata:
    result = chardet.detect(bytedata.read(10000))
print(result)

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


In [4]:
data=pd.read_csv('Twitter_Sentiment_Data.csv',encoding='cp437',header=None)
data.columns=['Sentiment','id','Date_of_posting','flag','username','tweet']
data.head()

Unnamed: 0,Sentiment,id,Date_of_posting,flag,username,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [5]:
data.info()
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
Sentiment          1600000 non-null int64
id                 1600000 non-null int64
Date_of_posting    1600000 non-null object
flag               1600000 non-null object
username           1600000 non-null object
tweet              1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


Sentiment          0
id                 0
Date_of_posting    0
flag               0
username           0
tweet              0
dtype: int64

We can see that there are no null values in this data,but still some data cleaning is required.We need to clean the 'text' column.Lets start!

In [6]:
def clean_data(tweet):
    import re
    tweet.lower()
    tweet=re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)#extracting urls and replacing with 'URL
    tweet=re.sub('@[\S]+','User_Name',tweet)#@username -> User_Name
    tweet = re.sub('[\s]+', ' ', tweet)#Redundant white spaces
    tweet = re.sub(r'#([\S]+)', r'\1', tweet)#removing hash tags i.e #topic -> topic
    return tweet

In [7]:
data_copy=data.copy()
data_copy['tweet']=data_copy['tweet'].apply(clean_data)
data_copy.head()

Unnamed: 0,Sentiment,id,Date_of_posting,flag,username,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"User_Name URL - Awww, that's a bummer. You sho..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,User_Name I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"User_Name no, it's not behaving at all. i'm ma..."


In [8]:
data_copy.tail()

Unnamed: 0,Sentiment,id,Date_of_posting,flag,username,tweet
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy charitytuesday User_Name User_Name User_...


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv=TfidfVectorizer(sublinear_tf=True, stop_words = "english")
features=tfv.fit_transform(np.array(data_copy['tweet']))
features.shape

(1600000, 284989)

In [10]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
labels=np.array(data_copy['Sentiment'])
#labels=labels.reshape(-1,1)
#labels.shape
model.fit(features,labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [11]:
from sklearn.metrics import accuracy_score
pred=model.predict(features)
score=accuracy_score(labels,pred)
score

0.79574125

With simple Naive Bayes classifier,we get an accuracy of about 80% which is quite good as compared to how simple the naive bayes model is.Now,lets try using some natural language processing using spacy and see if our accuracy improves.

In [12]:
import spacy

The TextCategorizer is a spaCy pipe. Pipes are classes for processing and transforming tokens. When we create a spaCy model with nlp = spacy.load('en_core_web_sm'), there are default pipes that perform part of speech tagging, entity recognition, and other transformations.What we'll do here is create an empty model without any pipes (other than a tokenizer, since all models always have a tokenizer). Then, we'll create a TextCategorizer pipe and add it to the empty model.

In [13]:
nlp=spacy.blank('en')
text_cat=nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})
nlp.add_pipe(text_cat)

Before we proceed,lets clean up the sentiment column in our data.The numerical values in the data are converted to corresponding labels.

In [14]:
def sentiments(val):
    if val==0:
        return 'negative'
    elif val==4:
        return 'positive'
    else:
        return 'neutral'
data_copy['Sentiment']=data_copy['Sentiment'].apply(sentiments)
data_copy.head()

Unnamed: 0,Sentiment,id,Date_of_posting,flag,username,tweet
0,negative,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"User_Name URL - Awww, that's a bummer. You sho..."
1,negative,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,negative,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,User_Name I dived many times for the ball. Man...
3,negative,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,negative,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"User_Name no, it's not behaving at all. i'm ma..."


Now we will add labels to the text categorizer i.e negative,positive an neutral.

In [15]:
text_cat.add_label('negative')
text_cat.add_label('positive')
text_cat.add_label('neutral')

1

Next, we'll convert the labels in the data to the form TextCategorizer requires. For each document, you'll create a dictionary of boolean values for each class.

In [16]:
train_texts = data_copy['tweet'].values
train_labels = [{'cats': {'negative': label == 'negative',
                          'positive': label == 'positive',
                          'neutral': label=='neutral'}} 
                for label in data_copy['Sentiment']]#create a dictionary showing the sentiment every text in the data shows.For eg. 'Some text' and the its dict: {'cats': {'negative': True, 'positive': False, 'neutral': False}}) 

In [17]:
train_data = list(zip(train_texts, train_labels))
train_data[:5]

[("User_Name URL - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",
  {'cats': {'negative': True, 'positive': False, 'neutral': False}}),
 ("is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!",
  {'cats': {'negative': True, 'positive': False, 'neutral': False}}),
 ('User_Name I dived many times for the ball. Managed to save 50% The rest go out of bounds',
  {'cats': {'negative': True, 'positive': False, 'neutral': False}}),
 ('my whole body feels itchy and like its on fire ',
  {'cats': {'negative': True, 'positive': False, 'neutral': False}}),
 ("User_Name no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. ",
  {'cats': {'negative': True, 'positive': False, 'neutral': False}})]

Now we are ready to train the model. First, we'll create an optimizer using nlp.begin_training(). spaCy uses this optimizer to update the model. In general it's more efficient to train models in small batches. spaCy provides the minibatch function that returns a generator yielding minibatches for training. Finally, the minibatches are split into texts and labels, then used with nlp.update to update the model's parameters.We continue this process for 3 epochs(cycles) so our model learns from the data and generalizes well.

In [19]:
import random
from spacy.util import minibatch
random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(3):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=10000)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 6.709390627523959e-07}
{'textcat': 1.1893510454275003e-06}
{'textcat': 1.6742088368992825e-06}


For some basic validation,we will use a text from the training data itself and see what our model predicts.

In [21]:
text=[ "Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D'"]
docs=[nlp.tokenizer(token) for token in text]
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)
print(scores)

[[7.8798574e-01 2.1181148e-01 2.0275638e-04]]


In [22]:
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

['negative']


Voila!The prediction is accurate accordig to the dataset.

In [29]:
text=['''There has been a unprecedented surge in COVID Recoveries in India.

There is more than 100% increase in recovered patients and they have been discharged in last 29 days''']
docs=[nlp.tokenizer(token) for token in text]
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)
print(scores)
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

[[8.2303387e-01 1.7687981e-01 8.6328742e-05]]
['negative']
