In this project,we will train a model on a dataset which contains 1.6 million tweets extracted from twitter along with the kind of sentiment they convey so that we can predict on a given random tweet the kind of sentiment it conveys(positive,negative or neutral).The name of dataset is Sentiment140 and it was obtained from Kaggle.Nowadays,many people use Twitter as a platform to talk about issues;be it political issues,economic issues,entertainment,technology and the list goes on.
With the help of sentiment analysis we can do variety of things some of which are:
1.Judge whether a movie is good or not by knowing the sentiments of the tweets that mention them.
2.Get the opinion of public on some important political issue
3.Help curb the spread of hate speech.
So with that lets start the project by reading the data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import chardet

In [2]:
#Trying to identify the encoding of the file using chardet,the output is saying ascii but that didnt work.But still it sometimes may help you out with character encodings
with open('Twitter_Sentiment_Data.csv','rb') as bytedata:
    result = chardet.detect(bytedata.read(10000))
print(result)

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


In [3]:
data=pd.read_csv('Twitter_Sentiment_Data.csv',encoding='cp437',header=None)
data.columns=['Sentiment','id','Date_of_posting','flag','username','tweet']
data.head()

Unnamed: 0,Sentiment,id,Date_of_posting,flag,username,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [4]:
data.info()
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
Sentiment          1600000 non-null int64
id                 1600000 non-null int64
Date_of_posting    1600000 non-null object
flag               1600000 non-null object
username           1600000 non-null object
tweet              1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


Sentiment          0
id                 0
Date_of_posting    0
flag               0
username           0
tweet              0
dtype: int64

We can see that there are no null values in this data,but still some data cleaning is required.We need to clean the 'text' column.Lets start!

In [5]:
def clean_data(tweet):
    import re
    tweet.lower()
    tweet=re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)#extracting urls and replacing with 'URL
    tweet=re.sub('@[\S]+','User_Name',tweet)#@username -> User_Name
    tweet = re.sub('[\s]+', ' ', tweet)#Redundant white spaces
    tweet = re.sub(r'#([\S]+)', r'\1', tweet)#removing hash tags i.e #topic -> topic
    return tweet

In [6]:
data_copy=data.copy()
data_copy['tweet']=data_copy['tweet'].apply(clean_data)
data_copy.head()

Unnamed: 0,Sentiment,id,Date_of_posting,flag,username,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"User_Name URL - Awww, that's a bummer. You sho..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,User_Name I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"User_Name no, it's not behaving at all. i'm ma..."


In [7]:
data_copy.tail()

Unnamed: 0,Sentiment,id,Date_of_posting,flag,username,tweet
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy charitytuesday User_Name User_Name User_...


In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv=TfidfVectorizer(sublinear_tf=True, stop_words = "english")
features=tfv.fit_transform(np.array(data_copy['tweet']))
features.shape

(1600000, 284989)

In [29]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
labels=np.array(data_copy['Sentiment'])
#labels=labels.reshape(-1,1)
#labels.shape
model.fit(features,labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [36]:
from sklearn.metrics import accuracy_score
pred=model.predict(features)
score=accuracy_score(labels,pred)
score

0.79574125