# Sentiment Analysis with Twitter Data

We use the previous case study in order to classify the tweets that we fetched. We read and train a BernoulliNB classifier as below (details in the original notebook).

We train using all data in order to predict fetched tweets.

Note that we use BernoulliNB and CountVectorizer from scikit-learn.

In [2]:
import string
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.naive_bayes import BernoulliNB

tweets = []
y=[]
with open('../data/neg_tweets.txt','r') as infile:
    for line in infile:
        usedwords = []
        for word in line.split():
            if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http':
                wordnp = word.translate(None, string.punctuation)
                if len(wordnp) >0 and not(wordnp[0].isdigit()):
                    usedwords.append(wordnp.lower());
        dictwords = dict([(word, True) for word in usedwords])
        if len(dictwords) > 0: # We omit empty tweets
            tweets.append(' '.join(usedwords))
            y.append(0)

with open('../data/pos_tweets.txt','r') as infile:
    for line in infile:
        usedwords = []
        for word in line.split():
            if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http':
                wordnp = word.translate(None, string.punctuation)
                if len(wordnp) >0 and not(wordnp[0].isdigit()):
                    usedwords.append(wordnp.lower());
        dictwords = dict([(word, True) for word in usedwords])                    
        if len(dictwords) > 0: # We omit empty tweets
            tweets.append(' '.join(usedwords))
            y.append(1)



# We use stop_words = 'english' in order to remove stop words
# In addition to that we select binary = True to create a matrix of 0 and 1s.
cv = CountVectorizer(min_df = 50, stop_words='english', binary=True)
cv_matrix = cv.fit_transform(tweets)
#cv_matrix = cv_matrix.todense()
print(np.shape(cv_matrix))

bnb = BernoulliNB()

#This line is for training
bnb.fit(cv_matrix,y)
print('BNB trained')

(642542, 5698)
BNB trained


Now let us do the sentiment analysis on the tweets that we fetched.

In [9]:
import json
import unicodedata

new_tweets = []
new_tweets_list=[]
with open('../data/fetched_tweets.txt', 'r') as f:
    for line in f:
        tweet = json.loads(line) # load it as Python dict
        #print json.dumps(tweet, indent=4)
        tweet = tweet['text']
        tweet = unicodedata.normalize('NFKD', tweet).encode('ascii','ignore')
        #new_tweets.append(tweet)
        usedwords = [str(word).translate(None, string.punctuation).lower() for word in tweet.split() \
             if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http']
        usedwords = [word for word in usedwords if not word.isdigit()]
        new_tweets.append(' '.join(usedwords))
        #print usedwords
        new_tweets_list.append(tweet)
        #print(json.dumps(tweet, indent=4)) # pretty-print
print(new_tweets[0])
print(new_tweets_list[0])

we must stand against hate wherever rears its ugly head hillary
RT @HillaryClinton: "We must stand against hate wherever it rears its ugly head." Hillary in 2000
https://t.co/qyhdZysMmH


Now let us use the newly fetched tweets in sentiment analysis.

In [10]:
cv_new_tweets = cv.transform(new_tweets)

y_sentiment = bnb.predict_proba(cv_new_tweets)

Print out the tweets that are at least 90% negative and positive.

In [11]:
for i in range(len(y_sentiment)):
    if(y_sentiment[i,0] >= 0.9):
        print new_tweets_list[i]
        print 'Negative sentiment: ', y_sentiment[i,0], 'Positive sentiment: ',y_sentiment[i,1]

RT @HillaryClinton: "We must stand against hate wherever it rears its ugly head." Hillary in 2000
https://t.co/qyhdZysMmH
Negative sentiment:  0.973976264389 Positive sentiment:  0.0260237356105
Wow! No news Fox wants it to be Bernie because Trump will eat him alive sadly to say and they know Hill will destroy Trump in UC election.
Negative sentiment:  0.943720021124 Positive sentiment:  0.0562799788764
RT @chatachula: VOTE FOR BERNIE TOMORROW PLEASE GOD IF HE DONT GET THIS CALIFORNIA PRIMARY IT GON BE BETWEEN HILARY AND TRUMP DO U WANNA DIE
Negative sentiment:  0.940436511975 Positive sentiment:  0.0595634880246
RT @david8hughes: Trump: fucken hate muslims
Advisor: Muhammad Ali just died
Trump: butterfly box man?
Advisor: butterfly box man
Trump: I...
Negative sentiment:  0.973090057892 Positive sentiment:  0.0269099421076
RT @cher: MUST VOTE TOMM!!Am Angry News Didnt Wait Till After Polls Closed TUES,B4 Calling HILLARY Presumptive Nominee.WE MUST VOTE 4HER T...
Negative sentiment:  0

In [12]:
for i in range(len(y_sentiment)):
    if(y_sentiment[i,1] >= 0.9):
        print new_tweets_list[i]
        print 'Negative sentiment: ', y_sentiment[i,0], 'Positive sentiment: ',y_sentiment[i,1]

RT @mitchellvii: PRIMARY DAY! California, New Jersey, Montana, New Mexico and South Dakota all vote today!  Let's get Trump over 1400!
Negative sentiment:  0.0554977837101 Positive sentiment:  0.94450221629
RT @Vipin4Vns: Congratulations: Madam Hillary Clinton clinches Democratic presidential nomination https://t.co/2yXdtXpDbF
Negative sentiment:  0.0453437830962 Positive sentiment:  0.954656216904
RT @msmitharena: AP calls it for Hillary as presumptive Democratic candidate. So proud of her--she's my hero. Unadulterated joy! https://t....
Negative sentiment:  0.0456692631977 Positive sentiment:  0.954330736802
RT @Schwarzenegger: Judge Curiel is an American hero who stood up to the Mexican cartels. I was proud to appoint him when I was Gov. https:...
Negative sentiment:  0.0424177215223 Positive sentiment:  0.957582278478
RT @JackPMoore: Meryl Streep dressed up like Donald Trump tonight and Christine Baranski loved it. Sometimes the world is perfect. https://...
Negative sentiment:  0.