# Introduction

The purpose of this notebook will be to test different Sentiment Analysis techniques. 

First, I will use a technique proposed by Peter Turney in his paper [Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews](https://arxiv.org/abs/cs/0212032). 

Next, I will train the TextBlob sentiment analysis tool to accurately capture the sentiment of tweets in my data.

## Step 0: Recap 

First, I will be reperforming a few of the same steps from part 2 such as reading in the libraries and csv file, cleaning the tweets, tokenizing the tweets, and counting the most common words and co-occurrences. 

In [1]:
import re
from textblob import TextBlob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
import string
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

%matplotlib inline

In [2]:
df = pd.read_csv('die-hard-tweets-no-retweets.csv')
df.columns = ['Tweet']
df.head()

Unnamed: 0,Tweet
0,Terrible take. Die Hard is a Christmas movie a...
1,Uhhh die hard is a Christmas movie fight me
2,Broke: Die Hard is a Christmas movie. Woke: E...
3,Okay I'm like 97% Die Hard is a Christmas movi...
4,「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証 https://t.co/y...


In [3]:
# First, create function to remove Twitter Handles
def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet) #removes all urls
    tweet = re.sub('RT|cc', '', tweet) #removes RT and CC's 
    tweet = re.sub('@\S+', '', tweet) #removes mentions
    tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', tweet) #removes any special characters
    tweet = re.sub('\s+',' ', tweet) #removes blank spaces
    return tweet.lower()

# Remove twitter handles
df['clean_tweet'] = np.vectorize(clean_tweet)(df['Tweet'])
df.head(25)

Unnamed: 0,Tweet,clean_tweet
0,Terrible take. Die Hard is a Christmas movie a...,terrible take die hard is a christmas movie an...
1,Uhhh die hard is a Christmas movie fight me,uhhh die hard is a christmas movie fight me
2,Broke: Die Hard is a Christmas movie. Woke: E...,broke die hard is a christmas movie woke eyes ...
3,Okay I'm like 97% Die Hard is a Christmas movi...,okay im like 97 die hard is a christmas movie ...
4,「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証 https://t.co/y...,「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証
5,"Okay, Darren, I concede. Even though Bruce Wil...",okay darren i concede even though bruce willis...
6,@FakeFakeNedYost If Die Hard is a Christmas mo...,if die hard is a christmas movie than die har...
7,Im sad cause I forgot to watch Die Hard for Ch...,im sad cause i forgot to watch die hard for ch...
8,"Well, I guess Die Hard *is* a Christmas tradit...",well i guess die hard is a christmas tradition...
9,Perhaps then a Boxing Day movie? https://t.co/...,perhaps then a boxing day movie


In [5]:
df['clean_tweet'].replace(' ', np.nan, inplace=True)
df = df.dropna()

# Reset indeces
df.index = range(2097)
df.head()

Unnamed: 0,Tweet,clean_tweet
0,Terrible take. Die Hard is a Christmas movie a...,terrible take die hard is a christmas movie an...
1,Uhhh die hard is a Christmas movie fight me,uhhh die hard is a christmas movie fight me
2,Broke: Die Hard is a Christmas movie. Woke: E...,broke die hard is a christmas movie woke eyes ...
3,Okay I'm like 97% Die Hard is a Christmas movi...,okay im like 97 die hard is a christmas movie ...
4,「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証 https://t.co/y...,「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証


In [6]:
df.shape

(2097, 2)

In [7]:
tokenized_tweet = df['clean_tweet'].apply(lambda x: x.split())
tokenized_tweet.head()

0    [terrible, take, die, hard, is, a, christmas, ...
1    [uhhh, die, hard, is, a, christmas, movie, fig...
2    [broke, die, hard, is, a, christmas, movie, wo...
3    [okay, im, like, 97, die, hard, is, a, christm...
4                    [「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証]
Name: clean_tweet, dtype: object

In [8]:
import operator
from collections import Counter

count_all = Counter()
for line in tokenized_tweet:
    count_all.update(line)
print(count_all.most_common(5))

[('christmas', 2588), ('hard', 2034), ('die', 2024), ('a', 1998), ('is', 1538)]


Next, we will import the defaultdict function which takes a function object as an argument and will return a value, in our case an integer. This will help facilitate our calculation of the probability of observing the terms individually and ocurring together.

We will use the code we used to create the co-occurences matrix to facilitate this process.

In [9]:
# co-occurences matrix 
from collections import defaultdict

com = defaultdict(lambda : defaultdict(int))
 
for line in tokenized_tweet: 
    terms_only = [term for term in line 
                  if not term.startswith(('#', '@'))]

    # Build co-occurrence matrix
    for i in range(len(terms_only)-1):            
        for j in range(i+1, len(terms_only)):
            w1, w2 = sorted([terms_only[i], terms_only[j]])                
            if w1 != w2:
                com[w1][w2] += 1
                
com_max = []
# For each term, look for the most common co-occurrent terms
for t1 in com:
    t1_max_terms = sorted(com[t1].items(), key=operator.itemgetter(1), reverse=True)[:5]
    for t2, t2_count in t1_max_terms:
        com_max.append(((t1, t2), t2_count))

        # Get the most frequent co-occurrences
terms_max = sorted(com_max, key=operator.itemgetter(1), reverse=True)
print(terms_max[:20])

[(('a', 'christmas'), 2934), (('christmas', 'hard'), 2638), (('christmas', 'die'), 2617), (('die', 'hard'), 2230), (('christmas', 'is'), 2200), (('christmas', 'movie'), 2122), (('a', 'hard'), 2071), (('a', 'die'), 2044), (('a', 'is'), 2042), (('a', 'movie'), 2005), (('christmas', 'the'), 1952), (('is', 'movie'), 1619), (('hard', 'is'), 1597), (('die', 'is'), 1569), (('die', 'movie'), 1530), (('hard', 'movie'), 1528), (('hard', 'the'), 1343), (('die', 'the'), 1340), (('is', 'the'), 1190), (('movie', 'the'), 969)]


## Step 1: Semantic Orientation

As mentioned above, I will first be doing Peter Turney’s technique to perform sentiment analysis (I encourage you to read his paper for more information as my technique is a simplification of his technique). The basis of this technique classifies phrases based on its association with positive and negative words. This classification is defined as Semantic Orientation. To do this, we must first calculate how close a word is with positive and negative words such as “good” and “bad”. This measure of closeness is called the Pointwise Mutual Information (PMI), which is calculated as: 

PMI(t1, t2) = log((P(t1 ^ t2))/P(t1) - P(t2))

where t1 and t2 are terms. 

In order to computer P(t), the probability of observing the term t, and P(t1 ^ t2), the probability of observing terms t1 and t2 together), we can re-use the code used to calculate the term frequencies and term co-occurrences that we did above.

To calculate the probabilities, we will use the following equation:

P(t) = DF(t) / abs(D)

P(t1 ^ t2) = DF(t1 ^ t2) / abs(D)

where DF is Document Frequency of a term which is the number of time the term occurs in the document or tweets, in our case, and D is the set of documents/tweets.

In [10]:
# Import defaultdict
from collections import defaultdict

# from our shape argument, we see our total amount of tweets is 2,096
p_t = {}
p_t_com = defaultdict(lambda : defaultdict(int))

for term, n in count_all.items():
    p_t[term] = n/2096
    for t2 in com[term]:
        p_t_com[term][t2] = com[term][t2]/2096

Next, I will define words that will help determine whether a tweet has a positive vs negative opinion. 

In [11]:
positive_vocab = ['good', 'great', 'favorite', 'yes', 'right', 'great', 
                  'terrific', ':)', ':-)', 'endorse', 'is', 'agree', 'awesome', 'fantastic',
                 'best', 'better', 'correct', 'like', 'love', 'outstanding']

negative_vocab = ['isnt', 'terrible', 'isn\'t', 'not', 'bad', 'no', 'wrong', 
                  'disagree', 'dont', 'don\'t', 'worse', 'worst', 'hate'
                 ':(', ':-(']

Now, I will perform the calculation for PMI that we defined above.  I will also define the Semantic Orientation (SO) using the positive and negative vocabulary defined above.

In [12]:
import math

pmi = defaultdict(lambda : defaultdict(int))
for t1 in p_t:
    for t2 in com[t1]:
        denom = p_t[t1] * p_t[t2]
        pmi[t1][t2] = math.log2(p_t_com[t1][t2]/denom)

semantic_orientation = {}
for term, n in p_t.items():
    positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
    negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
    semantic_orientation[term] = positive_assoc - negative_assoc

Now that the semantic orientation has been defined, I will sort the words in the tweets based on negative and positive sentiment.  

In [13]:
semantic_sorted = sorted(semantic_orientation.items(),
                        key=operator.itemgetter(1),
                        reverse=True)
top_pos = semantic_sorted[:20]
top_neg = semantic_sorted[-20:]

print('Top Positive Terms:')
print(top_pos)

print('Top Negative Terms:')
print(top_neg)

Top Positive Terms:
[('4th', 22.958244518024493), ('affect', 22.806366566188657), ('church', 22.47055575806039), ('earth', 19.729368543140435), ('classics', 18.77542571172153), ('brazil', 17.81733768400966), ('bringing', 17.80636656618866), ('fam', 17.353632146724294), ('arsenal', 16.41773022530042), ('drag', 16.41773022530042), ('ep', 16.41773022530042), ('amy', 16.41773022530042), ('easter', 16.139265778897027), ('breaker', 15.81733768400966), ('150', 15.16526098742997), ('andor', 14.41773022530042), ('alternative', 14.386459574274337), ('deal', 14.06245018184619), ('beat', 13.477487681125035), ('done', 13.063433918157239)]
Top Negative Terms:
[('1976', -9.989533379808881), ('anything', -10.191921299593476), ('brave', -10.484704021416123), ('doublestandard', -10.484704021416123), ('important', -10.584089176945792), ('niggas', -10.97408886215505), ('inside', -11.446585653195857), ('howl', -11.446585653195857), ('either', -11.808158884648607), ('mariahs', -11.89316886677148), ('neighbo

As the top positive and negative do not really tell us much about whether people believe *Die Hard* is or is not a Christmas movie, let's look specifically at whether specific words relating to 'Die', 'Hard', and 'Christmas' have a positive or negative orientation. 

In [14]:
print('Orientation for Die: %f' % semantic_orientation['die'])
print('Orientation for Hard: %f' % semantic_orientation['hard'])
print('Orientation for Christmas: %f' % semantic_orientation['christmas'])

Orientation for Die: -0.021344
Orientation for Hard: 0.977816
Orientation for Christmas: 1.058212


As we can see, two of the three terms have strong positive orientations and the one with the negative orientation, “Die”, is fairly low, with less than 1%. We can also attribute the negative orientation to the fact that “Die” as a word itself has a negative connotation. 

Therefore, based on this information, we can conclude that most of the words related to these terms are mostly positive and therefore, have a positive opinion with regards to whether Die Hard is a Christmas movie.

## Step 2: TextBlob

Now that we’ve tried one technique, let’s try and confirm our conclusion by using the TextBlob library. The TextBlob library is used for sentiment analysis by classifying the polarity of a tweet. TextBlob is already trained and therefore easy to use. However, because the wording between positive and negative tweets are so similar, I will train the classifier by using the *textblob.classifiers* module to create custom classifiers.

First, I'll enter in some training data that I've extracted from my Twitter pull that I will mark as positive or negative.

In [15]:
train = [
    ('die hard is christmas movie', 'pos'), 
    ('die hard is christmas story', 'pos'),
    ('die hard is best christmas movie', 'pos'),
    ('die hard is favorite christmas movie', 'pos'), 
    ('die hard is ultimate christmas movie', 'pos'), 
    ('die hard is great christmas movie', 'pos'),   
    ('die hard isn’t christmas movie', 'neg'), 
    ('die hard not christmas movie', 'neg'), 
    ('die hard isnt christmas movie', 'neg'),
    ('i dont think die hard christmas movie', 'neg'),
    ('people who think die hard christmas movie are wrong', 'neg'),
    ('die hard still isn’t christmas movie it’s just set at christmas', 'neg')
]

Now that we have the train data, we can use the NaiveBayesClassifier to train our classifier. By using the NaiveBayesClassifier, we are overriding the default PatternAnalyzer that is typically used for sentiment analysis within TextBlob. In addition, we will see what informative features determine whether a sentiment is positive or negative. 

In [16]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

In [17]:
cl.show_informative_features(10)

Most Informative Features
         contains(think) = False             pos : neg    =      1.4 : 1.0
           contains(isn) = False             pos : neg    =      1.4 : 1.0
             contains(t) = False             pos : neg    =      1.4 : 1.0
             contains(’) = False             pos : neg    =      1.4 : 1.0
         contains(great) = False             neg : pos    =      1.2 : 1.0
             contains(s) = False             pos : neg    =      1.2 : 1.0
          contains(dont) = False             pos : neg    =      1.2 : 1.0
         contains(movie) = True              neg : pos    =      1.2 : 1.0
      contains(favorite) = False             neg : pos    =      1.2 : 1.0
           contains(are) = False             pos : neg    =      1.2 : 1.0


Above, we can see the different words that would indicate whether a tweet is positive or negative based on our train data.

Now, we will create a function to analye the sentiment of our tweets based on our classifier.

In [18]:
from textblob import TextBlob

def analyze_sentiment(tweet):
    analysis = TextBlob(tweet, classifier=cl)
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1

Now that we have created the function, we can assign a sentiment to each of the tweets: 1 for positive, 0 for neutral, and -1 for negative.

In [19]:
df['SA'] = np.array([analyze_sentiment(tweet) for tweet in df['clean_tweet']])
df.head(25)

Unnamed: 0,Tweet,clean_tweet,SA
0,Terrible take. Die Hard is a Christmas movie a...,terrible take die hard is a christmas movie an...,-1
1,Uhhh die hard is a Christmas movie fight me,uhhh die hard is a christmas movie fight me,-1
2,Broke: Die Hard is a Christmas movie. Woke: E...,broke die hard is a christmas movie woke eyes ...,-1
3,Okay I'm like 97% Die Hard is a Christmas movi...,okay im like 97 die hard is a christmas movie ...,-1
4,「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証 https://t.co/y...,「ダイ・ハード」がクリスマス映画かどうかを登場要素から徹底検証,0
5,"Okay, Darren, I concede. Even though Bruce Wil...",okay darren i concede even though bruce willis...,1
6,@FakeFakeNedYost If Die Hard is a Christmas mo...,if die hard is a christmas movie than die har...,-1
7,Im sad cause I forgot to watch Die Hard for Ch...,im sad cause i forgot to watch die hard for ch...,-1
8,"Well, I guess Die Hard *is* a Christmas tradit...",well i guess die hard is a christmas tradition...,-1
9,Perhaps then a Boxing Day movie? https://t.co/...,perhaps then a boxing day movie,0


Now that we have assigned a label for each tweet, let's classify them into positive, negative, or neutral tweets to find the percentages.

In [20]:
pos_tweets = [ tweet for index, tweet in enumerate(df['clean_tweet']) if df['SA'][index] > 0]
neu_tweets = [ tweet for index, tweet in enumerate(df['clean_tweet']) if df['SA'][index] == 0]
neg_tweets = [ tweet for index, tweet in enumerate(df['clean_tweet']) if df['SA'][index] < 0]

In [21]:
print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(df['clean_tweet'])))
print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(df['clean_tweet'])))
print("Percentage of negative tweets: {}%".format(len(neg_tweets)*100/len(df['clean_tweet'])))

Percentage of positive tweets: 36.00381497377206%
Percentage of neutral tweets: 3.4334763948497855%
Percentage of negative tweets: 60.56270863137816%


## Limitations

When looking at our tweets and the sentiment labels, there are some inaccuracies with the way some of the tweets are labelled. For example, “I guess Die Hard is a Christmas movie” was labelled as a negative tweet when it should be positive, while “Forget 'Die Hard,' 'Go' Is the Ultimate Anti-Christmas Movie” is labeled as negative as well, but should be neutral. 