# Independent Study - Week 4 - James Quacinella

## Preparation

In [1]:
import nltk
import pickle
import prettytable
from collections import defaultdict
import textblob

%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Sentiment Analysis

In this section I will do some sentiment analysis using a positive and negative word list, and a new python module called textblob. 

### Word Lists

After doing some research, I found this [list of positive and negative words](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010) for tagging microblogs, like twitter. 

In [4]:
import math
import re
 
# AFINN-111 is as of June 2011 the most recent version of AFINN
filenameAFINN = 'AFINN/AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ 
            ws.strip().split('\t') for ws in open(filenameAFINN) ]))
 
# Word splitter pattern
pattern_split = re.compile(r"\W+")
 
def sentimentAFINN(text):
    """
    Returns a float for sentiment strength based on the input text.
    Positive values are positive valence, negative value are negative valence. 
    """
    words = pattern_split.split(text.lower())
    sentiments = map(lambda word: afinn.get(word, 0), words)
    if sentiments:
        # How should you weight the individual word sentiments? 
        # You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
        sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
        
    else:
        sentiment = 0
    return sentiment

def sentimentDisplayValue(sentimentScore):
    if sentimentScore > 0.1:
        return "Positive" 
    elif sentimentScore < -0.1:
        return "Negative"
    else:
        return "Neutral"

Now lets load our tweets and see if we can use the setiment function to classify tweets:

In [5]:
allStatuses = pickle.load( open( "../Week2/allStatuses", "rb" ) )

# Create a pretty table of tweet contents and sentiment
pt = prettytable.PrettyTable(["Tweet Status", "Sentiment Score", "Sentiment"])
pt.align["Tweet Status"] = "l" 
pt.align["Sentiment Score"] = "l" 
pt.max_width = 60 
pt.padding_width = 1 # One space between column edges and contents (default)

totals = defaultdict(int)

for status in allStatuses['fairmediawatch']:
    sentimentScore = sentimentAFINN(status.text)
    sentimentDisplay = sentimentDisplayValue(sentimentScore)
    totals[sentimentDisplay] = totals[sentimentDisplay] + 1
    pt.add_row([status.text, sentimentScore,  sentimentDisplay])

# Lets see the results!
print pt
print totals

+--------------------------------------------------------------+-----------------+-----------+
| Tweet Status                                                 | Sentiment Score | Sentiment |
+--------------------------------------------------------------+-----------------+-----------+
| That most US terrorists aren't Muslim "may come as a         | 0.0             |  Neutral  |
| surprise"--especially if you rely on corporate media.        |                 |           |
| http://t.co/J5bn1tQzRY                                       |                 |           |
| Baltimore "gang threat" swallowed by media was found to be   | -0.4472135955   |  Negative |
| "non-credible" by FBI. @Vice @AdamJohnsonNYC                 |                 |           |
| http://t.co/4kZSXwnRka                                       |                 |           |
| Downplaying right's role in radicalizing Dylann Roof,        | 0.0             |  Neutral  |
| corporate media preferred to pathologize him.   

In [28]:
# Create a pretty table of tweet contents and sentiment
pt = prettytable.PrettyTable(["Tweet Status", "Sentiment Score", "Sentiment"])
pt.align["Tweet Status"] = "l" 
pt.align["Sentiment Score"] = "l" 
pt.max_width = 60 
pt.padding_width = 1 # One space between column edges and contents (default)

totals = defaultdict(int)

for status in allStatuses['AccuracyInMedia']:
    sentimentScore = sentimentAFINN(status.text)
    sentimentDisplay = sentimentDisplayValue(sentimentScore)
    totals[sentimentDisplay] = totals[sentimentDisplay] + 1
    pt.add_row([status.text, sentimentScore,  sentimentDisplay])

# Lets see the results!
print pt
print totals

+--------------------------------------------------------------+-----------------+-----------+
| Tweet Status                                                 | Sentiment Score | Sentiment |
+--------------------------------------------------------------+-----------------+-----------+
| Rachel Dolezal's contract goes without renewal at Eastern    | 0.0             |  Neutral  |
| Washington University http://t.co/V1RjTtX9tM #tcot           |                 |           |
| In a 6-3 Vote, #SCOTUS upholds ObamaCare (again) in          | 0.0             |  Neutral  |
| #KingvBurwell decision. What's your take on the decision?    |                 |           |
| http://t.co/W33sy0GzgQ #tcot                                 |                 |           |
| IRS Shelled out $18.8 Million in Contracts to Contractors    | 0.0             |  Neutral  |
| who had Unpaid Back Taxes http://t.co/BBCN5KgJEY #tcot       |                 |           |
| Study: 5% of Colleges Protect First Amendment Ri

As you can see, with this simple method, we can see that most tweets are considered negative from both accounts

## Sentiment with Textblob

Lets do the same thing but using the default classifier in Textblob:

In [11]:
# Create a pretty table of tweet contents and sentiment
pt = prettytable.PrettyTable(["Tweet Status", "Sentiment Score", "Sentiment"])
pt.align["Tweet Status"] = "l" 
pt.align["Sentiment Score"] = "l" 
pt.max_width = 60 
pt.padding_width = 1 # One space between column edges and contents (default)

totals = defaultdict(int)

for status in allStatuses['fairmediawatch']:
    blob = textblob.TextBlob(status.text)
    sentimentScore = sum([sentence.sentiment.polarity for sentence in blob.sentences])
    
    sentimentDisplay = sentimentDisplayValue(sentimentScore)
    totals[sentimentDisplay] = totals[sentimentDisplay] + 1
    pt.add_row([status.text, sentimentScore,  sentimentDisplay])

# Lets see the results!
print pt
print totals

+--------------------------------------------------------------+------------------+-----------+
| Tweet Status                                                 | Sentiment Score  | Sentiment |
+--------------------------------------------------------------+------------------+-----------+
| That most US terrorists aren't Muslim "may come as a         | 0.166666666667   |  Positive |
| surprise"--especially if you rely on corporate media.        |                  |           |
| http://t.co/J5bn1tQzRY                                       |                  |           |
| Baltimore "gang threat" swallowed by media was found to be   | 0.0              |  Neutral  |
| "non-credible" by FBI. @Vice @AdamJohnsonNYC                 |                  |           |
| http://t.co/4kZSXwnRka                                       |                  |           |
| Downplaying right's role in radicalizing Dylann Roof,        | 0.142857142857   |  Positive |
| corporate media preferred to pathologi

In [10]:
# Create a pretty table of tweet contents and sentiment
pt = prettytable.PrettyTable(["Tweet Status", "Sentiment Score", "Sentiment"])
pt.align["Tweet Status"] = "l" 
pt.align["Sentiment Score"] = "l" 
pt.max_width = 60 
pt.padding_width = 1 # One space between column edges and contents (default)

totals = defaultdict(int)

for status in allStatuses['AccuracyInMedia']:
    blob = textblob.TextBlob(status.text)
    sentimentScore = sum([sentence.sentiment.polarity for sentence in blob.sentences])
    
    sentimentDisplay = sentimentDisplayValue(sentimentScore)
    totals[sentimentDisplay] = totals[sentimentDisplay] + 1
    pt.add_row([status.text, sentimentScore,  sentimentDisplay])

# Lets see the results!
print pt
print totals

+--------------------------------------------------------------+------------------+-----------+
| Tweet Status                                                 | Sentiment Score  | Sentiment |
+--------------------------------------------------------------+------------------+-----------+
| Rachel Dolezal's contract goes without renewal at Eastern    | 0.0              |  Neutral  |
| Washington University http://t.co/V1RjTtX9tM #tcot           |                  |           |
| In a 6-3 Vote, #SCOTUS upholds ObamaCare (again) in          | 0.0              |  Neutral  |
| #KingvBurwell decision. What's your take on the decision?    |                  |           |
| http://t.co/W33sy0GzgQ #tcot                                 |                  |           |
| IRS Shelled out $18.8 Million in Contracts to Contractors    | 0.1              |  Neutral  |
| who had Unpaid Back Taxes http://t.co/BBCN5KgJEY #tcot       |                  |           |
| Study: 5% of Colleges Protect First Am

This algorithm is a bit more biased for neutral sentiment. This might be due to the fact that we are dealing with microblog data versus data that this might have been trained on