# Sentiment Analysis for Twitter

## Overview

This tutorial is going to introduce some simple tools for detecting sentiment in Tweets. We will be using a set of tools called the Natural Language Toolkit ([NLTK](http://www.nltk.org)). This is collection of software written in the [Python](https://www.python.org) programming language. An important design goal behind Python is that it should be easy to read and fun to use, so well-suited for beginners. A similar motivation inspired NLTK: it should make complex tasks easy to carry out, and it should be written in a way that would allow users to inspect and understand the code.

Why is this relevant? Well, a lot of software these days is built to be easy to use, but hard to inspect. For example, smartphones have a lot of slick apps on them, but very few people have the expertise to look under the hood to find out how they work. NLTK has quite the opposite approach: you are actively encouraged to discover how the code works. However, your level of understanding will depend heavily on how far you get to grips with Python itself.

This tutorial is written using the [Ipython](http://ipython.org) framework. This allows text to be interspersed by fragments of code, occuring in special "cells". Just below is a cell where we are using Python to do a simple calculation:

In [1]:
3 + 4

7

Some of the cells will contain snippets of code that are necessary for the big story to work, but which you don't need to understand. We'll try to make it clear when it's important for you to pay attention to one of the cells. 

## Twitter

As you know, people are tweeting all the time. The rate varies, with about 6,000 per second being the average, but when I last checked, the [rate was over 10,000 Tweets per second](http://www.internetlivestats.com/one-second/#tweets-band). So, a lot. Twitter kindly allows people to tap into a small sample of this stream &mdash; unless you're able to pay, the sample is at most 1% of the total stream. 

Here's a tiny snapshow to Tweets, taken at 13.00 pm on Sunday 11 November 2015. By using the keywords `'love, hate'`, we restrict our sample to just those Tweets containing one or both of those words.

In [2]:
import nltk # load up the NLTK library
from nltk.twitter import Twitter
tw = Twitter() # start a new client that connects to Twitter
tw.tweets(keywords='love, hate', limit=25) #filter Tweets from the public stream

RT @dollsfromheaven: Give your child the gift of a Therese doll. Help your child start a love for her “little way https://t.co/wkgUcSFIUv h…
RT @WorldStarFunny: Retweet if you love your grandmother ❤ https://t.co/zTIUgl9UyE
RT @chancetherapper: Black Women are soooo beautiful. I love your skin, I love your hair, I love your shape. nothin like it
@Soorajpancholi9 love u soo much birthday boy..may u hv many many many more #HappyBirthdaySoorajPancholi
RT @andrea_arias_: I love my bed omg
Major #SRKSalman love today 😍😍 Hopefully a movie together soon😘😘
RT @CharlotteHawkns: Such terribly sad news @rivooh has lost his fight with #MND. My heart goes out to you @DavinaRivers so much love xx ht…
RT @asvpxrocky: IF U LOVE ART  THEN EXPRESS IT EVERYDAY , ALL YOU YOUNG NIGGAS ARE GOING TO RULE THIS GODFORSAKEN WORLD TOMORROW , PLZ MAKE…
#BiharResults  Grand alliance178,NDA:58,Ohters:7 a grand #Chmaat to violence and hate politics
@henderfabelous this is very sweet!! Tysm love 💞
"Can't Help Falling

## Using a Twitter corpus

You too can sample Tweets in this way, but you'll need to set up your Twitter API keys according to [these instructions](http://www.nltk.org/howto/twitter.html). Since this is a bit of hassle, for the rest of this tutorial, we'll focus our attention on a sample of 20,000 Tweets that were collected at the end of April 2015. In order focus on Tweets about the UK general election, the public stream was filtered with the following set of terms:
```
david cameron, miliband, milliband, sturgeon, clegg, farage, tory, tories, ukip, snp, libdem
```
The following code cell allows us to get hold of this collection, and prints out the text of the first 15. You don't need to worry about the details of how this happens.

In [3]:
from nltk.corpus import twitter_samples
strings = twitter_samples.strings('tweets.20150430-223406.json')
for string in strings[:20]:
    print(string)

RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY
RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…
RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1
RT @thesundaypeople: UKIP's housing spokesman rakes in £800k in housing benefit from migrants.  http://t.co/GVwb9Rcb4w http://t.co/c1AZxcLh…
RT @Nigel_Farage: Make sure you tune in to #AskNigelFarage tonight on BBC 1 at 22:50! #UKIP http://t.co/ogHSc2Rsr2
RT @joannetallis: Ed Milliband is an embarrassment. Would you want him representing the UK?!  #bbcqt vote @Conservatives
RT @abstex: The FT is backing the Tories. On an unrelated note, here's a photo of FT leader writer Jonathan Ford (next to Boris) http://t.c…
RT @NivenJ1: “@George_Osborne: Ed Mi

## Sentiment Analysis

When we talk about understanding natural language, we often focus on
'who did what to whom'. Yet in many situations, we are more interested
in attitudes and opinions. When someone writes about a movie, did they
like it or hate it? Is a product review for a water bottle on Amazon
positive or negative? Is this Tweet about the US President supportive
or critical? We might also care about the intensity of the views
expressed: `this is a fine movie`:lx: is different from `WOW!
This movie is soooooo great!!!!`:lx: even though both are
positive.

`Sentiment analysis`:dt: (or `opinion mining`:dt:) is a broad term for
a range of techniques that try to identify the subjective views
expressed in texts. Many organisations care deeply about public
opinion |mdash| whether these concern commercial products, creative
works, or political parties and policies |mdash| and have consequently
turned to sentiment analysis as a way of gleaning valuable insights
from voluminous bodies of online text. This in turn has stimulated
much activity in the area, ranging from academic research to
commercial applications and industry-focussed conferences.

We'll look at sentiment analysis in more detail later in the book [NB
XREF]. For the time being, let's say that our task is to classify a
sentence into one of three categories: positive, negative or
neutral. Each of these can be illustrated by posts on Twitter collected during
the UK General Election in 2015.

.. _ex-twitter-sa1:
.. ex::
   .. ex:: [positive] Good stuff from Clegg. Clear, passionate & \
           honest about the difficulties of govt but also the difference @LibDems have made. 
   .. ex:: [negative] Hmm. Ed Miliband being against SNP is a bad move \
           I think. It'll cost him n it is a dumb choice.
   .. ex:: [neutral] Why is Ed Milliband trending when him name is Ed Miliband?

The easiest approach to classifying examples like these is to get hold
of two lists of words, positive ones such as `good`:lx:,
`excellent`:lx:, `fine`:lx:, `triumph`:lx:, `well`:lx:, `succeed`:lx:
|DOTS| and negative ones such as `bad`:lx:, `poor`:lx:, `dismal`:lx:,
`lying`:lx:, `fail`:lx:, `disaster`:lx: |DOTS|. We then base our
polarity assignment on the ratio of positive tokens to negative ones
in a given string. A string with neither positive or negative tokens
(or possibly an equal number of each) will be categorised as
neutral. This simple approach is likely to yield the intuitively
correct results for ex-twitter-sa1_.

Things become more complicated when negation enter into the
picture. ex-twitter-sa1_ is mildly positive (at least in British
English), so we need to ensure that `not`:lx: flips the
polarity of `bad`:lx: in appropriate contexts,

.. _ex-twitter-sa1:

.. ex::  Given Miliband personal ratings still 20 points behind \
         Cameron, I'd say that not a bad margin for Labour leader  https://t.co/ILQP93VYLF

Relatively 'shallow' techniques can deal fairly effectively
with the way in which the prior polarity of a word is modified by the
contextual effects of negation and other semantic operators.
Nevertheless it's not hard to find examples where something close to
full natural language understanding is required to determine the
correct polarity.

.. _ex-twitter-sa2:
.. ex::
   .. ex:: David Cameron doesn't seem to have done too badly until \
      	   now. Otherwise #milifandom and #cleggers would be attacking \
	   him for these bad things 
   .. ex:: Even though I don't like UKIP I'm hating them less and less \
           every day, they do actually have very some good policies 

This has led some researchers to develop approaches where syntactic
structure is also factored into sentiment analysis,

A further challenge in sentiment analysis is deciding the right level
of granularity for the topic under discussion. Often, we can agree in the overall polarity of a
sentence (or even of larger texts) because there is a single dominant topic. But in a list-like construction
such as ex-twitter-sa3_, different sentiments are associated with different
entities, and there is no sensible way of aggregating this into a combined
polarity score for the text as a whole:

.. _ex-twitter-sa3:
.. ex::  @hugorifkind Audience - good. Mili - bad. Clegg - a bit sad. Cam - unscathed

Finally, our current approaches to language processing struggle with
sarcasm, irony and satire, which again lead to polarity reversals.

.. _ex-twitter-sa4:
.. ex::
   .. ex:: LOVE being sat on a plane for 4 hours after a 10 hour flight !! Soooo fun !
   .. ex:: The wrong spelling of Ed Miliband is trending, but not the \
           correct one. Good job, Britain.



It's clear that many of the people responsible for the Tweets shown above have strong opinions. We will use the term `sentiment analysis` to refer to the process of using software to figure out what these opinions actually are.

Although sentiment analysis is designed to work with written text, the way in which people express their feelings is often goes far beyond what they literally say. In spoken language, intonation will be important. And of course we often express emotion using no words at all, as illustrated in this picture from Darwin's book *The Expression of the Emotions*.

<a title="By Charles Darwin (author of volume); unknown photographer of plate [Public domain], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3APlate_depicting_emotions_of_grief_from_Charles_Darwin's_book_The_Expression_of_the_Emotions.jpg"><img align="left" width="512" alt="Plate depicting emotions of grief from Charles Darwin&'s book The Expression of the Emotions" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Plate_depicting_emotions_of_grief_from_Charles_Darwin%27s_book_The_Expression_of_the_Emotions.jpg/512px-Plate_depicting_emotions_of_grief_from_Charles_Darwin%27s_book_The_Expression_of_the_Emotions.jpg"/></a>

Because 


In [4]:
import pandas as pd
data = pd.DataFrame()

In [5]:
full_tweets = twitter_samples.docs('tweets.20150430-223406.json')


In [19]:
data['text'] = [t['text'] for t in full_tweets]

In [27]:
parties = {}
parties['conservative'] = set(['osborne', 'portillo', 'pickles', 'tory', 'tories',
                                'torie', 'voteconservative', 'conservative', 'conservatives', 'bullingdon', 'telegraph'])
parties['labour'] = set(['uklabour', 'scottishlabour', 'labour', 'lab', 'murphy'])
parties['libdem'] = set(['libdem', 'libdems', 'dems', 'alexander'])
parties['ukip'] = set(['ukip', 'davidcoburnukip'])
parties['snp'] = set(['salmond', 'snp', 'snpwin', 'votesnp', 'snpbecause', 'scotland',
                       'scotlands', 'scottish', 'indyref', 'independence', 'celebs4indy'])

leaders = {}
leaders['cameron'] = set(['cameron', 'david_cameron', 'davidcameron','dave', 'davecamm'])
leaders['miliband'] = set(['miliband', 'ed_miliband', 'edmiliband', 'edm', 'milliband', 'ed', 'edforchange', 'edforpm', 'milifandom'])
leaders['clegg'] = set(['clegg'])
leaders['farage'] = set(['farage', 'nigel_farage', 'nigel', 'askfarage', 'asknigelfarage', 'asknigelfar'])
leaders['sturgeon'] = set(['sturgeon', 'nicola_sturgeon', 'nicolasturgeon', 'nicola'])

def tweet_classify(text, keywords):
    from nltk.tokenize import wordpunct_tokenize
    import operator
    toks = wordpunct_tokenize(text)
    toks_lower = [t.lower() for t in toks]
    d = {}
    for k in keywords:
        d[k] = len(keywords[k] & set(toks_lower))
    #labels = [k for k in keywords if keywords[k] & set(toks_lower)]
    
    label = max(d.items(), key=operator.itemgetter(1))[0]
    return label
    #return ' '.join(labels)

data['party'] = [tweet_classify(row['text'], parties) for index, row in data.iterrows()]
data['leader'] = [tweet_classify(row['text'], leaders) for index, row in data.iterrows()]
data

Unnamed: 0,text,party,sentiment,leader
0,RT @KirkKus: Indirect cost of the UK being in ...,ukip,0.0000,farage
1,VIDEO: Sturgeon on post-election deals http://...,conservative,0.0000,sturgeon
2,RT @LabourEoin: The economy was growing 3 time...,conservative,0.1779,cameron
3,RT @GregLauder: the UKIP east lothian candidat...,ukip,0.0000,farage
4,RT @thesundaypeople: UKIP's housing spokesman ...,ukip,0.4588,farage
5,RT @Nigel_Farage: Make sure you tune in to #As...,ukip,0.3802,farage
6,RT @joannetallis: Ed Milliband is an embarrass...,conservative,-0.4389,miliband
7,RT @abstex: The FT is backing the Tories. On a...,conservative,0.0258,farage
8,RT @NivenJ1: “@George_Osborne: Ed Miliband pro...,conservative,0.0000,miliband
9,LOLZ to Trickle Down Wealth. It's never trickl...,conservative,0.8167,farage


In [10]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
data['sentiment'] = [sia.polarity_scores(row['text'])['compound'] for index, row in data.iterrows()]

In [35]:
grouped_party = data['sentiment'].groupby(data['party'])
grouped_party.mean().sort_values(ascending=False)

party
ukip            0.103774
conservative    0.074607
labour          0.020996
libdem         -0.001636
snp            -0.013133
Name: sentiment, dtype: float64

In [34]:
grouped_party.count().sort_values(ascending=False)

party
conservative    12367
snp              2995
ukip             2700
labour           1826
libdem            112
dtype: int64

In [26]:
grouped_party.min()

party
conservative   -0.9821
labour         -0.9274
libdem         -0.8819
snp            -0.9538
ukip           -0.9435
Name: sentiment, dtype: float64

In [36]:
grouped_leader = data['sentiment'].groupby(data['leader'])
grouped_leader.mean().sort_values(ascending=False)

leader
clegg       0.110223
miliband    0.082646
cameron     0.073460
sturgeon    0.041736
farage      0.040028
Name: sentiment, dtype: float64

In [37]:
grouped_leader.count().sort_values(ascending=False)

leader
farage      9949
miliband    6476
cameron     2251
clegg        695
sturgeon     629
dtype: int64

In [None]:
input_file = twitter_samples.abspath("tweets.20150430-223406.json")

In [None]:
from nltk.twitter.util import json2csv
with open(input_file) as fp:
    json2csv(fp, 'tweets_text.csv', ['text'])

In [None]:
import pandas as pd
tweets = pd.read_csv('tweets_text.csv', header=0, encoding="utf8")
tweets.head(20)

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer