# Sentiment Analysis with Twitter Data

In this case study, we will explore Twitter data used in the [paper](http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) by Go, Bhayani and Huang (2009). Our objective is to train a classifier and then use this classifier to determine whether a tweet is positive or negative. 

Go, Bhayani and Huang have automatically labeled the tweets as positive, negative or neutral. Their labeling algorithm is relatively simple: If a tweet includes a positive emoticon, then they label the tweet as positive. Likewise, if the tweet has a negative emoticon, then they label the tweet as negative. If a tweet does not include any emoticon, it is labeled as neutral.

From this data set, we have used only the positive and the negative tweets. These are stored in two files, "neg_tweets.txt" and "pos_tweets.txt." Here are the first lines of these files.

In [1]:
with open('pos_tweets.txt', 'r') as posfile, open('neg_tweets.txt', 'r') as negfile:
    print posfile.readline()
    print negfile.readline()

I LOVE @Health4UandPets u guys r the best!! "

@switchfoot http://twitpic.com/2y1zl - Awww



Tweets include mentions, which are marked by the '@' sign at the beginning of a word. These mentions do not help us to infer the sentiment of a tweet. Similarly, links to websites or pictures start with 'http' character array. We shall exclude both types of words. Moreover, people use shorthand letters instead of words like 'u' instead of 'you.' Therefore, we will consider words that have three or more characters. Given a tweet, we can extract the words that qualify for analysis.

In [2]:
tweet = "I LOVE @Health4UandPets u guys r the best!! \""
usedwords = [word for word in tweet.split() if word[0] != '@' and \
             len(word) >= 3 and word[0:4] != 'http']
print usedwords

['LOVE', 'guys', 'the', 'best!!']


This is better. However, we still need to strip the punctuation marks. Luckily, the `string` package includes the `translate` function that we can use. At the same time, we can convert the words to lowercase.

In [3]:
import string
usedwords = [word.translate(None, string.punctuation).lower() for word in tweet.split() \
             if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http']
print usedwords

['love', 'guys', 'the', 'best']


We will use Naive Bayes Classifier from the `nltk` package. This classifier works with a list of tokens consisting of features and labels. Here, each word that we extract from a tweet is a feature and its label will be either "positive" or "negative." 

We are now ready to read the data from the files.

In [4]:
tweets = []
with open('neg_tweets.txt','r') as infile:
    for line in infile:
        usedwords = []
        for word in line.split():
            if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http':
                wordnp = word.translate(None, string.punctuation)
                if not(wordnp.isdigit()):
                    usedwords.append(wordnp.lower());
        dictwords = dict([(word, True) for word in usedwords])
        if len(dictwords) > 0: # We omit empty tweets
            tweets.append((dictwords, 'negative'))
print usedwords
print dictwords

with open('pos_tweets.txt','r') as infile:
    for line in infile:
        usedwords = []
        for word in line.split():
            if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http':
                wordnp = word.translate(None, string.punctuation)
                if not(wordnp.isdigit()):
                    usedwords.append(wordnp.lower());
        dictwords = dict([(word, True) for word in usedwords])                    
        if len(dictwords) > 0: # We omit empty tweets
            tweets.append((dictwords, 'positive'))

['cant', 'belive', 'maddy', 'dead']
{'cant': True, 'dead': True, 'maddy': True, 'belive': True}


Here are two tweets in our desired data structure.

In [5]:
print dictwords
print tweets[0]
print tweets[-1]

{'oo': True, 'yeh': True, 'from': True, 'burytoes': True, 'haha': True, 'one': True, 'shud': True, 'breakfast': True, 'where': True, 'eat': True, 'day': True}
({'awww': True}, 'negative')
({'oo': True, 'yeh': True, 'from': True, 'burytoes': True, 'haha': True, 'one': True, 'shud': True, 'breakfast': True, 'where': True, 'eat': True, 'day': True}, 'positive')


Since we have appended the negative tweets followed by the positive tweets, we can shuffle them before we create our data sets for testing and training.

In [6]:
import random
random.shuffle(tweets)
print tweets[0]
print tweets[-1]

({'real': True, 'beijing': True, 'good': True, 'for': True, 'photo': True, 'sexy': True, 'amp': True, 'you': True, 'girl': True, 'massage': True}, 'positive')
({'just': True, 'some': True, 'finally': True, 'finished': True, 'stuff': True, 'new': True}, 'positive')


We are ready to split our data set into two. We shall use a quarter of the tweets for testing and the rest will be used for training.

In [7]:
cutoffval = len(tweets)*3/4
train_data = tweets[0:cutoffval];
test_data = tweets[cutoffval:];
print len(train_data)
print len(test_data)

482112
160704


Next, we import Naive Bayes Classifier as well the function to check our accuracy. Finally, we can train and test our classifier.

In [8]:
from nltk.classify import NaiveBayesClassifier, util
import time
start = time.time()
classifier = NaiveBayesClassifier.train(train_data)
print 'Elapsed time:', time.time() - start
print 'Obtained Accuracy:', util.accuracy(classifier, test_data)
classifier.show_most_informative_features(n=20)

Elapsed time: 11.6920001507
Obtained Accuracy: 0.741375448029
Most Informative Features
                nauseous = True           negati : positi =     32.1 : 1.0
                  ughhhh = True           negati : positi =     30.8 : 1.0
                hayfever = True           negati : positi =     26.5 : 1.0
               toothache = True           negati : positi =     26.4 : 1.0
                  booooo = True           negati : positi =     25.6 : 1.0
                     ftl = True           negati : positi =     25.5 : 1.0
                  cancel = True           negati : positi =     24.8 : 1.0
                 injured = True           negati : positi =     24.5 : 1.0
                  gutted = True           negati : positi =     23.7 : 1.0
                     bom = True           positi : negati =     23.5 : 1.0
                stranded = True           negati : positi =     23.5 : 1.0
                illusion = True           positi : negati =     22.5 : 1.0
            

With a few simple steps, we have obtained almost 75% accuracy. Note that we have not excluded the frequently used words (the, are, for, etc.), neither did we remove the stop words (because, but, and, etc.) It seems that with a little bit more effort, the accuracy of our analysis can be improved. Furthermore, it may be worthwile to try alternate classification methods like **Maximum Entropy** and **SVM**.

It is also possible to use methods in scikit-learn with a wrapper by loading **SklearnClassifier** module.

In [9]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from nltk.classify.scikitlearn import SklearnClassifier
classify = SklearnClassifier(MultinomialNB())
classify2 = SklearnClassifier(BernoulliNB())
start = time.time()
classify.train(train_data)
print 'Elapsed time:', time.time() - start
print 'Obtained Accuracy:', util.accuracy(classify, test_data)

start = time.time()
classify2.train(train_data)
print 'Elapsed time:', time.time() - start
print 'Obtained Accuracy:', util.accuracy(classify2, test_data)

Elapsed time: 7.57499980927
Obtained Accuracy: 0.75596127041
Elapsed time: 6.59500002861
Obtained Accuracy: 0.75728668857
