# Sentiment Analysis with Twitter Data

In this case study, we will explore Twitter data used in the [paper](http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) by Go, Bhayani and Huang (2009). Our objective is to train a classifier and then use this classifier to determine whether a tweet is positive or negative. 

Go, Bhayani and Huang have automatically labeled the tweets as positive, negative or neutral. Their labeling algorithm is relatively simple: If a tweet includes a positive emoticon, then they label the tweet as positive. Likewise, if the tweet has a negative emoticon, then they label the tweet as negative. If a tweet does not include any emoticon, it is labeled as neutral.

From this data set, we have used only the positive and the negative tweets. These are stored in two files, "neg_tweets.txt" and "pos_tweets.txt." Here are the first lines of these files.

In [1]:
with open('../data/pos_tweets.txt', 'r') as posfile, open('../data/neg_tweets.txt', 'r') as negfile:
    print posfile.readline()
    print negfile.readline()

I LOVE @Health4UandPets u guys r the best!! "

@switchfoot http://twitpic.com/2y1zl - Awww



Tweets include mentions, which are marked by the '@' sign at the beginning of a word. These mentions do not help us to infer the sentiment of a tweet. Similarly, links to websites or pictures start with 'http' character array. We shall exclude both types of words. Moreover, people use shorthand letters instead of words like 'u' instead of 'you.' Therefore, we will consider words that have three or more characters. Given a tweet, we can extract the words that qualify for analysis.

In [2]:
tweet = "I LOVE @Health4UandPets u guys r the best!! \""
usedwords = [word for word in tweet.split() if word[0] != '@' and \
             len(word) >= 3 and word[0:4] != 'http']
print usedwords

['LOVE', 'guys', 'the', 'best!!']


This is better. However, we still need to strip the punctuation marks. Luckily, the `string` package includes the `translate` function that we can use. At the same time, we can convert the words to lowercase.

In [3]:
import string
usedwords = [word.translate(None, string.punctuation).lower() for word in tweet.split() \
             if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http']
print usedwords

['love', 'guys', 'the', 'best']


We will use Bernoulii Naive Bayes Classifier from the **scikit-learn** package. This classifier works with a list of tokens consisting of features and labels. Here, each word that we extract from a tweet is a feature and its label will be either "positive" (1) or "negative" (0). 

We are now ready to read the data from the files.

In [4]:
tweets = []
y=[]
with open('../data/neg_tweets.txt','r') as infile:
    for line in infile:
        usedwords = []
        for word in line.split():
            if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http':
                wordnp = word.translate(None, string.punctuation)
                if not(wordnp.isdigit()):
                    usedwords.append(wordnp.lower());
        dictwords = dict([(word, True) for word in usedwords])
        if len(dictwords) > 0: # We omit empty tweets
            tweets.append(' '.join(usedwords))
            y.append(0)

with open('../data/pos_tweets.txt','r') as infile:
    for line in infile:
        usedwords = []
        for word in line.split():
            if word[0] != '@' and len(word) >= 3 and word[0:4] != 'http':
                wordnp = word.translate(None, string.punctuation)
                if not(wordnp.isdigit()):
                    usedwords.append(wordnp.lower());
        dictwords = dict([(word, True) for word in usedwords])                    
        if len(dictwords) > 0: # We omit empty tweets
            tweets.append(' '.join(usedwords))
            y.append(1)

Here are two tweets in our desired data structure.

In [5]:
print tweets[0], y[0]

print tweets[-1], y[-1]

awww 0
breakfast burytoes from where oo haha shud eat burytoes one day oo yeh 1


Since we have appended the negative tweets followed by the positive tweets, we can split our data set into two. We shall use a quarter of the tweets for testing and the rest will be used for training. **train_test_split** shuffles them before creating our data sets for testing and training.

Next, we import BernoulliNB and check our accuracy and AUC after training and testing our classifiers. In order to use BernoulliNB, we must create a term-document matrix consisting of 0 and 1s by using **CountVectorizer**.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tweets, y,train_size=0.75)

# We use stop_words = 'english' in order to remove stop words
# In addition to that we select binary = True to create a matrix of 0 and 1s.
cv = CountVectorizer(min_df = 1, stop_words='english', binary=True)
cv_matrix = cv.fit_transform(X_train)
#cv_matrix = cv_matrix.todense()
print(np.shape(cv_matrix))
from sklearn.naive_bayes import BernoulliNB

bnb = BernoulliNB()

#This line is for training
bnb.fit(cv_matrix,y_train)
print('BNB trained')

#In order to create test matrix we only use transform and not fit_transform
#This is to avoid new words being added to matrix, i.e. the number of columns in train and test data sets should be equal
cv_matrix_test = cv.transform(X_test)

from sklearn.metrics import accuracy_score, roc_auc_score

for clf in [bnb]:#,mnb,sgd]:
    print(accuracy_score(y_test,clf.predict(cv_matrix_test)))
    print(roc_auc_score(y_test,clf.predict_proba(cv_matrix_test)[:,1]))

(482112, 166370)
BNB trained
0.743889386699
0.820065203043


With a few simple steps, we have obtained almost 75% accuracy. Note that we have not excluded the frequently used words (the, are, for, etc.), neither did we remove the stop words (because, but, and, etc.) It seems that with a little bit more effort, the accuracy of our analysis can be improved. Furthermore, it may be worthwile to try alternate classification methods like Random Forests and SVM.