# Sentiment Analysis of Tweets Corpus

Using **sentiment analysis** for the task of predicting the polarity (positive, negative) of opinions expressed in tweets. The work presented here was based on and attempts to replicate that of the following [blogpost](http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/), which contains over 1.5 million compiled tweets, annotated with the sentiment polarity: 0 for negative sentiment, and 1 for positive sentiment.

#### Import libraries

In [1]:
import nltk
import sklearn
import csv
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ronaldpeic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
with open('Sentiment Analysis Dataset.csv', encoding="utf-8") as f:
    reader = csv.reader(f)
    print("Header line: %s" % next(reader))
    annotated_data = [r for r in reader]
print(annotated_data[0])
print("Total number of rows:", len(annotated_data))

Header line: ['\ufeffItemID', 'Sentiment', 'SentimentSource', 'SentimentText']
['1', '0', 'Sentiment140', '                     is so sad for my APL friend.............']
Total number of rows: 1578614


#### Each element in the array `annotated_data` is made up of the following fields:
* Item ID
* Sentiment (0 if negative, 1 if positive)
* Where the author sourced the data from (the entire file is compiled from multiple sources)
* Tweeted text

In [5]:
import random
random.seed(1234)  # use a seed value to manage the randomisation
random.shuffle(annotated_data)
# annotated_data = annotated_data[:500000]
annotated_data = annotated_data[:50000]  # using a reduced sample size due to hardware limitations

#### Spliting the data as follows:
* Training set: 80%
* Dev-test set: 10%
* Test set: 10%

In [6]:
threshold1 = int(len(annotated_data) *8/10)
threshold2 = int(threshold1 + len(annotated_data) *1/10)
print("threshold1", threshold1, "threshold2", threshold2)

train = annotated_data[:threshold1]
dev = annotated_data[threshold1:threshold2]
test = annotated_data[threshold2:]
print(dev[:5])

threshold1 40000 threshold2 45000
[['240313', '0', 'Sentiment140', "@Honey_ It's nasty.   No reports of flooding as yet. Multiple reports of bad hair and wet pants however."], ['618735', '0', 'Sentiment140', 'aww poor lacey  am nawt even in the call lol'], ['74547', '1', 'Sentiment140', '@Anniejunieee heyyy  i saw u on icarly ! you rocked  xoxoxo'], ['979620', '0', 'Sentiment140', "@leeagoldstein LOL. That's awesome. I was promised that when mine was taken it was being put in a safe place. It was so thrown out. "], ['1125777', '1', 'Sentiment140', 'Ok, time for bed! See you all tomorrow ']]


#### Checking the data is balanced


In [7]:
from collections import Counter

# reduced sample data such that each element is a tuple of (tweets, sentiment) for easier processing; 
train_tweets = [(row[3],int(row[1])) for row in train]
dev_tweets = [(row[3],int(row[1])) for row in dev]
test_tweets = [(row[3],int(row[1])) for row in test]

# list of tweets in train set separated by sentiment
train_sntmt_pos = [t for (t,s) in train_tweets if s == 1]
train_sntmt_neg = [t for (t,s) in train_tweets if s == 0]
print("train data with positive sentiment:", len(train_sntmt_pos)/len(train_tweets))
print("train data with negative sentiment:", len(train_sntmt_neg)/len(train_tweets))

# list of tweets in dev-test set separated by sentiment
dev_sntmt_pos = [t for (t,s) in dev_tweets if s == 1]
dev_sntmt_neg = [t for (t,s) in dev_tweets if s == 0]
print("dev-test data with positive sentiment:", len(dev_sntmt_pos)/len(dev_tweets))
print("dev-test data with negative sentiment:", len(dev_sntmt_neg)/len(dev_tweets))

# list of tweets in test set separated by sentiment
test_sntmt_pos = [t for (t,s) in test_tweets if s == 1]
test_sntmt_neg = [t for (t,s) in test_tweets if s == 0]
print("test data with positive sentiment:", len(test_sntmt_pos)/len(test_tweets))
print("test data with negative sentiment:", len(test_sntmt_neg)/len(test_tweets))


train data with positive sentiment: 0.492275
train data with negative sentiment: 0.507725
dev-test data with positive sentiment: 0.5034
dev-test data with negative sentiment: 0.4966
test data with positive sentiment: 0.501
test data with negative sentiment: 0.499


#### Size of the training set vocabulary

In [8]:
from nltk import word_tokenize
train_vocab = set([word for row in train_tweets for word in word_tokenize(row[0])])
print("size of training set vocabulary is", len(train_vocab))

size of training set vocabulary is 63906


#### Most occuring words for each tweet sentiment

In [9]:
train_tweets_pos = Counter([word for tweet in train_sntmt_pos for word in word_tokenize(tweet)])
train_tweets_neg = Counter([word for tweet in train_sntmt_neg for word in word_tokenize(tweet)])

print("most occuring words in tweets with postitive sentiment:", train_tweets_pos.most_common()[:5])
print("most occuring words in tweets with negative sentiment:", train_tweets_neg.most_common()[:5])

most occuring words in tweets with postitive sentiment: [('!', 13312), ('@', 11929), ('.', 9563), ('I', 6649), (',', 6412)]
most occuring words in tweets with negative sentiment: [('.', 11436), ('!', 9928), ('I', 9770), ('@', 8247), ('to', 8006)]


#### One-hot encoding for Naive Bayes in NLTK
Apply a feature extractor using one-hot encoding on the entire vocabulary in the training set. Use the feature extractor to train a Naive Bayes classifier in NLTK and report the accuracy of the classifier using the test set.

In [10]:
# feature extractor that uses one-hot encoding
def nltk_one_hot_extr(words, wordlist=None):
    result = dict()
    for w in word_tokenize(words):
        if wordlist and w in wordlist:
            result['has(%s)' % w] = True
        elif wordlist == None:
            result['has(%s)' % w] = True
#         result['has(%s)' % w] = w in wordlist if wordlist else True
    return result

nltk_one_hot_extr("Testing. This is a test.")

{'has(.)': True,
 'has(Testing)': True,
 'has(This)': True,
 'has(a)': True,
 'has(is)': True,
 'has(test)': True}

In [11]:
# extract the features from the respective data sets
train_features = [(nltk_one_hot_extr(x),y) for (x,y) in train_tweets]
test_features = [(nltk_one_hot_extr(x),y) for (x,y) in test_tweets]
print(train_features[:10])

[({'has(@)': True, 'has(shabooty)': True, 'has(:)': True, 'has(Taylor)': True, 'has(Swift)': True, 'has(is)': True, 'has(19)': True, 'has(.)': True, 'has(She)': True, "has('s)": True, 'has(already)': True, 'has(way)': True, 'has(behind)': True, 'has(on)': True, 'has(her)': True, 'has(career)': True, 'has(Portman)': True, 'has(did)': True, 'has(it)': True, 'has(better)': True, 'has(,)': True, 'has(this)': True, 'has(just)': True, 'has(makes)': True, 'has(me)': True, 'has(cringe)': True}, 0), ({'has(@)': True, 'has(raccoon9ta)': True, 'has(you)': True, 'has(can)': True, 'has(do)': True, 'has(it)': True}, 1), ({'has(@)': True, 'has(TheStitchWitch)': True, 'has(Not)': True, 'has(on)': True, 'has(a)': True, 'has(friday)': True, 'has(!)': True, 'has(Get)': True, 'has(out)': True, 'has(the)': True, 'has(excedrin)': True, 'has(and)': True, 'has(caffeine)': True, 'has(show)': True, 'has(that)': True, 'has(b1+ch)': True, 'has(who)': True, "has('s)": True, 'has(boss)': True}, 1), ({'has(@)': True

In [12]:
# train a Naive Bayes classifier
nltk_naive_bayes_classifier = nltk.NaiveBayesClassifier.train(train_features)

In [13]:
# report the accuracy using the test set
nltk.classify.accuracy(nltk_naive_bayes_classifier,test_features)

0.7378

#### One-hot encoding of most informative features
Use NLTK to build another Naive Bayes classifier that uses the 2000 most informative features and train it on the training set, then report the accuracy on the test set. 

In [14]:
# obtain the 2000 most informative features
most_informative_features = nltk_naive_bayes_classifier.most_informative_features(2000)
print(most_informative_features[:10])

[('has(horrible)', True), ('has(terrible)', True), ('has(vip)', True), ('has(hurts)', True), ('has(throat)', True), ('has(pleasure)', True), ('has(surgery)', True), ('has(depressing)', True), ('has(upset)', True), ('has(sad)', True)]


In [15]:
# get the most informative words
most_informative_words = [w[4:-1] for w,f in most_informative_features]
print(most_informative_words[:10])

['horrible', 'terrible', 'vip', 'hurts', 'throat', 'pleasure', 'surgery', 'depressing', 'upset', 'sad']


In [16]:
# extract the most informative words from the respective data sets
# using the function defined above for extracting features that use one-hot encoding
train_features_informative = [(nltk_one_hot_extr(x, most_informative_words),y) for (x,y) in train_tweets]
test_features_informative = [(nltk_one_hot_extr(x, most_informative_words),y) for (x,y) in test_tweets]
print(train_features_informative[:10])

[({'has(Taylor)': True}, 0), ({}, 1), ({'has(Get)': True}, 1), ({'has(sorry)': True, 'has(left)': True}, 0), ({}, 0), ({}, 1), ({'has(Ugh)': True}, 0), ({'has(Happy)': True}, 0), ({}, 1), ({}, 1)]


In [17]:
# train a Naive Bayes classifier using the list of most informative words
nltk_naive_bayes_classifier2 = nltk.NaiveBayesClassifier.train(train_features_informative)

In [18]:
# report the accuracy of the data of the 2000 most informative features
nltk.classify.accuracy(nltk_naive_bayes_classifier2,test_features_informative)

0.6626

#### Tfidf for Naive Bayes in Scikit-Learn
Using Scikit-Learn, generate the tf.idf matrix of the training set. With this matrix, train an `sklearn` Naive Bayes classifier using the training set and report the accuracy on the test set.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(lowercase=False)
sklearn_tfidf = tfidf.fit_transform([tweet[0] for tweet in train_tweets])
print([tweet[0] for tweet in train_tweets][:10])

["@shabooty : Taylor Swift is 19. She's already way behind on her career. Portman did it better, this just makes me cringe ", '@raccoon9ta you can do it ', "@TheStitchWitch Not on a friday! Get out the excedrin and caffeine and show that b1+ch who's boss! ", '@LauraKim123 yes, sorry! Saturday ... haha, was obviously very hopeful about the number of days left in this week ', 'cause mom got up late ', '@krystlerb Leave it to u to crack me up! IS that a Hummer on HIGHER wheels than neccesssary! OMG!   how are you Home Skillit?', 'Ugh, I need to get new glasses. This pair is scratched up to high heaven. ', "Happy birthday me!  Happy birthday evil identical twin! We're old ", 'some one buy me a teddy bear! ', 'Beer pong with @ChrisMallin, Marcie, an AJ. Minus the two ppl I made up to seem less pathetic... ']


In [None]:
# Kernel seems to crash when running this code block. Might need more powerful machine or more efficient code...
from sklearn.naive_bayes import MultinomialNB
sklearn_tfidf_NB = MultinomialNB()
sklearn_tfidf_NB.fit(sklearn_tfidf, [tweet[0] for tweet in train_tweets])

In [None]:
from sklearn.metrics import accuracy_score
sklearn_tfidf_test = tfidf.transform([tweet[0] for tweet in test_tweets])
predictions = sklearn_tfidf_NB.predict(sklearn_tfidf_test)
accuracy_score([tweet[0] for tweet in test_tweets], predictions)