# Sentiment Analysis - Question B - Voulgari Eleni - A.M. 17005

## Apply an existing application for sentiment analysis on Tweeter data

In [2]:
# Load libraries
import os
import re
import pandas as pd
from textblob import TextBlob
from sklearn.metrics import accuracy_score
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [4]:
# Load test set to evaluate the application
cols = ['id','sentiment','text']
test2016 = pd.read_csv("./sets/twitter-2016test-A.tsv", sep='\t', header=None, names=cols)
print len(test2016)

20342


In [4]:
# Remove the instances where the tweet text is not available and the column 'id' which is not needed for the purpose 
# of sentiment analysis.
test2016 = test2016[test2016.text != 'Not Available']
test2016 = test2016.drop(['id'], axis=1)
print len(test2016)

15437


In [5]:
# Check the number of instances that have each sentiment label in the test set.
sentiment_counts = test2016.sentiment.value_counts()
print(sentiment_counts)

neutral     7727
positive    5439
negative    2271
Name: sentiment, dtype: int64


In [6]:
# Transform the sentiment (positive, negative, neutral) in numerical form to be able to compare the results easily.
def sentiment2target(sentiment):
    return {
        'negative': 0,
        'neutral': 1,
        'positive' : 2
    }[sentiment]
test_targets = test2016.sentiment.apply(sentiment2target)

In [7]:
# First, we try Vader Sentiment Analysis from the nltk.sentiment.vader package.
vader = SentimentIntensityAnalyzer()

def vader_polarity(text):
    """ Transform the output to a result with the three sentiments"""
    score = vader.polarity_scores(text)
    if score['compound'] >= 0.5:
        return 2
    if -0.5 < score['compound'] < 0.5:
        return 1
    if score['compound'] <= -0.5:
        return 0

pred_y = [vader_polarity(text) for text in test2016.text]
print accuracy_score(test_targets, pred_y)

0.598885793872


In [8]:
# Then, we try TextBlob Sentiment Analysis from the textblob package after a basic cleaning of the text.
def clean_tweet(tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])(\w+:\/\/\S+)", " ", tweet).split())


def blob_polarity(text):
    """ Transform the output to a result with the three sentiments"""
    score = TextBlob(clean_tweet(text)).sentiment.polarity
    if score >= 0.5:
        return 2
    if -0.5 < score < 0.5:
        return 1
    if score <= -0.5:
        return 0

pred_y = [blob_polarity(text) for text in test2016.text]
print accuracy_score(test_targets, pred_y)

0.545701885081


As we can observe Vader sentiment analysis performs better than TextBlob sentiment analysis. 

Now, if we compare Vader's performance with the performance of the LinearSVC classifier obtained in question A, we observe that it is almost the same.

High performance in text sentiment analysis is a really hard task to achieve. This is due to the peculiarity of text as data and especially on Tweets' text, as it consists of small sentences, sometimes with ambiguous meaning or its sentiment is shown with punctuation or emoticon and such.