# Lexicon-based sentiment analysis of tweets: using VADER and TextBlob

Calculate the sentiment score of the tweets using lexicon-based models VADER and TextBlob. Both are constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon.

## Set up

In [None]:
import os
import re
import string
import pandas as pd
import numpy as np

from emot.emo_unicode import UNICODE_EMO, EMOTICONS

In [None]:
import sys
print(sys.executable)
print(sys.version)
print(sys.version_info)

In [None]:
%load_ext autoreload
from src.preproc_text import *
from src.utils import chain_functions
from src.analyse_text import get_sentiment_score_VDR, get_sentiment_score_TB

In [None]:
%reload_ext autoreload

In [None]:
os.getcwd()

In [None]:
pd.options.display.max_seq_items = 10000
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)

Environment variables and constants

In [None]:
DATA_DIR = os.environ.get("DIR_DATA_INTERIM")

In [None]:
FILENAME = "tweets_relevant_keywords"

### Define domain-specific stopwords

Sentiment requires context. 

When implementing an easy approach to sentiment analysis, you just have to kind of hope that you can ignore context and the sentiments will average out to the right trend.

However, we can take context into account by excluding those terms that are sentiment-loaded but that in the covid-19 context are so common to be "neutral" (e.g., crisis, virus, pandemic). 


In [None]:
EXTRA_STOPWORDS = ["coronavirus", "covid", "covid19", "covid-", "covid-19" "’s", "link", "dominic", "cummings", "boris", "johnson", 
                   "dr", "david", "halpern", "susan", "michie", "richard", "amlot", "thaler", "cass", "sunstein", 
                   "daniel", "kahneman", 
                   "d-", "th", "january", "february", "march", "april", "may", "june", "july", "august", "september", 
                   "october", "november", "december", "corona", "virus", "wd", "&amp;", "article", "here", "%", "'s",
                  "'ve'", '&', 'amp', "'re", "via", "hoe", "'ve",
                  "crisis", "pandemic", "epidemic"]

## Get Data

In [None]:
tweets_df = pd.read_csv(os.path.join(DATA_DIR, FILENAME + '.csv'))

In [None]:
tweets_df.shape

# Text preprocessing

In [None]:
tweets_df.text.head(10)

## Are there still duplicates?

Looks like there are still duplicates in the dataset that we need to get rid of. Consider re-teweet counts when doing do.

In [None]:
# Find a duplicate texts
duplicate_tweets = tweets_df[tweets_df.duplicated(['text'])]
print(duplicate_tweets[['favorite_count', 'retweet_count', 'text']])

Apparently they are all duplicates of one single tweets.

We will keep the one with the largest count of "favourites". 

In [None]:
duplicate_tweets[duplicate_tweets.favorite_count == max(duplicate_tweets.favorite_count)].index

In [None]:
# get index
duplicate_tweets_index = duplicate_tweets[duplicate_tweets.favorite_count != 
                                          max(duplicate_tweets.favorite_count)].index

In [None]:
duplicate_tweets_index

In [None]:
tweets_df = tweets_df.drop(duplicate_tweets_index, axis=0).copy()

In [None]:
tweets_df.shape

### Quick and dirty sentiment analysis without preprocessing the text

In [None]:
tweets_df['quick_VDR_sentiment'] = [get_sentiment_score_VDR(tweet) for tweet in tweets_df.text]

In [None]:
tweets_df['quick_TB_sentiment'] = [round(get_sentiment_score_TB(tweet),3) for tweet in tweets_df.text]

In [None]:
tweets_df[['text', 'quick_VDR_sentiment', 'quick_TB_sentiment']]

Looks like many negative tweets have not been captured as such.

### Sample 100 cases from tweets 

In [None]:
sample_tweets = tweets_df.sample(n=120, random_state=11)

In [None]:
sample_tweets[['created_at', 'id', 'user_location', 
               'relevant_subkwords', 'text']].to_csv(os.path.join(DATA_DIR, "sample1_tweets.csv"))

## Text-preprocessing steps

#### First part

1. Replace emojis and emoticons with corresponding text description

#### Second part 

2. Replace URLs with word "url" or remove them
3. Remove first name of users metioned or replace them with "user_mentioned"
4. Replace all the hashtags with the words with the hash symbol (e.g., "#hello" -> "hello")

Given that 1. and 2. are not part of the lexicons, so do not contribute to the sentiment score.

#### Third part

5. Split compounded-by-upper-case strings
6. Split compounded-by-underscore "_" strings (this is two get the words that make up the emojis descriptions)
7. Remove digits
8. Remove single-character words
9. Split domain-specific compunded all-lower-case strings (e.g., behaviouraleconomics)

#### Fourth part

10. Remove stop words
11. Remove punctuation (but keep !?...)


First part

In [None]:
# as a check: a sample of tweets that contain emojis
idx_sample_tweets_emojs = [37, 57, 135, 136]

In [None]:
tweets = [convert_emojis(t) for t in tweets_df.text]

In [None]:
tweets = [convert_emoticons(t) for t in tweets]

In [None]:
# check
[tweets[i] for i in idx_sample_tweets_emojs]

Second part

In [None]:
tweets = [clean_tweet_quibbles(tweet) for tweet in tweets]

In [None]:
tweets[:10]

In [None]:
[tweets[i] for i in idx_sample_tweets_emojs]

Third part

In [None]:
preproc_pipe1 = chain_functions(split_lowercase_compounds,
                                split_string_at_uppercase,
                                break_compound_words,
                                remove_digits, 
                                remove_single_characters)

In [None]:
tweets = [preproc_pipe1(tweet) for tweet in tweets]

In [None]:
tweets[:10]

In [None]:
[tweets[i] for i in idx_sample_tweets_emojs]

Fourth part

In [None]:
# lower text
tweets = [tweet.lower() for tweet in tweets]

In [None]:
tokenise_pipe = chain_functions(tokenise_sent, tokenise_word)

In [None]:
tweets_tok = [tokenise_pipe(tweet) for tweet in tweets]

In [None]:
tweets_tok = [remove_stopwords(tweet, extra_stopwords= EXTRA_STOPWORDS) for tweet in tweets_tok]

In [None]:
# check
[tweets_tok[idx] for idx in idx_sample_tweets_emojs][:1]

Let's do not remove punctuation form the time being

In [None]:
tokens2string_pipe = chain_functions(flatten_irregular_listoflists, list, detokenise_list)

In [None]:
tweets_cleaned = [tokens2string_pipe(tweet) for tweet in tweets_tok]

In [None]:
# remove extra white space before punctuation
tweets_cleaned = [re.sub(r'\s([?.!,;:"](?:\s|$))', r'\1', tweet) for tweet in tweets_cleaned]

In [None]:
# let's take a look
[tweets_cleaned[i] for i in idx_sample_tweets_emojs]

## Merge to original dataset of tweets

In [None]:
len(tweets_cleaned)

In [None]:
tweets_df['tweet_cleaned'] = tweets_cleaned

In [None]:
tweets_df[['text', 'tweet_cleaned']][:10]

# VADER sentiment analysis on cleaned-text tweets

VADER stands for Valence Aware Dictionary for Sentiment Reasoning and is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. 

Intro to VADER: https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664

In [None]:
tweets_df['VDR_sentiment'] = [get_sentiment_score_VDR(tweet) for tweet in tweets_df.tweet_cleaned]

In [None]:
tweets_df[['text', 'tweet_cleaned','VDR_sentiment']][:10]

# TextBlob sentiment analysis on cleaned-text tweets

Intro to TextBlob: https://planspace.org/20150607-textblob_sentiment/  

In [None]:
tweets_df['TB_sentiment'] = [round(get_sentiment_score_TB(tweet),3) for tweet in tweets_df.tweet_cleaned]

In [None]:
tweets_df[['text', 'tweet_cleaned','VDR_sentiment', 'TB_sentiment']][:10]

Let's compare these scores to the ones obtained for the non-preprocessed tweet texts:

## VADER with individual score for pos/neu/neg

In [None]:
tweets_df['VDR_detailed_sentiment'] = [get_sentiment_score_VDR(tweet, score_type='all') for tweet in tweets_df.tweet_cleaned]

In [None]:
tweets_df[['text', 'tweet_cleaned','VDR_sentiment', 'VDR_detailed_sentiment', 'TB_sentiment']][:10]

## Let's clean the text less

Let's try not to clean the text as much as Vader should be sensitive to emoticons, capital letter that emphasise, etc... See: https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

So, we will:

- keep emojis and emoticons in as they are
- not lemmatise
- not lower-case
- not remove stop-words 

#### New cleaning steps

1. Replace URLs with word "url" or remove them
2. Remove first name of users metioned or replace them with "user_mentioned"
3. Replace all the hashtags with the words with the hash symbol (e.g., "#hello" -> "hello")


4. Split compounded-by-upper-case strings
5. Split compounded-by-underscore "_" strings (this is two get the words that make up the emojis descriptions)
6. Remove digits
7. Remove single-character words
8. Split domain-specific compunded all-lower-case strings (e.g., behaviouraleconomics)

In [None]:
tweets_2 = [clean_tweet_quibbles(tweet) for tweet in tweets_df.text]

In [None]:
tweets_2[:5]

In [None]:
tweets_2 = [preproc_pipe1(tweet) for tweet in tweets_2]

In [None]:
tweets_2[:5]

In [None]:
tweets_df['tweet_cleaned_less'] = tweets_2

In [None]:
tweets_df['VDR_sentiment_2'] = [get_sentiment_score_VDR(tweet) for tweet in tweets_df.tweet_cleaned_less]

In [None]:
tweets_df['VDR_detailed_sentiment_2'] = [get_sentiment_score_VDR(tweet, score_type='all') for tweet in tweets_df.tweet_cleaned_less]

In [None]:
tweets_df[['text', 'tweet_cleaned_less','VDR_sentiment', 'VDR_sentiment_2', 'VDR_detailed_sentiment', 'VDR_detailed_sentiment_2']][:10]

Not much difference, really.

## Save dataset with sentiment scores

In [None]:
tweets_df_to_save = tweets_df[['id', 'created_at', 'favorite_count', 'retweet_count', 
           'text', 'tweet_cleaned', 'tweet_cleaned_less', 'VDR_sentiment', 'VDR_sentiment_2', 'TB_sentiment', 'VDR_detailed_sentiment', 'VDR_detailed_sentiment_2']]

In [None]:
tweets_df_to_save.to_csv(os.path.join(DATA_DIR, "tweets_en_lexicon_sentiments.csv"))

In [None]:
tweets_df_to_save.to_pickle(os.path.join(DATA_DIR, "tweets_en_lexicon_sentiments.pickle"))