# Labeling Data with VADER

The data used in this tutorial has been scraped from Twitter using [Twint](https://github.com/twintproject/twint), an advanced Twitter scraping tool written in Python. It has then been partially cleaned, and labeled for negativity and positivity using VADER (Valence Aware Dictionary for Sentiment Reasoning) model. VADER can be used for sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data. Below are the steps showing the labeling. 


**Natural Language is a difficult problem.** One of the most challenging aspects of working in supervised Machine Learning and/or Deep Learning is to have good data. Good data can mean many things in different contexts. Labeling data is an expensive process and is not free from human bias. The dataset and labels created for it in this tutorial are far from being perfect. 

A more fine-tuned labeling needs to be done to get more accurate labels, which often requires a human-in-the-loop approach. However, for our purpose of this tutorial which aims introducting participants to basic concepts of NLP, we decided to put our focus less on the accuracy of the model, and more on the steps to and end-to-end model. We believe that better data beats better algorithms, and those, who would like to experiment more, would need to iterate over the preparation and cleaning of data and try out more sophisticated models to achieve better accuracy.  

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/covid19_tweets.csv')

In [3]:
df.head()

Unnamed: 0,date,timezone,tweet,hashtags,retweet,sentiment
0,2020-05-26,CDT,Words matter! Masks Don’t “Prevent” a person f...,[],False,neg
1,2020-05-26,CDT,"If you are scared wear a mask, which does not ...",[],False,neg
2,2020-05-26,CDT,Why should we wear a sweaty useless mask? Harc...,[],False,neg
3,2020-05-26,CDT,"WEAR A MASK, people!\n\nDoctor recovering from...",[],False,neg
4,2020-05-26,CDT,Yo Dee why are u wearing a mask in public when...,[],False,neg


In [4]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment

sentiment_analyzer = SentimentIntensityAnalyzer()

df['scores'] = df['tweet'].apply(lambda tweet: sentiment_analyzer.polarity_scores(tweet))

df.head()

Unnamed: 0,date,timezone,tweet,hashtags,retweet,sentiment,scores
0,2020-05-26,CDT,Words matter! Masks Don’t “Prevent” a person f...,[],False,neg,"{'neg': 0.128, 'neu': 0.847, 'pos': 0.025, 'co..."
1,2020-05-26,CDT,"If you are scared wear a mask, which does not ...",[],False,neg,"{'neg': 0.126, 'neu': 0.798, 'pos': 0.076, 'co..."
2,2020-05-26,CDT,Why should we wear a sweaty useless mask? Harc...,[],False,neg,"{'neg': 0.25, 'neu': 0.694, 'pos': 0.056, 'com..."
3,2020-05-26,CDT,"WEAR A MASK, people!\n\nDoctor recovering from...",[],False,neg,"{'neg': 0.096, 'neu': 0.904, 'pos': 0.0, 'comp..."
4,2020-05-26,CDT,Yo Dee why are u wearing a mask in public when...,[],False,neg,"{'neg': 0.077, 'neu': 0.867, 'pos': 0.056, 'co..."


#### `compound` in VADER, is computed by normalizing the scores of positive, negative, and neutral

In [5]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df.tail()

Unnamed: 0,date,timezone,tweet,hashtags,retweet,sentiment,scores,compound
4508,2020-05-13,CDT,Pakistan top management without mask on their ...,[],False,pos,"{'neg': 0.0, 'neu': 0.813, 'pos': 0.187, 'comp...",0.3485
4509,2020-05-13,CDT,@UPS I can't believe that a UPS store employee...,[],False,pos,"{'neg': 0.122, 'neu': 0.682, 'pos': 0.197, 'co...",0.4753
4510,2020-05-13,CDT,How are you making things FUN at work during C...,['#hrcommunity'],False,pos,"{'neg': 0.09, 'neu': 0.668, 'pos': 0.241, 'com...",0.8765
4511,2020-05-13,CDT,"COMMUNITY ALERT (Friday, May 15th from 11am un...","['#teamreynolds', '#hd27', '#givingback', '#pa...",False,pos,"{'neg': 0.0, 'neu': 0.758, 'pos': 0.242, 'comp...",0.8877
4512,2020-05-13,CDT,"NO ONE CARES.... 83,000 PEOPLE HAVE DIED FROM ...",[],False,neg,"{'neg': 0.37, 'neu': 0.63, 'pos': 0.0, 'compou...",-0.8485


In [6]:
df['sentiment'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()

Unnamed: 0,date,timezone,tweet,hashtags,retweet,sentiment,scores,compound
0,2020-05-26,CDT,Words matter! Masks Don’t “Prevent” a person f...,[],False,neg,"{'neg': 0.128, 'neu': 0.847, 'pos': 0.025, 'co...",-0.6696
1,2020-05-26,CDT,"If you are scared wear a mask, which does not ...",[],False,neg,"{'neg': 0.126, 'neu': 0.798, 'pos': 0.076, 'co...",-0.4344
2,2020-05-26,CDT,Why should we wear a sweaty useless mask? Harc...,[],False,neg,"{'neg': 0.25, 'neu': 0.694, 'pos': 0.056, 'com...",-0.8439
3,2020-05-26,CDT,"WEAR A MASK, people!\n\nDoctor recovering from...",[],False,neg,"{'neg': 0.096, 'neu': 0.904, 'pos': 0.0, 'comp...",-0.1759
4,2020-05-26,CDT,Yo Dee why are u wearing a mask in public when...,[],False,neg,"{'neg': 0.077, 'neu': 0.867, 'pos': 0.056, 'co...",-0.2406


In [7]:
sentimental_df = df[['date', 'timezone', 'tweet', 'hashtags', 'retweet', 'sentiment']]

In [8]:
sentimental_df.head()

Unnamed: 0,date,timezone,tweet,hashtags,retweet,sentiment
0,2020-05-26,CDT,Words matter! Masks Don’t “Prevent” a person f...,[],False,neg
1,2020-05-26,CDT,"If you are scared wear a mask, which does not ...",[],False,neg
2,2020-05-26,CDT,Why should we wear a sweaty useless mask? Harc...,[],False,neg
3,2020-05-26,CDT,"WEAR A MASK, people!\n\nDoctor recovering from...",[],False,neg
4,2020-05-26,CDT,Yo Dee why are u wearing a mask in public when...,[],False,neg


In [9]:
sentimental_df.to_csv('./covid19_mask_tweets.csv', index=False)