# Labeling Data with VADER

The data used in this tutorial has been scraped from Twitter using [Twint](https://github.com/twintproject/twint), an advanced Twitter scraping tool written in Python. It has then been partially cleaned, and labeled for negativity and positivity using VADER (Valence Aware Dictionary for Sentiment Reasoning) model, which is part of the Natural Language Toolkit (NLTK) library. VADER can be used for sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data. Below are the steps showing the labeling. 


**Natural Language is a difficult problem.** One of the most challenging aspects of working in supervised Machine Learning and/or Deep Learning is to have good data. Good data can mean many things in different contexts. Labeling data is an expensive process and is not free from human bias. The dataset and labels created for it in this tutorial are far from being perfect. 

**A more sophisticated labeling approach needs to be applied** to get more accurate labels, which often requires a human-in-the-loop approach. However, for our purposes in this tutorial which aim introducting participants to basic concepts of NLP, we decided to put our focus less on the accuracy of the model (or the labels), and more on the steps to an end-to-end model. We believe that better data beats better algorithms, and those, who would like to experiment more, would need to iterate over the preparation and cleaning of data and try out more sophisticated models to achieve better accuracy.  

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('<the_data_file_you_want_to_label>.csv')

In [None]:
df.head()

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment

sentiment_analyzer = SentimentIntensityAnalyzer()

#The data we collected from Twitter, had a `tweet` column. It is the column we want to label on.
df['scores'] = df['tweet'].apply(lambda tweet: sentiment_analyzer.polarity_scores(tweet))
df.head()

#### **We will clean the tweets that are replies to other people from usernames that they came with**

+ Usernames in tweet replies starts with an @ symbol. 

In [None]:
f = lambda x: ' '.join([item for item in x.split() if not item.startswith('@')])
df["tweet"] = df["tweet"].apply(f)

#### **We will also remove any urls in tweets to reduce noise in data**

In [None]:
remove_urls = lambda x: ' '.join([item for item in x.split() if not item.startswith('https://')])
df["tweet"] = df["tweet"].apply(remove_urls)
df.head()

#### `compound` **in VADER, is computed by normalizing the scores of positive, negative, and neutral**

In [None]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df.tail()

#### **We apply a simple approach here when we decide which sentiment is negative or positive**

+ You would probably want to spend more time on how to make this decision. For now, we will keep it simple. 

In [None]:
df['sentiment'] = df['compound'].apply(lambda c: 'pos' if c >= 0 else 'neg')
df.head()

In [7]:
df = df[['tweet', 'sentiment']]
df.head()

#### **Save your labeled data into a csv file**

In [9]:
df.to_csv('<your_file_name>.csv', index=False)