# Natural Language Processing 101 Workshop
### Demo 2- Sentiment Analysis with VADER

In this demo, we'll be performing exploratory analysis on the expressed emotional 
sentiment of r/nyc users as embedded within their comment text. We will use readr, 
dplylr, and stringr once again as these packages are core libraries
across NLP use cases. We're also including the **vader** package which wraps the original 
VADER sentiment analyser written in Python for R as well as **ggplot** for simple
data visualizations. Let's load them all in along with our r/nyc comment data set as follows: 

In [None]:
library(readr)
library(dplyr)
library(stringr)
library(vader)
library(ggplot2)

nyc <- read_csv("nyc_reddit.csv")

Let’s consider three individual string examples in sequence to hone our intuition regarding how VADER generates its sentiment scores. We'll be using VADER's core function for analyzing the sentiment of individual records. 

In [None]:
get_vader("As someone who always depended on cars before, I LOVE the subway! <3")

In [None]:
get_vader("The subway is very helpful, but I'm not a fan of the rats.")

In [None]:
get_vader("I hate how delayed the subway always is… being late for work sucks. :(")

We can see how each get_vader call generates both the individual scores of each word accounting for its polarity and valence, as well as the four overarching sentiment scores of each string. The first string is scored strongly positive and is shown to account for both the capitalization of “love” and the heart emoticon. The second string has both a positive connotation through “helpful” but also a negative tone with “not a fan,” leading to a weakly negative compound score. The final string is accurately identified as strongly negative and successfully captures the intention behind the sad face emoticon.

Given our validation of VADER’s classification scheme to text very similar to our r/nyc comments, let’s go ahead and create a data frame of the VADER metrics of each of our comments. VADER has a designated function to generate its four sentiment scores across a full column of text that can then be used to instantiate an entirely seperate data frame dedicated to the results. This actual process however can take quite some time, so I've went ahead and pre-generated the results to be loaded in directly from another CSV for easy access. 

In [None]:
# Don't actually run this- it takes a while to run 
# nyc_sentiment <- vader_df(nyc$body)

In [None]:
nyc_sentiment <- read_csv('r_nyc_10k_sentiment.csv')
head(nyc_sentiment)

We can see through viewing the top of the generated data frame that the positive, negative, neutral, and compound scores have been identified for each respective comments. There's also a "but_count" comment to flag for sentiment negation- i.e. "it's okay, but not my favorite". 

Now that we have generated scores for the entirety of our dataset, let’s investigate what the most high-compound scoring positive and negative posts are respectively. We’ll start on the positive side first through some dplyr-powered data frame manipulation:

In [None]:
top_pos <- nyc_sentiment %>%
    top_n(5, compound)

top_pos$text

We can see through briefly reviewing the text that the most positive comments under VADER’s classification are speaking highly of particular locations or community initiatives or are expressing friendly comments to other users within the subreddit. The score is highly driven by the use of words such as "better", "grand", and "well" that the VADER dictionary has classified as positively valenced. 

Let’s replicate this for the most negative comments.

In [None]:
top_neg <- nyc_sentiment %>%
    top_n(-5, compound)

top_neg

We’ll skip reading through the details of the most negative comments due to their often disturbing content, but skimming the text lines gives a rather clear picture towards their general themes around crime within the city. The fact that these are the most negatively associated comments reflects VADER’s ability to catch the negative emotional valence of text related to violence and fear quite effectively.

As a final exercise, we’ll explore whether there’s a relationship between an r/NYC post’s community score and its expressed sentiment as identified by VADER. Comments can either be upvoted or downvoted by other users. This produces a score that serves as a proxy for the collective community reaction to a given comment. 

To prepare for this analysis, I’ll first have to combine my separate data frames of the baseline Reddit data with the VADER scores by comment. Luckily, the 'body' column of comments can serve as a natural primary key for a simple join function. 

In [None]:
nyc_full <- merge(nyc, nyc_sentiment, by.x = "body", by.y = "text")

We'll can now explore the sentiment trends of comments with a positive rating score of 20 through a ggplot scatter plot. You'll notice how the following code slices our dataset to our subpopulation of high-upvote comments. 

In [None]:
ggplot(nyc_full[which(nyc_full$score>20),], aes(x=compound, y=score)) + geom_point()

Let's look at the equivalent for sentiment among posts that received a negative score. We'll use the same ggplot call script by adjusting our conditional slice to capture the negative-scoring poster subgroup. 

In [None]:
ggplot(nyc_full[which(nyc_full$score<0),], aes(x=compound, y=score)) + geom_point()

And that's the fundamentals of exploring text data with a dictionary-based sentiment analyzer! As you can imagine there's a wide range of further extensions you can build from these foundations- looking at sentiment over time, across subgroups, before and after significant events, and beyond. Let's now return to our discussion of additional NLP methods back in the main workshop slides. 