# Chapter 22 - Sentiment analysis using VADER

In this notebook, we focus on sentiment analysis, which is the task of determining whether a text expresses a negative, neutral, or positive opinion. We introduce how to work with [VADER](https://github.com/cjhutto/vaderSentiment) as part of the NLTK to perform sentiment analysis. Given a sentence, e.g., "I like Python", VADER will predict a sentiment score on a scale from -1 to 1. The goal of this notebook is to show you how to work with VADER. One of the learning goals of the accompanying assignment is to gain insight into VADER by reading blogs about the system.

### at the end of this notebook, you will:
* have VADER installed on your computer
* be able to load the VADER model
* be able to apply the VADER model on new sentences:
    * with and without lemmatization
    * with providing VADER with certain parts of speech, e.g., providing the adjectives from a sentence as input to VADER
    
### If you want to learn more about this chapter, you might find the following links useful:
* [blog on sentiment analysis](https://towardsdatascience.com/quick-introduction-to-sentiment-analysis-74bd3dfb536c)
* [GitHub repository](https://github.com/cjhutto/vaderSentiment)
* [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html)

If you have **questions** about this chapter, please contact **Marten (m.c.postma@vu.nl)**.

## 1. Downloading VADER package
Please run the following commands to download VADER to your computer.

In [None]:
import nltk

In [None]:
# You only need to run this cell once.
# After that, you can comment it out.

nltk.download('vader_lexicon', quiet=False)

To verify that the download was successful, you can run the following command.

In [None]:
from nltk.sentiment import vader

## 2. Load VADER model
The model can be loaded in the following way.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
vader_model = SentimentIntensityAnalyzer()

We will use the following three sentences:

In [None]:
sentences = ["Here are my sentences.",
             "It's a nice day.",
             "It's a rainy day."] 

The next for loop assigns a sentiment score from VADER to **each sentence**.

In [None]:
for sent in sentences:
    scores = vader_model.polarity_scores(sent)
    print()
    print('INPUT SENTENCE', sent)
    print('VADER OUTPUT', scores)

VADER provides a dictionary containing four ratings, i.e., keys, for each sentence.
The sentence is rated on how negative (key *neg*), positive (key *pos*), and neutral (key *neu*), it is.
Also, there is a *compound* key that combines the values of the keys *neg*, *pos*, and *neu* into one single score, i.e., the *compound* key. The compound value ranges from -1, i.e., very negative, to 1, i.e., very positive. You can read more about the VADER system on [this blog](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).

## 3. Using spaCy to manipulate the input to VADER 

In the examples in Section 2, VADER always takes into account each token, i.e., word, in the sentence to arrive at its sentiment prediction. In this section,
we are going to use spaCy to manipulate the input to VADER. This is one way to gain insight into how language systems work, i.e., by manipulating the input and inspecting the output.
We use spaCy as a tool to manipulate the input to VADER.

Please first install spaCy by following the instructions from **Chapter 19 - More about Natural Language Processing Tools (spaCy) -- Section 2.1 Installing and loading spaCy**

In [None]:
import spacy
nlp = spacy.load('en') # en_core_web_sm

The next function defines an API to process texts (textual_unit) using different settings. This function operates on texts and assumes spaCy is loaded with the corresponding language model as we just did. Take a little time to analyze the function, which uses certain spaCy token properties to process the text in different ways and returns the VADER sentiment.

In [None]:
def run_vader(nlp,
              textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=set(),
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param nlp: spaCy model
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -empty set -> all parts of speech are provided
    -non-empty set: only these parts of speech are considered
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:
            
            if verbose >= 2:
                print(token, token.pos_)

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

We can now use various API settings to experiment with processing text in different ways.

### 3.1 Lemmatization
The first setting is to lemmatize the provided sentence. If you want to know more about lemmas, you can read [this blog](https://www.retresco.de/en/encyclopedia/lemma/). If you want the function to print more information, you can set the keyword parameter **verbose** to 1.

In [None]:
sentences = ["Here are my sentences.",
             "It's a nice day.",
             "It's a rainy day."]

In [None]:
prediction = run_vader(nlp, sentences[1], lemmatize=False, verbose=1)

In [None]:
print(prediction)

In [None]:
prediction = run_vader(nlp, sentences[1], lemmatize=True, verbose=1)

In [None]:
print(prediction)

Perhaps you are surprised to see that there is no difference in the output! This is useful information for you to try to understand how the system works! Perhaps there are sentences for which it does matter. Feel free to experiment with other sentences.

### 3.2 Filter on part of speech
You can also filter on the part of speech, i.e., we let VADER make a prediction by only considering the nouns, verbs, or adjectives. The manipulation of the input to VADER allows you to gain insight how the system works.

Only Nouns:

In [None]:
run_vader(nlp, 
          sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'NOUN'},
          verbose=1)

Please note that in this case, VADER only considers *day* to predict the sentiment score and ignores all other words. Do you agree with the assessment that *day* is neutral? I hope you have a great day!

Only verbs:

In [None]:
run_vader(nlp, 
          sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'VERB'},
          verbose=1)

This is even more interesting. The part of speech label *VERB* is not applied to any of the tokens (*'s* is labeled as auxiliary and not with the label VERB). We have not provided VADER with input at all!

Let's also try adjectives:

In [None]:
run_vader(nlp,
          sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'ADJ'},
          verbose=1)

Very interesting! By only considering adjectives, i.e., *nice*, VADER predicts that the sentence is very positive! I hope that you start to get an understanding of how VADER works.