# Lab3.2 Sentiment analysis using VADER

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we introduce how to use [VADER](https://github.com/cjhutto/vaderSentiment) as part of the NLTK to perform sentiment analysis.

**at the end of this notebook, you will**:
* have VADER installed on your computer
* be able to load the VADER model
* be able to apply the VADER model on new sentences:
    * with and without lemmatization
    * selecting only certain parts of speech, e.g., only providing the adjectives from a sentences as input to VADER

### About VADER
VADER is a rule-based system that makes use of a lexicon. The lexicon can be found [here](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt). It associates words with valence scores; a group of crowd workers rated words on a scale of -4 (strongly negative) to 4 (very positive). The lexicon includes the mean and standard deviation as well as the raw ratings. VADER is particularly useful for tweets; the lexicon includes emoji's and internet slang as well as 'regular' words. 

VADER is a very basic tool that relies mostly on the lexicon created by crowd annotations. It can easily be adapted to suit your needs, but the aproach remains basic.

## Downloading VADER package
Please run the following commands first to download VADER to your computer.

In [2]:
import nltk

In [4]:
### once you downloaded vader successfully you do not need to do this again.
### You can command it out in your personal copy as I did below to skip this.

#nltk.download('vader_lexicon', quiet=False)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\szcze\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

To verify that the download was successful, you can run the following command.

In [6]:
from nltk.sentiment import vader

## Using VADER
The model can be loaded in the following way.

In [8]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [10]:
vader_model = SentimentIntensityAnalyzer()

We will use the following three sentences:

In [13]:
sentences = ["Here are my sentences.",
             "It's a nice day.",
             "It's a rainy day."] 

The next for loop assigns a sentiment score from VADER to each sentence. 

As you can see, VADER outputs 4 scores. `neg`, `neu` and `pos` reflect the ratios of the proportions of text with negative, neutral and positive valence. The `compound` score is the result of the rule-based computation, and reflects the valence of the sentence as a whole. 

In [16]:
for sent in sentences:
    scores = vader_model.polarity_scores(sent)
    print()
    print('INPUT SENTENCE', sent)
    print('VADER OUTPUT', scores)


INPUT SENTENCE Here are my sentences.
VADER OUTPUT {'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'compound': 0.0516}

INPUT SENTENCE It's a nice day.
VADER OUTPUT {'neg': 0.0, 'neu': 0.417, 'pos': 0.583, 'compound': 0.4215}

INPUT SENTENCE It's a rainy day.
VADER OUTPUT {'neg': 0.394, 'neu': 0.606, 'pos': 0.0, 'compound': -0.0772}


## Using VADER with SpaCy
We can manipulate the input to VADER by providing the lemmas as input instead of the words and by only providing words with certain parts of speech, e.g., only adjectives. We use spaCy for the lemmatization and part of speech tagging.

In [19]:
import spacy
nlp = spacy.load('en_core_web_sm')

The function `run_vader` defines an API to process texts (`textual_unit`) using different settings. The function operates on texts, and assumes the SpaCy English model is loaded as `nlp` and VADER as `vader_model`. The keyword arguments can be used to control the use of lemmatization and select parts-of-speech to process. 

Take some time to analyse the function, which uses certain SpaCy token properties to process the text in different ways and returns the VADER sentiment.

In [21]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

We can now use various API settings to experiment with processing text in different ways.

#### Provide VADER with lemmas

In [24]:
sentences = ["Here are my sentences.",
             "It's a nice day.",
             "It's a rainy day."]

In [25]:
run_vader(sentences[1], lemmatize=True)

{'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'compound': 0.4215}

Do you notice any differences with the previous output?

#### Verbose output
If you want the function to print more information, you can set the keyword argument `verbose` to True.

In [32]:
run_vader(sentences[1], lemmatize=True, verbose=1)


INPUT SENTENCE It's a nice day.
INPUT TO VADER ['it', 'be', 'a', 'nice', 'day', '.']
VADER OUTPUT {'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'compound': 0.4215}


{'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'compound': 0.4215}

#### Using specific parts-of-speech
You can also filter on part of speech using the keyword argument `parts_of_speech_to_consider`.

Only consider nouns:

In [36]:
run_vader(sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'NOUN'},
          verbose=1)


INPUT SENTENCE It's a nice day.
INPUT TO VADER ['day']
VADER OUTPUT {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Only verbs:

In [39]:
run_vader(sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'VERB'},
          verbose=1)


INPUT SENTENCE It's a nice day.
INPUT TO VADER []
VADER OUTPUT {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}


{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

Only adjectives:

In [42]:
run_vader(sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'ADJ'},
          verbose=1)


INPUT SENTENCE It's a nice day.
INPUT TO VADER ['nice']
VADER OUTPUT {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}


{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}

## End of this Notebook