# Lab3.3 Sentiment analysis with rules using VADER

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we introduce how to use [VADER](https://github.com/cjhutto/vaderSentiment) as part of the NLTK to perform sentiment analysis. VADER is a rule-based system that uses a lexicon with sentiment values created through crowd-annotation of tweets.

**at the end of this notebook, you will**:
* have VADER installed on your computer
* be able to load the VADER model
* be able to apply the VADER model to any text:
    * with and without lemmatization
    * with only providing VADER with certain parts of speech, e.g., only providing the adjectives from a sentences as input to VADER

## Downloading VADER package
Please run the following commands first to download VADER to your computer.

In [1]:
import nltk

In [2]:
### once you downloaded vader successfully you do not need to do this again.
### You can command it out in your personal copy as I did below to skip this.

nltk.download('vader_lexicon', quiet=False)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/piek/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

To verify that the download was successful, you can run the following command. If there is no error message it works.

In [4]:
from nltk.sentiment import vader

VADER is rule-based system that makes use of a lexicon. The lexicon can be found [here](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt).

## Load VADER model
The model can be loaded in the following way.

In [5]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [6]:
vader_model = SentimentIntensityAnalyzer()

We will use the following three sentences:

In [7]:
sentences = ["Here are my sentences.",
             "It's a nice day.",
             "It's a rainy day."] 

The next for loop assigns a sentiment score from VADER to **each sentence**.

In [8]:
for sent in sentences:
    scores = vader_model.polarity_scores(sent)
    print()
    print('INPUT SENTENCE', sent)
    print('VADER OUTPUT', scores)


INPUT SENTENCE Here are my sentences.
VADER OUTPUT {'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'compound': 0.0516}

INPUT SENTENCE It's a nice day.
VADER OUTPUT {'neg': 0.0, 'neu': 0.417, 'pos': 0.583, 'compound': 0.4215}

INPUT SENTENCE It's a rainy day.
VADER OUTPUT {'neg': 0.394, 'neu': 0.606, 'pos': 0.0, 'compound': -0.0772}


We can manipulate the input to VADER by providing the lemmas as input instead of the words and by only providing words with certain parts of speech, e.g., only adjectives. We use spaCy for the lemmatization and part of speech tagging.

In [9]:
import spacy
nlp = spacy.load('en') # en_core_web_sm

The next function defines an API to process texts (textual_unit) using different settings. This function operates on texts and assumes SpaCy is loaded with the corresponding language model as we just did. Take a little time to analyse the function, which uses certain SpaCy token properties to process the text in different ways and returns the VADER sentiment.

In [18]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=set(),
              verbose=False):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -empty set -> all parts of speech are provided
    -non-empty set: only these parts of speech are considered
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text
            else:
                # keep the original token
                # do nothing
                pass

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add)
                else:
                    # we ignore this word and do nothing
                    pass
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    ### the verbose flag is set to True we print more info, the default value is False
    if verbose:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)
    else:
        pass # do nothing

    ## we always return the aggregated scores
    return scores

We can now use various API settings to experiment with processing text in different ways.

Provide VADER with lemmas

In [15]:
sentences = ["Here are my sentences.",
             "It's a nice day.",
             "It's a rainy day."]

In [16]:
run_vader(sentences[1], lemmatize=True)

{'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'compound': 0.4215}

Do you notice any differences with the previous output?

If you want the function to print more information, you can set the keyword argument **verbose** to True.

In [19]:
run_vader(sentences[1], lemmatize=True, verbose=True)


INPUT SENTENCE It's a nice day.
INPUT TO VADER ['It', 'be', 'a', 'nice', 'day', '.']
VADER OUTPUT {'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'compound': 0.4215}


{'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'compound': 0.4215}

You can also filter on part of speech. 

Only Nouns:

In [20]:
run_vader(sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'NOUN'},
          verbose=True)


INPUT SENTENCE It's a nice day.
INPUT TO VADER ['day']
VADER OUTPUT {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Only verbs:

In [21]:
run_vader(sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'VERB'},
          verbose=True)


INPUT SENTENCE It's a nice day.
INPUT TO VADER ['be']
VADER OUTPUT {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [22]:
run_vader(sentences[1], 
          lemmatize=True, 
          parts_of_speech_to_consider={'ADJ'},
          verbose=True)


INPUT SENTENCE It's a nice day.
INPUT TO VADER ['nice']
VADER OUTPUT {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}


{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}

VADER is a very basic tool that relies mostly on the lexicons created by crowd annotations and tuned to tweets. It can easily be adapted but the aproach remains basic.

## End of this notebook