<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis of Movie Reviews With SpaCy and VADER

_Authors: Kiefer Katovich (SF)_

---

### Learning Objectives
- Understand the goal of basic sentiment analysis.
- Calculate sentiment scores manually using a movie reviews data set and scores tagged by word.
- Practice using the spaCy parser to distill part-of-speech tags from text.
- Fit a model using sentiment and grammar features.
- Use the VADER sentiment analyzer to get more accurate sentiment scores and compare the models.

### Lesson Guide
- [Introduction to Sentiment Analysis](#intro)
- [Load the Word Sentiment Data Set](#load-sen)
    - [Engineer Objectivity and Positive Difference Scores](#adj-scores)
    - [Put Scores in a Parts-of-Speech Dictionary](#pos-dict)
- [Load the Rotten Tomatoes Review Data Set](#rt-reviews)
    - [Restrict Reviews to Valid Lengths and Ratings](#subset)
- [Import SpaCy](#spacy)
    - [Parse All of the Quotes Using SpaCy's Multithreaded Parser](#multi)
- [Parts-of-Speech Features](#pos-features)
- [Assign Sentiment Scores](#assign)
- [Print Out the Most Positive and Negative Reviews](#print-most)
- [Print Out the Most Objective and Subjective Reviews](#print-most-obj)
- [Build a Model to Classify Fresh vs. Rotten With the Sentiment and Grammar Features](#model)
- [Use the VADER Library to Get Better Sentiment Scores](#vader)
    - [Build a Model Using the VADER Sentiment Features](#vader-model)

<a id='intro'></a>

## Introduction to Sentiment Analysis
---

Sentiment analysis is one of the most popular topics in NLP. Most commonly, it's the quantification of text into valence and subjectivity scores.

First, we'll load in a data set of precoded sentiment scores for positivity and negativity of words. These words are also tagged with their part of speech in the sentence. We can use these valence scores to evaluate the sentiment of Rotten Tomatoes movie reviews. Many packages, such as TextBlob, come prepackaged with sentiment scores for words after parsing text. But doing the sentiment parsing manually will show you how it can be done without any "magic."

We'll also explore a more advanced sentiment analysis library in Python: [VADER](https://github.com/cjhutto/vaderSentiment). We can parse the sentiment of the movie reviews using this package and compare it to our more basic method.


<a id='load-sen'></a>

## Load the Word Sentiment Data Set
---

Below, we’ll load in some pre-tagged positive and negative valence scores for a dictionary of words. Each row of the data set contains the part of speech, the word, and the positive and negative scores for the word. A word may appear more than once if it can be classified with different parts-of-speech tags. 

These scores are designed so we can also derive the *objectivity score* of the word from the positive and negative scores.

Objectivity is calculated as: 

    1. - (positive_score + negative_score)

If a score has a zero positive and negative score, it’s completely objective. If a score has, for example, a 0.5 positive and 0.5 negative score, it may not be any more positive than negative, but we can tell that it's subjective.


In [1]:
import pandas as pd
import numpy as np

In [2]:
sen = pd.read_csv('./datasets/sentiment_words_simple.csv')

In [99]:
# A:

sen.head()

**Make the parts-of-speech tags uppercase (this will come in handy when we use spaCy).**

In [97]:
# A:


<a id='adj-scores'></a>

### Engineer Objectivity and Positive Difference Scores

Because subjective versus objective is embedded in the positive and negative scores, we should extract this and convert the scores into relative difference scores.

**Calculate two new scores:**

    objectivity = 1. - (pos_score + neg_score)
    pos_vs_neg = pos_score - neg_score
    

In [100]:
# A:


In [8]:
sen.head()

Unnamed: 0,pos,word,pos_score,neg_score,objectivity,pos_vs_neg
0,ADJ,.22-caliber,0.0,0.0,1.0,0.0
1,ADJ,.22-calibre,0.0,0.0,1.0,0.0
2,ADJ,.22_caliber,0.0,0.0,1.0,0.0
3,ADJ,.22_calibre,0.0,0.0,1.0,0.0
4,ADJ,.38-caliber,0.0,0.0,1.0,0.0


<a id='pos-dict'></a>

### Put Scores in a Parts-of-Speech Dictionary

The dictionary format of the data will be much easier to index using our parsing functions later on. Create a dictionary where the keys are the four parts-of-speech tags:

    ADJ
    NOUN
    VERB
    ADV

For each key, store a dictionary that contains all of the words for that part of speech with their objectivity and positive versus negative scores.

In [95]:
# A:


<a id='rt-reviews'></a>

## Load the Rotten Tomatoes Reviews Data Set

---

This data set has:
    
    critic: Critic's name.
    fresh: Fresh vs. rotten rating.
    imdb: Code for IMDB.
    publication: Where the review was published.
    quote: The review snippet.
    review_date: Date of review.
    rtid: Rotten Tomatoes ID.
    title: Name of movie.

In [10]:
rt = pd.read_csv('./datasets/rt_critics.csv')

In [98]:
# A:


<a id='subset'></a>

### Restrict Data to Reviews With Valid Ratings and Reviews More Than 10 Words Long

Additionally, clean up the reviews, making a column with the case and punctuation removed.

In [94]:
# A:


<a id='spacy'></a>

## Import SpaCy

---

The spaCy package is the current gold standard for parsing the grammatical structure of text (aside from neural network architectures). We're going to use it to find the parts-of-speech tags for the review words. 

Once we’ve parsed the tags with spaCy, we can assign objectivity and valence scores by finding the match in our sentiment data set.

The code for loading spaCy is:

- ```pip install spacy```
- ```Python -m spacy download en```  
  




** For OSx**  
If you have problems with "unknown locale UTF" (or similar), open the file ~.bash_profile (e.g. with ```nano ~.bash_profile``` from terminal) and add these two lines:


```export LC_ALL=en_US.UTF-8```  
```export LANG=en_US.UTF-8```

In [None]:
# For OSx only, you have the problem above run also this:
!export LC_ALL=en_US.UTF-8
!export LANG=en_US.UTF-8

In [23]:
import spacy
en_nlp = spacy.load('en')

**Parse a single quote.**

In [33]:
# A:
tmp = en_nlp(rt.qt.values[0])

**Print out the parts-of-speech tags for each word in the quote.**

In [101]:
# A:


<a id='multi'></a>
### Parse All of the Quotes Using SpaCy's Multithreaded Parser

Parsing a lot of text can take awhile. Luckily, spaCy comes with multithreading functionality to speed up the process. Below is code that will parse the quotes across multiple threads and assign them to a list.

In [None]:
parsed_quotes = []
for i, parsed in enumerate(en_nlp.pipe(rt.qt.values, batch_size=50, n_threads=4)):
    assert parsed.is_parsed
    if (i % 1000) == 0:
        print i
    parsed_quotes.append(parsed)  

In [102]:
# A:


<a id='pos-features'></a>

## Create Features With Parts-of-Speech Proportions

---

With our spaCy-parsed reviews, we have a lot of feature engineering potential before we even get to sentiment. Something simple we could do is calculate the proportion of words in the quote that have each part-of-speech tag. We can try using these as predictors later.

**Find all of the unique part-of-speech categories in the reviews.**

In [103]:
# A:


**Create the proportion columns for each part of speech.**

In [38]:
# A:


**Iterate through the reviews and calculate the proportions of each part-of-speech tag.**

In [104]:
# A:


<a id='assign'></a>

## Assign Sentiment Scores
---

We'll now use the parsed reviews and the sentiment data set to assign the average objectivity and positive versus negative scores.

If a word can't be found in the data set, we can ignore it. If a review has no words that match something in our data set, we can assign overall neutral scores of `objectivity = 1` and `pos_vs_neg = 0`.

There are problems with this approach, but for now we can keep it simple and see if things improve when we use the VADER analyzer.

In [105]:
# A:


<a id='print-most'></a>
## Print Out the Most Positive and Negative Reviews
---

Now that we have the average valence for reviews, try printing out the top 10 most positive and top 10 most negative reviews to visually verify that our approach makes sense.

In [106]:
# A:


<a id='print-most-obj'></a>

## Print Out the Most Objective and Subjective Reviews
---

Complete the same task as above but sort by objectivity. What kind of differences do you notice between the two? Does our approach appear to capture meaningful subjectivity and objectivity in the reviews?

In [19]:
# A:

<a id='model'></a>

## Build a Model to Classify Fresh vs. Rotten With the Sentiment and Grammar Features

---

Let's use the features we've created to construct a logistic regression that will predict whether a review is fresh or rotten. 

Don't forget to check the baseline score and remember that it's a good practice to standardize your predictors.

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

<a id='vader'></a>

## Use the VADER Library to Get Better Sentiment Scores
---

The [VADER](https://github.com/cjhutto/vaderSentiment) package for Python is a more advanced way to calculate positivity, negativity, and objectivity in our reviews. Its GitHub page describes VADER as:

> VADER (valence aware dictionary and sentiment reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media and works well on texts from other domains.

You will likely need to install VADER with `pip` or `conda`. Instructions can be found on the GitHub page. Once it is installed, you can load the `SentimentIntensityAnalyzer` and parse text.

**Parse a couple of quotes with the `SentimentIntensityAnalyzer` and print out the dictionary of scores using `analyzer.polarity_scores`**.

In [107]:
# Pip install vaderSentiment.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [108]:
# A:


These scores look more legitimate. VADER polarity score dictionaries have four elements: `neg`, `pos`, `neu`, and `compound`. The compound score is a single metric that represents the overall valence.

**Calculate the four scores for each review and save them as features in the DataFrame.**

In [109]:
# A:



<a id='vader-model'></a>

### Fit a Model Using the VADER Sentiment Features

Does this model perform better? 

In [110]:
# A:


<a id='vader-top'></a>

### Print Out the Top Most Negative, Positive, Neutral, and Subjective Features by VADER Score

In [26]:
# A:

