<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis With SpaCy and VADER

# What is Sentiment Analysis?



## SpaCy and Part of Speech (PoS)

---


In [3]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
Building wheels for collected packages: en-core-web-sm
  Running setup.py bdist_wheel for en-core-web-sm: started
  Running setup.py bdist_wheel for en-core-web-sm: finished with status 'done'
  Stored in directory: C:\Users\Samson\AppData\Local\pip\Cache\wheels\54\7c\d8\f86364af8fbba7258e14adae115f18dd2c91552406edc3fdaa
Successfully built en-core-web-sm


You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [4]:
import spacy

In [7]:
en_nlp = spacy.load('en_core_web_sm')

In [None]:
# en_nlp = spacy.load('en')

**Parse a single quote.**

In [8]:
#ERROR
sentence = "this is a very nice sentence about football and food"
sentence_parsed = en_nlp(sentence)

TypeError: Argument 'string' has incorrect type (expected unicode, got str)

In [9]:
sentence = u"this is a very nice sentence about football and food"
sentence_parsed = en_nlp(sentence)

In [10]:
len(sentence_parsed) # number of words!

10

In [11]:
sentence_parsed[0]

this

In [12]:
type(sentence_parsed[0])

spacy.tokens.token.Token

In [13]:
sentence_parsed.sentiment

0.0

In [14]:
for token in sentence_parsed:
    print(token, token.pos_)

(this, u'DET')
(is, u'VERB')
(a, u'DET')
(very, u'ADV')
(nice, u'ADJ')
(sentence, u'NOUN')
(about, u'ADP')
(football, u'NOUN')
(and, u'CCONJ')
(food, u'NOUN')


In [15]:
pos_counts = {}
for token in sentence_parsed:
    pos = token.pos_
    pos_counts[pos] = pos_counts.get(pos,0) + 1
    
pos_counts

{u'ADJ': 1,
 u'ADP': 1,
 u'ADV': 1,
 u'CCONJ': 1,
 u'DET': 2,
 u'NOUN': 3,
 u'VERB': 1}

In [16]:
pos_perc = {}
for k,v in pos_counts.items():
    pos_perc [k] = 1.*v/len(sentence_parsed)
    
pos_perc

{u'ADJ': 0.1,
 u'ADP': 0.1,
 u'ADV': 0.1,
 u'CCONJ': 0.1,
 u'DET': 0.2,
 u'NOUN': 0.3,
 u'VERB': 0.1}

#### Those are new features you can use!

#  
#  
#  
## Sentiment analysis

In [19]:
import pandas as pd

sen = pd.read_csv('datasets/sentiment_words_simple.csv')
sen['pos'] = sen['pos'].str.upper()

sen.sample(10)

Unnamed: 0,pos,word,pos_score,neg_score
48994,NOUN,deciduous_holly,0.0,0.0
50056,NOUN,deuteromycotina,0.0,0.0
83067,NOUN,lebanon,0.0,0.0
47648,NOUN,cuss,0.0,0.125
87635,NOUN,march_17,0.0,0.0
10965,ADJ,made-to-order,0.0,0.0
98062,NOUN,oscar_fingal_o'flahertie_wills_wilde,0.0,0.0
18362,ADJ,togolese,0.0,0.0
12284,ADJ,nonelective,0.0,0.5
3051,ADJ,brown-striped,0.125,0.0


In [20]:
# let's define positive-negative
sen['pos_vs_neg'] = sen['pos_score'] - sen['neg_score']

In [21]:
# example 1
sen[(sen['word']=='sentence') & (sen['pos']=='NOUN')]

Unnamed: 0,pos,word,pos_score,neg_score,pos_vs_neg
116721,NOUN,sentence,0.0,0.0,0.0


### We can get a score for each word and average the results

In [22]:
for token in sentence_parsed:
    score = sen[(sen['word']==str(token)) & (sen['pos'].astype(unicode)==str(token.pos_))]['pos_vs_neg'].values
    if len(score)>0:
        print(token, token.pos_,score[0])

(very, u'ADV', 0.125)
(nice, u'ADJ', 0.5750000000000001)
(sentence, u'NOUN', 0.0)
(football, u'NOUN', 0.0)
(food, u'NOUN', -0.0416666666667)


<a id='print-most-obj'></a>
#  
#  
#  
## Objective and Subjective
---

Objective = 1 - (positive+negative)  

"terrible":
    * positve = 0.0
    * negative = 0.8
    * objective = 0.2
    
"very":
    * positve = 0.7
    * negative = 0.0
    * objective = 0.3
    
"room":
    * positve = 0.02
    * negative = 0.03
    * objective = 0.95


#  
#  
#  

## Sentiment Scores with VADER Library
---

In [None]:
#score is computed based on individual words. not robust as RNN. for simple baseline model use.

In [25]:
# Pip install vaderSentiment.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [26]:
sentences = ['Hawthorne is by turn outrageous and pathetic and imperious and poignant and very funny.',
            'Delivers guilt-free escapism about pretty people having wicked-hot fun in pretty places.',
            'Brian De Palma take on Tom Wolfe The Bonfire of the Vanities is a misfire of inanities.',
            'I hated this movie. Hated hated hated hated hated this movie. Hated it.']

In [67]:
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print sentence
    print vs
    print('')

Hawthorne is by turn outrageous and pathetic and imperious and poignant and very funny.
{'neg': 0.321, 'neu': 0.526, 'pos': 0.153, 'compound': -0.5434}

Delivers guilt-free escapism about pretty people having wicked-hot fun in pretty places.
{'neg': 0.0, 'neu': 0.481, 'pos': 0.519, 'compound': 0.8658}

Brian De Palma take on Tom Wolfe The Bonfire of the Vanities is a misfire of inanities.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

I hated this movie. Hated hated hated hated hated this movie. Hated it.
{'neg': 0.855, 'neu': 0.145, 'pos': 0.0, 'compound': -0.9854}

