# Sentiment Analysis
Now that we've seen word vectors we can start to investigate sentiment analysis. The goal is to find commonalities between documents, with the understanding that similarly *combined* vectors should correspond to similar sentiments.

While the scope of sentiment analysis is very broad, we will focus our work in two ways.

### 1. Polarity classification
We won't try to determine if a sentence is objective or subjective, fact or opinion. Rather, we care only if the text expresses a *positive*, *negative* or *neutral* opinion.
### 2. Document level scope
We'll also try to aggregate all of the sentences in a document or paragraph, to arrive at an overall opinion.
### 3. Coarse analysis
We won't try to perform a fine-grained analysis that would determine the degree of positivity/negativity. That is, we're not trying to guess how many stars a reviewer awarded, just whether the review was positive or negative.

## Broad Steps:
* First, consider the text being analyzed. A model trained on paragraph-long movie reviews might not be effective on tweets. Make sure to use an appropriate model for the task at hand.
* Next, decide the type of analysis to perform. In the previous section on text classification we used a bag-of-words technique that considered only single tokens, or *unigrams*. Some rudimentary sentiment analysis models go one step further, and consider two-word combinations, or *bigrams*. In this section, we'd like to work with complete sentences, and for this we're going to import a trained NLTK lexicon called *VADER*.

## Vader(valence aware dictionary for sentiment reasoning)
<div class="alert alert-info" style="margin:20px">
    
Vader is a model used for text sentiment analysis that is sensitive to both(positive/negative) and intensity(strength) of emotion. It is available in NLTK package and can be applied directly to unlabeled text data. Primarily, VADER sentiment analysis relies on a dictionary which maps lexical features to emotion intensities called sentiment score. The sentiment score of text can be obtained by summing up the intensity of each word in the text. For example words like **"love", "like", "enjoy" and "happy"** all convey a positive sentiment.
    
Vader is intelligent enough to understand basic context of those words such as **"did not like"** as a negative sentiment.It also understands capitalization and punctuations, such as **"LOVE!!!"**.
    
Sentiment analysis on raw text is always challenging however, due to a variety of possible factors:
    
    * Positive and negative sentiment in the same text data.
    * Sarcasm using positive words in a negative way.

In [3]:
# import nltk and load vader sentiment method 
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sia=SentimentIntensityAnalyzer()

VADER's `SentimentIntensityAnalyzer()` takes in a string and returns a dictionary of scores in each of four categories:
* negative
* neutral
* positive
* compound *(computed by normalizing the scores above)*

In [5]:
a="This is a great movie."
sia.polarity_scores(a)

{'neg': 0.0, 'neu': 0.423, 'pos': 0.577, 'compound': 0.6249}

In [6]:
a="I don't like this movie, it's such a waste of time"
sia.polarity_scores(a)

{'neg': 0.412, 'neu': 0.588, 'pos': 0.0, 'compound': -0.6007}

In [7]:
a="hey, did you see that movie, it was awful."
sia.polarity_scores(a)

{'neg': 0.273, 'neu': 0.727, 'pos': 0.0, 'compound': -0.4588}

## Use VADER to analyze Amazon Reviews
For this exercise we're going to apply `SentimentIntensityAnalyzer` to a dataset of 10,000 Amazon reviews. Like our movie reviews datasets, these are labeled as either "pos" or "neg". At the end we'll determine the accuracy of our sentiment analysis with VADER.

In [8]:
import numpy as np
import pandas as pd

In [9]:
df=pd.read_csv("../TextFiles/amazonreviews.tsv",sep="\t")
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [10]:
df.tail()

Unnamed: 0,label,review
9995,pos,A revelation of life in small town America in ...
9996,pos,Great biography of a very interesting journali...
9997,neg,Interesting Subject; Poor Presentation: You'd ...
9998,neg,Don't buy: The box looked used and it is obvio...
9999,pos,Beautiful Pen and Fast Delivery.: The pen was ...


In [11]:
df["label"].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

## Check for any empty values

In [12]:
df.isnull().sum()

label     0
review    0
dtype: int64

So we don't have any missing values in our data

## Adding Scores and Labels to the DataFrame

In [13]:
df.iloc[0]['review']

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [14]:
sia.polarity_scores(df.iloc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [16]:
df['scores']=df['review'].apply(lambda review: sia.polarity_scores(review))

In [17]:
df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [18]:
df['compound']=df['scores'].apply(lambda x: x['compound'])

In [19]:
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [20]:
df['compound_score']=df['compound'].apply(lambda score: "pos" if score >=0 else "neg")

In [21]:
df.head()

Unnamed: 0,label,review,scores,compound,compound_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


## Evaluate the outcome 

In [22]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

accuracy_score(df["label"],df["compound_score"])

0.7091

In [23]:
confusion_matrix(df["label"],df["compound_score"])

array([[2623, 2474],
       [ 435, 4468]], dtype=int64)

In [24]:
print(classification_report(df["label"],df["compound_score"]))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

   micro avg       0.71      0.71      0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



This means vader has an accuracy of around 71% only 