# Sentiment Analysis using Python and NLTK

## How are we going to be doing this?
Python, being Python, apart from its incredible readability, has some remarkable libraries at hand. One of which is NLTK. NLTK or Natural Language Tool Kit is one of the best Python NLP libraries out there. The functionality it leaves at your fingertips while maintaining its ease of use and again, readability is just fantastic.

In fact, we’re going to be completing this mini project under 25 lines of code. And you’re most probably going to understand each line as you read through it. Crazy, I know.

Let’s get right into it !
- IDE 
    Personally whenever I’m doing anything even relatively fancy, in Python, I use Jupyter Lab.

Now, we’ve got to get hold of the libraries we need. Just 4, super easy to get libraries.

- NLTK
- Numpy
- Pandas
- Scikit-learn

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
import nltk
nltk.download('vader_lexicon') # one time only
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
vader = SentimentIntensityAnalyzer() # or whatever you want to call it

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\gotha\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### What is this ‘VADER’ ?
While this is the official page for NLTK’s VADER, it’s actually the code and not an explanation of VADER which by the way, does not, refer to Darth Vader, very sad, I know.

It actually stands for Valence Aware Dictionary and sEntiment Reasoner. It’s basically going to do all the sentiment analysis for us. So convenient. I mean, at this rate jobs are definitely going to be vanishing faster. (No, I’m kidding)

The way this magical downloadable works, is by mapping the word you pass into it, to lexical features with emotional intensities. In English, since you ask, that means figuring out, let’s just call them synonyms for now, to figure out what that word relates to and then gives it a score. A sentiment score, to be precise.

So now that each word has a sentiment score, the score of a paragraph of words, is going to be, you guessed it, the sum of all the sentiment scores. Shocking, I know. 

Now let’s try out what this ‘VADER’ can do. Write the following and run it

In [3]:
sample = 'I really love NVIDIA'
vader.polarity_scores(sample)

{'neg': 0.0, 'neu': 0.308, 'pos': 0.692, 'compound': 0.6697}

So, it was 69.2% positive. Which might not be perfect, but it definitely gets the job done, as you’ll see.

In case you’re wondering, the compound value is basically the normal of the 3 values negative, positive and neutral.

Now, try this

In [4]:
sample = 'I really don\'t love NVIDIA'
vader.polarity_scores(sample)

{'neg': 0.549, 'neu': 0.451, 'pos': 0.0, 'compound': -0.5642}

54.9% negative, whew, by the skin of its teeth. 

Now let’s work on some real world data
Here’s a file with Amazon reviews of a product from which we’re going to be extracting sentiments. Go ahead and download it. Also ensure that it’s in the same directory as the python file you’re working on. Otherwise remember to add the correct path to it.

We’re going to be needing both pandas and numpy now

In [5]:
# file with Amazon reviews of a product from which we’re going to be extracting sentiments
amzn_df = pd.read_csv('./Resources/amazonreviews.csv', sep='\t')
display(amzn_df.head())
display(amzn_df.tail())

Unnamed: 0,label,"review,,,,,,,,,,,,,,,,,,,,,,,,"
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,"pos\t""Amazing!: This soundtrack is my favorite...",
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember,Pull Your Jaw Off The Floor After Hea..."


Unnamed: 0,label,"review,,,,,,,,,,,,,,,,,,,,,,,,"
9995,"pos\t""A revelation of life in small town Ameri...",
9996,pos,Great biography of a very interesting journali...
9997,neg,Interesting Subject; Poor Presentation: You'd ...
9998,neg,Don't buy: The box looked used and it is obvio...
9999,pos,Beautiful Pen and Fast Delivery.: The pen was ...


In the above code, we’ve initialized a Pandas Dataframe object, and called it to view the top 5 objects in the dataframe.

This dataset already has all the reviews categorized under positive and negative. This is just for you to cross check the values you get back from VADER and calculate your metrics.

To see how many positive and negative reviews we have, type in the following

In [6]:
amzn_df['label'].value_counts()

neg                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3985
pos                                                                                                     

In [7]:
# amzn_df.value_counts()

### Let’s try one of the objects out, shall we ?

But before we do that, let’s ensure that our dataset is nice and clean, i.e, ensure that there aren’t any blank objects.

In [8]:
amzn_df.dropna(inplace=True)
empty_objects = []
for index, label, review in amzn_df.itertuples():
    if type(review)==str:
        if review.isspace():
            empty_objects.append(i)
 
    amzn_df.drop(empty_objects, inplace=True)
amzn_df    

Unnamed: 0,label,"review,,,,,,,,,,,,,,,,,,,,,,,,"
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember,Pull Your Jaw Off The Floor After Hea..."
5,pos,an absolute masterpiece: I am quite sure any o...
...,...,...
9994,neg,"Sorry Jim: As a former realtor,Mr. Cole owes t..."
9996,pos,Great biography of a very interesting journali...
9997,neg,Interesting Subject; Poor Presentation: You'd ...
9998,neg,Don't buy: The box looked used and it is obvio...


This little convenience function will drop any blank dataframe objects. The method ensures that the dataframe keeps the changes made by dropping any blank objects, and not cheekily throwing them away despite all our effort. Very much like a commit in Github.

In [9]:
inplace=True 

In [10]:
amzn_df['label'].value_counts()

neg    3985
pos    3955
Name: label, dtype: int64

#### However, this particular dataset had no empty objects, but still, it doesn’t harm to be careful.

Currently there’s a couple of problems:

We can’t compare the extracted sentiment to the original sentiment as doing that for each sentiment is time consuming and quite frankly, completely caveman.
The extracted sentiment is printed out, which, in my opinion is plain flimsy.
Let’s fix it.

Let’s add the sentiment to the dataframe alongside its original sentiment.

In [11]:
amzn_df['scores'] = amzn_df['review'].apply(lambda review: vader.polarity_scores(review))
amzn_df.head()

KeyError: 'review'

But currently the scores column has just the raw sentiment which, we can’t really compare programmatically with the ‘label’ column which already has all the data, so let’s find a workaround.

Let’s use the compound value.

In [None]:
amzn_df['compund'] = amzn_df['scores'].apply(lambda score_dict: score_dict[compound])
amzn_df

If the compound value is greater than 0, we can safely say that the review is positive, otherwise it’s negative. Great ! Let’s implement that now !

Well then let’s check our score now, shall we ?

In [None]:
from sklearn.metrics import accuracy_score 

In [None]:
accuracy_score(amzn_df(['label'], amzn_df['sentiment'])

There’s definitely room for improvement. But, do keep in mind that we got this score without making any changes to VADER and that we didn’t write any custom code to figure out the sentiment ourselves.

Alright then, if you have any queries feel free to post them in the comments and I’ll try to help out ! Peace.