<h1><center>Sentiment Analysis</center></h1>

[Test Sentiment Analysis here](https://monkeylearn.com/sentiment-analysis-online/)<br><br>
Sentiment analysis identifies emotions <b>in text</b> and classifies opinions as <b>positive/negative/neutral</b>

---
<h1><center>Tokenization</center></h1>

Breaking raw text into small chunks. Raw texts that are converted into words/sentences are known as <b>tokens</b><br><br>

<h3>Why is tokenization important?</h3><br>
Tokenization helps in <b>intepreting the meaning</b> of the text by <b>analyzing the sequence of words</b><br>

E.g : <code>"It is raining"</code> can be converted into <code>'It'</code> <code>'is</code> <code>'raining'</code><br><br>

<h3>Why is Tokenization required in NLP?</h3>

Because <b>meaning of the text</b> could be interpreted easily by <b>analyzing the words present in the text</b><br>

Also, we are going to extract meaningful/important information only from the words

In social media, hashtags are used to say about something. So from counting the hastags, It's easy to determine how many times has X hashtags have been mentioned/said.

<b>For example,</b><br>
Before tokenization : <code>"This is a cat"</code><br>
After tokenization : <code>['This','is','a','cat']</code><br><br>

Before tokenization : <code>"Smith is a good politician"</code><br>
After tokenization : <code>['Smith','is','a','good','politician']</code><br>
So when sentiment analysis is applied: <code>['Smith','good','politician']</code><br><br>

Before tokenization : <code>"Alex is an average politician"</code><br>
After tokenization : <code>['Alex','is','an','average','politician']</code><br>
So when sentiment analysis is applied: <code>['Alex','average','politician']</code><br><br>

From the <b>tokenized text</b>, we can:
<ul>
    <li>Count number of words in the text</li>
    <li>Count frequency of the word (how many times the word is repeated/present in a paragraph)</li>
</ul>


In [2]:
import nltk

<code>'punkt'</code> is a tokenizer that divides a text into a list of sentences

[Read more about 'punkt'](https://www.kite.com/python/docs/nltk.punkt)

In [4]:
nltk.download('punkt') 
from nltk import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<h3>Tokenization in action</h3>

<b>Fun fact:</b> All 26 alphabets are in <code>test_sentence</code>!

In [5]:
test_sentence = "The quick brown fox jumps over the little lazy dog"
print(word_tokenize(test_sentence))

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'little', 'lazy', 'dog']


---
<h2>Some important interview questions</h2>

<h3>What is the difference between <code>.split()</code> and <code>word_tokenize</code>?</h3>

In <code>.split()</code> the word <code>python,</code> and <code>nltk,</code> with commas are <b> not separated</b>

However, <code>word_tokenize</code> returns us with <code>python</code>
<code>,</code> and <code>nltk</code> <code>,</code>

In [10]:
data = "This is the tutorial of python, nltk, textProcessing"

# Using .split()
split_data = data.split()
print(split_data)

['This', 'is', 'the', 'tutorial', 'of', 'python,', 'nltk,', 'textProcessing']


In [11]:
#Using word_tokenize

tokenized_data = word_tokenize(data)
print(tokenized_data)

['This', 'is', 'the', 'tutorial', 'of', 'python', ',', 'nltk', ',', 'textProcessing']


<h3>What is the difference between <code>word_tokenize</code> and <code>sent_tokenize</code>?</h3>

<code>sent_tokenize</code> is used when you want to <b>separate the sentences</b>

In [21]:
from nltk import sent_tokenize

data = "We have a little voice in our heads. Like the one you used to read this."
data2 = "My name is Fikri. I live in Malaysia. I am unemployed. Send help"

# Using word_tokenize
print(word_tokenize(data))
print(word_tokenize(data2))

['We', 'have', 'a', 'little', 'voice', 'in', 'our', 'heads', '.', 'Like', 'the', 'one', 'you', 'used', 'to', 'read', 'this', '.']
['My', 'name', 'is', 'Fikri', '.', 'I', 'live', 'in', 'Malaysia', '.', 'I', 'am', 'unemployed', '.', 'Send', 'help']


<h4>Using <code>sent_tokenize</code></h4>

When to use <code>word_tokenize</code> or <code>sent_tokenize</code>?

It is very <b>situational</b>, depends on your <b>programming needs</b>

General rule of thumb is, 
<ul>
    <li><b><code>word_tokenize</code></b> separate words
    <li><b><code>sent_tokenize</code></b> separate sentences
</ul>

In [22]:
print(sent_tokenize(data))
print(sent_tokenize(data2))

['We have a little voice in our heads.', 'Like the one you used to read this.']
['My name is Fikri.', 'I live in Malaysia.', 'I am unemployed.', 'Send help']


<h3>Performing tokenization on Twitter Dataset</h3>

Remember, <b>hashtags</b> and <b>names</b> plays important role in social media? 

Yup, we're using them here.

In [24]:
data = "#Nexperts is the best institute"
data1 = "#GFRIEND is underrated"
word_tokenize(data)

['#', 'Nexperts', 'is', 'the', 'best', 'institute']

In [25]:
word_tokenize(data1)

['#', 'GFRIEND', 'is', 'underrated']

Why is <code>word_tokenize</code> separating the <code>#</code>?

We need to consider the <b>'#'</b> and the text with it are the same.

To deal with it, we need to use <b><code>TweetTokenizer</code> package</b>

In [27]:
from nltk.tokenize import TweetTokenizer
tk = TweetTokenizer() #Creating an object from our class

In [28]:
tk.tokenize(data)

['#Nexperts', 'is', 'the', 'best', 'institute']

In [29]:
tk.tokenize(data1)

['#GFRIEND', 'is', 'underrated']

---

<h1><center>Stopwords</center></h1>

Stopwords are most common words in any natural language. 

Since we are analyzing text data and building NLP models, <b>stopwords</b> might not add much value to the meaning 

<b><i><center>basically useless</center></i></b>
    
Words such as : 
    <ul>
        <li>the</li>
        <li>is</li>
        <li>in</li>
        <li>for</li>
        <li>where</li>
        <li>when</li>
        <li>to</li>
        <li>at</li>
        <li>and many more..</li>
    </ul>
    
Why is it useless?<br><br>
Consider a text string <code>"There is a pen on the table"</code>

<code>['is','a','on','the']</code> add no meaning while parsing it.

However, <code>['there','book','table]</code> tells us what statement is all about

<h3>Why we need to remove stopwords?</h3>

Depends <b>on the task you're working on</b>. You want to remove/keep them, it's up to you.

Tasks like <b>text classification</b>, where text is classified into <b>different categories</b>, <b>stopwords</b> are <b>removed/excluded</b> from the text so more focus can be given to those words which <b>define the meaning of the text</b>

Remove stopwords so your analysis becomes more efficient


In [32]:
# Stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


<h4>How many stopwords are there in the package?</h4>

In [33]:
print(stopwords.words('english'))

# if text in stopwords.words == it will be deleted

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [36]:
test = "my laptop is overheating"
test1 = "my overheating is laptop"
stop_words = stopwords.words('english')

'''
First, we need to separate(tokenize) our text data first
'''
token = word_tokenize(test)

'''
Using for loop to remove stopwords in our token. 
Create an empty list(cleaned_token) to append new filtered text
'''
cleaned_token = []

for word in token:
    if word not in stop_words:
        cleaned_token.append(word)

print(cleaned_token)

['laptop', 'overheating']


<h4>But what if the stopwords that we want is not in the list?</h4>

Simple, you can create your own. Only do this when your stopword <b>IS NOT IN THE LIST</b>

---

Note : In handling text data/reviews, convert them into lower case first. The code won't work since it is case sensitive

<code>test = "My laptop is overheating"</code> will output <code>['My','laptop','overheating']</code> because <b><code>['My']</code></b> is not in <code>stopwords.words('english')</code>

Use the string method <code>.lower()</code>

So the strategy when the dataset has case-sensitive,
<ol>
    <li>Use .lower() to convert to lowercase</li>
    <li>Tokenize the text</li>
    <li>For loop to filter</li>
</ol>

---

<h1><center>Stemming</center></h1>

Is a process of <b>producing morphological variants of a root/base word</b>

Also known as <b>stemming algorithms</b> or <b>stemmers</b>

<code>['chocolates','chocolatey','choco']</code> will be reduced to <code>['chocolate']</code>

<code>['retrieval','retrieved','retrieves']</code> reduced to <code>['retrieve']</code>

In [38]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

words = ['program','programs','programmer','programming','programmers']

#Applying stemming

for w in words:
    print(w," : ",ps.stem(w))

program  :  program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm


---
<h1><center>Lemmatization</center></h1>

Switches any kind of word to its <b>base root mode</b>, similar to stemming

In [49]:
nltk.download('wordnet')
nltk.download('omw-1.4') 

'''
Weird error WITHOUT 'omw-1.4', should not happen if using Google Colab
'''

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\omw-1.4.zip.


True

In [50]:
from nltk.stem import WordNetLemmatizer
wnlem = WordNetLemmatizer()

In [52]:
#Lemmatization in action

print("rocks : ",wnlem.lemmatize("rocks"))
print("cries : ",wnlem.lemmatize("cries"))

rocks :  rock
cries :  cry


<h3>How is Lemmatization different from Stemming?</h3>

Stemming,
<ul>
    <li>part of word is chopped off (at tail end) to arrive at the end of the word(it will remove the last word)</li>
</ul>

Example, 

<code>words = ['program','programs','programmer','programming','programmers']</code>

After stemming will output, 

program  :  program<br>
programs  :  program<br>
programmer  :  programm<br>
programming  :  program<br>
programmers  :  programm<br>

<b>Stemming algorithm</b> don't actually know the meaning of the word in the language it belongs to. It might also change the meaning of the word too!

<b>Lemmatization</b> have this knowledge!

<b><center>Lemma vs Stemma</center></b>

In [53]:
#Stemma
word = 'studies'
print(ps.stem(word))

studi


In [55]:
#Lemma
print(wnlem.lemmatize(word))

study


| Stemming | Lemmatization |
| :- | :- | 
| Faster because it chops word without context of sentences | Slower, but knows context of word before proceeding 
| Rule-based approach | Dictionary-based approach
| Less accurate | More accurate compared to stemming
| May create non-existence meaning of word after conversion | Always gives dictionary meaning word while converting into root form
| Preferred when meaning of word is not important for analysis, e.g spam detection | Recommended when meaning of word is important for analysis,e.g question answer

---


<h3><center>TextBlob.correct() Method</center></h3>

When to use?

When you are doing analysis, and you want perfect words (no spelling mistakes). <code>TextBlob.correct()</code> will try to recorrect our words according to the dictionary


In [58]:
#!pip install textblob --Run once!-- Not needed in Colab/Only other IDEs
from textblob import TextBlob

In [63]:
test = TextBlob('XYZ is a good compny and alays valule ttheir employees.')
test = test.correct()
print(test)

XYZ is a good company and always value their employees.


In [67]:
test2 = TextBlob('AESPA needs morer recogniton from teh world.')
test2= test2.correct()
print(test2)

AESPA needs more recognition from the world.


After creating the class, pass our user reviews in the <code>TextBlob()</code>
You can pass entire dataset in there too.

When is this useful?<br>

Assuming that user reviews are not written in perfect english/spelling mistakes/shortforms. It will autocorrect it!

<h4>Can you apply the method to other languages too?</h4>

Suppose that you are doing a sentiment analysis to a Malaysian company.

It might be possible that <b>reviews are mixed/not in English.</b>

<code>TextBlob()</code> can also help in Malay language. (example)

In [69]:
sample_text = 'Adakah saudara faham'
check = TextBlob(sample_text)
print(check.detect_language())

#ms -- malaysian

ms


<h4>Performing Translation from MS to Target Language</h4>

In [72]:
print(check.translate(to='en'))

Do you understand


<h1><center>VADER</center></h1>

Valence<br>Aware<br>Dictionary<br>and sEntiment<br>Reasoner

is a <b>rule-based sentiment analysis tool</b> specifically attuned to <b>sentiments expressed in social media</b>

Used for sentiment analysis which has both polarities (negative/positive)

Used to quantify <b>how much of positive or negative</b> emotion the text and also <b>the intensity of emotion</b>

| Advantages
| :-
| Does not require any training data
| Can very well understand sentiment of text containing emoticons,slangs,conjunctions,capital words,punctuations etc.
| Works excellent on social media text
 

In [84]:
#!pip install nltk[twitter] --RUN ONCE ONLY--


VADER is going to use Twython to perfrom sent. analysis from this pip


<h4>Downloading dictionary for VADER</h4>

In [78]:
import nltk
nltk.download('vader_lexicon') # Lexicon -- is a dictionary

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...


True

In [79]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()

<h2>Using VADER</h2>

In [80]:
sample = 'I really love ice cream'
vader.polarity_scores(sample)

{'neg': 0.0, 'neu': 0.4, 'pos': 0.6, 'compound': 0.6697}

<code>'neg'</code> shows negative score<br>
<code>'neu'</code> shows neutral score<br>
<code>'pos'</code> shows positive score<br>
<code>'compound'</code> compound score(all scores combined)

<h4>Compound Score</h4>

Compound score is <b>sum of positive,negative & neutral</b> scores which is then normalized between <b>-1 (most extreme negative)</b> and <b>+1 (most extreme positive</b>

Closer to +1, higher positivity of the text and <b>vice versa</b>

In [82]:
# Negative review example
sample = "I really don't like mint ice cream"
vader.polarity_scores(sample)

{'neg': 0.323, 'neu': 0.677, 'pos': 0.0, 'compound': -0.3374}

In [85]:
sample = "Blackpink's songs are okay"
vader.polarity_scores(sample)

{'neg': 0.0, 'neu': 0.612, 'pos': 0.388, 'compound': 0.2263}

<h2>Applying VADER on real dataset</h2>

In [87]:
import pandas as pd
data = pd.read_csv('amazonreviews.tsv',sep='\t')
data.head()
# We will be focusing on the reviews

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


<h3>Performing Explarotary Data Analysis</h3>

1. How many no. of reviews we are going to deal with?

In [89]:
data.shape
#10000 reviews and 2 columns

(10000, 2)

2. Are there any missing values? Checking missing values..

<code>.dropna()</code> will remove all missing values

<code>inplace=True</code> will remove all missing values <b>PERMANENTLY</b>

So if only .dropna() without inplace, the changes will take place only in this cell

In [90]:
data.dropna(inplace=True)

In [91]:
data.shape

(10000, 2)

Shape is still the same before & after <code>.dropna()</code>. 

So there are <b>no missing values</b>.

Good news!

3. What are the columns in this dataset?

In [92]:
data.columns

Index(['label', 'review'], dtype='object')

In summary we are dealing with,
<ul>
    <li>10000 amazon reviews</li>
    <li>No missing values</li>
    <li>The column names are <b>label</b> and <b>review</b></li>
</ul>

Additionally, you can check for how many <b>positive or negative reviews</b> are there

In [93]:
data['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

So, we are dealing with

<b>5097 negative reviews</b><br>
<b>4903 positive reviews</b>

---
<h2>Lambda Functions</h2>

also known as anonymous functions.
<ul>
    <li>can take any number of <b>arguments</b>, but can only have <b>one expression</b>
</ul>



In [94]:
x = lambda x : x + 10
print(x(5))

15


<code>x1 = lambda x2 : x3 + 10</code>

<code>x1</code> is the name of function<br>
<code>x2</code> is the argument<br>
<code>x3 + 10</code> is the expression<br>

So when we pass <code>x(5)</code> into <code>x = lambda x : x + 10</code>

It will become,

<code>x1 = lambda x2 : 5 + 10 </code>

<b>Remember, you can pass any number of arguments in a lambda</b>

In [95]:
x = lambda a,b,c : a + b + c
print(x(10,5,6))

21


<code>x = lambda a,b,c : a + b + c</code>

Since <code>x(10,5,6)</code> is passed,

It will become <code> x = lambda a,b,c : 10 + 5 + 6 </code>

Output : <code>21</code>

----

<h2>Applying Lambda function on our dataset</h2>

Since we have 2 columns in our dataset, <code>['label','review']</code>

We only need <code>['review']</code> for our analysis

<b>But why lambda?</b>

If we don't use lambda,

We need to use <code>for</code> loop and put it into a <code>list</code>.. etc, many steps. Veri inefficient

<b>What is happening with the lambda?</b>

Each <b>review</b> in <code>data['review']</code> is going to be passed into <code>vader.polarity_scores</code> and will be calculated. <br>Then it is stored into a variable named <code>data['scores']</code>

In [97]:
data['scores'] = data['review'].apply(lambda review : vader.polarity_scores(review))

In [98]:
data

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."
...,...,...,...
9995,pos,A revelation of life in small town America in ...,"{'neg': 0.017, 'neu': 0.846, 'pos': 0.136, 'co..."
9996,pos,Great biography of a very interesting journali...,"{'neg': 0.0, 'neu': 0.868, 'pos': 0.132, 'comp..."
9997,neg,Interesting Subject; Poor Presentation: You'd ...,"{'neg': 0.084, 'neu': 0.754, 'pos': 0.162, 'co..."
9998,neg,Don't buy: The box looked used and it is obvio...,"{'neg': 0.091, 'neu': 0.909, 'pos': 0.0, 'comp..."


<h3>Retrieving review and scores</h3>

In [99]:
data[['review','scores']]

Unnamed: 0,review,scores
0,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."
...,...,...
9995,A revelation of life in small town America in ...,"{'neg': 0.017, 'neu': 0.846, 'pos': 0.136, 'co..."
9996,Great biography of a very interesting journali...,"{'neg': 0.0, 'neu': 0.868, 'pos': 0.132, 'comp..."
9997,Interesting Subject; Poor Presentation: You'd ...,"{'neg': 0.084, 'neu': 0.754, 'pos': 0.162, 'co..."
9998,Don't buy: The box looked used and it is obvio...,"{'neg': 0.091, 'neu': 0.909, 'pos': 0.0, 'comp..."


After all that, the most important score is the <b>compound score</b>

Closer to <b>+1, the higher the positivity of the text</b> and <b>vice versa</b>

<h3>Compound scores of dataset</h3>

The lambda function is going to store <code><b>['scores']</b></code> into the <code><b>scores_dict</b></code>, 

and from <code><b>scores_dict['compound']</b></code> only the values of <code>{'compound' : value}</code> is going to be <b>taken and added</b>

into a new column <code>data['compound']</code>

In [101]:
data['compound'] = data['scores'].apply(lambda scores_dict: scores_dict['compound'])
data

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781
...,...,...,...,...
9995,pos,A revelation of life in small town America in ...,"{'neg': 0.017, 'neu': 0.846, 'pos': 0.136, 'co...",0.9610
9996,pos,Great biography of a very interesting journali...,"{'neg': 0.0, 'neu': 0.868, 'pos': 0.132, 'comp...",0.9544
9997,neg,Interesting Subject; Poor Presentation: You'd ...,"{'neg': 0.084, 'neu': 0.754, 'pos': 0.162, 'co...",0.9102
9998,neg,Don't buy: The box looked used and it is obvio...,"{'neg': 0.091, 'neu': 0.909, 'pos': 0.0, 'comp...",-0.3595


<h3>Retrieving review and compound info</h3>

In [102]:
data[['review','compound']]

Unnamed: 0,review,compound
0,Stuning even for the non-gamer: This sound tra...,0.9454
1,The best soundtrack ever to anything.: I'm rea...,0.8957
2,Amazing!: This soundtrack is my favorite music...,0.9858
3,Excellent Soundtrack: I truly like this soundt...,0.9814
4,"Remember, Pull Your Jaw Off The Floor After He...",0.9781
...,...,...
9995,A revelation of life in small town America in ...,0.9610
9996,Great biography of a very interesting journali...,0.9544
9997,Interesting Subject; Poor Presentation: You'd ...,0.9102
9998,Don't buy: The box looked used and it is obvio...,-0.3595


<h3>Converting float values of compound to positive and negatives</h3>

In variable <code>c</code>, we are passing <code>data['compound']</code> score 1 by 1.

Then if <code>compound score, c </code> is <b>>= 0</b> it will be labelled as <b>'pos' or positive</b>

else, it will be labelled as <b>'neg' or negative</b>


In [107]:
data['sentiment'] = data['compound'].apply(lambda c: 'pos' if c>= 0 else 'neg')
data

Unnamed: 0,label,review,scores,compound,sentiment
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos
...,...,...,...,...,...
9995,pos,A revelation of life in small town America in ...,"{'neg': 0.017, 'neu': 0.846, 'pos': 0.136, 'co...",0.9610,pos
9996,pos,Great biography of a very interesting journali...,"{'neg': 0.0, 'neu': 0.868, 'pos': 0.132, 'comp...",0.9544,pos
9997,neg,Interesting Subject; Poor Presentation: You'd ...,"{'neg': 0.084, 'neu': 0.754, 'pos': 0.162, 'co...",0.9102,pos
9998,neg,Don't buy: The box looked used and it is obvio...,"{'neg': 0.091, 'neu': 0.909, 'pos': 0.0, 'comp...",-0.3595,neg


<h3>Retrieving review and sentiment</h3>

In [105]:
data[['review','sentiment']]

Unnamed: 0,review,sentiment
0,Stuning even for the non-gamer: This sound tra...,pos
1,The best soundtrack ever to anything.: I'm rea...,pos
2,Amazing!: This soundtrack is my favorite music...,pos
3,Excellent Soundtrack: I truly like this soundt...,pos
4,"Remember, Pull Your Jaw Off The Floor After He...",pos
...,...,...
9995,A revelation of life in small town America in ...,pos
9996,Great biography of a very interesting journali...,pos
9997,Interesting Subject; Poor Presentation: You'd ...,pos
9998,Don't buy: The box looked used and it is obvio...,neg


Compound score <b>closer to 1 is positive, negatives are negatives obviously</b>

<h3>Comparing label with sentiment</h3>

In [108]:
data[['label','sentiment']].head(10)

Unnamed: 0,label,sentiment
0,pos,pos
1,pos,pos
2,pos,pos
3,pos,pos
4,pos,pos
5,pos,pos
6,neg,neg
7,pos,pos
8,pos,pos
9,pos,pos


In [109]:
data['sentiment'].value_counts()

pos    6936
neg    3064
Name: sentiment, dtype: int64

<h2><center>Before VADER, we had</center></h2>

<b><center>5097 negative reviews</b><br>
<b><center>4903 positive reviews</b><br>

<h2><center>After VADER, now we have</center></h2>
<b><center>3064 negative reviews</center></b><br>
    <b><center>6939 positive reviews</center></b>
    
    
---
---

<h4>Video timestamps</h4>

Tokenization --- Stemming : sentiment_analysis1<br><br>
Lemmatization --- TextBlob : sentiment_analysis2<br><br>
VADER part1 : sentiment_analsysis3<br><br>
Applying VADER --- lambda data compound : sentiment_analysis4<br><br>
Filtering between positive & negative reviews : sentiment_analysis5

<b><center>This notebook is using Python 3.8</center></b>