<img style="float: left;" src="pic2.png">

### Sridhar Palle, Ph.D, spalle@emory.edu (Applied ML & DS with Python Program)

# Unsupervised Models

**Import the libraries and dependencies**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from bs4 import BeautifulSoup
import nltk
import contractions
%matplotlib inline

In [2]:
#nltk.download('all', halt_on_error=False) # do this only once

**Lets load the data set and store it in imdb**

In [3]:
imdb_big = pd.read_csv('movie_reviews.csv')

In [4]:
imdb_big.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [5]:
imdb_big.shape

(50000, 2)

In [6]:
imdb_big['review'].describe()

count                                                 50000
unique                                                49582
top       Loved today's show!!! It was a variety and not...
freq                                                      5
Name: review, dtype: object

## 1. Text Preprocessing

**Text Normalization or preprocessing steps**
    - Converting to lowercase
    - Remove html tags
    - Expanding contractions
    - Removing punctuation
    - Removing stop words
    - Stemming or lemmatization

We have already defined functions which can perform these steps. All these functions
are in text_preprocessing.py file. We can directly import the functions from this file.

In [7]:
from Text_Preprocessing import lower_case,html_parser,replace_contractions
from Text_Preprocessing import remove_special, remove_stopwords, word_stem

# remember we are importing from .py file  not .pynb

**Lets Preprocess the reviews with the above imported functions**

In [8]:
def text_preprocess(text):
    text = lower_case(text) # convert to lower case
    text = html_parser(text) # remove html tags
    text = replace_contractions(text) # replace contractions Ex: haven't  to have not
    text = remove_special(text) # remove special characters @, #, %, $ etc..
    text = remove_stopwords(text) # remove stop words. Ex: and, the
    text = word_stem(text, 'lemmatize') # stem or lemmatize
    return text

In [9]:
prep_review = []
for review in imdb_big['review']:
    prep_review.append(text_preprocess(review))
    
imdb_big['prep_review'] = prep_review
imdb_big.head()

Unnamed: 0,review,sentiment,prep_review
0,One of the other reviewers has mentioned that ...,positive,one reviewer mentioned watching 1 oz episode h...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...


## 2. Lexicon Models

### 2.1 AFINN Lexicon

**Lets import the Afinn library**

In [10]:
#pip install afinn on anaconda command terminal. Do this only once.

In [11]:
from afinn import Afinn

In [12]:
afn = Afinn(emoticons=True)

**Lets check how afn works on a sample text**

In [16]:
afn.score('I this stupid movie')

-2.0

In [17]:
afn.score('Data science has plenty of jobs')

0.0

**Lets check how it works on few sample reviews**

In [21]:
imdb_big['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [18]:
print (imdb_big['prep_review'][0])   # 0th review
afn.score(imdb_big['prep_review'][0])

one reviewer mentioned watching 1 oz episode hooked right exactly happened first thing struck oz brutality unflinching scene violence set right word go trust show faint hearted timid show pull punch regard drug sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inwards privacy high agenda em city home many aryan muslim gangsta latino christian italian irish scuffle death stare dodgy dealing shady agreement never far away would say main appeal show due fact go show would dare forget pretty picture painted mainstream audience forget charm forget romance oz mess around first episode ever saw struck nasty surreal could say ready watched developed taste oz got accustomed high level graphic violence violence injustice crooked guard sold nickel inmate kill order get away well mannered middle class inmate turned prison bitch due lack street skill prison experience 

-38.0

In [19]:
print (imdb_big['prep_review'][1])   # 1 review
afn.score(imdb_big['prep_review'][1])

wonderful little production filming technique unassuming old time bbc fashion give comforting sometimes discomforting sense realism entire piece actor extremely well chosen michael sheen got polari voice pat truly see seamless editing guided reference williams diary entry well worth watching terrificly written performed piece masterful production one great master comedy life realism really come home little thing fantasy guard rather use traditional wouldream technique remains solid disappears play knowledge sens particularly scene concerning orton halliwell set particularly flat halliwell mural decorating every surface terribly well done


12.0

**Lets calculate polarity scores predicted by the afinn lexicon**

In [20]:
polarity = [afn.score(review) for review in imdb_big['prep_review']]   
polarity[0:5]
# we can do this with a for loop or list comprehension or as shown below

[-38.0, 12.0, 23.0, -8.0, 29.0]

**Lets create a new column with afinn polarity values**

In [22]:
imdb_big['afinn_polarity'] = polarity
imdb_big.head(5)

Unnamed: 0,review,sentiment,prep_review,afinn_polarity
0,One of the other reviewers has mentioned that ...,positive,one reviewer mentioned watching 1 oz episode h...,-38.0
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...,12.0
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...,23.0
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...,-8.0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...,29.0


In [23]:
# We could achieve the same with 
# imdb_big['prep_review'].apply(afn.score)

**Lets create a new column with sentiment predictions directly from the polarity**

In [24]:
imdb_big['afinn_senti'] = ['positive' if pol > 0 else 'negative' for pol in imdb_big['afinn_polarity']]
imdb_big.head(3)

Unnamed: 0,review,sentiment,prep_review,afinn_polarity,afinn_senti
0,One of the other reviewers has mentioned that ...,positive,one reviewer mentioned watching 1 oz episode h...,-38.0,negative
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...,12.0,positive
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...,23.0,positive


In [25]:
# The following code achieves the same except it uses a traditional for loop
# senti = []
# for pol in imdb_big['polarity']:
#     if pol > 0:
#         senti.append('positive')
#     else:
#         senti.append('negative')
# imdb_big['senti_pred'] = senti

**Lets evaluate the performance of the model**

In [26]:
# In reality,we cannot evalute Unsupervised models,because we do not know the actual target values
# but here we have the target variable. 

In [27]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [28]:
confusion_matrix(imdb_big['sentiment'], imdb_big['afinn_senti'])

array([[13735, 11265],
       [ 3597, 21403]], dtype=int64)

In [29]:
accuracy_score(imdb_big['sentiment'], imdb_big['afinn_senti'])

0.70276

## 2. Vader Lexicon

In [30]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

**Lets see how Vader Lexicon works with a sample text**

In [31]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores('Secrets of success: Positive attitude, faith, goals, time, and effort')

{'neg': 0.0, 'neu': 0.409, 'pos': 0.591, 'compound': 0.8779}

It not only generates a compound polarity score, but also scores for negativity, neutrality, positivity
Recommendation from the Lexicon is to treat 
* compound polarity >= 0.5 (Positive)
* -0.5 < compound polarity < 0.5 (Neutral)
* compound polarity < -0.5 (Negative)


**Vader can also estimate scores even for slangs, emoticons etc..**

In [32]:
sia.polarity_scores('lol')

{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}

In [33]:
sia.polarity_scores('wtf')

{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.5859}

In [None]:
# Because it can find scores even for non typical words, we will either do 
# ...miminal text pre-processing or none before passing the reviews to this Lexiconm

**Lets calculate the Vader Polarity scores for the raw imdb reviews**

In [None]:
polarity_vader = [sia.polarity_scores(review)['compound'] for review in imdb_big['review']]   
polarity_vader[0:5]

**Lets create a new column with vader predictions from polarity scores with a threshold of 0**

In [None]:
imdb_big['vader_polarity'] = polarity_vader
imdb_big.head(3)

In [None]:
imdb_big['vader_senti'] = ['positive' if pol > 0.5 else 'negative' for pol in imdb_big['vader_polarity']]
imdb_big.head(3)

**Lets evaluate the performance of the Vader model**

In [None]:
# In reality,we cannot evalute Unsupervised models,because we do not know the actual target values
# but here we have the target variable. 

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
confusion_matrix(imdb_big['sentiment'], imdb_big['vader_senti'])

In [None]:
accuracy_score(imdb_big['sentiment'], imdb_big['vader_senti'])