# Rule Based Approach

Two methods : 
    - vader 
        -- powerful easy to use rule based classifier
    - sentiwordnet
        -- performs reasonably, can also be used for building classifiers

Create data folder      
Download data and  extract to project directory/data.   
Download data here :<a href = http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz>CLICK ME</a>

In [2]:
import os
from string import punctuation 
from nltk.sentiment import vader 
from nltk.corpus import stopwords
from nltk.corpus import sentiwordnet as swn

In [3]:
posReviewFileName = "/data/rt-polaritydata/rtpolarity.pos"
negReviewFileName = "/data/rt-polaritydata/rtpolarity.neg"

In [7]:
with open(posReviewFileName,'r') as f:
    positiveReviews = f.readlines()
with open(negReviewFileName,'r') as f:
    negativeReviews = f.readlines()

### VADER

In [8]:
sia = vader.SentimentIntensityAnalyzer() #instantiate  ... object from vader module

In [6]:
def vaderSentiment(review):
    return sia.polarity_scores(review)['compound']

In [11]:
sampPosReview = "This is a good restaurant"
sampNegReview = "This is not a good restaurant"

*************************

In [17]:
# Trial :

print(vaderSentiment(sampPosReview) , vaderSentiment(sampNegReview))

0.4404 -0.3412


***************************************

In [18]:
# following fuction takes a function as input and returns a dictionary : so that we can apply any function
def getReviewSentiments(sentimentCalculatorFunc):
    negReviewResult = [sentimentCalculatorFunc(i) for i in negativeReviews]
    posReviewResult = [sentimentCalculatorFunc(i) for i in positiveReviews]
    return {'positive_results' : posReviewResult, 'negative_results' : negReviewResult}

In [21]:
# To test accuracy
def runDiagnostics(reviewResult):
    
    posReviewRslt = reviewResult['positive_results']
    negReviewRslt = reviewResult['negative_results']
    
    # percentage
    pctTruePositive = sum([x>0 for x in posReviewRslt])/len(posReviewRslt)
    pctTrueNegative = sum([x<0 for x in negReviewRslt])/len(negReviewRslt)
    
    #total 
    total       = len(posReviewRslt)+len(negReviewRslt)
    total_pct   = sum([x>0 for x in posReviewRslt])+sum([x<0 for x in negReviewRslt])
    overall_Acc = (total_pct*100)/total
    
    #display results
    print("Accuracy of positive reviews : " + "%.2f" % (pctTruePositive*100) +"%" )
    print("Accuracy of negative reviews : " + "%.2f" % (pctTrueNegative*100) +"%" )
    print("Overall Accuracy : " + "%.2f" % overall_Acc +"%" )

Test : Vader 

In [22]:
runDiagnostics(getReviewSentiments(vaderSentiment))

Accuracy of positive reviews : 69.44%
Accuracy of negative reviews : 40.09%
Overall Accuracy : 54.76%


### SENTIWORDNET

In [28]:
stopwords=set(stopwords.words('english')+list(punctuation))

In [29]:
def senti_word_net(review):
    reviewPolariy = 0.0
    numExceptions = 0
    for word in review.lower().split():
        numMeanings = 0
        
        if word in stopwords:
            continue
        weight = 0.0
        try:
            for meaning in list(swn.senti_synsets(word)):
                if meaning.pos_score() > meaning.neg_score() :
                    weight += (meaning.pos_score() - meaning.neg_score())
                    numMeanings += 1
                elif meaning.pos_score() < meaning.neg_score() :
                    weight -= (meaning.neg_score() - meaning.pos_score())
                    numMeanings += 1
        except : 
            numExceptions = numExceptions+1
        
        if numMeanings > 0:
            reviewPolariy += (weight/numMeanings)
    return reviewPolariy

In [31]:
runDiagnostics(getReviewSentiments(senti_word_net))

Accuracy of positive reviews : 75.56%
Accuracy of negative reviews : 42.79%
Overall Accuracy : 59.17%


Accuracy improved, but : 

************************************************

In [35]:
# Trial:

print(senti_word_net(sampPosReview) , senti_word_net(sampNegReview))

0.6630434782608695 0.6630434782608695


***********************************

both pos & neg reviews are giving same values . because of stopword removal.   
These are some of the issues with rule based approach.  
It will require lots of fine tuning.
Rule Based approaches are not efficient if variety of patterns exist in data

******************