# Naive Bayes for Sentiment Analysis

We will be using [this](http://ai.stanford.edu/~amaas/data/sentiment/) dataset for binary sentiment classification. The dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing. To get the data:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

## Load Data

In [1]:
from pyspark import SparkContext 
import numpy as np
from collections import Counter
import re
import string
import os
import numpy as np
from nltk.corpus import stopwords

In [2]:
sc = SparkContext.getOrCreate()
train_path = "aclImdb/train/"
test_path ="aclImdb/test/"

In [159]:
data_raw_pos = sc.textFile(train_path + "pos/*.txt")
data_raw_neg = sc.textFile(train_path + "neg/*.txt")

In [160]:
# note that this is a whole review
data_raw_pos.first()

u'For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.'

In [161]:
# sample 20% of the data
data_raw_pos = data_raw_pos.sample(False, 0.2, 1)
data_raw_neg = data_raw_neg.sample(False, 0.2, 1)

In [162]:
# number of partitions
data_raw_pos.getNumPartitions()

12500

In [163]:
# You may OR may NOT want to repartition or coalesce
# num_partitions = 3 or 4 times the number of CPUs
num_partitions = 8
data_raw_pos = data_raw_pos.repartition(num_partitions)
data_raw_neg = data_raw_neg.repartition(num_partitions)

In [164]:
# count 2529 elements
# this takes some time
print(data_raw_pos.count())
print(data_raw_neg.count())

KeyboardInterrupt: 

## Training NB

In [175]:
# Data cleaning: Remove stop words and punctuation
def words(text):
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
    word = regex.sub(" ", text)
    if word.strip().lower() not in stopwords.words('english'):
        return word.strip().lower()

In [166]:
bigrams = data_raw_pos.map(lambda x: x.split()).map(lambda x: [words(w) for w in x])\
    .map(lambda x: [w.strip() for w in x if w is not None])\
    .flatMap(lambda x: [x[i]+'_'+x[i+1] for i in range(0,len(x)-1)])\
    .map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda x: x[1], ascending = False)\
    .map(lambda x: x[0]).take(100)

In [183]:
def bigram_mapping(x, bigrams):
    words = []
    for i in range(0, len(x)-1):
        if x[i]+'_'+x[i+1] in bigrams:
            words.append(x[i]+'_'+x[i+1])
        else:
            words.append(x[i])
    return [words]

In [180]:
pos_words = data_raw_pos.map(lambda x: x.split()).map(lambda x: [words(w) for w in x])\
    .map(lambda x: [w.strip() for w in x if w is not None])\
    .flatMap(lambda x: bigram_mapping(x, bigrams)).flatMap(lambda x: x)
data_pos = pos_words.map(lambda x: (x, 1))
data_pos = data_pos.reduceByKey(lambda x,y:x+y)

In [184]:
bigrams_neg = data_raw_neg.map(lambda x: x.split()).map(lambda x: [words(w) for w in x])\
    .map(lambda x: [w.strip() for w in x if w is not None])\
    .flatMap(lambda x: [x[i]+'_'+x[i+1] for i in range(0,len(x)-1)])\
    .map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).sortBy(lambda x: x[1], ascending = False)\
    .map(lambda x: x[0]).take(100)

In [185]:
neg_words = data_raw_neg.map(lambda x: x.split()).map(lambda x: [words(w) for w in x])\
    .map(lambda x: [w.strip() for w in x if w is not None])\
    .flatMap(lambda x: bigram_mapping(x, bigrams_neg)).flatMap(lambda x: x)
data_neg = neg_words.map(lambda x: (x, 1))
data_neg = data_neg.reduceByKey(lambda x,y:x+y)

How should we compute count(pos) and count(neg)?

In [186]:
count_pos = data_pos.map(lambda x: x[1]).reduce(lambda x,y:x+y)
count_neg = data_neg.map(lambda x: x[1]).reduce(lambda x,y:x+y)

In [187]:
print(count_pos, count_neg)

(317578, 312018)


In [188]:
## Let's get V
v1 = data_pos.map(lambda x: x[0]) # pos vocabulary
v2 = data_neg.map(lambda x: x[0]) # neg vocabulary
v = v1.union(v2)
#v.count()
v0 = v.distinct()
V = v0.count()
print(V)

52389


In [189]:
# Note that the denominators are different 
pos_denom = float(count_pos + V + 1)
neg_denom = float(count_neg + V + 1)

In [190]:
# log probabities
pos_prob = data_pos.map(lambda x: (x[0], np.log(float(x[1] + 1)/pos_denom)))

neg_prob = data_neg.map(lambda x: (x[0], np.log(float(x[1] + 1)/neg_denom))) 

In [191]:
pos_prob.take(10)

[(u'', -4.6735940581919788),
 (u'fawn', -12.128024613833803),
 (u'eye  br', -12.128024613833803),
 (u'plot kurt', -12.128024613833803),
 (u'divinely', -11.72255950572564),
 (u'linda s', -12.128024613833803),
 (u'blackend', -12.128024613833803),
 (u'resist', -10.423276521595378),
 (u'sahan', -11.72255950572564),
 (u'joshua', -11.72255950572564)]

In [192]:
pos_prob = dict(pos_prob.collect())
neg_prob = dict(neg_prob.collect())
print(pos_prob)



In [193]:
# broadcast = shared by all nodes
pos_prob_b = sc.broadcast(pos_prob)
neg_prob_b = sc.broadcast(neg_prob)

## Prediction

In [194]:
test_raw_pos = sc.textFile(test_path + "pos/*.txt")
test_raw_neg = sc.textFile(test_path + "neg/*.txt")

test_raw_pos = test_raw_pos.sample(False, 0.2, 1)
test_raw_neg = test_raw_neg.sample(False, 0.2, 1)

num_partitions = 8
test_raw_pos = test_raw_pos.repartition(num_partitions)
test_raw_neg = test_raw_neg.repartition(num_partitions)

print(test_raw_pos.count())
print(test_raw_neg.count())

2529
2529


This film grabs you from the opening scenes and never lets go. You watch indulgently upon viewing Janos Szaby's excitement over coming to America. He's a likable fellow. You cannot help being fond of him even when his eagerness is replaced by bitterness as his fortunes turn. You know that in his circumstances,you would be forced to make the same choices he does to survive. This movie comments on society's worship of beauty and all things superficial and is only more true in the culture of the twenty-first century. Janos himself becomes victim to this philosophy when he tells his blind girlfriend "you're young and beautiful; if you could see, you would have the world." And like many a modern gangster movie, when her safety is threatened, he extracts a powerful revenge. His innocence is not altogether lost however for he demands an equally high price of himself, knowing he deserves his fate.<br /><br />Peter Lorre is in fine form in this starring role. Only a few actors could convincingl

In [23]:
def pred_class(doc):
    words = doc.split(" ")
    counts = Counter(words)
    log_pos = 0.0
    log_neg = 0.0
    for w in counts:
        log_pos += counts[w]* pos_prob_b.value.get(w, np.log(1.0/pos_denom))
        log_neg += counts[w]* neg_prob_b.value.get(w, np.log(1.0/neg_denom))
    if log_pos > log_neg:
        return "pos"
    return "neg"

In [24]:
pred_class(doc)

'pos'

In [25]:
test_pos_res = test_raw_pos.map(pred_class)
test_pos_res.take(10)

['pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'pos', 'neg', 'pos', 'pos']

In [26]:
test_pos_res = test_raw_pos.map(pred_class).map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y)
pos_results = dict(test_pos_res.collect())
print(pos_results)

{'neg': 592, 'pos': 1890}


In [27]:
test_neg_res = test_raw_neg.map(pred_class).map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y)
neg_results = dict(test_neg_res.collect())
print(neg_results)

{'neg': 2156, 'pos': 326}


In [28]:
# compute accuracy
total = sum(neg_results.values()) + sum(pos_results.values())
acc = float(neg_results["neg"] + pos_results["pos"]) / float(total)
print(acc)

0.815068493150685
