Final Project for CIS 399

Author: Benjamin Martinson

I'm importing a dataset called 'Grammer and Product Reviews' downloaded from Kaggle.com. My goal is to predict the rating of a product based on the text review that is given. To do so I will use the Naive Bayes method. The rating for each review is 1-5, so the method will output a prediction with 5 classes. I will take the highest predicted class of each review and compare it to the actual rating to see how close the prediction is.

Import statements first

In [75]:
from nltk.corpus import stopwords
swords = stopwords.words('english')

In [76]:
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()   

In [77]:
import string
punctuation = string.punctuation

In [78]:
import pandas as pd

Import the dataset into a pandas dataframe

In [79]:
review_table = pd.read_csv('/Users/owner/Downloads/GrammarandProductReviews.csv',
                          encoding='utf-8')

The first function I'm defining takes a string sentence and seperates each word, removing punctuation and 'stop words' (words that are very common and shouldn't be used in the comparisons).

In [80]:
def sentence_wrangler(sentence, swords, punctuation):
    removed_tokes = []
    wrangled = []
    word_tokes = word_punct_tokenizer.tokenize(sentence)
    word_tokes = [x.lower() for x in word_tokes]
    for i in range(len(word_tokes)):
        foundPunct = False
        for punct in punctuation:
            if punct in word_tokes[i]:
                foundPunct = True
        if word_tokes[i] in swords:
            removed_tokes.append(word_tokes[i])
        elif foundPunct == True:
            removed_tokes.append(word_tokes[i])
        else:
            wrangled.append(word_tokes[i].encode("utf-8"))

    return (wrangled, removed_tokes)
            

Here is an example of sentence_wrangler in action. I'm printing the review text followed by the list of words extracted from that text. 

In [81]:
for i in range(5):
    text = review_table.loc[i, 'reviews.text']
    print(text+'\n')
    print(sentence_wrangler(text, swords, punctuation)[0])
    print('='*10)

i love this album. it's very good. more to the hip hop side than her current pop sound.. SO HYPE! i listen to this everyday at the gym! i give it 5star rating all the way. her metaphors are just crazy.

[b'love', b'album', b'good', b'hip', b'hop', b'side', b'current', b'pop', b'sound', b'hype', b'listen', b'everyday', b'gym', b'give', b'5star', b'rating', b'way', b'metaphors', b'crazy']
Good flavor. This review was collected as part of a promotion.

[b'good', b'flavor', b'review', b'collected', b'part', b'promotion']
Good flavor.

[b'good', b'flavor']
I read through the reviews on here before looking in to buying one of the couples lubricants, and was ultimately disappointed that it didn't even live up to the reviews I had read. For starters, neither my boyfriend nor I could notice any sort of enhanced or 'captivating' sensation. What we did notice, however, was the messy consistency that was reminiscent of a more liquid-y vaseline. It was difficult to clean up, and was not a pleasant,

This next function produces a dictionary with keys of all words extracted from each review text, with corresponding values identifying the rating given for the review that the word was used in. For example bag['hello'] = [0,1,0,5,0] means that 'hello' is in 6 review texts, where 1 review gave a rating of 2, and 5 reviews gave a rating of 4.

In [85]:
def all_words(table, swords, punct):
    bag = {}
    for i in range(len(table)):
        #Some review texts are NaN so I'm checking for that here
        if pd.isnull(table.loc[i, 'reviews.text']):
            continue
            
        text = table.loc[i, 'reviews.text']     
        rating = table.loc[i, 'reviews.rating']
        sentence = sentence_wrangler(text, swords, punctuation)
        sentence = sentence[0]
        #rating is one more than the rating index
        idx = rating-1
        if i %5000 == 0:
            print('did 5000')
        for word in sentence:
            if word.isalpha():
                if word not in bag:
                    bag[word] = [0,0,0,0,0]
                bag[word][idx] += 1   
    return bag

In [86]:
bag_of_words = all_words(review_table, swords, punctuation)
len(bag_of_words) 

did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000


26337

The naive bayes method needs the total count of each review rating given. I'll collect this in a list.

In [87]:
total_count = [0,0,0, 0, 0]
for index, row in review_table.iterrows():
    idx = row['reviews.rating'] - 1
    total_count[idx] +=1
total_count

[3701, 1833, 4369, 14598, 46543]

The naive_bayes function I'm defining uses the numClasses variable to represent the number of classes in the prediction, but that will be 5 for my purposes. The function returns a list of 5 elements that represent the prediction for each rating, 1-5 (left to right).

In [88]:
def naive_bayes(raw_sentence, bag, counts, numClasses):
    sentence = sentence_wrangler(raw_sentence, swords, punctuation)
    sentence = sentence[0]
    numerator = [1] * numClasses
    casePercentage = [0] * numClasses
    tableSize = 0
    for i in range(numClasses):
        tableSize += counts[i]
  
    for i in range(numClasses):
        for word in sentence:
            if word in bag:
                numerator[i] *= (bag[word][i] / counts[i])
            
        casePercentage[i] = counts[i] / tableSize
    
    predictions = []
    for i in range(numClasses):
        predictions.append(numerator[i] * casePercentage[i])
    return predictions

I'm using the following code as a check to make sure I'm on the right track. I'm printing the prediction list, followed by the predicted rating (max of the list) and the actual rating.

In [89]:
for i in range(10):
    predictions = naive_bayes(review_table.loc[i, 'reviews.text'], bag_of_words, total_count, 5)
    print(predictions)
    m = max(predictions)
    print('Rating Prediction =', predictions.index(m)+1)
    print('actual rating =', review_table.loc[i, 'reviews.rating'])
    print('============')

[0.0, 0.0, 0.0, 0.0, 3.7154811459226627e-47]
Rating Prediction = 5
actual rating = 5
[5.898149682819661e-10, 2.5187986026742685e-08, 2.078700058067819e-07, 9.029366301808492e-07, 3.0493451735918935e-06]
Rating Prediction = 5
actual rating = 5
[2.8113535661845036e-05, 2.1770238788879918e-05, 7.065276804400496e-05, 0.00018073082982545665, 0.0005387804752337192]
Rating Prediction = 5
actual rating = 5
[5.437953728208661e-108, 0.0, 0.0, 0.0, 0.0]
Rating Prediction = 1
actual rating = 1
[1.7115749906034258e-19, 4.362577677240861e-23, 2.61094409743329e-25, 7.638665597932919e-27, 7.797542175922421e-28]
Rating Prediction = 1
actual rating = 1
[1.5470875712768262e-37, 0.0, 0.0, 1.5001378859331859e-43, 0.0]
Rating Prediction = 1
actual rating = 1
[5.542780014621437e-33, 0.0, 0.0, 1.2004092276944343e-38, 0.0]
Rating Prediction = 1
actual rating = 1
[8.544887589944767e-32, 0.0, 0.0, 1.567397639132955e-39, 0.0]
Rating Prediction = 1
actual rating = 1
[3.030485187441979e-27, 0.0, 0.0, 0.0, 0.0]
Rati

The first 10 predictions happen to be correct but let's try it for all the reviews now.

In [90]:
predictions = []

for i in range(len(review_table)):
    if pd.isnull(review_table.loc[i, 'reviews.text']):
        continue
    if i%5000 == 0: print('did 5000')
    pair = naive_bayes(review_table.loc[i, 'reviews.text'], bag_of_words, total_count, 5)
    m = max(pair)
    predictions.append(pair.index(m)+1)

did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000
did 5000


To efficiently compare each prediction to the actual rating, I will zip the two side by side

In [91]:
actuals = review_table['reviews.rating']
zipped = list(zip(predictions, actuals))

Heres the first 50. As you can see some predictions are off.

In [92]:
zipped[:50]

[(5, 5),
 (5, 5),
 (5, 5),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (2, 3),
 (3, 3),
 (4, 4),
 (4, 4),
 (1, 4),
 (5, 5),
 (4, 5),
 (5, 5),
 (1, 5),
 (5, 5),
 (1, 5),
 (5, 5),
 (5, 5),
 (5, 5),
 (5, 5),
 (5, 5),
 (4, 5),
 (5, 5),
 (2, 4),
 (4, 5),
 (1, 5),
 (5, 5),
 (5, 1),
 (1, 1),
 (3, 2),
 (3, 3),
 (4, 4),
 (4, 4),
 (5, 4),
 (4, 4),
 (4, 4),
 (5, 4)]

Now let's see what the accuracy percentage for all the predictions is.

In [93]:
correct = 0
for i in range(len(zipped)):
    if zipped[i][0]==zipped[i][1]:
        correct += 1
1.0*correct/len(zipped)

0.6085621743416421

60% doesn't look great, but given that there are 5 different ratings to choose from, and some of the reviews texts are too short or not descriptive enough to make an exact prediction, 60% is actually not bad in my opinion. 

You might notice that some of the predictions are only off by 1. I will now calculate the accuracy percentage, including predictions off by one at most one.

In [94]:
correct = 0
for i in range(len(zipped)):
    if abs(zipped[i][0] - zipped[i][1]) <= 1:
        correct += 1
1.0*correct/len(zipped)

0.8324602168708632

This shows that over 80% of predictions are very close to correct. Now let's see off by at most 2.

In [95]:
correct = 0
for i in range(len(zipped)):
    if abs(zipped[i][0] - zipped[i][1]) <= 2:
        correct += 1
1.0*correct/len(zipped)

0.911970145049993

I consider this a success because the method is able to predict the general positivity of the review text, whether the reviewer will give a positive of negative rating, with over 90% accuracy. 