# Extra Credit - Basic Baseline System

## Yelp Reviews

Here we are going to take the same baseline system developed previously and use it on Yelp Reviews to seperate the good reviews (or those that are 3.5 stars and above) and the bad reviews (less than 3.5 stars)

The first step is to import the various python libraries that would be needed in the code

In [1]:
import nltk
import numpy as np
import math
import glob
from string import punctuation
from nltk.corpus import stopwords
from collections import Counter

#### Extracting the Text

##### Creating a list of all reviews

In [2]:
filename = "yelp_reviews/all_reviews.txt"

with open(filename) as file:
    text = file.read()

reviews = text.split("]]]")
del reviews[10391] #removing a particular garbage value
reviews[0] = '\n'+reviews[0] # adding hashtag to keep consistency

#### Evaluate each review to return star rating and text

In [3]:
def evaluate_review(review):
    stars =  review[5] 
    text = review[21:]
    return int(stars),text
    

### Cleaning Text

#### Cleaning text and updating the Vocabulary (The vocabulary needs to be constantly updated)

In [5]:
def update_vocab(tokens,vocab):    
    vocab.update(tokens)

#### Saving the vocabulary as a list

In [6]:
def save_vocab(vocab):
    voc = [i for i,j in vocab.items()]
    return voc

#### Splitting reviews into training and testing lists

In [7]:
training_review = reviews[:8500]
testing_review = reviews[8500:]

### Creating the Vocabulary using the previous functions

In [8]:
# A counter that would be used for vocabulary
pos_vocab = Counter() # initialize the counter to be used throughout
neg_vocab = Counter()

for i in training_review:
    stars,text = evaluate_review(i)
    tokens = text.split()
    if stars > 3:
        update_vocab(tokens,pos_vocab)
    elif stars < 3:
        update_vocab(tokens,neg_vocab)
    elif stars == 3:
        if ('3.5' in tokens) and ('stars' in tokens):
            update_vocab(tokens,pos_vocab)
        else:
            update_vocab(tokens,neg_vocab)

pos_vocabulary= save_vocab(pos_vocab) 
neg_vocabulary = save_vocab(neg_vocab)
#print(pos_vocabulary)

### Testing

In [9]:
def test_review():
    correct = 0
    for i in testing_review:
        stars,text = evaluate_review(i)
        tokens = clean_text(i)
        pos_decision =0
        neg_decision = 0
        for i in tokens:
            if i in pos_vocabulary:
                pos_decision += 1 #* pos_vocab[i]/pos_sum   # weights were taken off cos of reason stated above
            if i in neg_vocabulary:
                neg_decision +=1 #* neg_vocab[i]/neg_sum
        if pos_decision >= neg_decision :
            pos = 1
        else:
            pos = -1
        if stars > 3:
            if pos == 1:
                correct += 1
        elif stars < 3:
            if pos == -1:
                correct += 1
        elif stars == 3:
            if ('3.5' in tokens) and ('stars' in tokens):
                if pos == 1:
                    correct += 1
            else:
                if pos == -1:
                    correct += 1
    return correct

In [11]:
number_of_corrects = test_review() #we are testing on the positive files
accuracy =  number_of_corrects/len(testing_review) * 100

print ("Accuracy is "+ str(accuracy))

Accuracy is 69.32839767318879
