In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl 
%matplotlib inline
mpl.rcParams['patch.force_edgecolor'] = True
sns.set()

In [2]:
df_nyt_yelp = pd.read_pickle('df_nyt_yelp_corrected.pkl') # import nyt-yelp master dataframe

In [3]:
import re
from dateutil import parser

# make a new, second master dataframe that compiles all reviews scraped so far
df_reviews = pd.DataFrame()
rest_length = []
for idx in range(0,280):
    pkl_name = 'import/restaurants/' + str(idx) + '.pkl' # get Pickle file name of restaurant
    df = pd.read_pickle(pkl_name) # import pickled dataframe
    df['review_idx'] = idx # affix restaurant's index to dataframe
    df_reviews = df_reviews.append(df) # append to master dataframe of reviews
    rest_length.append(len(df))

df_reviews = df_reviews.reset_index().drop('index', axis=1)

# convert date to DateTime object
df_reviews['review_date'] = [parser.parse(t) for t in df_reviews['review_date']]
#df_reviews['review_date'] = [datetime.strptime(str(t).split()[0], '%Y-%m-%d') for t in df_reviews['review_date']]

# convert rating from str to numeric
df_reviews['rating'] = pd.to_numeric(df_reviews['rating'])

print('Total number of reviews: ', len(df_reviews))
df_reviews.head(2)

Total number of reviews:  72007


Unnamed: 0,cool_count,elite_count,friend_count,funny_count,length_count,rating,review_count,review_date,review_text,useful_count,user_count,review_idx
0,4,1,45,2,1771,5,345,2018-06-18,"Davelle, uh, oden, uh. Foodie, why you trippin...",4,Jennie C.,0
1,1,1,68,0,791,4,350,2018-07-01,Keep It Simple Smart. Cute all day cafe with ...,1,Yvonne C.,0


In [52]:
#test = pd.DataFrame({'yelp_name':df_nyt_yelp['yelp_name'][:238], 'length':rest_length})
#test.to_csv('test.csv')

#### Drop invalid restaurants

Drop some other restaurants that were discovered to be invalid (not from NYC, etc)

In [4]:
delete_idx = [34,44,47,151,352,498]
df_nyt_yelp = df_nyt_yelp.drop(delete_idx)

delete_reviews = [idx for idx, row in df_reviews.iterrows() if row['review_idx'] in delete_idx]
df_reviews = df_reviews.drop(delete_reviews).reset_index().drop('index', axis=1)

#### Fill in master dataframe w/ scraped reviews

In [None]:
# grab restaurants that have complete Yelp scraped data now
#df_master = df_nyt_yelp.iloc[:`90]

# fill in Yelp scraped data
#for idx in range(0,88):
#    pkl_name = str(idx) + '.pkl'
#    df = pd.read_pickle(pkl_name)
#    df_master.loc[idx, 'yelp_reviews'] = 

# 1. Yelp reviews - NLP corpus

## 1.1 NLP pre-processing

Generate a pre-processed, tokenized list of documents in preparation for using gensim to create a corpus. Here, a <u>document</u> = a review's text. Each document is converted to a list of pre-processed tokens (not unique - will list all instances of tokens).

Pre-processing steps: 
- Lowercase
- Remove non-alphabetic characters/punctuation
- Remove stop words
- Lemmatize
- Correct (some) misspellings w/ [TextBlob](http://textblob.readthedocs.io/en/dev/)

In [13]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from textblob import TextBlob

text_tokenized = []
text_preprocessed = []

for idx, review in df_reviews.iterrows():
    
    # basic pre-processing
    text = review['review_text'].lower() # lowercase doc
    text = str(TextBlob(text).correct())
    
    text2 = word_tokenize(text) # tokenize doc
    text3 = [tok for tok in text2 if tok.isalpha()] # retain only alphabetic words
    
    # remove stopwords
    stop_words = set(stopwords.words('english')) # generate stopwords from English dictionary
    text4 = [tok for tok in text3 if tok not in stop_words]
    
    # lemmatize tokens
    wordnet_lemmatizer = WordNetLemmatizer()
    text5 = [wordnet_lemmatizer.lemmatize(tok) for tok in text4]
    
    # correct some misspellings
    #text6 = [str(TextBlob(tok).correct()) for tok in text5]
    
    text_tokenized.append(text5)

## 1.2 Create corpus w/ gensim

We use [gensim](https://radimrehurek.com/gensim/) to create a corpus, where each token is mapped to a unique numerical ID and word count (i.e. bag of words, BoW) in order to set up structure for inputting to NLP algorithms.

In [16]:
from gensim.corpora.dictionary import Dictionary

# create dictionary from list of pre-processed tokens (all instances) across all documents ('lemmatized')
dictionary = Dictionary(text_tokenized)

# generate corpus
corpus = [dictionary.doc2bow(doc) for doc in text_tokenized] # .doc2bow method converts documents into BoW format



Visualize a sample review under our different processing steps leading up to gensim corpus.

In [17]:
print('Review (after pre-processing): ', text, '\n')
print('Review (after document tokenization, removing stopwords, lemmatization): ', text5, '\n')
print('Review (after gensim corpus): ', corpus[-1])

Review (after pre-processing):  i don't give 5 stars often, unless it was truly a stellar meal. and of course it doesn't have to be a stuffy fancypants place to be stellar, although sometimes it is. shuko was one of those upscale places and was truly, from the bottom of my heart, one of the best meals i've ever had. it's so badass that the door isn't even marked- it's this elite club that you enter because you know about it already. i got the omakase menu, every bite was carefully prepared in front of my eyes before being placed on my little stone tray and then popped into my mouth. it was as if centuries of preparation and thought had gone into the making of each bite; the culture and technique behind the assembly of the sushi, the flavor and texture profiles, the quality of the ingredients... i think when i went through that unmarked black door, i went to narnia and then reemerged into the mundane world upon exit. sake was phenomenal. i tasted everything that the sommelier (sakelier?

#### Create dataframe of corpus, which tracks the restaurant that the review belongs to

In [7]:
df_corpus = pd.DataFrame({'restaurant_idx':df_reviews['review_idx'], 
                          'corpus':corpus, 
                          'yelp_rating':df_reviews['rating']})
df_corpus.head()

Unnamed: 0,restaurant_idx,corpus,yelp_rating
0,0,"[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...",5
1,0,"[(17, 1), (26, 1), (34, 1), (42, 1), (72, 1), ...",4
2,0,"[(14, 1), (29, 1), (30, 1), (42, 1), (44, 1), ...",3
3,0,"[(23, 1), (26, 1), (42, 1), (60, 2), (61, 1), ...",5
4,0,"[(4, 1), (13, 1), (29, 1), (42, 2), (50, 2), (...",4


## 1.3 Basic word count and bag of words

#### Find most frequent words in best-rated and worst-rated Yelp restaurants

- "Good" Yelp reviews have ratings = 5 
- "Bad" Yelp reviews have ratings <= 3

Note that individual reviews can only be an integer from 1 to 5. Overall average Yelp rating for a restaurant, however, is capable of increments of 0.5 (such as 4.5/5).

In [8]:
# Best-rating reviews

idx_good = df_reviews[df_reviews['rating'] == 5].index
idx_good_doc = [t for t,j in df_corpus['restaurant_idx'].iteritems() if j in idx_good] # index of docs belonging to those restaurants

subcorpus_good = []
subcorpus_good = [(subcorpus_good + doc) for idx, doc in df_corpus.loc[idx_good_doc]['corpus'].iteritems()]

# Worst-rating reviews

idx_bad = df_reviews[df_reviews['rating'] <= 3 ].index
idx_bad_doc = [t for t,j in df_corpus['restaurant_idx'].iteritems() if j in idx_bad] # index of docs belonging to those restaurants

subcorpus_bad = []
subcorpus_bad = [(subcorpus_bad + doc) for idx, doc in df_corpus.loc[idx_bad_doc]['corpus'].iteritems()]

#### Print top 10 words for "good" and "bad" Yelp reviews.

In [9]:
import collections
import itertools

# Good-rating reviews

total_word_count_good = collections.defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(subcorpus_good):
    total_word_count_good[word_id] += word_count

sorted_word_count_good = sorted(total_word_count_good.items(), key=lambda w: w[1], reverse=True) 

print('Top 10 words for GOOD-rating Yelp reviews:','\n')
for word_id, word_count in sorted_word_count_good[:10]:
    print(dictionary.get(word_id), word_count)
print('\n')

# Bad-rating reviews

total_word_count_bad = collections.defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(subcorpus_bad):
    total_word_count_bad[word_id] += word_count

sorted_word_count_bad = sorted(total_word_count_bad.items(), key=lambda w: w[1], reverse=True) 

print('Top 10 words for BAD-rating Yelp reviews:','\n')
for word_id, word_count in sorted_word_count_bad[:10]:
    print(dictionary.get(word_id), word_count)


Top 10 words for GOOD-rating Yelp reviews: 

food 34855
good 27015
place 24056
dish 20569
like 19130
great 18700
restaurant 18590
one 18162
service 16875
would 16590


Top 10 words for BAD-rating Yelp reviews: 

food 14010
good 9392
dish 8281
place 8195
restaurant 7776
service 6912
like 6870
great 6812
one 6387
would 5782


<b>Conclusion</b>: As seen from the overlap between top-10 "good"/"bad" words from a simple bag of words count, we will need more sophisticated tools to parse keywords associated with "good" or "bad" ratings.

# 2. tf-idf EDA

In the previous section, we did simple pre-processing and simply took token frequency. Here, we experiment with using gensim's [tf-idf](https://radimrehurek.com/gensim/models/tfidfmodel.html) to identify most important words in each document. This is accomplished with their algorithm by down-weighting shared words (between documents) beyond simply stopwords, ensuring that common words don't show up as key words. Conversely, document-specific words are weighted highly.

## 2.1 Experimenting with tf-idf on a Yelp review

We generate tfidf weights for a single document (Yelp review) to see how tf-idf performs. The tf-idf model is generated on the entire corpus of documents (i.e. reviews).

In [170]:
from gensim.models.tfidfmodel import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
doc = corpus[0]
tfidf_weights = tfidf[doc]

# Print the first five weights
print('tfidf weights: ', '\n', tfidf_weights[:5], '\n')

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
print('Top 5 weighted words:')
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary[term_id], weight)

print('\n')
print('Text: ','\n', ' '.join(corpus_tokenized[0]))

tfidf weights:  
 [(0, 0.11605235285752101), (1, 0.055517904959346144), (2, 0.058550703583362056), (3, 0.01923654735257033), (4, 0.027415914582950895)] 

Top 5 weighted words:
oden 0.43697383680554125
dashi 0.2652405475558776
uh 0.2273065414231621
mochi 0.19016016546668416
spaghetti 0.1868451957856557


Text:  
 davelle uh oden uh foodie trippin get order right uh shawty look good eatin oden oden dish drink dashi davelle oden moonlight xxxtentacion rip everything amazing u dining tiny cozy cramped beautiful little spot got oden set karaage cod spaghetti hokkaido spaghetti uni tomato cold dish topped kinda optional light cheese drink dashi aaaalllll dish good soft blanched skinless savory daikon served spicy yuzu paste use sparingly pretty big kick red miso paste soft mushy perfectly cooked heart shaped daikon mochi lightly fried bag soft gooey delicious mochi def drink dashi scallion enoki mushroom ginger hanpen white fish cake soft texture airy typical fish cake denseness fishcake del

<b>Conclusion</b>: It appears that if we take a document to be a single review, tf-idf may pick keywords that are specific to the reviewed restaurant's cuisine. Although this may be useful for identifying what food the restaurant serves, we are more interested in what the reviewer thought of the food, service, etc. 

Next, we try taking a document to be the the concatenation of all reviews belonging to a single restaurant to see if we get more relevant results.

## 2.2 Experimenting with tf-idf on all reviews belonging to a single restaurant

In [156]:
# Combine a single restaurant's reviews into one document (Davelle, first entry in df_nyt_yelp restaurant database)

restaurant_idx = 0
df = df_corpus[df_corpus['restaurant_idx']==0]
doc = df['corpus'].tolist()
doc = list(itertools.chain(*doc))

tfidf_weights = tfidf[doc]

# Print the first five weights
print('tfidf weights: ', '\n', tfidf_weights[:5], '\n')

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
print('Top 5 weighted words:')
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary[term_id], weight)

tfidf weights:  
 [(0, 0.0385683994677437), (1, 0.018717773450899693), (2, 0.01961377610979211), (3, 0.006518710429663028), (4, 0.00919997238797367)] 

Top 5 weighted words:
oden 0.17360491863449606
oden 0.14467076552874672
gobo 0.13137044585937177
ada 0.10397982332595031
dashi 0.0903979302813469


<b>Conclusion:</b> Top tf-idf keywords still seem to refer to the restaurant's type of cuisine more so than taste/food. It's likely that tf-idf, because it is designed to down-weight common words between documents, will actually leave out the phrases we want regarding food, service, and quality, since these are likely to appear across all documents (i.e. reviews).

## 2.3 Summary

We conclude this section with a pipeline setting up for converting documents (Yelp reviews) into token-wordcount mappings. As seen from the top wordcounts of "best-rating" and "worst_rating" Yelp reviews, there are many confounding terms that probably won't serve as good predictors for "good" or "bad" restaurants. 

Next, we'll experiment with word embeddings, sentiment analysis, CountVectorizer train-test-split on "good"/"bad" restaurants, etc.

# 3. Classification: "good"/"bad" reviews

Here, we use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to train/predict review labels: "good" or "bad". 

## 3.1 Generating "good" & "bad" review labels

Like the previous section, we'll take "good" ratings = 5 and "bad" ratings <= 3. As a result, we ignore reviews with a "4"-rating for now. 

The rationale is that these can contain a mix of positive/negative comments - negative comments explaining why the restaurant is not a 5, but also positive comments explaining why the restaurant would be > 3. Such a mix may confound our results.

#### Generate dataframe for classification train/test

Generate dataframe we will be working with, which contains only reviews of <=3 & = 5

In [11]:
# Grab "good" & "bad" review indices
idx_class = idx_good.append(idx_bad)

# Filter dataframe for only "good" and "bad" reviews. Save to 'df_class'
df_class = df_reviews.loc[idx_class]

# Include pre-processed text in new column: 'text_preprocessed'
text_preprocessed = [' '.join(doc) for doc in text_tokenized]
df = pd.DataFrame({'text':text_preprocessed})
df_class['text_preprocessed'] = df.loc[idx_class]

# Assign "good" or "bad" label
df_class.loc[idx_good, 'label'] = 'good'
df_class.loc[idx_bad, 'label'] = 'bad'

df_class.head(3)

Unnamed: 0,cool_count,elite_count,friend_count,funny_count,length_count,rating,review_count,review_date,review_text,useful_count,user_count,review_idx,text_preprocessed,label
0,4,1,45,2,1771,5,345,2018-06-18,"Davelle, uh, oden, uh. Foodie, why you trippin...",4,Jennie C.,0,davelle uh oden uh foodie trippin get order ri...,good
3,1,1,55,0,486,5,58,2018-06-27,Lovely little 16 seater at the south end of th...,0,Adam W.,0,lovely little seater south end le went late lu...,good
6,2,1,277,2,1583,5,96,2018-03-20,If you enjoy an small intimate cafe with diffe...,8,Maria S.,0,enjoy small intimate cafe different type japan...,good


## 3.2 CountVectorizer for train/test split

Use CountVectorizer to convert text to a sparse <b>document-term matrix (DTM)</b> of token counts, where each column is a <b>token</b> from the corpus vocabulary (generated from training set), each row is a <b>document</b> (a Yelp review), and the values are token frequency. Train & fit to set up for next section of predicting "good"/"bad" reviews.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Create a series to store the labels: y
y = df_class.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_class['text_preprocessed'], y, test_size=.33, random_state = 53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Learn the "vocabulary" of the training set & transform into 'document-term' matrix
count_train = count_vectorizer.fit_transform(X_train.values)

#  Use the fitted vocabulary to build a DTM from the testing data (IGNORES tokens it hasn't seen before)
count_test = count_vectorizer.transform(X_test.values)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:20])

['aa', 'aaa', 'aaaaaaaaaand', 'aaaaaaaand', 'aaaaaamazing', 'aaaaall', 'aaaalllll', 'aaaamazing', 'aaaand', 'aaaanyway', 'aaah', 'aaalll', 'aaamazing', 'aaanndd', 'aaawwweeesome', 'aahed', 'aahhhhh', 'aahing', 'aahs', 'aamer']


Tiny sample of resulting sparse DTM, where rows = Yelp reviews, columns = vocabulary generated from training set.

In [13]:
print('Shape of DTM: ', count_train.shape)
pd.DataFrame(count_train[0,1000:1015].toarray(), columns=count_vectorizer.get_feature_names()[1000:1015])

Shape of DTM:  (34497, 37663)


Unnamed: 0,amature,amazballs,amaze,amazeballs,amazed,amazeee,amazeeeeeeeballs,amazement,amazes,amazig,amazin,amazing,amazinggg,amazingggg,amazinggggg
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


It's apparent that there will be many duplicate versions/misspellings of a word, which might confound results and cause some terms to be downweighted in importance. Not clear how to correct for these

## 3.3 Naive Bayes classifier

The Naive Bayes model is commonly used for testing NLP classification problems. It is rooted in probability ([Bayes theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem)) and generates predictions based on past data - given prior training data with features and labelled outcomes, what can we predict with our set of test observations and their features? The label it predicts for each observation is based on its calculation of the likeliest out of the possible labels. See [here](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/) for a good explanation.

Here, each word from `CountVectorizer` acts as a feature.

#### Generate predicted "good"/"bad" reviews

Use sklearn's [naive_bayes](http://scikit-learn.org/stable/modules/naive_bayes.html) module to generate predictions.

In [14]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print('Accuracy score:', score, '\n')

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['good', 'bad'])
print('Confusion matrix:','\n', cm)

Accuracy score: 0.8936558380414312 

Confusion matrix: 
 [[9954  825]
 [ 982 5231]]


Our classifier performed fairly well. Let's inspect the model to actually see what it has learned.

In [15]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = count_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20], '\n')

# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

bad [(-13.98729597068333, 'aaaanyway'), (-13.98729597068333, 'aaah'), (-13.98729597068333, 'aaalll'), (-13.98729597068333, 'aahed'), (-13.98729597068333, 'aahing'), (-13.98729597068333, 'aamer'), (-13.98729597068333, 'aarp'), (-13.98729597068333, 'aaverage'), (-13.98729597068333, 'abaiyin'), (-13.98729597068333, 'abandoning'), (-13.98729597068333, 'abberation'), (-13.98729597068333, 'abbreviation'), (-13.98729597068333, 'aberation'), (-13.98729597068333, 'abercrombie'), (-13.98729597068333, 'abhorrent'), (-13.98729597068333, 'abiding'), (-13.98729597068333, 'abit'), (-13.98729597068333, 'abject'), (-13.98729597068333, 'abnormal'), (-13.98729597068333, 'abnoxious')] 

good [(-5.577465297595593, 'love'), (-5.55066228712551, 'come'), (-5.510299969018506, 'try'), (-5.4789417279342985, 'experience'), (-5.422646838110797, 'definitely'), (-5.403566255466849, 'meal'), (-5.180273014757853, 'amazing'), (-5.17803035851402, 'really'), (-5.150486088284966, 'best'), (-5.1146686047644945, 'menu'), (-

The results make sense. Negative words such as "abberation" and "abominable" feature prominently in "bad" reviews, while "good" reviews have positive descriptors such as "delicious" and "amazing". 

It is interesting to note features that by themselves are neutral, such as "service". Since they are included in "good" reviews, it seems service is an important factor and is conducted well in "good" reviews.

However, it seems like reviewer misspellings could downweight the term's importance (ex. "abnoxious" vs "obnoxious"). There are a few duplicate words (ex. "abberation", "aberation") to the same effect as well.

#### Check examples of false positives

Where "bad" reviews were incorrectly classified as "good" reviews.

In [55]:
false_positives = X_test[y_test < pred]
print('1 example: ','\n')
print('Pre-processed: ', false_positives[47597], '\n')
print('Actual review: ', df_reviews.loc[47597, 'review_text'], '\n')
print('First 10 examples: ')
false_positives.head(10)

1 example:  

Pre-processed:  great bistro ambiance bistro actual building gorgeous feel much like sitting nice bistro lyon nyc high end price view street delivery truck service fantastic gourgeres bar nice touch food good priced great ambiance crowded 

Actual review:  Great Bistro ambiance, at not so bistro prices.. . The actual building is gorgeous and feels very much like you are sitting in a nice bistro in Lyon, but with NYC high end prices, and views of 55th street delivery trucks and trashbags.. . The service is fantastic, and the gourgeres at the bar are a nice touch. The food is very good but over priced for what you get.. . Great ambiance when not crowded. 

First 10 examples: 


37051    although definitely better pho chinatown nice ...
47597    great bistro ambiance bistro actual building g...
22326    white gold lunch game another level went back ...
30444    bookmarked quite time picture seen instagram p...
49345    let took trip ny specifically eat per se loved...
4851     food pretty good flavor authentic tried nasi l...
20765    good service place beautiful convinced food we...
35277    would given star service faster gluten free pa...
7839     try e alpukat coffee avocado milkshake one tim...
60917    restaurant small cozy good spot date night sin...
Name: text_preprocessed, dtype: object

As expected, a likely reason some reviews were false positives was due to the prevalence of positive words (ex. "good", "gorgeous") in the midst of a negative evaluation with a few turns of phrase (ex. "...but over priced for what you get"). 

In fact, in our example, pre-processing and removing stopwords may have removed tokens that would have caused the review to be correctly labelled as "bad", since they were critical parts of negative turns of phrase. Phrases/words such as "not so", "trashbags", "but over priced for what you get" are lost as a result.

#### Check examples of false negatives

Where "good" reviews were incorrectly classified as "bad" reviews.

In [60]:
false_negatives = X_test[y_test > pred]

print('1 example: ','\n')
print('Pre-processed: ', false_negatives[18056],'\n')
print('Actual review: ', df_reviews.loc[18056, 'review_text'], '\n')
print('First 10 examples: ')
false_negatives.head(10)

1 example:  

Pre-processed:  love place went first time ordered salt pepper chicken rice tofu vegetable rice beef tendon dish rice stayed meal complimentary pork soup tea nice soup tea like place mama lee made soup tea heart kindness know restarurant complimentary thing taste kinda bland shitty sometimes felt like home eating meal rice came separately another bowl even tho ordered rice worth 

Actual review:  Love this place. I went there for the first time and ordered salt and pepper chicken over rice, tofu with vegetable over rice,  and beef tendon dish over rice. We stayed for the meal and they have complimentary pork soup and tea. . . It's a very nice soup and tea, this is not like other places, Mama Lee made these soup and tea with her heart (kindness). You know how other restarurant complimentary things taste kinda just bland and shitty sometimes, this is not at all!   I felt like home eating all the meals. . . The rice came separately with another bowl even tho you ordered "___

14283    love place bummer ca since come visit nyc ever...
18056    love place went first time ordered salt pepper...
32238    brother stopped quick lunch food great loved a...
60373    cute kitschy vibe decent price strong drink cu...
54334    true service bad food take awhile come pizza g...
46904    despite poke craze taking entire city place ju...
60155    amazing place noodle tasty take low spice othe...
14436    visited tim ho wan couple time must say go dim...
13951    came tuesday evening line door already pretty ...
7329     rice noodle mi fen decently made color texture...
Name: text_preprocessed, dtype: object

Like the false positives, these reviews tend to have a mixed bag of vocabulary associated with both "good" & "bad" reviews. In the above example for instance, there are many positive words (ex. "love", "nice") but also negative words that would probably trigger a "bad" review (ex. "shitty", "bland") even though the reviewer used these terms to describe other competing restaurants.

# 4. Conclusion

In this notebook, we: 
- Set up a pipeline for NLP - including pre-processing, generating a corpus, bag of words, CountVectorizer sparse DTM, etc. 
- Discovered words (and word frequencies) associated with "good" & "bad" reviews.
- Successfully explored classification for "good" & "bad" reviews. Our Naive Bayes classifier worked fairly well, with an accuracy score of ~90%. 

Now that we have quantified terms associated with "good"/"bad" reviews, as well as experimented with a classifier for predicting such reviews, we proceed with incorporating NYT data in our next notebook (`data_EDA_timeseries`). 

The results in this section will allow us to determine whether the introduction of NYT reviews has any influence on Yelp reviews. For instance, introducing NYT reviews may shift the emphasis on different terms predicting "good"/"bad" reviews (ex. "NYT", "service", "hole-in-the-wall", "critic").