# Analysis of Yelp Restaurant Reviews with NLP

## Outline
1. A tour of the dataset
1. Introduction to text processing with spaCy
1. Automatic phrase modeling
1. Topic modeling with LDA
1. Visualizing topic models with pyLDAvis
1. Word vector models with word2vec
1. Visualizing word2vec with t-SNE

## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [**Yelp**](http://yelp.com) for academic research and educational purposes.

The current iteration of the Yelp dataset (as of this demo) consists of the following data:
- __552K__ users
- __77K__ businesses
- __2.2M__ user reviews

When focusing on restaurants alone, there are approximately __55K__ restaurants with approximately __1M__ user reviews written about them.

The data is provided in a handful of files in _.json_ format. We'll be using the following files for our demo:
- __business.json__ &mdash; _the records for individual businesses_
- __review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples.

The business records consist of _key, value_ pairs containing information about the particular business. A few attributes we'll be interested in for this demo include:
- __business\_id__ &mdash; _unique identifier for businesses_
- __categories__ &mdash; _an array containing relevant category values of businesses_

In [13]:
import json
import os

with open('business.json', 'r') as f:
    first_business_record = json.loads(f.readline())

print(first_business_record)

{'business_id': 'FYWN1wneV18bWNgQjJ2GNg', 'name': 'Dental by Design', 'neighborhood': '', 'address': '4855 E Warner Rd, Ste B9', 'city': 'Ahwatukee', 'state': 'AZ', 'postal_code': '85044', 'latitude': 33.3306902, 'longitude': -111.9785992, 'stars': 4.0, 'review_count': 22, 'is_open': 1, 'attributes': {'AcceptsInsurance': True, 'ByAppointmentOnly': True, 'BusinessAcceptsCreditCards': True}, 'categories': ['Dentists', 'General Dentistry', 'Health & Medical', 'Oral Surgeons', 'Cosmetic Dentists', 'Orthodontists'], 'hours': {'Friday': '7:30-17:00', 'Tuesday': '7:30-17:00', 'Thursday': '7:30-17:00', 'Wednesday': '7:30-17:00', 'Monday': '7:30-17:00'}}


The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews. A few attributes of note on the review records:
- __business\_id__ &mdash; _indicates which business the review is about_
- __text__ &mdash; _the natural language text the user wrote_

In [14]:
with open('review.json', 'r') as f:
    first_review_record = json.loads(f.readline())
    
print(first_review_record)

{'review_id': 'v0i_UHJMo_hPBq9bxWvW4w', 'user_id': 'bv2nCi5Qv5vroFiqKGopiw', 'business_id': '0W4lkclzZThpx3V65bVgig', 'stars': 5, 'date': '2016-05-28', 'text': "Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.", 'useful': 0, 'funny': 0, 'cool': 0}


The code below extracts the unique restaurant IDs and counts them

In [15]:
restaurant_ids = set()

# open the businesses file
with open('business.json') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

54,618 restaurants in the dataset.


We now extract from the reviews the text relative to restaurant reviews, and only those.

In [16]:
%%time

review_txt_filepath = os.path.join('intermediate', 'review_text_all.txt')
if os.path.isfile(review_txt_filepath):
    with open(review_txt_filepath, 'r') as fh:
        for review_count, line in enumerate(review_txt_filepath):
            pass
else:
    review_count = 0

    # create & open a new file in write mode
    with gzip.open(review_txt_filepath, 'w') as review_txt_filepath:

        # open the existing review json file
        with gzip.open('review.json') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_filepath.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

print('''Text from {:,} restaurant reviews 
         written to the new txt file.'''.format(review_count))

Text from 31 restaurant reviews 
         written to the new txt file.
CPU times: user 694 µs, sys: 828 µs, total: 1.52 ms
Wall time: 785 µs


![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

SpaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
- Large English vocabulary, including stopword lists
- Token "probabilities"
- Word vectors

SpaCy Can perform
- Tokenization
- Text normalization, such as lowercasing, stemming/lemmatization
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

Let's look at one review.

In [18]:
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

with open(review_txt_filepath) as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

This is currently my parents new favourite restaurant. 

We come here in the morning for dim sum. They are not the cart pushing type of dim sum, it is order off of the sheet. Dim sum is not bad and not expensive either.

We also frequent the dinner scene. Their set dinner menu is not bad. We typically order a 6 dish menu and it's big enough to feed a family of 9 with leftovers. 

Overall, food is pretty tasty!



### Decomposition into sentences

Let's decompose this review into it's language components. First of all, how many sentences is the review composed of?

In [19]:
parsed_review = nlp(sample_review)

for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

Sentence 1:
This is currently my parents new favourite restaurant. 



Sentence 2:
We come here in the morning for dim sum.

Sentence 3:
They are not the cart pushing type of dim sum, it is order off of the sheet.

Sentence 4:
Dim sum is not bad and not expensive either.



Sentence 5:
We also frequent the dinner scene.

Sentence 6:
Their set dinner menu is not bad.

Sentence 7:
We typically order a 6 dish menu and it's big enough to feed a family of 9 with leftovers. 



Sentence 8:
Overall, food is pretty tasty!




### Named-Entity recognition

**GPE** stands for Geo-Political Entity. Apparently the newlines are confusing the library.

In [20]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: 6 - CARDINAL

Entity 2: 9 - CARDINAL

Entity 3: 
 - GPE



### Parts Of Speech (POS) tagging

Let's decompose the text into POS tags, like verbs, adverbs, adjectives, punctuation, etc.

In [21]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]
pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,This,DET
1,is,VERB
2,currently,ADV
3,my,ADJ
4,parents,NOUN
5,new,ADJ
6,favourite,ADJ
7,restaurant,NOUN
8,.,PUNCT
9,\n\n,SPACE


### Stemming, lemmatization, shape analysis

Note that `is` is lemmatized into `be`. Plurals are stemmed into singulars.

In [22]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,This,this,Xxxx
1,is,be,xx
2,currently,currently,xxxx
3,my,-PRON-,xx
4,parents,parent,xxxx
5,new,new,xxx
6,favourite,favourite,xxxx
7,restaurant,restaurant,xxxx
8,.,.,.
9,\n\n,\n\n,\n\n


Let's check the relative frequency of certain types of token:
- stopword
- punctuation
- whitespace
- represents a number
- whether or not the token is included in spaCy's default vocabulary?

In [30]:
token_attributes = [(token.orth_,
                     # token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           # 'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'in vocab.?'])

df.loc[:, 'stop?':'in vocab.?'] = (df.loc[:, 'stop?':'in vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,stop?,punctuation?,whitespace?,number?,in vocab.?
0,This,,,,,Yes
1,is,Yes,,,,Yes
2,currently,,,,,Yes
3,my,Yes,,,,Yes
4,parents,,,,,Yes
5,new,,,,,Yes
6,favourite,,,,,Yes
7,restaurant,,,,,Yes
8,.,,Yes,,,Yes
9,\n\n,,,Yes,,Yes


**Note** that all this works because the text is *generic*. If we had clinical, medical, or other technical language, SpaCy would struggle. This can be remediated, but this is outside of the scope of this notebook.

## Phrase Modeling

So far we have considered single words. Words that often co-occur together can refer to a single entity (e.g. 'New' and 'York' and 'City' refer to one concept, not three). The Chinese food *dim sum* was actually interpreted as the two English words *dim* and *sum*.
We compare the frequency of the occurrence of words `A` and `B` separately, with the frequency of occurrences of the word-pair `A B`.

In [31]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

The functions below:
1. Check whether a word is punctuation or whitespace.
2. WHAT DO THEY DO????

In [32]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with open(filename) as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename)):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [33]:
%%time
unigram_sentences_filepath = os.path.join('intermediate', 'unigram_sentences_all.txt')

if not os.path.isfile(unigram_sentences_filepath):  
    with open(unigram_sentences_filepath, 'w') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')

CPU times: user 593 µs, sys: 1.1 ms, total: 1.69 ms
Wall time: 6.04 ms
