(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Yelp Reviews and Clustering

In this assignment, we will be working with the [Yelp dataset](http://cs-people.bu.edu/kzhao/teaching/yelp_dataset_challenge_academic_dataset.tar). You can find the format of the dataset [here](https://www.yelp.com/dataset_challenge).

First, we will look at Review Objects and perform some [sentiment analysis](http://sentiment.christopherpotts.net/) on the review text.

You will need to preprocess the text using a stemming algorithm. The Porter stemming algorithm is a well-known one. Then, use a lexicon to assign a score to a review based on the positive/negative words you find in the text. You can find various lexicons [here](http://sentiment.christopherpotts.net/lexicons.html).

After you have assigned scores to the reviews based on the text analysis, compare your scores with the stars associated with the reviews. **(20 pts)**

In [None]:
"""
read file
read line by line to extract json object
nltk sentiment intensity analyzer - takes in string returns sentiment
"""

import json
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
import matplotlib.pyplot as plt
import pandas as pd
import time

%matplotlib inline

path = "/Users/ALaw/Desktop/Stuff/2016 Spring/591/submissions/591-hw/hw2-submission/yelp_dataset_challenge_academic_dataset/"
filename = path + "yelp_academic_dataset_review.json"

debug = False

def parse(fn=filename):
    '''Read line by line and extract (x,y) coordinates, where x = sentiment intensity and y = star rating'''
    
    data= []
    sid = SentimentIntensityAnalyzer()
    
    if debug:
        print("Begin Parsing...")
    
    with open(fn) as f:
        for line in f:
            json_line = json.loads(line)
            scores = sid.polarity_scores(json_line['text'])
            sentiment = (scores['pos']*2 - scores['neg']*2)*(scores['compound']**2)
            stars = json_line['stars']
            data.append((stars,sentiment))
    
    
    if debug:
        print("Coverting to Dataframe...")
    
    data = pd.DataFrame(data,columns = ['stars','sentiment'])
    data = data.sort_values('stars')
    
    #print(list(data['stars']))
    #print(list(data['sentiment']))
    
    if debug:
        print("Converting to CSV...")
    
    data.to_csv('sample_data.csv')
    
    return data

def visualize(data):
    '''Visualizes the sentiment vs star rating data through a scatter plot and runs linear regression'''
    
    if debug:
        print("Sorting X,Y values...")
    
    x = list(data['sentiment'])
    y = list(data['stars'])
    
    plt.figure(figsize=(10,9))
    plt.scatter(x, y)
    plt.title('Review Data Scatterplot')
    plt.ylabel('# of Stars')
    plt.xlabel('Sentiment Score')
    
    if debug:
        print("Graph complete")
    
    plt.show()

parse()

Visualization and short (detailed) analysis. **(10 pts)**

-----------------

Now, let's look at Business Objects. Try to find culinary districts in Las Vegas. These are characterized by closeness and similarity of restaurants. Use the "longitude" and "latitude" to cluster closeness. Use "categories" and "attributes" to cluster for similarity.

Find clusters using the 3 different techniques we discussed in class: k-means++, hierarchical, and GMM. Explain your data representation and how you determined certain parameters (for example, the number of clusters in k-means++). **(30 pts)**

Things you may want to consider:
1. The spatial coordinates and restaurant categories/attributes have different units of scale. Your results could be arbitrarily skewed if you don't incorporate some scaling.
2. Some restaurant types are inherently more common than others. For example, there are probably lots of "pizza" restaurants. You may want to normalize your vectors so that you don't end up with only clusters of "pizza" restaurants.

Visualize your clusters using each technique. Label your clusters. **(10 pts)**

Now let's detect outliers. These are the ones who are the farthest from the centroids of their clusters. Track them down and describe any interesting observations that you can make. **(10 pts)**

Give a short (detailed) analysis comparing the 3 techniques. **(10 pts)**

-----------------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()