# Yelp Data Challenge - Clustering Analysis

- From Setiment analysis, I come to the conclusion that rating score is unreasonable in 4-5 stars cases and 1-2 stars cases. I suggest to replace 1-5 rating system with like/dislike or below_average/average/above_average systems. Here I want to determine which one is better through clustering analysis.
- The clutsering method is K-means. It turns out that three clusters (below_average/average/above_average) has better performance than two clusters (like/dislike).


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('../last_2_years_restaurant_reviews.csv')
df.head(2)

Unnamed: 0,business_id,name,categories,avg_stars,review_id,user_id,stars,date,text,useful,funny,cool
0,--9e1ONYQuAa-CB_Rrw7Tw,"""Delmonico Steakhouse""",Cajun/Creole;Steakhouses;Restaurants,4.0,6SgvNWJltnZhW7duJgZ42w,oFyOUOeGTRZhFPF9uTqrTQ,5,2016-03-31,This is mine and my fiancé's favorite steakhou...,0,0,0
1,--9e1ONYQuAa-CB_Rrw7Tw,"""Delmonico Steakhouse""",Cajun/Creole;Steakhouses;Restaurants,4.0,UxFpgng8dPMWOj99653k5Q,aVOGlN9fZ-BXcbtj6dbf0g,5,2016-02-10,Truly Fantastic! Best Steak ever. Service was...,0,0,0


# Cluster the review text data for all the restaurants

#### Define my feature variables - the text of the review (I decided to only run on a part of the data since I got memory error many times.)

In [6]:
documents = df['text'][::20]

In [7]:
from sklearn.model_selection import train_test_split
# The clsutering algorithms is slow so I will take samples to find optima number of clusters
documents_train, documents_test= train_test_split(documents, test_size=0.3)

In [8]:
len(documents_train), len(documents_test)

(12196, 5227)

#### representation of the documents

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True, max_features = 5000
                            )
# Train the model with my training data
documents_train_vec = vectorizer.fit_transform(documents_train).toarray()
# Get the vocab of my tfidf
words = vectorizer.get_feature_names()
# Use the trained model to transform all the reviews
documents_vec = vectorizer.transform(documents).toarray()

## KMeans 

#### K = 2 

In [10]:
from sklearn.cluster import KMeans
km_clf = KMeans(verbose = 0,n_clusters = 2)
km_clf.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### Make predictions on all data

In [11]:
cluster = km_clf.predict(documents_vec)

####  Inspect the centroids
To find out what "topics" Kmeans has discovered I must inspect the centroids. Print out the centroids of the Kmeans clustering.
These centroids are simply a bunch of vectors.  To make any sense of them I need to map these vectors back into 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" review or the average occurances of words for that cluster.

In [12]:
km_clf.cluster_centers_

array([[ 4.70370891e-04, -1.30646362e-17,  4.16172982e-03, ...,
         1.51689537e-04,  5.14538600e-04, -8.80914265e-18],
       [ 2.56770575e-03,  3.27333436e-04,  7.78323301e-03, ...,
         1.70504095e-04,  4.12182372e-04,  2.68677805e-04]])

####  Find the top 10 features for each cluster.

In [15]:
cluster_top_features = list()
for i in range(km_clf.n_clusters):
    cluster_top_features.append(np.argsort(km_clf.cluster_centers_[i])[::-1][:10])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num, ", ".join(words[i] for i in centroid)))

0: great, food, service, place, amazing, good, love, staff, best, friendly
1: good, food, place, just, like, time, ordered, chicken, service, order


It seems that two clusters are both for positive reviews. Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [16]:
for i in range(km_clf.n_clusters):
    sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 3 stars.
The review is:
Sung to the tune of Eye of The Tiger by Survivor:

Feast Buffet, you hooked me in
With your $10 special
Stood forever for an awesome feast
Just a girl with a will to digest

So many foods I just couldn't wait
Everything looked so so yummy
Why can't the line move any faster?
You must wait just to get to the food

It's the anticipation, its the need for some grinds
and I need to eat soon 'coz I'm hungry
I get up to the buffet and I start piling stuff
And I'm sucking down the shrimp like there's no tomorrow

Hand to mouth it's all a blur
Who cares if the food is real salty
I gotta eat and eat all I can
Just a girl who is really pake*

It's the Feast Buffet and it gave me the runs
I was running to the crapper in the morning
That's the last time I'll eat there and I pray I'm alright
And I swear I will never go there, never ever!


(*Pake pronounced pah-keh- Local Hawaiian pidgin for a penny-pinching cheap Chi

#### Try different k = 3

In [17]:
# To be implemented
km_3_clusters = KMeans(verbose = 0, n_clusters = 3)
km_3_clusters.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [18]:
cluster_top_features = list()
for i in range(km_3_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_3_clusters.cluster_centers_[i])[::-1][:10])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))

1: good,place,food,chicken,best,delicious,vegas,like,really,service
2: food,order,just,time,service,like,minutes,didn,place,got
3: great,food,service,place,amazing,good,awesome,love,staff,friendly


It seems that three clusters can get the positive, negative and in-between types.

In [19]:
three_cluster_pred = km_3_clusters.predict(documents_vec)

Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [20]:
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, three_cluster_pred.shape[0])[three_cluster_pred == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 3 stars.
The review is:
I stopped by here at very late night Saturday.. this place opens till 6am.  The price of the late night food was pretty fair and the food was good as well. 

You can't expect a fancy interior nor service here but it you are looking for some authentic Cantonese food in a late night with fair price. Then it's a good place to go..

The cluster is 2.
The star is: 2 stars.
The review is:
Food isn't that special but service was good. We were a bit lost and I have to commend the guy who answered the phone (didn't get his name) but he gave us very detailed directions on how to reach the spot. Unfortunately, that was the highlight for me and my company. Wouldn't go back to this place but you can try it once..

The cluster is 3.
The star is: 5 stars.
The review is:
Just starting off with saying, this place is A MUST GO TO!!! Every single item here is so purely unique and delicious. Not to mention, how incredibly welcoming the owners are! T

At least the three examples have different rating, now let's see what happends when k - 5

#### Try different k = 5

In [21]:
km_5_clusters = KMeans(verbose = 0, n_clusters = 5)
km_5_clusters.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [22]:
cluster_top_features = list()
for i in range(km_5_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_5_clusters.cluster_centers_[i])[::-1][:10])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))

1: food,order,time,just,service,minutes,got,didn,ordered,burger
2: great,food,service,place,good,amazing,awesome,love,time,staff
3: pizza,good,crust,place,great,slice,cheese,best,like,time
4: food,good,place,service,vegas,best,delicious,amazing,love,time
5: chicken,good,fried,rice,food,like,place,ordered,thai,really


In [23]:
five_cluster_pred = km_5_clusters.predict(documents_vec)

In [24]:
for i in range(km_5_clusters.n_clusters):
    sub_cluster = np.arange(0, five_cluster_pred.shape[0])[five_cluster_pred == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 5 stars.
The review is:
A friend of mine comes here weekly for karaoke. 

I, sadly, usually have to work super early the next day, so I hadn't been able to come out...until last Monday night.  I had some vacation days in, so on Monday night, about 9:00pm, we headed over to Aces and Ales.

I'd actually been to this location about 5 years ago, and enjoyed the food.  

This time, we didn't eat, but we drank...oh did we drink.  

Their beer selection isn't as vast as the Tenaya location, but they still have GREAT beers at great prices.  

My selections for the evening: 
Crafthaus (my FAVORITE local brewery) Centerpiece Sour:  AMAZING.  I'm new to the sour beer side of things, and I just loved this.  
Delirium Tremens:  My second favorite beer of all time, I had two of these.  This beer is SO much better on draught, rather than out of the bottle, I will definitely come back here the next time I want one, instead of buying a bottle at Lee's. 

I a

Seems like its cluster not only the star but also the food type. 