# Yelp_Data_Challenge - Clustering

## Main tasks

1. Data preprocessing
    - 1.1 Define feature variables
    - 1.2 Define target variable
    - 1.3 Create training dataset and test dataset
    - 1.4 Get NLP representation of the documents
2. Cluster reviews with KMeans
    - 2.1 Fit k-means clustering with the training vectors and apply it on all the data
    - 2.2 Make predictions on all data
    - 2.3 Inspect the centroids
    - 2.4 Try using different k (clusters)
3. Cluster all the reviews of the most reviewed restaurant
    - 3.1 Vectorize the text feature
    - 3.2 Define target variable
    - 3.3 Create train and test datasets
    - 3.4 Get NLP representation of the documents
    - 3.5 Cluster reviews with KMeans
4. Other user cases of clustering
    - 4.1 Different distance/similarity metrics for clusterings
    - 4.2 Cluster restaurants by category information
    - 4.3 Cluster restaurants by restaurant names
    - 4.4 Cluster restaurants by tips

#### Read in the dataset

In [94]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [95]:
df = pd.read_csv('dataset/last_2_years_restaurant_reviews.csv')

In [96]:
df.head(2)

Unnamed: 0,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,Delmonico Steakhouse,"['Cajun/Creole', 'Steakhouses', 'Restaurants']",4.0,0,2016-03-31,0,6SgvNWJltnZhW7duJgZ42w,5,This is mine and my fiancé's favorite steakhou...,0,oFyOUOeGTRZhFPF9uTqrTQ
1,Delmonico Steakhouse,"['Cajun/Creole', 'Steakhouses', 'Restaurants']",4.0,0,2016-02-10,0,UxFpgng8dPMWOj99653k5Q,5,Truly Fantastic! Best Steak ever. Service was...,0,aVOGlN9fZ-BXcbtj6dbf0g


## 1. Data preprocessing 

### 1.1 Filter positive reviews 

#### Here I am only interested in perfect (5 stars) rating reviews

In [98]:
df_positive = df[df['stars'] == 5]

In [100]:
len(df_positive)

210559

### 1.2 Define feature variables

#### Here feautre variable is the text of the review 

In [101]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df_positive['text'].values

In [102]:
print(len(documents))

210559


### 1.3 Create training dataset and test dataset

In [103]:
from sklearn.cross_validation import train_test_split

In [104]:
# X: documents
# Y: targets
# Now split the data to training set 80% and test set 20%
documents_train, documents_test = train_test_split(documents, test_size = 0.2, random_state = 42)

In [105]:
len(documents_train), len(documents_test)

(168447, 42112)

### 1.4 Get NLP representation of the documents

#### Fit TfidfVectorizer with training data only, then tranform all the data to tf-idf

In [106]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [107]:
# Create TfidfVectorizer, and name it vectorizer
# choose a reasonable max_features, e.g. 1000 to fast the computation speed
vectorizer = TfidfVectorizer(stop_words = 'english', max_features = 1000)

In [108]:
# Train the model with your training data
vectors_train = vectorizer.fit_transform(documents_train).toarray()

In [109]:
vectors_train.shape

(168447, 1000)

In [110]:
# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

In [111]:
# Use the trained model to transform all the reviews
vectors_documents = vectorizer.transform(documents).toarray()

## 2 Cluster reviews with KMeans

### 2.1 Fit k-means clustering with the training vectors and apply it on all the data

In [112]:
from sklearn.cluster import KMeans

kmeans = KMeans()

kmeans.fit(vectors_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

### 2.2 Make predictions on all data

In [113]:
assigned_cluster = kmeans.predict(vectors_documents)

### 2.3 Inspect the centroids

- Description: To find out what "topics" Kmeans has discovered we must inspect the centroids. Print out the centroids of the Kmeans clustering. These centroids are simply a bunch of vectors.  To make sense of them we need to map these vectors back into our 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" review or the average occurances of words for that cluster.
- Solution: Find the top 10 features (words) within each cluster. 
- Steps: 
    - (1) Sort each centroid vector to find the top 10 features 
    - (2) Go back to our vectorizer object to find out what words each of these features corresponds to

In [114]:
# Default of kmeans uses 8 clusters
print ('number of clusters:' + str(kmeans.cluster_centers_.shape))

number of clusters:(8, 1000)


In [115]:
# print top 10 words of each cluster centers
# step (1) Sort each centroid vector to find the top 10 features
top_centroids = kmeans.cluster_centers_.argsort()[:, -1:-11:-1]
print("top 10 features for each cluster:")
# step (2) Go back to our vectorizer object to find out what words each of these features corresponds to
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ", ".join(words[i] for i in centroid)))

top 10 features for each cluster:
0: burger, fries, burgers, good, great, place, cheese, best, shake, food
1: food, good, place, best, vegas, amazing, delicious, time, service, just
2: excellent, service, food, great, place, good, vegas, definitely, restaurant, best
3: love, place, food, great, good, service, amazing, best, friendly, staff
4: pizza, great, crust, place, good, best, vegas, cheese, service, delicious
5: great, food, service, place, amazing, good, awesome, friendly, staff, definitely
6: sushi, place, roll, rolls, great, fresh, ayce, service, best, fish
7: chicken, fried, good, food, rice, place, delicious, great, ordered, amazing


#### We will try different k, because:
    - Using eight clusters (default setting in kmeans), I found that several clusters are kind of similar to each other, such as in Cluster 0 and 7 might signify fast food restaurants. 
    - The rest of clusters have some significant meanings such as in Cluster 6, it mainly tell about Japanese restaurants.

### 2.4 Try using different k (clusters)

#### How does the top features change after using 5 clusters?
- Using five clusters, the difference among clusters stands out more significant than using eight clusters. Each cluster now has an unique topic, such as Cluster 0 is surrounding with the topic of chicken, Cluster 2 is relating to Japanese food, Cluster 3 is relating to the pizza, and Cluster 4 is mainly about service aspect in vegas.
- However, the top features using five clusters seem to be highly overlapped with the default method. In fact, it's a good strategy to narrow down overlapped clusters into denser clusters.

In [116]:
# Find the top 10 features for each cluster.
kmeans = KMeans(n_clusters = 5)
kmeans.fit(vectors_train)
assigned_cluster = kmeans.predict(vectors_documents)

top_centroids = kmeans.cluster_centers_.argsort()[:, -1:-11:-1]
print("top 10 features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ",".join(words[i] for i in centroid)))

top 10 features for each cluster:
0: good,food,really,place,service,great,nice,love,chicken,time
1: place,food,best,vegas,delicious,amazing,time,love,ve,just
2: sushi,place,roll,rolls,great,fresh,ayce,service,best,fish
3: pizza,great,place,crust,good,best,love,service,vegas,cheese
4: great,food,service,place,amazing,awesome,friendly,excellent,staff,definitely


#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [117]:
for i in range(kmeans.n_clusters):
    cluster = np.arange(0, vectors_documents.shape[0])[assigned_cluster==i]
    sample_reviews = np.random.choice(cluster, 1, replace=False)
    print("cluster %d:" % i)
    for review in sample_reviews:
        print("    %s" % df.loc[review]['text'])

cluster 0:
    My friends and I come here every Friday! It is our tradition. :) We love Sushi Kaya for many, many reasons. Their sushi is so fresh and cold. We usually do all-you-can-eat, and start off with miso soup and seaweed salad. I love their spicy tuna, sashimi, yellowtail, and albacore. They have a nice fish-to-rice too ratio since some places give way too much rice! I love their mochi as a dessert. The service is usually great every time. It doesn't take long when we order for our AYCE sushi.

This place gets packed on the weekends, and for good reason! I would wait the 20 minutes or make it easier for yourself, and call ahead of time to make reservations.
cluster 1:
    Oh how I miss Hawaii after coming here! If you're looking for good and cheap food, this is the place to be. We ordered a furikake chicken and that was more than enough for two people. We had a ton of leftover. The chicken has so much flavor! We also ordered a half order of avocado poke and it was just enough f

## 3. Cluster all the reviews of the most reviewed restaurant
- 3.1 Vectorize the text feature
- 3.2 Define the target variable
- 3.3 Create train and test datasets
- 3.4 Get NLP representation of the documents
- 3.5 Cluster reviews with KMean

#### Let's find the most reviewed restaurant and analyze its reviews

In [118]:
# Find the business who got most reviews, get your filtered df, name it df_top_restaurant
df_top_restaurant = df['name'].value_counts().index[0]
df_top_restaurant

'Hash House A Go Go'

#### We can also load restaurant profile information from the business dataset

In [119]:
# Load business dataset (optional)
# Take a look at the most reviewed restaurant's profile 
df_top_restaurant = df[df['name'] == df_top_restaurant].copy().reset_index()
df_top_restaurant

Unnamed: 0,index,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,32737,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2016-06-22,0,psGDwACpn7tFmWm36865fA,4,"There isn't much here for vegetarians, but I h...",0,Y76nS3L426UCz7N_1pUfUQ
1,32738,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2017-06-19,0,ZY0ym6jDPXCnyzyRKSVTHg,4,"Visiting Las Vegas again, and decided to stop ...",0,SeHCNZeTtVvL1HmKFOLSkQ
2,32739,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2017-01-13,0,vPFRrO6k6ynH-CgGKJLpPQ,5,This place is as crazy as Las Vegas. The twis...,0,SvpxzDdYOrrI9ntolyNSxQ
3,32740,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2015-08-26,0,DOZWVKN2n4CAp7mtkhxiaw,1,I've eaten at Hash House A Go Go on the strip ...,0,Io0qqdu_PyKfkr8d7F19mg
4,32741,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,1,2017-10-16,0,-UGGkrLKjWMdW2N9l2rb2Q,4,We were told that this was a good place for br...,0,JrILFVrSIRIacx2qTy5tiA
5,32742,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2017-06-23,0,VvPH04YYZ8RcOimJdZXU7A,4,EDC food. The chicken (2 breasts or thighs? Ca...,0,qyPBg6aUIAM83vbkNJCtSQ
6,32743,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2017-03-29,0,EdCoN1v8Tv7CtSRe5WLyNg,5,We went there after checking Yelp during our t...,0,PdiutioUdu9q8VhtHdzpVQ
7,32744,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2015-07-19,0,uGDn7km6sXBQ8NlAC3chhg,5,Brunch was amazing! I mean the banana French t...,0,Eq_3Wq22Xjw2mxVln-NALw
8,32745,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2017-10-23,0,umgRwr9PbF0xOM8p5H4Waw,2,I felt the service was not stellar at all. Wai...,0,Af2xB-Sfv-r0kdwl_FbGzg
9,32746,Hash House A Go Go,"['American (New)', 'Restaurants', 'Breakfast &...",3.5,0,2017-09-30,0,uDf1xM8e9BzwtDwN6rb7IQ,4,I forget what our dish was called but it was t...,0,QpGBJKgosHPz7MBz95NGbA


### 3.1 Vectorize the text feature

In [120]:
# Take the values of the column that contains review text data, save to a variable named "documents_top_restaurant"
documents_top_restaurant = df_top_restaurant['text'].values
documents_top_restaurant.shape

(3620,)

### 3.2 Define target variable

#### Again, we look at perfect (5 stars) and imperfect (1-4 stars) rating

In [121]:
df_top_restaurant['target'] = df_top_restaurant['stars'] == 5
target_top_restaurant = df_top_restaurant['target'].values
target_top_restaurant[:5]

array([False, False,  True, False, False])

#### Check the statistic of the target variable

In [122]:
len(target_top_restaurant), target_top_restaurant.mean(), target_top_restaurant.std()

(3620, 0.42265193370165743, 0.49398104886716776)

### 3.3 Create training dataset and test dataset

In [123]:
from sklearn.cross_validation import train_test_split

In [124]:
# X: documents_top_restaurant
# Y: target
# Now split the data to training set 80% and test set 20%
X_train, X_test, y_train, y_test = train_test_split(
    documents_top_restaurant,
    target_top_restaurant,
test_size = 0.2, random_state = 42) 

### 3.4 Get NLP representation of the documents

In [125]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [126]:
# Create TfidfVectorizer, and name it vectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', max_features = 1000)

In [127]:
# Train the model with your training data
vector_train = vectorizer.fit_transform(X_train).toarray()

In [128]:
# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

In [129]:
# Use the trained model to transform the test data
vector_test = vectorizer.transform(X_test).toarray()

In [130]:
# Use the trained model to transform all the data
vector_documents_top_restaurant = vectorizer.transform(documents_top_restaurant).toarray()

### 3.5 Cluster reviews with KMeans

#### Fit k-means clustering on the training vectors and make predictions on all data

In [160]:
from sklearn.cluster import KMeans

# Fit k-means clustering on the train vectors

kmeans = KMeans(n_clusters = 4)

kmeans.fit(vector_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### Make predictions on all your data

In [161]:
# Make predictions on all data
assigned_cluster = kmeans.predict(vector_documents_top_restaurant)

#### Inspect the centroids and find the top 10 features for each cluster.

In [162]:
# Find the top 10 features for each cluster.
top_centroids = kmeans.cluster_centers_.argsort()[:, -1:-11:-1]
print("top 10 features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ", ".join(words[i] for i in centroid)))

top 10 features for each cluster:
0: chicken, waffles, fried, sage, bacon, benedict, good, food, place, huge
1: food, minutes, wait, time, just, service, good, took, order, table
2: hash, good, breakfast, food, house, eggs, pancake, place, potatoes, huge
3: great, food, portions, place, service, huge, wait, good, vegas, amazing


#### Summary:
- Using four clusters, the difference among clusters stands out significantly and each cluster now has an unique topic, shows different aspects that customers care about:
    - Cluster 0 is surrounding with the topic of food, like chicken and waffles. 
    - Cluster 1 is surrounding with the topic of waiting time and service.
    - Cluster 2 is relating to the breakfast, like eggs and pancake. 
    - Cluster 3 is mainly about the taste and nutritional value.

#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [164]:
for i in range(kmeans.n_clusters):
    cluster = np.arange(0, vector_documents_top_restaurant.shape[0])[assigned_cluster==i]
    sample_reviews = np.random.choice(cluster, 1, replace=False)
    print("cluster %d:" % i)
    for review in sample_reviews:
        print("    %s" % df_top_restaurant.loc[review]['text'])

cluster 0:
    While being seated, I saw the popular Chicken and Waffles being delivered to a neighboring table. The presentation was so fabulous, I felt I had to try it! It was quite tasty, and HUGE! ( Share if you can, or plan on a 'to go' box if you have somewhere to keep the leftovers!) 4 good sized waffles, bacon strips layered about, and two ample boneless fried chicken breasts on top...YUM! My husband had the steel cut oatmeal with fruit so he could help me with my endeavor, but his turned out to be plentiful, also! The oatmeal had banana, apple, blueberries, and finely chopped mango on top.  The restaurant itself was spacious with a comfortable theme. There are numerous autographed menus on display of current celebrities, sports greats, & political figures...interesting to look at while you wait to get seated. Our only drawback was we felt a bit like a cattle-call as far as service. It was a bit impersonal, & my husband's oatmeal came out significantly earlier than my breakfast

## 4. Other user cases of clustering
- 4.1 Different distance/similarity metrics for clusterings
- 4.2 Cluster restaurants by category information
- 4.3 Cluster restaurants by restaurant names
- 4.4 Cluster restaurants by tips

### 4.1 Different distance/similarity metrics for clusterings

#### Q: How do you compare with Cosine distance or Euclidean distance?

A:
- Cosine takes more computation time in comparison to  Euclidean distance. 
- While the “correlation” distance measures show a better interpretation of the clustered data. 


### 4.2 Cluster restaurants by category information
**Note:** a business may have mutiple categories, e.g. a restaurant can have both "Restaurants" and "Korean"

In [109]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df['categories'].values
documents.shape, documents.dtype

((447033,), dtype('O'))

In [110]:
# X: documents_top_restaurant
# Y: target
# Now split the data to training set 80% and test set 20%
documents_train, documents_test = train_test_split(
    documents,test_size = 0.2, random_state = 42) 

In [111]:
# Create TfidfVectorizer, and name it vectorizer, choose a reasonable max_features, e.g. 1000
vectorizer = TfidfVectorizer(stop_words = 'english', max_features = 500)

# Train the model with your training data
vectors_train = vectorizer.fit_transform(documents_train).toarray()

# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

# Use the trained model to transform all the reviews
vectors_documents = vectorizer.transform(documents).toarray()

In [112]:
# Fit k-means clustering on the training vectors and make predictions on all data

kmeans = KMeans(n_clusters=5)

kmeans.fit(vectors_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [113]:
# Make predictions on all data
assigned_cluster = kmeans.predict(vectors_documents)

In [114]:
# Find the top 10 features for each cluster.
top_centroids = kmeans.cluster_centers_.argsort()[:, -1:-11:-1]
print("top 10 features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ", ".join(words[i] for i in centroid)))

top 10 features for each cluster:
0: restaurants, food, mexican, chinese, thai, barbeque, asian, seafood, fusion, japanese
1: bars, nightlife, sushi, restaurants, japanese, american, wine, new, cocktail, seafood
2: pizza, italian, restaurants, sandwiches, wings, chicken, salad, food, seafood, delis
3: breakfast, brunch, american, restaurants, traditional, sandwiches, food, new, buffets, diners
4: american, traditional, new, burgers, restaurants, food, steakhouses, fast, seafood, southern


#### Summary
#### Cluster restaurants from their category information, the difference among clusters is significant. Each cluster now has an unique topic, such as Cluster 0 is mainly about Mexican and Chinese, Cluster 1 is Japanese, Cluster 2 is Italian,  Cluster 3 is American breakfast, and Cluster 4 is American(Traditional) in vegas.

#### Here we defined the most representative restaurant as the one with most review comments in each cluster

In [115]:
for i in range(kmeans.n_clusters):
    cluster = df.iloc[assigned_cluster == i]
    print("cluster %d representative restaurant: %s" % (i, cluster['name'].value_counts().index[0]))

cluster 0 representative restaurant: Gangnam Asian BBQ Dining
cluster 1 representative restaurant: Lotus of Siam
cluster 2 representative restaurant: Secret Pizza
cluster 3 representative restaurant: Hash House A Go Go
cluster 4 representative restaurant: Gordon Ramsay BurGR


### 4.3 Cluster restaurants by restaurant names

#### If we cluster categories from business entities, we are trying to find the similarity between restaurant names

In [103]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents_name = df['name'].values
documents_name.shape, documents_name.dtype

((447033,), dtype('O'))

In [104]:
# X: documents_top_restaurant
# Y: target
# Now split the data to training set 80% and test set 20%
documents_name_train, documents_name_test = train_test_split(
    documents_name,test_size = 0.2, random_state = 42) 

In [105]:
# Create TfidfVectorizer, and name it vectorizer, choose a reasonable max_features, e.g. 1000
vectorizer_name = TfidfVectorizer(stop_words = 'english', max_features = 500)

# Train the model with your training data
vectors_train_name = vectorizer_name.fit_transform(documents_train).toarray()

# Get the vocab of your tfidf
words_name = vectorizer_name.get_feature_names()

# Use the trained model to transform all the reviews
vectors_documents_name = vectorizer_name.transform(documents).toarray()

In [106]:
# Fit k-means clustering on the training vectors and make predictions on all data

kmeans_name = KMeans(n_clusters=5)

kmeans_name.fit(vectors_train_name)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [107]:
# Make predictions on all data
assigned_cluster = kmeans_name.predict(vectors_documents_name)

In [108]:
# Find the top 10 features for each cluster.
top_n = 10
top_centroids = kmeans_name.cluster_centers_.argsort()[:, -1:-(top_n+1):-1]
print("top 10 features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ", ".join(words_name[i] for i in centroid)))

top 10 features for each cluster:
0: restaurants, food, american, mexican, burgers, chinese, new, traditional, fast, seafood
1: japanese, sushi, bars, restaurants, fusion, asian, ramen, noodles, seafood, poke
2: bars, nightlife, american, restaurants, wine, new, cocktail, sports, traditional, mexican
3: breakfast, brunch, american, restaurants, traditional, sandwiches, food, new, buffets, diners
4: pizza, italian, restaurants, sandwiches, wings, salad, chicken, food, seafood, american


#### We notice the most used business names are very straight forword, telling the major business the entities are running.
#### While I don't think these clusters are meaningful in distinguishing each other.

### 4.4 Cluster restaurants by tips

#### As we have data "tip.json", we can cluster the tips business entities to customers, to see whether different business entities emphasis different aspects of their business. 

In [106]:
import json
import pandas as pd
import numpy as np

file_business, file_checkin, file_review, file_tip, file_user = [
    'dataset/business.json',
    'dataset/checkin.json',
    'dataset/review.json',
    'dataset/tip.json',
    'dataset/user.json'
]

In [107]:
with open(file_tip) as f:
    df_tip = pd.DataFrame(json.loads(line) for line in f)
df_tip.head(10)

Unnamed: 0,business_id,date,likes,text,user_id
0,tJRDll5yqpZwehenzE2cSg,2012-07-15,0,Get here early enough to have dinner.,zcTZk7OG8ovAmh_fenH21g
1,jH19V2I9fIslnNhDzPmdkA,2015-08-12,0,Great breakfast large portions and friendly wa...,ZcLKXikTHYOnYt5VYRO5sg
2,dAa0hB2yrnHzVmsCkN4YvQ,2014-06-20,0,Nice place. Great staff. A fixture in the tow...,oaYhjqBbh18ZhU0bpyzSuw
3,dAa0hB2yrnHzVmsCkN4YvQ,2016-10-12,0,Happy hour 5-7 Monday - Friday,ulQ8Nyj7jCUR8M83SUMoRQ
4,ESzO3Av0b1_TzKOiqzbQYQ,2017-01-28,0,"Parking is a premium, keep circling, you will ...",ulQ8Nyj7jCUR8M83SUMoRQ
5,k7WRPbDd7rztjHcGGkEjlw,2017-02-25,0,Homemade pasta is the best in the area,ulQ8Nyj7jCUR8M83SUMoRQ
6,k7WRPbDd7rztjHcGGkEjlw,2017-04-08,0,"Excellent service, staff is dressed profession...",ulQ8Nyj7jCUR8M83SUMoRQ
7,SqW3igh1_Png336VIb5DUA,2016-07-03,0,Come early on Sunday's to avoid the rush,ulQ8Nyj7jCUR8M83SUMoRQ
8,KNpcPGqDORDdvtekXd348w,2016-01-07,0,Love their soup!,ulQ8Nyj7jCUR8M83SUMoRQ
9,KNpcPGqDORDdvtekXd348w,2016-05-22,0,Soups are fantastic!,ulQ8Nyj7jCUR8M83SUMoRQ


In [109]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df_tip['text'].values
documents.shape, documents.dtype

((1098325,), dtype('O'))

In [110]:
# Now split the data to training set and test set
# Now your data is smaller, you can use a typical "test_size", e.g. 0.3-0.7
documents_train, documents_test = train_test_split(
    documents,
test_size = 0.7, random_state = 42) 

In [111]:
# Create TfidfVectorizer, and name it vectorizer, choose a reasonable max_features, e.g. 1000
vectorizer = TfidfVectorizer(stop_words = 'english', max_features = 500)

# Train the model with your training data
vectors_train = vectorizer.fit_transform(documents_train).toarray()

# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

# Use the trained model to transform all the reviews
vectors_documents = vectorizer.transform(documents).toarray()

In [112]:
# Fit k-means clustering on the training vectors and make predictions on all data

kmeans = KMeans(n_clusters=5)

kmeans.fit(vectors_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [114]:
# Make predictions on all data
assigned_cluster = kmeans.predict(vectors_documents)

In [115]:
# Find the top 10 features for each cluster.
top_n = 10
top_centroids = kmeans.cluster_centers_.argsort()[:, -1:-(top_n+1):-1]
print("top 10 features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ", ".join(words[i] for i in centroid)))

top 10 features for each cluster:
0: great, food, service, place, staff, friendly, love, atmosphere, amazing, prices
1: place, love, time, amazing, food, service, try, don, delicious, like
2: awesome, food, service, place, great, staff, love, friendly, good, best
3: best, town, ve, place, vegas, food, pizza, service, love, hands
4: good, food, service, great, place, really, nice, pretty, friendly, prices


####  We notice that almost all business entities are using positive words in their tips, thus these clusters are not meaningful in distinguishing each other.