# Yelp Data Challenge - Clustering and PCA

BitTiger DS501

Jun 2017

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [4]:
df = pd.read_csv('last_2_years_restaurant_reviews_from_else.csv')

In [5]:
df.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0,1,2016-05-17,0,0Qc1THNHSapDL7cv-ZzW5g,5,What can I say.. Wowzers! Probably one of the ...,0,4LxKRRIikhr65GfPDW626w
1,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0,0,2017-01-20,0,L8lo5SKXfZRlbn1bpPiC9w,5,Went here for guys weekend. Unbelievable. Ravi...,0,nT8zgjoc-PbdBoQsFEXFLw
2,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0,52,2016-09-25,30,6eUT3IwwWPP3CZkAhxqOIw,5,"One word my friends: tableside!!! Yes, tablesi...",56,7RlyCglsIzhBn081inwvcg
3,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0,1,2017-02-12,0,3cnTdE45VrsS0o4cVhfGog,3,"Located inside my favorite hotel Venetian, Del...",1,rOIrilMC7VFwFVBeQNiKMw
4,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0,0,2016-10-30,0,tYrSbjX3QgZGBZuQ3n8g6w,5,"After the most incredible service, delicious m...",2,PiWlV_UC_-SXqyxQM9fAtw


## 1. Cluster the review text data for all the restaurants

### Define your feature variables, here is the text of the review

In [3]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents=df['text'].values

### Define your target variable (any categorical variable that may be meaningful)

#### For example, I am interested in perfect (5 stars) and imperfect (1-4 stars) rating

In [4]:
# Make a column and take the values, save to a variable named "target"
df['target']=df['stars']>4


In [5]:
target=df['target'].values.astype(int)

In [6]:
target[:10]

array([ True,  True,  True, False,  True,  True, False, False,  True, False], dtype=bool)

#### You may want to look at the statistic of the target variable

In [7]:
# To be implemented
target.mean(), target.std()

(0.45747874965725255, 0.49818866232511694)

### Create training dataset and test dataset

In [8]:
from sklearn.cross_validation import train_test_split



In [9]:
# documents is your X, target is your y
# Now split the data to training set and test set
# You may want to start with a big "test_size", since large training set can easily crash your laptop.

document_train, document_test, target_train, target_test=train_test_split(documents, target, test_size=0.95, random_state=42)

In [10]:
document_train.size

21882

### Get NLP representation of the documents

#### Fit TfidfVectorizer with training data only, then tranform all the data to tf-idf

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
# Create TfidfVectorizer, and name it vectorizer, choose a reasonable max_features, e.g. 1000
vectors=TfidfVectorizer(stop_words='english', max_features=1000)

In [13]:
# Train the model with your training data
vectors_train=vectors.fit_transform(document_train).toarray()

In [36]:
# Get the vocab of your tfidf
words=vectors.get_feature_names()

[u'00',
 u'10',
 u'100',
 u'11',
 u'12',
 u'15',
 u'16',
 u'18',
 u'20',
 u'24',
 u'25',
 u'30',
 u'40',
 u'45',
 u'50',
 u'95',
 u'99',
 u'able',
 u'absolutely',
 u'accommodating',
 u'actually',
 u'add',
 u'added',
 u'affordable',
 u'afternoon',
 u'ago',
 u'ahead',
 u'al',
 u'amazing',
 u'ambiance',
 u'ambience',
 u'american',
 u'appetizer',
 u'appetizers',
 u'apple',
 u'area',
 u'aren',
 u'arrived',
 u'asada',
 u'asian',
 u'ask',
 u'asked',
 u'asking',
 u'ate',
 u'atmosphere',
 u'attention',
 u'attentive',
 u'attitude',
 u'authentic',
 u'available',
 u'average',
 u'avocado',
 u'avoid',
 u'away',
 u'awesome',
 u'awful',
 u'ayce',
 u'bacon',
 u'bad',
 u'bag',
 u'baked',
 u'banana',
 u'bar',
 u'barely',
 u'bartender',
 u'bartenders',
 u'based',
 u'basic',
 u'basically',
 u'bathroom',
 u'bbq',
 u'bean',
 u'beans',
 u'beautiful',
 u'beef',
 u'beer',
 u'beers',
 u'believe',
 u'bellagio',
 u'belly',
 u'benedict',
 u'best',
 u'better',
 u'big',
 u'birthday',
 u'bit',
 u'bite',
 u'black',
 u'

In [14]:
# Use the trained model to transform all the reviews
vectors_documents=vectors.transform(documents).toarray()

### Cluster reviews with KMeans

#### Fit k-means clustering with the training vectors and apply it on all the data

In [15]:
# To be implemented
from sklearn.cluster import KMeans
kmeans=KMeans()
kmeans.fit(vectors_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### Make predictions on all your data

In [None]:
# To be implemented
assigned_clusters=kmeans.predict(vectors_documents)

#### Inspect the centroids
To find out what "topics" Kmeans has discovered we must inspect the centroids. Print out the centroids of the Kmeans clustering.

   These centroids are simply a bunch of vectors.  To make any sense of them we need to map these vectors back into our 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" review or the average occurances of words for that cluster.
   
   Each Dimension represents a Word, if that dimension has high average value, means high occurances

In [None]:
# To be implemented
pass

#### Find the top 10 features for each cluster.
For topics we are only really interested in the most present words, i.e. features/dimensions with the greatest representation in the centroid.  Print out the top ten words for each centroid.

* Sort each centroid vector to find the top 10 features
* Go back to your vectorizer object to find out what words each of these features corresponds to.


In [None]:
# To be implemented
print ('cluster centers:')
print (kmeans.cluster_centers_.shape)

n_feat=10
top_centroids=kmeans.cluster_centers_.argsort()[:,-1:-n_feat:-1]

In [None]:
print ('top features for each cluster:')

for num, centroid in enumerate(top_centroids):
    print ('{}:{}' .format(num, ','.join(words[i] for i in centroid)))

#### Try different k
If you set k == to a different number, how does the top features change?

In [None]:
# To be implemented
kmeans=KMeans(n_clusters=6)
kmeans.fit(vectors_train)

In [None]:
# predict al documents

assigned_clusters=kmeans.predict(vectors_documents)

# select top 10 features for each centroids
n_feat=10
top_centroids=kmeans.cluster_centers_.argsort()[:,-1:-n_feat:-1]

print ('top features for each cluster')
for num, centroid in enumerate(top_centroids):
    print('{}: {}'.format(num, ','.join(words[i] for i in centroid)))

#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [None]:
# To be implemented
for i in range(kmeans.n_clusters):
    cluster=np.arange(0,vectors_documents.shape[0])[assigned_cluster=i] # index of cluster
    sample_reviews=np.random.choice(cluster, 2,replace=False)
    print ('='*10)
    print ('Cluster {}.format(i)')
    for review_index in sample_reviews:
        print df.ix[review_index]['stars'],
        print df.ix[review_index]['text']

## 2. Cluster all the reviews of the most reviewed restaurant
Let's find the most reviewed restaurant and analyze its reviews

In [None]:
# Find the business who got most reviews, get your filtered df, name it df_top_restaurant

most_reviewed_restaurant=df['business_id'].value_counts().index[0]

# select rows of most reviewed restaurant
# .copy() is deep copy, when we change df_top_res,  the orginal df wouldn't be affected

df_top_restaurant=df[df['business_id']==most_reviewed_restaurant].copy().reset_index()
df_top_restaurant.head()

We can also load restaurant profile information from the business dataset (optional)

In [None]:
# Load business dataset (optional)
# Take a look at the most reviewed restaurant's profile (optional)

import json as js
import pandas as pd

with open('sample_data/business.json') as f:
    df_business=pd.DataFrame(json.loads(line) for line in f)

In [None]:
df_business[df_business['business_id']==most_reviewed_restaurant]

In [None]:
df_business[df_business['business_id']==most_reviewed_restaurant]['categories'].values
df_business[df_business['business_id']==most_reviewed_restaurant]['attributes'].values 

### Vectorize the text feature

In [None]:
# Take the values of the column that contains review text data, save to a variable named "documents_top_restaurant"
documents_top_restaurants=df_top_restaurant['text'].values

documents_top_restaurants[:3]

### Define your target variable (for later classification use)

#### Again, we look at perfect (5 stars) and imperfect (1-4 stars) rating

In [None]:
# To be implemented

df_top_restaurant['target']=df_top_restaurant['stars']>4
target_top_restaurants=df_top_restaurant['target'].values.astype(int)
target_top_restaurants

#### Check the statistic of the target variable

In [None]:
# To be implemented
target_top_restaurants.mean()

In [None]:
documents_top_restaurants.shape, target_top_restaurants.shape

### Create training dataset and test dataset

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
# documents_top_restaurant is your X, target_top_restaurant is your y
# Now split the data to training set and test set
# Now your data is smaller, you can use a typical "test_size", e.g. 0.3-0.7
documetns_top_restaurants_train,target_top_restaurants_train, documetns_top_restaurants_test,target_top_restaurants_test=train_test_split(
    documents_top_restaurants,target_top_restaurants, test_size=0.4, random_state=42)

### Get NLP representation of the documents

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Create TfidfVectorizer, and name it vectorizer
vectorizer=TfidfVectorizer(stop_words='english', max_features=1000)

In [None]:
# Train the model with your training data
vectors_train=vectorizer.fit_transform(documetns_top_restaurants_train).toarray()

In [None]:
# Get the vocab of your tfidf
words=vectorizer.get_feature_names()

In [None]:
# Use the trained model to transform the test data
vectors_test=vectorizer.transform(documetns_top_restaurants_test).toarray()

In [None]:
# Use the trained model to transform all the data
vectors_documents_top_restaurants=vectorizer.transform(documents_top_restaurants).toarray()

### Cluster reviews with KMeans

#### Fit k-means clustering on the training vectors and make predictions on all data

In [None]:
# To be implemented
from sklearn.cluster import KMeans
kmeans=KMeans(n_cluster=5)
kmeans.fit(vectors_train)

#### Make predictions on all your data

In [None]:
# To be implemented
assigned_clusters=kmeans.predict(vectors_documents_top_restaurants)

#### Inspect the centroids

In [None]:
# To be implemented
print ('cluster centers:')
print (kmeans.cluster_centers_.shape)

#### Find the top 10 features for each cluster.

In [None]:
# To be implemented
n_feat=10
top_centroid=kmeans.cluster_centers_.argsort()[:,-1:-n_feat:-1]

for n,centroid in enumerate(top_centroid):
    print ('{},{}'.format(n,','.join(words[i] for i in centroid)))

#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [None]:
# To be implemented
for i in range(kmeans.n_clusters):
    cluster=np.arange(0,vectors_document.shape[0])[assigned_cluster=i]
    sample_reviews=np.random.choice(cluster,2,replace=False)
    print('='*10)
    print('Cluster {}'.format(i))
    for review_index in sample_reviews:
        print df.ix[review_index]['stars'],
        print df.ix[review_index]['text']

In [None]:
# try this one

for i in range(kmeans.n_clusters):
    cluster=df.index[assigned_cluster=i]
    sample_reviews=np.random.choice(cluster,2,replace=False)
    print('='*10)
    print('Cluster {}'.format(i))
    for review_index in sample_reviews:
        print df.ix[review_index]['stars'],
        print df.ix[review_index]['text']

## 3. Use PCA to reduce dimensionality

### Stardardize features
Your X_train and X_test

In [None]:
# vectors_train: most_review data
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
# To be implemented
x_train_scaled=scaler(vector_train)
x_test_scaled=scaler(vector_test)


### Use PCA to transform data (train and test) and get princial components

In [None]:
from sklearn.decomposition import PCA

# Let's pick a n_components
n_columns = 50
pca=PCA(n_components=n_columns)
# To be implemented
x_train_pca=pca.fit_transform(x_train_scaled)
x_test_pca=pca.transform(x_test_scaled)


In [None]:
x_train_pca.shape,x_test_pca.shape

### See how much (and how much percentage of) variance the principal components explain

In [None]:
# To be implemented
print (pca.explained_variance_[:10])

In [None]:
# To be implemented
print (pca.explained_variance_ratio_[:10])

### Viz: plot proportion of variance explained with top principal components

For clear display, you may start with plotting <=20 principal components

In [None]:
# To be implemented
n_col_to_display=20
pca_range=np.arange(n_col_to_display)+1
pca_name = ['pca_%s' % i for i in pca_range]
plt.figure (figsize(10,10))
plt.bar(pca_name,
       pca.explained_variance_[:n_col_to_display,align='center']
xticks =plt.xticks(pca_range,pca_name,rotation=90)
plt.ylabel('variance explained')
plt.show()

## Classifying positive/negative review with PCA preprocessing

### Logistic Regression Classifier
#### Use standardized tf-idf vectors as features

In [None]:
x_train_scaled.shape,target_top_restaurants_train.shape

In [None]:
x_train_pca.shape,target_top_restaurants_train.shape

In [None]:
# Build a Logistic Regression Classifier, train with standardized tf-idf vectors

from sklearn.linear_model import LogisticRegression

# To be implemented
lrc_model=LogisticRegression()
lrc_model.fit(x_train_scaled,target_top_restaurants_train)


In [None]:
# Get score for training set
lrc_model.score(x_train_scaled,target_top_restaurants_train)

In [None]:
# Get score for test set
lrc_model.score(x_test_scaled,target_top_restaurants_test)

#### Use (Stardardized + PCA) tf-idf vectors as features

In [None]:
# Build a Logistic Regression Classifier, train with PCA tranformed X

from sklearn.linear_model import LogisticRegression

# To be implemented
lrc_model=LogisticRegression()
lrc_model.fit(x_train_pca,target_top_restaurants_train)

In [None]:
# Get score for training set
lrc_model.score(x_train_pca,target_top_restaurants_train)

In [None]:
# Get score for test set, REMEMBER to use PCA-transformed X!
lrc_model.score(x_test_pca,target_top_restaurants_test)

#### Q: What do you see from the training score and the test score? How do you compare the results from PCA and non-PCA preprocessing?

A: (insert your comments here) overfitting

#### You can plot the coefficients against principal components


In [None]:
# To be implemented
pca_range=np.arange(pca.n_components_)+1
pca_name =['pca_%s' % i for i in pca_range]


### Random Forest Classifier
#### Use standardized tf-idf vectors as features

In [None]:
# Build a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# To be implemented
rd_model=RandomForestClassifier(max_depth=None, n_estimators=20, min_samples_leaf=3,random_state=42)
rd_model.fit(x_train_scaled,target_top_restaurants_train)

In [None]:
# Get score for training set
rd_model.score(x_train_scaled,target_top_restaurants_train)

In [None]:
# Get score for test set
rd_model.score(x_test_scaled,target_top_restaurants_test)

#### Use (Stardardized + PCA) tf-idf vectors as features

In [None]:
# Build a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# To be implemented
rd_model=RandomForestClassifier(max_depth=None, n_estimators=20, min_samples_leaf=3,random_state=42)
rd_model.fit(x_train_pca,target_top_restaurants_train)

In [None]:
# Get score for training set
rd_model.score(x_train_pca,target_top_restaurants_train)

In [None]:
# Get score for test set, REMEMBER to use PCA-transformed X!
rd_model.score(x_test_pca,target_top_restaurants_test)

#### Q: What do you see from the training result and the test result?

A: (insert your comments here)

#### You can plot the feature importances against principal components


In [None]:
# To be implemented
pass

## Extra Credit #1: Can you cluster restaurants from their category information?
Hint: a business may have mutiple categories, e.g. a restaurant can have both "Restaurants" and "Korean"

In [None]:
# To be implemented

## Extra Credit #2: Can you try different distance/similarity metrics for clusterings, e.g. Pearson correlation, Jaccard distance, etc. 

Hint: You can take a look at [scipy](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist) documentations to use other distances

#### Q: How do you compare with Cosine distance or Euclidean distance?

In [None]:
# To be implemented

## Extra Credit #3: Can you cluster categories from business entities? What does it mean by a cluster?
Hint: Think the example where words can be clustered from the transposed tf-idf matrix.

In [None]:
# To be implemented

## Extra Credit #4: What are the characteristics of each of the clustered  ? For each cluster, which restaurant can best represent ("define") its cluster?
Hint: how to interpret "best"?

In [None]:
# To be implemented

## Extra Credit #5: Can you think of other use cases that clustering can be used? 
Hint: of course you can make use of other yelp dataset. You can try anything you want as long as you can explain it.

In [None]:
# To be implemented