# Yelp Data Challenge - Clustering and PCA

BitTiger DS501

Nov 2017

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [None]:
df = pd.read_csv('data/last_2_years_restaurant_reviews.csv')

In [None]:
df.head()

## 1. Cluster the review text data for all the restaurants

### Define your feature variables, here is the text of the review

In [None]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df.text.values

### Define your target variable (any categorical variable that may be meaningful)

#### For example, I am interested in perfect (5 stars) and imperfect (1-4 stars) rating

In [None]:
# Make a column and take the values, save to a variable named "target"
df['perfection'] = df['stars'].apply(lambda x : int(x == 5))
target = df['perfection'].values

#### You may want to look at the statistic of the target variable

In [None]:
# To be implemented
print ("mean:{}".format(target.mean()))

### Create training dataset and test dataset

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
# documents is your X, target is your y
# Now split the data to training set and test set
# You may want to start with a big "test_size", since large training set can easily crash your laptop.
documents_train, documents_test, target_train, target_test = train_test_split(documents, target, 
test_size=0.3, random_state=11)

### Get NLP representation of the documents

#### Fit TfidfVectorizer with training data only, then tranform all the data to tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Create TfidfVectorizer, and name it vectorizer, choose a revectorizer = TfidfVectorizer(stop_words = "english", max_features = 5000)asonable max_features, e.g. 1000
vectorizer = TfidfVectorizer(stop_words = "english", max_features = 5000)

In [None]:
# Train the model with your training data
vectors_train = vectorizer.fit_transform(documents_train).toarray()

In [None]:
# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

In [None]:
# Use the trained model to transform all the reviews
vectors_test = vectorizer.transform(documents_test).toarray()
vectors_all = vectorizer.transform(documents).toarray()

### Cluster reviews with KMeans

#### Fit k-means clustering with the training vectors and apply it on all the data

In [None]:
# To be implemented
def choose_k_cluster(train_data,test_data,max_cluster_num):
    '''
    choose number of cluster by silhouette score
    '''
    K=range(2,max_cluster_num+1)
    kmeans = [KMeans(n_clusters=k,max_iter=2000,random_state=113) for k in K]
    [models.fit(train_data) for models in kmeans]
    assigned_cluster = [models.predict(test_data) for models in kmeans]
    s_score = [silhouette_score(test_data, assigned_cluster[i]) for i in range(0,max_cluster_num-1)]
    plt.plot(K,s_score)
    return np.argmax(s_score)+2

#### Make predictions on all your data

In [None]:
# To be implemented
n_clusters = choose_k_cluster(vectors_train, vectors_all, 20)
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(vectors_train)

#### Inspect the centroids
To find out what "topics" Kmeans has discovered we must inspect the centroids. Print out the centroids of the Kmeans clustering.

   These centroids are simply a bunch of vectors.  To make any sense of them we need to map these vectors back into our 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" review or the average occurances of words for that cluster.

In [None]:
# To be implemented
top_centroids = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]

#### Find the top 10 features for each cluster.
For topics we are only really interested in the most present words, i.e. features/dimensions with the greatest representation in the centroid.  Print out the top ten words for each centroid.

* Sort each centroid vector to find the top 10 features
* Go back to your vectorizer object to find out what words each of these features corresponds to.


In [None]:
# To be implemented
print ("top features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print "%d: %s" % (num, ", ".join(words[i] for i in centroid))

#### Try different k
If you set k == to a different number, how does the top features change?

In [None]:
# To be implemented
top_centroids = kmeans.cluster_centers_.argsort()[:,-2:-15:-3]
print "top features for each cluster:"
for num, centroid in enumerate(top_centroids):
    print ("%d: %s" % (num, ", ".join(words[i] for i in centroid)))

#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [None]:
# To be implemented
assigned_cluster = kmeans.predict(vectors_all)
for i in range(kmeans.n_clusters):
    cluster = np.arange(0, vectors_all.shape[0])[assigned_cluster==i]
    sample_reviews = np.random.choice(cluster, 3, replace=False)
    print "cluster %d:" % i
    for reviews in sample_reviews:
        print ('rate:%d -' % df.ix[reviews]['stars'],)
        print ("    %s" % df.ix[reviews]['text'])

## 2. Cluster all the reviews of the most reviewed restaurant
Let's find the most reviewed restaurant and analyze its reviews

In [None]:
# Find the business who got most reviews, get your filtered df, name it df_top_restaurant
selected_feature = [u'review_id']
df_top_restaurant = df_filtered[selected_feature]

We can also load restaurant profile information from the business dataset (optional)

In [None]:
# Load business dataset (optional)
# Take a look at the most reviewed restaurant's profile (optional)
pass

### Vectorize the text feature

In [None]:
# Take the values of the column that contains review text data, save to a variable named "documents_top_restaurant"
documents_top_restaurant = df_top_restaurant.text.values

### Define your target variable (for later classification use)

#### Again, we look at perfect (5 stars) and imperfect (1-4 stars) rating

In [None]:
# To be implemented
df_top_restaurant['perfection'] = df['stars'].apply(lambda x : int(x == 5))

#### Check the statistic of the target variable

In [None]:
# To be implemented
target = df_top_restaurant['perfection'].values
print ("mean:{}".format(target.mean()))

### Create training dataset and test dataset

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
# documents_top_restaurant is your X, target_top_restaurant is your y
# Now split the data to training set and test set
# Now your data is smaller, you can use a typical "test_size", e.g. 0.3-0.7
documents_train, documents_test, target_train, target_test = train_test_split(documents, target, 
test_size=0.3, random_state=11)

### Get NLP representation of the documents

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Create TfidfVectorizer, and name it vectorizer
vectorizer = TfidfVectorizer(stop_words = "english", max_features = 5000)

In [None]:
# Train the model with your training data
vectors_train = vectorizer.fit_transform(documents_train).toarray()

In [None]:
# Get the voab of your tfidf
words = vectorizer.get_feature_names()

In [None]:
# Use the trained model to transform the test data
vectors_test = vectorizer.transform(documents_test).toarray()

In [None]:
# Use the trained model to transform all the data
vectors_all = vectorizer.transform(documents).toarray()

### Cluster reviews with KMeans

#### Fit k-means clustering on the training vectors and make predictions on all data

In [None]:
# To be implemented
def choose_k_cluster(train_data,test_data,max_cluster_num):
    '''
    choose number of cluster by silhouette score
    '''
    K=range(2,max_cluster_num+1)
    kmeans = [KMeans(n_clusters=k,max_iter=2000,random_state=113) for k in K]
    [models.fit(train_data) for models in kmeans]
    assigned_cluster = [models.predict(test_data) for models in kmeans]
    s_score = [silhouette_score(test_data, assigned_cluster[i]) for i in range(0,max_cluster_num-1)] 
    plt.plot(K,s_score)
    return np.argmax(s_score)+2

#### Make predictions on all your data

In [None]:
# To be implemented
n_clusters = choose_k_cluster(vectors_train, vectors_all, 20)

#### Inspect the centroids

In [None]:
# To be implemented
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(vectors_train)

#### Find the top 10 features for each cluster.

In [None]:
# To be implemented
top_centroids = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
print ("top features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print ("%d: %s" % (num, ", ".join(words[i] for i in centroid)))

#### Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [None]:
# To be implemented
assigned_cluster = kmeans.predict(vectors_all)
for i in range(kmeans.n_clusters):
    cluster = np.arange(0, vectors_all.shape[0])[assigned_cluster==i]
    sample_reviews = np.random.choice(cluster, 3, replace=False)
    print ("cluster %d:" % i)
    for reviews in sample_reviews:
        print ('rate:%d -' % df.ix[reviews]['stars'],)
        print ("    %s" % df.ix[reviews]['text'])


## 3. Use PCA to reduce dimensionality

### Stardardize features
Your X_train and X_test

In [None]:
from sklearn.preprocessing import StandardScaler

# To be implemented
pass


### Use PCA to transform data (train and test) and get princial components

In [None]:
from sklearn.decomposition import PCA

# Let's pick a n_components
n_components = 50

# To be implemented
pass


### See how much (and how much percentage of) variance the principal components explain

In [None]:
# To be implemented
pass

In [None]:
# To be implemented
pass

### Viz: plot proportion of variance explained with top principal components

For clear display, you may start with plotting <=20 principal components

In [None]:
# To be implemented
pass

## Classifying positive/negative review with PCA preprocessing

### Logistic Regression Classifier
#### Use standardized tf-idf vectors as features

In [None]:
# Build a Logistic Regression Classifier, train with standardized tf-idf vectors

from sklearn.linear_model import LogisticRegression

# To be implemented
pass

In [None]:
# Get score for training set
pass

In [None]:
# Get score for test set
pass

#### Use (Stardardized + PCA) tf-idf vectors as features

In [None]:
# Build a Logistic Regression Classifier, train with PCA tranformed X

from sklearn.linear_model import LogisticRegression

# To be implemented
pass

In [None]:
# Get score for training set
pass

In [None]:
# Get score for test set, REMEMBER to use PCA-transformed X!
pass

#### Q: What do you see from the training score and the test score? How do you compare the results from PCA and non-PCA preprocessing?

A: (insert your comments here)

#### You can plot the coefficients against principal components


In [None]:
# To be implemented
pass

### Random Forest Classifier
#### Use standardized tf-idf vectors as features

In [None]:
# Build a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# To be implemented
pass

In [None]:
# Get score for training set
pass

In [None]:
# Get score for test set
pass

#### Use (Stardardized + PCA) tf-idf vectors as features

In [None]:
# Build a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# To be implemented
pass

In [None]:
# Get score for training set
pass

In [None]:
# Get score for test set, REMEMBER to use PCA-transformed X!
pass

#### Q: What do you see from the training result and the test result?

A: (insert your comments here)

#### You can plot the feature importances against principal components


In [None]:
# To be implemented
pass

## Extra Credit #1: Can you cluster restaurants from their category information?
Hint: a business may have mutiple categories, e.g. a restaurant can have both "Restaurants" and "Korean"

In [None]:
# To be implemented

## Extra Credit #2: Can you try different distance/similarity metrics for clusterings, e.g. Pearson correlation, Jaccard distance, etc. 

Hint: You can take a look at [scipy](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist) documentations to use other distances

#### Q: How do you compare with Cosine distance or Euclidean distance?

In [None]:
# To be implemented

## Extra Credit #3: Can you cluster categories from business entities? What does it mean by a cluster?
Hint: Think the example where words can be clustered from the transposed tf-idf matrix.

In [None]:
# To be implemented

## Extra Credit #4: What are the characteristics of each of the clustered  ? For each cluster, which restaurant can best represent ("define") its cluster?
Hint: how to interpret "best"?

In [None]:
# To be implemented

## Extra Credit #5: Can you think of other use cases that clustering can be used? 
Hint: of course you can make use of other yelp dataset. You can try anything you want as long as you can explain it.

In [None]:
# To be implemented