# Assignment 2

Zach Aldrich  
Alec Grubbs  
James Krach  
Brian Lakey  
Greg Merchant  
Francisco Sananez  

## Setup

In [1]:
import sklearn
import pandas as pd
import nltk
import numpy as np
from sklearn.cross_validation import train_test_split

In [2]:
yelp = pd.read_csv('Yelp Data Restaurant Reviews Ratings.csv')
yelp['target'] = (yelp['stars']>3).astype(int)

In [3]:
numerics = ['votes_cool','votes_funny','votes_useful',
            'Moderate','Expensive','VeryExpensive','American',
            'Chinese','French','Japanese','Indian','Italian',
            'Greek','Mediterranean','Mexican','Thai','Vietnamese','Others']
yelp_numerics = yelp[numerics]

## Part A

Ignore the text (reviews) and run a classification model with the numeric data (you can use standard methods like logistic regression, k-nearest neighbors or anything else). What is the best accuracy of your model?

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.cross_validation import cross_val_score

In [5]:
lr_numeric = LogisticRegression()
rf_numeric = RandomForestClassifier(n_estimators=20)
ab_numeric = AdaBoostClassifier(RandomForestClassifier(n_estimators=20))

In [6]:
lr_num_cv = cross_val_score(lr_numeric, yelp_numerics, y=yelp['target'], cv=10, n_jobs=-1)
rf_num_cv = cross_val_score(rf_numeric, yelp_numerics, y=yelp['target'], cv=10, n_jobs=-1)

print "Logistic Regression 10-fold cross validated accuracy: " + str(sum(lr_num_cv)/len(lr_num_cv))
print 'Random Forest 10-fold cross validated accuracy: ' + str(sum(rf_num_cv)/len(rf_num_cv))

Logistic Regression 10-fold cross validated accuracy: 0.682335594821
Random Forest 10-fold cross validated accuracy: 0.657338840247


In [7]:
ab_num_cv = cross_val_score(ab_numeric, yelp_numerics, y=yelp['target'], cv=10, n_jobs=-1)
print "Boosted Random Forest 10-fold cross validated accuracy: " + str(sum(ab_num_cv)/len(ab_num_cv))

Boosted Random Forest 10-fold cross validated accuracy: 0.653887589447


The best accuracy of a model we tested was Logistic Regression, which provided about 68% accuracy. This is only slightly better than simply guessing that all restaurants are high quality (since the data is split 65:35).

## Part B

Perform a supervised classification on a subset of the corpus using the reviews only. You can write your code in Python or R. What accuracy do you get from this text mining exercise?

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
corpus = TfidfVectorizer(stop_words='english',
                       min_df=5)

In [10]:
yelp_text = corpus.fit_transform(yelp['Review'])

In [11]:
boosted_forests_text = AdaBoostClassifier(RandomForestClassifier(n_estimators=20))

In [12]:
text_cv_score = cross_val_score(boosted_forests_text, yelp_text, y=yelp['target'], cv=10, n_jobs=-1)

In [13]:
print "Boosted Random Forests 10-fold cross validated accuracy on reviews: " + str(sum(text_cv_score)/len(text_cv_score))

Boosted Random Forests 10-fold cross validated accuracy on reviews: 0.789592763398


The accuracy of a text only classifier is about 79%, which is high, but not much higher than the baseline rate that would be expected from guessing all restaurants are high (since the data is split 65:35). Regardless, the model works and is relatively effective.

## Part C

Combine the numeric data and the text classification model (in task B) to create a hybrid model. It is your task to figure out how to do this. Now run this hybrid classification model and compare the results with those in A and B.

In [14]:
from sklearn.cross_validation import train_test_split

In [15]:
X_train, yelp_numText, y_train, yelp_numText_target = train_test_split(pd.concat([yelp_numerics, yelp['Review']], axis=1), yelp['target'])

In [16]:
text_prediction_corpus = TfidfVectorizer(stop_words='english',
                       min_df=5)

In [17]:
text_prediction_dtm = text_prediction_corpus.fit_transform(X_train['Review'])

In [18]:
text_prediction_classifier = LogisticRegression()

In [20]:
text_prediction_model = text_prediction_classifier.fit(text_prediction_dtm, y_train)

In [21]:
yelp_text_smaller = text_prediction_corpus.transform(yelp_numText['Review'])
predictions = text_prediction_model.predict(yelp_text_smaller)

In [22]:
yelp_numText = yelp_numText[numerics]
yelp_numText['text_predictions'] = predictions

In [23]:
ab_hybrid = AdaBoostClassifier(RandomForestClassifier(n_estimators=20))
ab_hybrid_cv = cross_val_score(ab_hybrid, yelp_numText, y=yelp_numText_target, cv=10, n_jobs=-1)
print "Boosted Random Forest 10-fold cross validated hybrid model accuracy: " + str(sum(ab_hybrid_cv)/len(ab_hybrid_cv))

Boosted Random Forest 10-fold cross validated hybrid model accuracy: 0.828395989584


While the purely numeric models from A perform around 65% accuracy and the purely text based model performs at around 79% accuracy, the hybrid model performs far better at around 83% since it incorporates data from both types of models. This indicates that text data provides additional information that cannot be captured in the numeric data provided.

## Part D

Use unsupervised sentiment analysis on the reviews (with SentiStrength or any other tool) and use the sentiment score to predict high/low rating. Compare and contrast the results of tasks B and D. What can you conclude from your analysis?

In [109]:
def review_to_sent(review):
    review = review.encode('string-escape')
    blob = TextBlob(review)
    return blob.sentiment.polarity

In [113]:
yelp['review_sent'] = yelp['Review'].map(review_to_sent)

In [114]:
yelp['sent_classification'] = yelp['review_sent']>0

In [115]:
cm_sentiment_1 = confusion_matrix(yelp['target'], yelp['sent_classification'])

In [116]:
print "Sentiment based classification scheme accuracy: " + str(round((cm_sentiment_1[0][0]+cm_sentiment_1[1][1])/float((sum(cm_sentiment_1[0])+sum(cm_sentiment_1[1]))), 3)) 

Sentiment based classification scheme accuracy: 0.722


The accuracy of a purely sentiment based classification scheme is lower than a bag-of-words model. This is likely because the word sentiment (in the case of this program and the case of SentiStrength) is based on a collection of words and sentiments that are not immediately relevant to this particular corpus (i.e. yelp restaurant reviews). Any accurate word sentiment based classification scheme needs to have training data that is related in some way to the corpus being tested, so if that were the case it would just be easier to use a bag-of-words model. Often, the absolute sentiment of a word is not the same as the context-based sentiment of a word. The accuracy achieved by Part B was higher since it took into account the context and wording of the reviews, but Part B was also far more computationally intensive than Part D, so there is a clear tradeoff to using a more specific model.

## Part E

Use unsupervised clustering on the text. Does clustering achieve good separation between high and low rated restaurants? How can you explain the result?

In [24]:
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score

In [25]:
cluster_corpus = TfidfVectorizer(stop_words='english',
                                 min_df=5,
                                 max_features=20)

In [26]:
yelp_text_cluster = cluster_corpus.fit_transform(yelp['Review'])

In [27]:
kmeans_cluster = KMeans(n_clusters=2, n_jobs=-1)
clusters = kmeans_cluster.fit(yelp_text_cluster)

In [28]:
cluster_entropy = normalized_mutual_info_score(yelp['target'], clusters.labels_)

In [29]:
print "The normalized entropy of clusters is: " + str(round(cluster_entropy,4))

The normalized entropy of clusters is: 0.0305


Unsurprisingly, the text data performs very poorly when trying to force clustering into high and low rated reviews. This is likely because while there are clusters, it is difficult to cluster on two specific features. Additionally, the text data has to have at least 100 dimensions to say anything meaningful about the documents, and trying to measure distance in high dimensional space with a simple clustering algorithm is extremely difficult.

## Part F

What are the top 5 attributes of a restaurant that are associated with (i) high and (ii) low ratings?

In [124]:
ab_hybrid.fit(yelp_numText, yelp_numText_target)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          learning_rate=1.0, n_estimators=50, random_state=None)

In [148]:
feature_importances = []
for i in range(len(yelp_numText.columns)):
    feature_importances.append((yelp_numText.columns[i], ab_hybrid.feature_importances_[i]))
ordered_feature_importances = sorted(feature_importances, key=lambda x: x[1])[::-1]
for i in range(5):
    print ordered_feature_importances[i]

('votes_useful', 0.23786825687164273)
('votes_cool', 0.13511610472757099)
('votes_funny', 0.10750255367262984)
('text_predictions', 0.10395620534159045)
('Moderate', 0.085149482681453309)


In [149]:
boosted_forests_text.fit(yelp_text, y=yelp['target'])

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          learning_rate=1.0, n_estimators=50, random_state=None)

In [153]:
text_feature_importances = []
for i in range(len(yelp_numText.columns)):
    text_feature_importances.append((corpus.get_feature_names()[i], boosted_forests_text.feature_importances_[i]))
ordered_text_feature_importances = sorted(text_feature_importances, key=lambda x: x[1])
for i in range(5):
    print ordered_text_feature_importances[i]

(u'05', 3.0496556863730767e-07)
(u'02', 4.4864205158504963e-07)
(u'07', 6.1475414279939689e-07)
(u'06', 6.6888815875615668e-07)
(u'04', 9.4404560616427456e-07)


In [155]:
for i in range(5):
    print ordered_text_feature_importances[::-1][i]

(u'00', 0.00035484546925930731)
(u'10', 0.00020311674798398523)
(u'100', 2.9762992173300608e-05)
(u'09', 1.5337025121344874e-05)
(u'03', 7.7907334843650031e-06)


The words that are most indicative of low and high rated restaurants are listed above, but since these featues are fairly non-descriptive, it is also meaningful to look at the features of the hybrid model at the top. So the most important features for determining if a review will be either highly rated or low rated are how useful, cool, and funny a review is. These features are somewhat unsurprising since they are in a way meta-reviews. It is also unsurprising that the prediction from the text model is the next most indicative feature since it is also an analysis of many reviews and a prediction based on an entire training set. 