# Yelp Data Challenge - NLP

BitTiger DS501

Jun 2017

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('last_2_years_restaurant_reviews.csv')

In [3]:
df.head(2)

Unnamed: 0,business_id,name,categories,avg_stars,review_id,user_id,stars,date,text,useful,funny,cool
0,--9e1ONYQuAa-CB_Rrw7Tw,"""Delmonico Steakhouse""",Cajun/Creole;Steakhouses;Restaurants,4.0,6SgvNWJltnZhW7duJgZ42w,oFyOUOeGTRZhFPF9uTqrTQ,5,2016-03-31,This is mine and my fiancé's favorite steakhou...,0,0,0
1,--9e1ONYQuAa-CB_Rrw7Tw,"""Delmonico Steakhouse""",Cajun/Creole;Steakhouses;Restaurants,4.0,UxFpgng8dPMWOj99653k5Q,aVOGlN9fZ-BXcbtj6dbf0g,5,2016-02-10,Truly Fantastic! Best Steak ever. Service was...,0,0,0


### The feature variables is the text of the review

In [4]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df['text'].values # here .values makes result numpy array

In [5]:
# inspect your documents, e.g. check the size, take a peek at elements of the numpy array
documents.dtype, documents.shape

(dtype('O'), (348455,))

In [6]:
documents[20]

'It\'s hard for me to only give four stars but....  I ate at CUT WHICH IS a two minute walk from Delmonicos two nights ago.   The service was great the food was good but CUT is a masterpiece the Mona Lisa.  I ordered the spinach salad which was very good fresh and delicious.  For my main course I ordered the bone in rib eye "house specialty"  it was great but it did not compare to the magic of my CUT ribeye.  This is an excellent restaurant the only flaw is its two minutes away from CUT!  It\'s like playing basketball next to Jordan baseball next to Ruth.'

## Train test split

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# Split to documents_train, documents_test, target_train, target_test
documents_train, documents_test, target_train, target_test = train_test_split(
    documents,
    target,
    test_size = 0.2,
    random_state = 42
)

## Let's get NLP representation of the documents

Here I used the TFidf to vectorize the features, and select out all the words.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
# Create TfidfVectorizer, and name it vectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', max_features=5000)

In [14]:
# Train the model with your training data
vectors_train = vectorizer.fit_transform(documents_train).toarray()

In [15]:
# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

In [16]:
vectors_train.shape

(278764, 5000)

In [17]:
# Use the trained model to transform your test data
vectors_test = vectorizer.transform(documents_test).toarray()

## Similar review search engine

I used the cosine similarity to identify the similarity between the two review.

In [18]:
import numpy as np

# We will need these helper methods pretty soon

def get_top_values(lst, n, labels):
    return [labels[i] for i in np.argsort(lst)[::-1][:n]]  

def get_bottom_values(lst, n, labels):
    return [labels[i] for i in np.argsort(lst)[:n]]


In [20]:

from sklearn.metrics.pairwise import cosine_similarity
# Draw an arbitrary review from test (unseen in training) documents
some_random_number = 42
search_query = documents_test[some_random_number]
search_queries = [search_query] # Need to be put into a list-like format
print(search_query)


This place has the best food. Everything is brought as it's made and ready so food is deliciously fresh. Their devil eggs is a must to try. Love love love their jerk turkey burger with tater tots. And please leave room for dessert, their bread pudding is amazing!!! Plus you will love the atmosphere and they just expanded and added a bar area. Recommend the sangria.


In [21]:
# Transform the drawn review(s) to vector(s)
vector_search_queries = vectorizer.transform(search_queries).toarray()

In [22]:
# Calculate the similarity score(s) between vector(s) and training vectors
similarity_scores = cosine_similarity(vector_search_queries, vectors_train)

In [23]:
# Let's find top 3 similar reviews
n = 3
returned_reviews = get_top_values(similarity_scores[0], n, documents_train)

In [24]:
print('Our search query:')
print(search_queries[0]) # To be added

Our search query:
This place has the best food. Everything is brought as it's made and ready so food is deliciously fresh. Their devil eggs is a must to try. Love love love their jerk turkey burger with tater tots. And please leave room for dessert, their bread pudding is amazing!!! Plus you will love the atmosphere and they just expanded and added a bar area. Recommend the sangria.


In [25]:
print('query')
print(search_query)

print('\n\nMost %s similar reviews:' % n)
for i, review in enumerate(returned_reviews):
    print('#%s:' % i)
    print(review)

query
This place has the best food. Everything is brought as it's made and ready so food is deliciously fresh. Their devil eggs is a must to try. Love love love their jerk turkey burger with tater tots. And please leave room for dessert, their bread pudding is amazing!!! Plus you will love the atmosphere and they just expanded and added a bar area. Recommend the sangria.


Most 3 similar reviews:
#0:
Always very friendly, service and food is amazing.  I love the takuyaki here!  Love it here
#1:
Amazing food! I love the customer service and the coffee... Love love love! Must try this spot. Ask for Julia
#2:
Love love love love love love love cheap fun good food for cheap fun and good love the people and place dearly yessssss !!!!


Here we can see that all the reviews are positive and it wonldn't surprize me that they are similar. 

## Classifying positive/negative review

#### Naive-Bayes Classifier

In [29]:
# Build a Naive-Bayes Classifier

from sklearn.naive_bayes import MultinomialNB

model_nb = MultinomialNB()

model_nb.fit(vectors_train, target_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [30]:
# Get score for training set
model_nb.score(vectors_train, target_train) # accuracy

0.8153133116184299

In [31]:
# Get score for test set
model_nb.score(vectors_test, target_test)

0.80861947740741269

#### Logistic Regression Classifier

In [32]:
# Build a Logistic Regression Classifier

from sklearn.linear_model import LogisticRegression

model_lrc = LogisticRegression()
model_lrc.fit(vectors_train, target_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [33]:
# Get score for training set
model_lrc.score(vectors_train, target_train)

0.84719691208333936

In [34]:
# Get score for test set
model_lrc.score(vectors_test, target_test)

0.82946147996154451

#### What are the key features(words) that make the positive prediction?

In [35]:
# Let's find it out by ranking
n = 20
get_top_values(model_lrc.coef_[0], n, words)

['amazing',
 'best',
 'delicious',
 'awesome',
 'thank',
 'perfect',
 'fantastic',
 'great',
 'bomb',
 'highly',
 'phenomenal',
 'love',
 'favorite',
 'gem',
 'excellent',
 'perfection',
 'wonderful',
 'incredible',
 'heaven',
 'perfectly']

#### What are the key features(words) that make the negative prediction?

In [36]:
# Let's find it out by ranking
n = 20
get_bottom_values(model_lrc.coef_[0], n, words)

['worst',
 'ok',
 'horrible',
 'mediocre',
 'rude',
 'terrible',
 'disappointing',
 'slow',
 'bland',
 'okay',
 'poor',
 'average',
 'decent',
 'overpriced',
 'lacking',
 'wasn',
 'unfortunately',
 'dry',
 'worse',
 'awful']

#### Random Forest Classifier

In [43]:
# Build a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

model_rfc = RandomForestClassifier(max_depth = 20,
                                   n_estimators = 50,
                                   min_samples_leaf = 10,
                                   n_jobs = -1)
model_rfc.fit(vectors_train, target_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [44]:
# Get score for training set
model_rfc.score(vectors_train, target_train)

0.79988807737010514

In [45]:
# Get score for test set
model_rfc.score(vectors_test, target_test)

0.78458122282647691

#### what features (words) are important by inspecting the RFC model?

In [46]:
n = 20
get_top_values(model_rfc.feature_importances_, n, words)

['amazing',
 'delicious',
 'best',
 'love',
 'great',
 'ok',
 'bad',
 'didn',
 'definitely',
 'awesome',
 'worst',
 'wasn',
 'minutes',
 'horrible',
 'good',
 'favorite',
 'vegas',
 'like',
 'place',
 'money']

In [47]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model_lrc,
                            vectors_train,
                            target_train,
                            cv = 5,
                            scoring="accuracy")
cv_scores

array([ 0.83119306,  0.82294282,  0.82988951,  0.82221266,  0.82349143])

In [48]:
# Let's tune Logistic Regression Regularization parameter C and different penalty
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

param_grid = [{'penalty':['l1'], 'C':[0.1, 100]},
              {'penalty':['l2'], 'C':[0.1, 100]}]

scores = ['accuracy']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score + "\n\n")
    clf = GridSearchCV(LogisticRegression(),
                       param_grid,
                       cv=5,
                       scoring=score)
    clf.fit(vectors_train[:500,:], target_train[:500])
    print("Best parameters set found on development set:\n\n")
    print(clf.best_params_)
    print("\nGrid scores on development set:\n\n")
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    
    print("\nDetailed classification report:\n")
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print("\n")
    y_true, y_pred = target_test, clf.predict(vectors_test)
    print(classification_report(y_true, y_pred))
    print("\n")

# Tuning hyper-parameters for accuracy


Best parameters set found on development set:


{'C': 100, 'penalty': 'l2'}

Grid scores on development set:


0.538 (+/-0.005) for {'C': 0.1, 'penalty': 'l1'}
0.746 (+/-0.027) for {'C': 100, 'penalty': 'l1'}
0.572 (+/-0.047) for {'C': 0.1, 'penalty': 'l2'}
0.764 (+/-0.046) for {'C': 100, 'penalty': 'l2'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.


             precision    recall  f1-score   support

      False       0.75      0.79      0.77    145336
       True       0.76      0.71      0.74    133428

avg / total       0.75      0.75      0.75    278764



