<img src="ost_logo.png" width="240" height="240" align="right"/>
<div style="text-align: left"> <b> Machine Learning </b> <br> MSE FTP MachLe <br> 
<a href="mailto:christoph.wuersch@ost.ch"> Christoph Würsch </a> </div>

# Lab 7, A3: Yelp Business Classification using Bag-of-Words and tf-idf

Yelp is a crowd-sourced review forum, as well as an American multinational corporation headquartered in San Francisco, California. It develops, hosts and markets Yelp.com and the Yelp mobile app, which publish crowd-sourced reviews about local businesses, as well as the online reservation service Yelp Reservations. The company also trains small businesses in
how to respond to reviews, hosts social events for reviewers, and provides data about businesses, including health inspection scores. The data is open and can be downloaded here https://www.yelp.com/dataset/challenge.

We have prepared a partial dataset from the big Yelp dataset in the __pickle serialization format__ that you can read into a pandas dataframe using ``pd.read_pickle``.


In [1]:
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.linear_model import LogisticRegression
import sklearn.model_selection as modsel
import sklearn.preprocessing as preproc

### (a)  Load and prep Yelp reviews data

In [2]:
DataPath='D:/Downloads/YelpDataset/'
nightlife_subset = pd.read_pickle('nightlife_subset.pkl')
restaurant_subset = pd.read_pickle('restaurant_subset.pkl')

### (b) combine both datasets

In [3]:
combined = pd.concat([nightlife_subset, restaurant_subset])

In [4]:
combined['target'] = combined.apply(lambda x: 'Nightlife' in x['categories'],
                                    axis=1)

In [5]:
combined

Unnamed: 0,business_id,name,stars_y,text,categories,target
2203299,lpYFsXFrojiBZ1kbWR2lZw,Four Peaks Grill & Tap,5,Great service and food... Enjoy the atmosphere...,"Food, Restaurants, American (New), Local Flavo...",True
482774,KskYqH1Bi7Z_61pH6Om8pg,Lotus of Siam,5,Lotus is one of my all time favorites in Las V...,"Wine Bars, Nightlife, Restaurants, Seafood, Ca...",True
3086879,zFnPRtP7LGvr3sfxvy_dfg,Revolution Ale House,1,This place has definitely gone downhill. We fi...,"Nightlife, Italian, Restaurants, Pizza, Bars",True
512115,gUR2pWQKLPgMEm_R_aI_aw,Shooters On The Water,3,Thought this might be a worn out hang out for ...,"Dive Bars, American (New), American (Tradition...",True
3278073,hmYnzs8-aHbltaOOGDgmbA,Zipps Sports Grill,4,Zipps is almost always on happy hour! They hav...,"Sports Bars, American (Traditional), Nightlife...",True
...,...,...,...,...,...,...
2429937,Vs7gc9EE3k9wARuUcN9piA,Pan Asian,5,Pan Asian stays 5 stars for me. I went back t...,"Thai, Chinese, Japanese, Restaurants",False
640067,mYzlPKXvOVRrQivHnDqD5g,YamChops,5,Amazingly delicious with a variety of distinct...,"Butcher, Juice Bars & Smoothies, Vegan, Restau...",False
1566257,9Zl4uWSgSMpxHnsK_MPneg,Palermo Family Restaurant,4,"Eat here quite often, the Palermo Special piz...","Restaurants, Pizza",False
1607254,F-AYOq1xIY2u_qmWUG5VBw,Hakka Ren,5,It was amazing! I had the chilli fish schezuan...,"Chinese, Halal, Indian, Restaurants",False


### (c) Split the dataset into a training and test set

In [6]:
# Split into training and test data sets
training_data, test_data = modsel.train_test_split(combined,
                                                   train_size=0.7, 
                                                   random_state=123)

In [7]:
training_data.shape

(14000, 6)

In [8]:
test_data.shape

(6000, 6)

### (d) Transform the text as BoW (bag-of-words)

In [9]:
# Represent the review text as a bag-of-words 
bow_transform = text.CountVectorizer()
X_tr_bow = bow_transform.fit_transform(training_data[''])

KeyError: '...'

In [None]:
len(bow_transform.vocabulary_)

In [None]:
X_tr_bow.shape

In [None]:
X_te_bow = bow_transform.transform(test_data['...'])

In [None]:
y_tr = training_data['...']
y_te = test_data['...']

### (e,f) Classify with logistic regression

In [None]:
def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description, _C=1.0):
    ## Helper function to train a logistic classifier and score on test data
    m = LogisticRegression(C=_C).fit(X_tr, y_tr)
    s = m.score(X_test, y_test)
    print ('Test score with', description, 'features:', s)
    return m

In [None]:
m1 = simple_logistic_classify(..., y_tr, ..., y_te, 'bow')

### (f) applying normalization to the features

In [None]:
X_tr_l2 = preproc.normalize(..., axis=0)
X_te_l2 = preproc.normalize(..., axis=0)

In [None]:
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')

### (g) tf-idf represenation

In [None]:
# Create the tf-idf representation using the bag-of-words matrix
tfidf_trfm = text.TfidfTransformer(norm=None)
X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow)

In [None]:
X_te_tfidf = tfidf_trfm.transform(X_te_bow)

In [None]:
m3 = simple_logistic_classify(..., y_tr, ..., y_te, 'tf-idf')

### (h) Tune regularization parameters using grid search

In [None]:
param_grid_ = {'C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}
bow_search = modsel.GridSearchCV(LogisticRegression(), cv=5, param_grid=param_grid_, return_train_score=True)
l2_search  = modsel.GridSearchCV(LogisticRegression(), cv=5, return_train_score=True,
                               param_grid=param_grid_)
tfidf_search = modsel.GridSearchCV(LogisticRegression(), cv=5, return_train_score=True,
                                   param_grid=param_grid_)

In [None]:
bow_search.fit(X_tr_bow, y_tr)

In [None]:
bow_search.best_score_

In [None]:
l2_search.fit(X_tr_l2, y_tr)

In [None]:
l2_search.best_score_

In [None]:
tfidf_search.fit(X_tr_tfidf, y_tr)

In [None]:
tfidf_search.best_score_

What regularization parameters are best for each method?

In [None]:
bow_search.best_params_

In [None]:
l2_search.best_params_

In [None]:
tfidf_search.best_params_

Let's check one of the grid search outputs to see how it went:

In [None]:
bow_search.cv_results_

In [None]:
import pickle

In [None]:
results_file = open('tfidf_gridcv_results.pkl', 'wb')
pickle.dump(bow_search, results_file, -1)
pickle.dump(tfidf_search, results_file, -1)
pickle.dump(l2_search, results_file, -1)
results_file.close()

In [None]:
pkl_file = open('tfidf_gridcv_results.pkl', 'rb')
bow_search = pickle.load(pkl_file)
tfidf_search = pickle.load(pkl_file)
l2_search = pickle.load(pkl_file)
pkl_file.close()

In [None]:
search_results = pd.DataFrame.from_dict({'bow': bow_search.cv_results_['mean_test_score'],
                               'tfidf': tfidf_search.cv_results_['mean_test_score'],
                               'l2': l2_search.cv_results_['mean_test_score']})
search_results

## Plot cross validation results

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

In [None]:
ax = sns.boxplot(data=search_results, width=0.4)
ax.set_ylabel('Accuracy', size=14)
ax.tick_params(labelsize=14)
plt.savefig('tfidf_gridcv_results.png')

In [None]:
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow', 
                              _C=bow_search.best_params_['C'])
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized', 
                              _C=l2_search.best_params_['C'])
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf', 
                              _C=tfidf_search.best_params_['C'])

In [None]:
bow_search.cv_results_['mean_test_score']