# Classification modeling - Restaurants

I create two sets of models: a set of binary classifiers (is useful/is not useful) and a set of ordinal classifiers with 3/4 levels of usefulesness.

## Import modules and data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import time
import re

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

In [2]:
correct_index = np.load('../data/rests_eng_index.npy')

In [3]:
rests = pd.read_csv('../data/restaurants.csv', compression='gzip')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
rests.shape

(3055990, 28)

In [25]:
def useful_mapper(x):
    if x == 0:
        return 0
    elif x in (1, 2):
        return "Undetermined"
    elif x >= 3:
        return 1

In [26]:
rests['Usefulness'] = rests['useful'].map(useful_mapper)

In [5]:
rests['isUseful'] = (rests['useful'] > 0).astype(int)

In [6]:
rests.drop(['useful', 'text', 'cool', 'state'], 1, inplace=True)

In [7]:
rests = rests[rests.index.isin(correct_index)]

In [8]:
lsa_matrix = np.load('../data/lsa_matrix.npy')

In [9]:
rests.drop([1841405, 1841406], 0, inplace=True)

In [10]:
lsa_matrix = np.delete(lsa_matrix, [1841405, 1841406], 0)

In [11]:
rests.columns

Index(['stars', 'funny', 'active_life', 'arts_and_entertainment', 'automotive',
       'beauty_and_spas', 'education', 'event_planning_and_services',
       'financial_services', 'food', 'health_and_medical', 'home_services',
       'hotels_and_travel', 'local_flavor', 'local_services', 'mass_media',
       'nightlife', 'pets', 'professional_services',
       'public_services_and_government', 'religious_organizations',
       'restaurants', 'shopping', 'review_length', 'isUseful'],
      dtype='object')

In [12]:
left_array = rests[rests.columns[:-2]].values

In [13]:
left_array.shape

(2984419, 23)

In [14]:
lsa_matrix.shape

(2984419, 100)

In [17]:
features = np.hstack((left_array, lsa_matrix))

In [18]:
features.shape

(2984419, 123)

In [20]:
np.save('../data/lsa_features.npy', features)

In [16]:
np.save('../data/rests_useful_scores.npy', rests['isUseful'].values)

## Modeling Pipeline

1. Features for reviews from review dataset
2. Topic weights from topic model
3. Scaler
4. GridSearched models

### Load features array

In [2]:
features = np.load('../data/lsa_features.npy')
target = np.load('../data/rests_useful_scores.npy')

## Set target and feature vectors, train/test/split, normalize

In [3]:
X = features
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y)

ss = StandardScaler()

X_train = ss.fit_transform(X_train)

X_test = ss.transform(X_test)

### Try modeling with only the LSA feature weights

In [None]:
X = dt_matrix
y = rests['isUseful']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)

In [None]:
X_test = ss.transform(X_test)

In [None]:
lr = GridSearchCV(LogisticRegression(), param_grid={'random_state': [32], 'C': [1e-4, 1e-3]})

In [None]:
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train), lr.score(X_test, y_test)

## LR model with LSA feature weights and features

In [13]:
lr = GridSearchCV(LogisticRegression(), param_grid={'random_state': [32], 
                                                    'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 
                                                    'solver': ['saga'],
                                                    'penalty': ['l2'],
                                                    'n_jobs': [-1],
                                                    'verbose': [1]})

In [14]:
%time lr.fit(X_train, y_train)

convergence after 24 epochs took 65 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.1min finished


convergence after 23 epochs took 63 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.1min finished


convergence after 24 epochs took 66 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.1min finished


convergence after 45 epochs took 122 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.0min finished


convergence after 44 epochs took 121 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.0min finished


convergence after 45 epochs took 125 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.1min finished


convergence after 54 epochs took 144 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 52 epochs took 137 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.3min finished


convergence after 54 epochs took 143 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 55 epochs took 146 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 53 epochs took 141 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.3min finished


convergence after 55 epochs took 154 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.6min finished


convergence after 55 epochs took 144 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 53 epochs took 140 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.3min finished


convergence after 55 epochs took 146 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 55 epochs took 148 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.5min finished


convergence after 54 epochs took 143 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 55 epochs took 145 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 55 epochs took 147 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.5min finished


convergence after 54 epochs took 144 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 55 epochs took 147 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.5min finished


convergence after 55 epochs took 147 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.5min finished


convergence after 54 epochs took 145 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.4min finished


convergence after 55 epochs took 148 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.5min finished


convergence after 39 epochs took 151 seconds
CPU times: user 56min 8s, sys: 32.1 s, total: 56min 40s
Wall time: 55min 54s


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.5min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'random_state': [32], 'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'solver': ['saga'], 'penalty': ['l2'], 'n_jobs': [-1], 'verbose': [1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [15]:
lr.score(X_train, y_train), lr.score(X_test, y_test)

(0.70162050543400079, 0.70181140724160807)

## Random Forest

In [54]:
rf = GridSearchCV(RandomForestClassifier(), param_grid={'random_state': [32], 
                                                        'min_samples_split': [3, 4, 5], 
                                                        'min_samples_leaf': range(4, 11, 1),
                                                        'verbose': [1],
                                                        'n_jobs': [-1]})

In [55]:
%%time
rf.fit(X_train, y_train)
rf.score(X_train, y_train), rf.score(X_test, y_test)

[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   36.4s remaining:   24.3s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.1min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.5s remaining:    0.3s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.9s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    1.1s remaining:    0.7s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    1.8s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   37.8s remaining:   25.2s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.2min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.5s remaining:    0.3s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.9s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    1.1s remaining:    0.7s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    1.8s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   36.8s remaining:   24.5s
[Parallel(n_job

[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.1min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.5s remaining:    0.3s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.8s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.9s remaining:    0.6s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    1.7s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   35.6s remaining:   23.8s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.1min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.5s remaining:    0.3s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.8s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.9s remaining:    0.6s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    1.7s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   35.3s remaining:   23.5s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.1min finished
[Parallel(n_jobs=8)]: Don

[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  1.7min remaining:  1.1min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  2.1min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.4s remaining:    0.3s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.8s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.9s remaining:    0.6s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  1.7min remaining:  1.1min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  2.5min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    2.5s remaining:    1.7s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    3.1s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    5.0s remaining:    3.3s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    6.1s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  1.9min remaining:  1.3min
[Parallel(n_job

[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  3.3min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.4s remaining:    0.3s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.7s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.8s remaining:    0.5s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    1.4s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  2.4min remaining:  1.6min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  3.2min finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    2.3s remaining:    1.5s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    2.8s finished
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    4.8s remaining:    3.2s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    5.8s finished
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  1.5min remaining:   58.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  2.1min finished
[Parallel(n_jobs=8)]: Don

CPU times: user 12h 33min 56s, sys: 2min 55s, total: 12h 36min 52s
Wall time: 2h 6min 21s


[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    3.1s finished


In [56]:
rf.best_params_

{'min_samples_leaf': 10,
 'min_samples_split': 3,
 'n_jobs': -1,
 'random_state': 32,
 'verbose': 1}

In [57]:
rf.score(X_train, y_train)

[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    8.1s remaining:    5.4s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    9.8s finished


0.85624492363448557

In [58]:
rf.score(X_test, y_test)

[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    2.7s remaining:    1.8s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    3.2s finished


0.69385006131844718