# Classification Modeling - Businesses

Fit and cross-validation binary classification models which predict usefulness for non-restaurant business reviews. We choose a logistic regression model that classifies reviews as useful with 82% accuracy.

## Import modules

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import pickle

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

## Load features and target

In [2]:
features = np.load('../data/businesses_train_features.npy')
target = np.load('../data/business_target.npy')

## Scale, train/test split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(features, target)

ss = StandardScaler()

X_train = ss.fit_transform(X_train)

X_test = ss.transform(X_test)



## Baseline Accuracy

In [9]:
target.mean()

0.31229046971224506

## Logistic Regression

The logistic regression model assumes that the log-odds of the probability of a review being useful review is a linear combination of the features entered into the model. 

In [10]:
lr = GridSearchCV(LogisticRegression(), param_grid={'random_state': [32], 
                                                    'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 
                                                    'solver': ['saga'],
                                                    'penalty': ['l2'],
                                                    'n_jobs': [-1],
                                                    'verbose': [1]})

In [None]:
lr.fit(X_train, y_train)

In [12]:
lr.score(X_train, y_train), lr.score(X_test, y_test)

(0.82158141719403521, 0.82175349520813346)

Our model is 82% accurate in predicting whether or not a review had 3+ useful votes or 0 useful votes for our train/test split data. In the next notebook, I will use the trained model on the validation data to see what reviews are being predicted as useful and what reviews are being predicted as not useful.

In [18]:
lr.best_params_

{'C': 10,
 'n_jobs': -1,
 'penalty': 'l2',
 'random_state': 32,
 'solver': 'saga',
 'verbose': 1}

In [23]:
with open('../models/logisticreg.pkl', 'wb') as model:
    pickle.dump(lr, model)

## Random Forest

Random forest classifiers aim to control for the overfitting present in simple decision trees by fitting mulitple trees using a random subset of the feature space.

In [16]:
rf = GridSearchCV(RandomForestClassifier(), param_grid={'random_state': [32],
                                                        'n_estimators': [10, 50, 100],
                                                        'min_samples_split': range(2, 4),
                                                        'min_samples_leaf': range(8, 12),
                                                        'n_jobs': [-1],
                                                        'verbose':[-1]})

In [19]:
%%time
rf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:   32.9s remaining:   49.3s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   59.1s finished
[Parallel(n_jobs=8)]: Done   4 out of  10 | elapsed:    0.2s remaining:    0.2s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.3s finished
[Parallel(n_jobs=8)]: Done   4 out of  10 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.7s finished
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:   28.4s remaining:   42.5s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   55.0s finished
[Parallel(n_jobs=8)]: Done   4 out of  10 | elapsed:    0.2s remaining:    0.2s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.3s finished
[Parallel(n_jobs=8)]: Done   4 out of  10 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.7s finished
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:   28.0s remaining:   42.1s
[Parallel(n_job

[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    1.0s finished
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    2.1s finished
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  5.5min finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    1.1s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    2.1s finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    2.6s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    4.4s finished
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  6.6min finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    1.7s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    3.7s finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:   10.6s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   21.9s finished
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed: 14.1min
[Paralle

[Parallel(n_jobs=8)]: Done   4 out of  10 | elapsed:    1.8s remaining:    2.7s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    2.4s finished
[Parallel(n_jobs=-1)]: Done   4 out of  10 | elapsed:  2.4min remaining:  3.7min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  3.2min finished
[Parallel(n_jobs=8)]: Done   4 out of  10 | elapsed:    0.9s remaining:    1.4s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    1.2s finished
[Parallel(n_jobs=8)]: Done   4 out of  10 | elapsed:    1.7s remaining:    2.6s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    2.3s finished
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 11.3min finished
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    1.0s finished
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    2.2s finished
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 11.2min finished
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    6.4s finished
[Parallel(n_jobs=8)]: Done  50 out of  50 | el

[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    2.4s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    4.3s finished
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed: 14.4min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 23.5min finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    7.1s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   12.5s finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:   14.6s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   25.7s finished
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed: 20.6min


CPU times: user 3d 17h 51min 48s, sys: 18min 20s, total: 3d 18h 10min 8s
Wall time: 11h 49min 50s


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 36.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'random_state': [32], 'n_estimators': [10, 50, 100], 'min_samples_split': range(2, 4), 'min_samples_leaf': range(8, 12), 'n_jobs': [-1], 'verbose': [-1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [20]:
rf.score(X_train, y_train), rf.score(X_test, y_test)

[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:   27.4s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   47.6s finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    8.8s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   15.4s finished


(0.91140107943526016, 0.82144071811191954)

In [22]:
rf.best_params_

{'min_samples_leaf': 10,
 'min_samples_split': 2,
 'n_estimators': 100,
 'n_jobs': -1,
 'random_state': 32,
 'verbose': -1}

### While results of our grid search suggest that model accuracy could be improved by increasing the number of estimators in each tree, I will continue to the prediction stage using the logistic regression model due to the large disparity in fit time (approx. 120 min for a random forest of 100 estimators vs approx. 25 minutes for logistic regression).