# Classification Modeling - Businesses

Fit and cross-validation binary classification models which predict usefulness for non-restaurant business reviews. We choose a logistic regression model that classifies business reviews as useful or not useful with 82% accuracy.
___

## Import modules

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import pickle

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

from sklearn.metrics import classification_report, confusion_matrix

## Load features and target

In [4]:
features = np.load('../data/businesses_train_features.npy')

target = np.load('../data/business_target.npy')

## Scale, train/test split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(features, target)

ss = StandardScaler()

X_train = ss.fit_transform(X_train)

X_test = ss.transform(X_test)



## Baseline Accuracy

In [9]:
target.mean()

0.31229046971224506

## Logistic Regression

The logistic regression model assumes that the log-odds of the probability of a review being useful review is a linear combination of the features entered into the model. 

In [10]:
lr = GridSearchCV(LogisticRegression(), param_grid={'random_state': [32], 
                                                    'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 
                                                    'solver': ['saga'],
                                                    'penalty': ['l2'],
                                                    'n_jobs': [-1],
                                                    'verbose': [1]})

In [None]:
lr.fit(X_train, y_train)

In [12]:
lr.score(X_train, y_train), lr.score(X_test, y_test)

(0.82158141719403521, 0.82175349520813346)

Our model is 82% accurate in predicting whether or not a review had 3+ useful votes or 0 useful votes for our split data. In the [next notebook](https://github.com/gd32/DSI_capstone/blob/master/notebooks/08_Review_Prediction_Businesses.ipynb), I will use the trained model on the validation data to see examples of the model's predictions.

In [18]:
lr.best_params_

{'C': 10,
 'n_jobs': -1,
 'penalty': 'l2',
 'random_state': 32,
 'solver': 'saga',
 'verbose': 1}

Generate confusion matrix:

In [5]:
with open('../models/logisticreg.pkl', 'rb') as m:
    lr = pickle.load(m)

In [13]:
tn, fp, fn, tp = confusion_matrix(y_train, lr.predict(X_train)).ravel()

In [15]:
confusion_matrix

(641285, 37828, 138564, 170247)

In [24]:
print(classification_report(y_train, lr.predict(X_train), target_names=['Not useful', 'Useful']))

             precision    recall  f1-score   support

 Not useful       0.82      0.94      0.88    679113
     Useful       0.82      0.55      0.66    308811

avg / total       0.82      0.82      0.81    987924



We have a high number of false negatives and a low number of false positives. The classification report shows that we have high specificity but low sensitivity; this is likely because short reviews are generally being classified as not useful even if they speak specifically about a product or service the business provides.
___

Save the model for predictions:

In [23]:
with open('../models/logisticreg.pkl', 'wb') as model:
    pickle.dump(lr, model)

## Random Forest

Random forest classifiers aim to control for the overfitting present in simple decision trees by fitting mulitple trees using a random subset of the feature space.

In [16]:
rf = GridSearchCV(RandomForestClassifier(), param_grid={'random_state': [32],
                                                        'n_estimators': [10, 50, 100],
                                                        'min_samples_split': range(2, 4),
                                                        'min_samples_leaf': range(8, 12),
                                                        'n_jobs': [-1],
                                                        'verbose':[-1]})

In [None]:
%%time
rf.fit(X_train, y_train)

In [20]:
rf.score(X_train, y_train), rf.score(X_test, y_test)

[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:   27.4s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   47.6s finished
[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    8.8s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   15.4s finished


(0.91140107943526016, 0.82144071811191954)

In [22]:
rf.best_params_

{'min_samples_leaf': 10,
 'min_samples_split': 2,
 'n_estimators': 100,
 'n_jobs': -1,
 'random_state': 32,
 'verbose': -1}

### While results of our grid search suggest that model accuracy could be improved by increasing the number of estimators in each tree, I will continue to the prediction stage using the logistic regression model due to the large disparity in fit time (approx. 120 min for a random forest of 100 estimators vs approx. 25 minutes for logistic regression).