# 3.6.3 Amazon Reviews

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). The file is present in URL-http://jmcauley.ucsd.edu/data/amazon/

Columns are as below:

reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B

asin - ID of the product, e.g. 0000013714

reviewerName - name of the reviewer

helpful - helpfulness rating of the review, e.g. 2/3

reviewText - text of the review

overall - rating of the product

summary - summary of the review

unixReviewTime - time of the review (unix time)

reviewTime - time of the review (raw)

In [1]:

import math
import warnings
from datetime import datetime
start=datetime.now()
from IPython.display import display
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
import statsmodels.formula.api as smf
from sklearn.model_selection import cross_val_score
from sklearn import neighbors
from sklearn import tree,model_selection
from sklearn.grid_search import GridSearchCV
# A convenience for displaying visualizations.
from IPython.display import Image
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import time
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsClassifier
# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
import gzip
import numpy as np
import json
# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
start=datetime.now()
rfc = ensemble.RandomForestClassifier()
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

#df = getDF('C:/Users/ankush/Downloads/reviews_Pet_Supplies_5.json.gz')
df=data = pd.read_json('C:/Users/ankush/Desktop/Pet_Supplies_5.json',lines=True)


  from numpy.core.umath_tests import inner1d


# Exploring The Data

We only care about only one column here-reviewText in the dataset.Then we will get the keywords out of the column reviewtext and will evaluate if they are negative or positive

In [8]:
df['overall'].describe()

count   65430.000
mean        4.245
std         1.183
min         1.000
25%         4.000
50%         5.000
75%         5.000
max         5.000
Name: overall, dtype: float64

We are using countvector object here to first convert convert a collection of text documents to a matrix of token counts and then using tdidf to get the measured frequency of each token 

In [2]:
#positive_keywords = ['great','enjoyed','loves','like','recommend','happy','yummy','nice','very','good','easy','useful','awesome','works','love','excellent','helpful','Best','nicely','wonderful']
#negative_keywords = ['not','disgusting','junk','bad','didn\'t work','waste','crap']

count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(df.reviewText)
X_counts.shape


(65430, 45087)

In [3]:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
X_tfidf.shape

(65430, 45087)

In [4]:
Y = [1 if x > 3 else 0 for x in df.overall]

In [16]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score
# Instantiate our model and store it in a new variable.
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=2)
X_sampled,y_sampled = sm.fit_sample(X_tfidf,Y)
X_train_sample,X_test_sample,y_train_sample,y_test_sample = train_test_split(X_sampled,y_sampled,test_size=.20)


Using Dimension Reduction SVD to reduce the features since PCA has limitation for being used in sparse matrix and then use each model on these techniques as below

In [17]:
from sklearn.decomposition import TruncatedSVD
from sklearn.random_projection import sparse_random_matrix
X_svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
X_svd.fit(X_train_sample)  
X_train_svd = X_svd.transform(X_train_sample)
X_test_svd = X_svd.transform(X_test_sample)


#X_pca = PCA(n_components=3)
#X_pca.fit(X_train_sample)
#X_train_pca = X_pca.transform(X_train_sample)
#X_test_pca = X_pca.transform(X_test_sample)

By using PCA, we can see 10 features are closely correlated to the output variable so we will keep them.Now we will use these techniques for each Classification model
#Gridsearchcv
#Cross Validation
#classification_report
#AUC
#Confusion Matrix

In [18]:
#Naive Bayes doesn't need parameter tuning so we just need to get only accuracy score
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB() 
bnb = bnb.fit(X_train_svd, y_train_sample)
y_pred_bnb = bnb.predict(X_train_svd)
print("Accuracy:", bnb.score(X_train_svd, y_train_sample))

Accuracy: 0.7073962355212355


In [None]:
#KNN
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_svd, y_train_sample)
y_pred_knn = knn_model.predict(X_train_svd)
params = {'n_neighbors':[5,6,7,8,9,10],
          'leaf_size':[1,2,3,5]
                  }
#Making models with hyper parameters sets
grid_class = model_selection.GridSearchCV(knn_model, param_grid=params)
#Learning
grid_class.fit(X_train_svd,y_train_sample)
#The best hyper parameters set

results = grid_class.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class.best_params_)
print(grid_class.best_score_)
final_model = grid_class.best_estimator_
print(final_model)

In [8]:
#Random Forest
rfc = ensemble.RandomForestClassifier(max_depth=10,n_estimators= 100)
rfc.fit(X_train_svd, y_train_sample)
y_pred_rfc = rfc.predict(X_train_svd)
#Gridsearchcv to get the best parameters
dt_grid={'max_depth':[3,4,5],'max_features': [50]}

grid_class_rfc=model_selection.GridSearchCV(rfc,dt_grid,cv=6)
grid_class_rfc.fit(X_train_svd, y_train_sample)
results = grid_class_rfc.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class_rfc.best_params_)
print(grid_class_rfc.best_score_)
final_model = grid_class_rfc.best_estimator_
print(final_model)


[{'max_depth': 3, 'max_features': 50}, {'max_depth': 4, 'max_features': 50}, {'max_depth': 5, 'max_features': 50}]
[0.75380068 0.76043678 0.76911197]
[0.75500242 0.76358832 0.77361969]
Best Hyper Parameters:
 {'max_depth': 5, 'max_features': 50}
0.7691119691119691
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=50, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
[0 0 1 ... 0 1 1]




In [9]:
#Decision Tree
decision_tree = tree.DecisionTreeClassifier(max_depth=10)
decision_tree.fit(X_train_svd, y_train_sample)
y_pred_dt = decision_tree.predict(X_train_svd)
dt_grid={'max_depth':[50],'max_features': [10]}
grid_class_dt=model_selection.GridSearchCV(decision_tree,dt_grid,cv=6)
grid_class_dt.fit(X_train_svd, y_train_sample)
results = grid_class_dt.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class_dt.best_params_)
print(grid_class_dt.best_score_)
final_model = grid_class_dt.best_estimator_
print(final_model)


[{'max_depth': 50, 'max_features': 10}]
[0.76692809]
[0.99979247]
Best Hyper Parameters:
 {'max_depth': 50, 'max_features': 10}
0.7669280888030888
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=50,
            max_features=10, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')




In [None]:
#SVC
from sklearn.svm import SVC
svm=SVC()
svm.fit(X_train_pca, y_train)
y_pred_svm = svm.predict(X_train_pca)
param_grid = {'C':[1,10,100,1000],'gamma':[1,0.1,0.001,0.0001], 'kernel':['linear','rbf']}
grid_class_svm=model_selection.GridSearchCV(SVC(),param_grid,refit = True)
grid_class_svm.fit(X_train_pca, y_train)
results = grid_class_svm.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class_svm.best_params_)
print(grid_class_svm.best_score_)
final_model = grid_class_svm.best_estimator_
print(final_model)

In [10]:
#Linear Regression
lr = LogisticRegression(C=1e9)
lr.fit(X_train_svd, y_train_sample)
y_pred_lr = lr.predict(X_train_svd)
#Gridsearchcv 
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=model_selection.GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train_svd,y_train_sample)
results = logreg_cv.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",logreg_cv.best_params_)
print(logreg_cv.best_score_)
final_model = logreg_cv.best_estimator_
print(final_model)

[{'C': 0.001, 'penalty': 'l1'}, {'C': 0.001, 'penalty': 'l2'}, {'C': 0.01, 'penalty': 'l1'}, {'C': 0.01, 'penalty': 'l2'}, {'C': 0.1, 'penalty': 'l1'}, {'C': 0.1, 'penalty': 'l2'}, {'C': 1.0, 'penalty': 'l1'}, {'C': 1.0, 'penalty': 'l2'}, {'C': 10.0, 'penalty': 'l1'}, {'C': 10.0, 'penalty': 'l2'}, {'C': 100.0, 'penalty': 'l1'}, {'C': 100.0, 'penalty': 'l2'}, {'C': 1000.0, 'penalty': 'l1'}, {'C': 1000.0, 'penalty': 'l2'}]
[0.68929778 0.75611728 0.78295125 0.78559363 0.82481902 0.81934122
 0.82825772 0.82747346 0.82839044 0.82835425 0.8284749  0.82839044
 0.82846284 0.8284749 ]
[0.6892589  0.75629692 0.78308129 0.78589125 0.8256864  0.82001421
 0.82898702 0.82809417 0.82910634 0.82901518 0.82908623 0.82909159
 0.82907148 0.82907953]
Best Hyper Parameters:
 {'C': 100.0, 'penalty': 'l1'}
0.8284749034749035
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_st



In [None]:
#GBR
params = {'n_estimators': 500,
          'max_depth': 7,
          'loss': 'deviance'}
# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train_pca, y_train)
y_pred_gbr = clf.predict(X_train_pca)

params ={
    "loss":["deviance"],
    "max_depth":[3,5,8],
    "n_estimators":[1000]
    }
gb = model_selection.GridSearchCV(clf, params, cv=10)
gb.fit(X_train, y_train)
results = gb.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",gb.best_params_)
print(gb.best_score_)
final_model = gb.best_estimator_
print(final_model)


GridSearchcv  looks to be good for all the models except Decision Tree 

# Cross Validation

In [11]:
# Naive Bayes model 
print("Training data accuracy for Naive Bayes: ",cross_val_score(bnb, X_train_svd, y_train_sample, cv=10))
print("Testing data accuracy Naive Bayes: ",cross_val_score(bnb, X_test_svd, y_test_sample, cv=10))
#KNN

#print("Training data accuracy for KNN: ",cross_val_score(knn_model, X_train_pca, y_train, cv=10))
#print("Testing data accuracy for KNN: ",cross_val_score(knn_model, X_test_pca, y_test, cv=10))
#Random Forest
print("Training data accuracy for RandomForest: ",cross_val_score(rfc, X_train_svd, y_train_sample, cv=10))
print("Testing data accuracy for RandomForest: ",cross_val_score(rfc, X_test_svd, y_test_sample, cv=10))
#Decision Tree
print("Training data accuracy for Decision Tree: ",cross_val_score(decision_tree, X_train_svd, y_train_sample, cv=10))
print("Testing data accuracy for Decision Tree: ",cross_val_score(decision_tree, X_test_svd, y_test_sample, cv=10))
#SVC
#print("Training data accuracy for SVC: ",cross_val_score(SVC(), X_train_pca, y_train, cv=6))
#print("Testing data accuracy for SVC: ",cross_val_score(SVC(), X_test_pca, y_test, cv=6))
#Linear Regression
print("Training data accuracy for Linear Regression : ",cross_val_score(lr, X_train_svd, y_train_sample, cv=10))
print("Testing data accuracy for Linear Regression  : ",cross_val_score(lr, X_test_svd, y_test_sample, cv=10))
#GBR
#print("Training data accuracy for GBR: ",cross_val_score(clf, X_train_pca, y_train, cv=10))
#print("Testing data accuracy for GBR: ",cross_val_score(clf, X_test_pca, y_test, cv=10))

print("Training data accuracy for BNB: ",cross_val_score(bnb, X_train_svd, y_train_sample, cv=10))
print("Testing data accuracy for BNB: ",cross_val_score(bnb, X_test_svd, y_test_sample, cv=10))

Training data accuracy for Naive Bayes:  [0.70659911 0.70310049 0.70141151 0.70756424 0.71187259 0.70921815
 0.70821769 0.70001207 0.70882105 0.7094244 ]
Testing data accuracy Naive Bayes:  [0.72503618 0.72310661 0.70815244 0.69126869 0.70994208 0.70366795
 0.71559633 0.71994206 0.71849348 0.70787059]
Training data accuracy for RandomForest:  [0.80757631 0.82338038 0.81891664 0.82880927 0.82142857 0.82215251
 0.82502715 0.82309642 0.82201038 0.82261373]
Testing data accuracy for RandomForest:  [0.81186686 0.80607815 0.80221901 0.81283164 0.81418919 0.80984556
 0.81265089 0.81506519 0.82665379 0.83631096]
Training data accuracy for Decision Tree:  [0.75763059 0.768609   0.76390397 0.77246954 0.76749517 0.77099421
 0.77591408 0.77181127 0.764209   0.77808616]
Testing data accuracy for Decision Tree:  [0.74722624 0.72841293 0.74433189 0.74529667 0.75048263 0.71573359
 0.73297924 0.73442781 0.75229358 0.75132786]
Training data accuracy for Linear Regression :  [0.81626252 0.83701291 0.8238


It actually looks like we did good on cross val with PCA than we did with our full feature set. There is no overfitting since we have derived our own features with the keywords

# Area Under Curve

In [12]:
from sklearn.metrics import roc_auc_score
print ("AUC Score for BNB :" ,roc_auc_score(y_train_sample, y_pred_bnb))
#print ("AUC Score for KNN:" ,roc_auc_score(y_train, y_pred_knn))
print ("AUC Score for Random Forest:" ,roc_auc_score(y_train_sample,y_pred_rfc))
print ("AUC Score for Decision Tree:" ,roc_auc_score(y_train_sample, y_pred_dt))
#print ("AUC Score for SVM:" ,roc_auc_score(y_train, y_pred_svm))
print ("AUC Score for Linear Regression:" ,roc_auc_score(y_train_sample, y_pred_lr))
#print ("AUC Score for GBR:" ,roc_auc_score(y_train, y_pred_gbr))


AUC Score for BNB : 0.7068965376279065
AUC Score for Random Forest: 0.8650449992624996
AUC Score for Decision Tree: 0.8132196050081972
AUC Score for Linear Regression: 0.8292603488036231


AUC performed ok with PCA

# Confusion Matrix

In [13]:

#Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score 
#print(confusion_matrix(y_train, y_pred_bnb))
#print('Accuracy Score for KBN :',accuracy_score(y_train, y_pred_bnb))
#print(confusion_matrix(y_train, y_pred_knn))
#print('Accuracy Score for KNN :',accuracy_score(y_train, y_pred_knn))
print(confusion_matrix(y_train_sample, y_pred_rfc))
print('Accuracy Score for Random Forest :',accuracy_score(y_train_sample, y_pred_rfc))
print(confusion_matrix(y_train_sample, y_pred_dt))
print('Accuracy Score for Decision Tree :',accuracy_score(y_train_sample, y_pred_dt))
#print(confusion_matrix(y_train, y_pred_SVM))
#print('Accuracy Score for SVM:',accuracy_score(y_train, y_pred_svm))
print(confusion_matrix(y_train_sample, y_pred_lr))
print('Accuracy Score for Linear Regression :',accuracy_score(y_train_sample, y_pred_lr))
#print(confusion_matrix(y_train, y_pred_gbr))
#print('Accuracy Score for GBR :',accuracy_score(y_train, y_pred_gbr))
print(confusion_matrix(y_train_sample, y_pred_bnb))
print('Accuracy Score for BNB :',accuracy_score(y_train_sample, y_pred_bnb))

[[33419  7905]
 [ 3267 38289]]
Accuracy Score for Random Forest : 0.8652027027027027
[[34254  7070]
 [ 8414 33142]]
Accuracy Score for Decision Tree : 0.8131756756756757
[[34642  6682]
 [ 7471 34085]]
Accuracy Score for Linear Regression : 0.8292350386100386
[[29316 12008]
 [12285 29271]]
Accuracy Score for BNB : 0.7068894787644787


Accuracy looks to be doing well after PCA.There are 3 models KNN,SVC,GBR  that are taking lot of time to complete.Accuracy is similar but these 3 models are very slow compared to others

In [14]:
#Classification Report
from sklearn.metrics import classification_report
#print("Classification Report for KNN:",classification_report(y_train, y_pred_knn))
print("Classification Report for RFC:",classification_report(y_train_sample, y_pred_rfc))
print("Classification Report for DT:",classification_report(y_train_sample, y_pred_dt))
#print("Classification Report for SVM:",classification_report(y_train, y_pred_svm))
print("Classification Report for LR:",classification_report(y_train_sample, y_pred_lr))
#print("Classification Report for GBR:",classification_report(y_train, y_pred_gbr))
print("Classification Report for BNB:",classification_report(y_train_sample, y_pred_bnb))

Classification Report for RFC:              precision    recall  f1-score   support

          0       0.91      0.81      0.86     41324
          1       0.83      0.92      0.87     41556

avg / total       0.87      0.87      0.86     82880

Classification Report for DT:              precision    recall  f1-score   support

          0       0.80      0.83      0.82     41324
          1       0.82      0.80      0.81     41556

avg / total       0.81      0.81      0.81     82880

Classification Report for LR:              precision    recall  f1-score   support

          0       0.82      0.84      0.83     41324
          1       0.84      0.82      0.83     41556

avg / total       0.83      0.83      0.83     82880

Classification Report for BNB:              precision    recall  f1-score   support

          0       0.70      0.71      0.71     41324
          1       0.71      0.70      0.71     41556

avg / total       0.71      0.71      0.71     82880



F1 score and recall are very good and ran at a good time.Now,lets get the accuracy using same techniques after going feature reduction SelectKbest(SKB)

# Selectkbest(SKB)

In [25]:
skb = SelectKBest(f_classif,k=100)
skb.fit(X_train_sample, y_train_sample)
X_train_skb = skb.transform(X_train_sample)
X_test_skb = skb.transform(X_test_sample)

  f = msb / msw


In [None]:
#Gridsearchcv for KNN
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_skb, y_train_sample)
y_pred_knn = knn_model.predict(X_train_skb)
params = {'n_neighbors':[5],
          'leaf_size':[1,2,3,5]
                  }
#Making models with hyper parameters sets
grid_class = model_selection.GridSearchCV(knn_model, param_grid=params)
#Learning
grid_class.fit(X_train_skb,y_train_sample)
#The best hyper parameters set

results = grid_class.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class.best_params_)
print(grid_class.best_score_)
final_model = grid_class.best_estimator_
print(final_model)

In [27]:
#Gridsearchcv for Random Forest
rfc = ensemble.RandomForestClassifier(max_depth=10,n_estimators= 100)
rfc.fit(X_train_skb, y_train_sample)
y_pred_rfc = rfc.predict(X_train_skb)

#Gridsearchcv to get the best parameters
dt_grid={'max_depth':[3,4,5],'max_features': [100]}

grid_class_rfc=model_selection.GridSearchCV(rfc,dt_grid,cv=6)
grid_class_rfc.fit(X_train_skb, y_train_sample)
results = grid_class_rfc.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class_rfc.best_params_)
print(grid_class_rfc.best_score_)
final_model = grid_class_rfc.best_estimator_
print(final_model)

[{'max_depth': 3, 'max_features': 100}, {'max_depth': 4, 'max_features': 100}, {'max_depth': 5, 'max_features': 100}]
[0.71914817 0.73122587 0.74121622]
[0.7207336  0.73363418 0.74592906]
Best Hyper Parameters:
 {'max_depth': 5, 'max_features': 100}
0.7412162162162163
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=100, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)




In [28]:
#Gridsearchcv for Decision Tree
decision_tree = tree.DecisionTreeClassifier(max_depth=50)
decision_tree.fit(X_train_skb, y_train_sample)
y_pred_dt = decision_tree.predict(X_train_skb)
dt_grid={'max_depth':[50],'max_features': [100]}
grid_class_dt=model_selection.GridSearchCV(decision_tree,dt_grid,cv=6)
grid_class_dt.fit(X_train_skb, y_train_sample)
results = grid_class_dt.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class_dt.best_params_)
print(grid_class_dt.best_score_)
final_model = grid_class_dt.best_estimator_
print(final_model)

[{'max_depth': 50, 'max_features': 100}]
[0.77502413]
[0.999264]
Best Hyper Parameters:
 {'max_depth': 50, 'max_features': 100}
0.7750241312741313
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=50,
            max_features=100, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')




In [None]:
#Gridsearchcv for SVC
from sklearn.svm import SVC
svm=SVC()
svm.fit(X_train_skb, y_train)
y_pred_svm = svm.predict(X_train_pca)
param_grid = {'C':[1,10,100,1000],'gamma':[1,0.1,0.001,0.0001], 'kernel':['linear','rbf']}
grid_class_svm=model_selection.GridSearchCV(SVC(),param_grid,refit = True)
grid_class_svm.fit(X_train_skb, y_train_sample)
results = grid_class_svm.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",grid_class_svm.best_params_)
print(grid_class_svm.best_score_)
final_model = grid_class_svm.best_estimator_
print(final_model)

In [29]:
#Gridsearchcv for Linear Regression
lr = LogisticRegression(C=1e9)
lr.fit(X_train_skb, y_train_sample)
y_pred_lr = lr.predict(X_train_skb)
#Gridsearchcv 
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=model_selection.GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train_skb,y_train_sample)
results = logreg_cv.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",logreg_cv.best_params_)
print(logreg_cv.best_score_)
final_model = logreg_cv.best_estimator_
print(final_model)

[{'C': 0.001, 'penalty': 'l1'}, {'C': 0.001, 'penalty': 'l2'}, {'C': 0.01, 'penalty': 'l1'}, {'C': 0.01, 'penalty': 'l2'}, {'C': 0.1, 'penalty': 'l1'}, {'C': 0.1, 'penalty': 'l2'}, {'C': 1.0, 'penalty': 'l1'}, {'C': 1.0, 'penalty': 'l2'}, {'C': 10.0, 'penalty': 'l1'}, {'C': 10.0, 'penalty': 'l2'}, {'C': 100.0, 'penalty': 'l1'}, {'C': 100.0, 'penalty': 'l2'}, {'C': 1000.0, 'penalty': 'l1'}, {'C': 1000.0, 'penalty': 'l2'}]
[0.50054295 0.74416023 0.74854006 0.76478041 0.80591216 0.79517375
 0.80798745 0.80694981 0.80815637 0.80815637 0.80814431 0.80815637
 0.80814431 0.80812017]
[0.50054295 0.74415889 0.74919965 0.76492117 0.80648461 0.79591377
 0.80887494 0.80773943 0.80910419 0.80904654 0.80915647 0.80914441
 0.80915513 0.80915111]
Best Hyper Parameters:
 {'C': 10.0, 'penalty': 'l1'}
0.8081563706563707
LogisticRegression(C=10.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_stat



In [None]:
#Gridsearchcv for GBR
params = {'n_estimators': 500,
          'max_depth': 7,
          'loss': 'deviance'}
# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train_skb, y_train_sample)
y_pred_gbr = clf.predict(X_train_skb)

params ={
    "loss":["deviance"],
    "max_depth":[3,5,8],
    "n_estimators":[1000]
    }
gb = model_selection.GridSearchCV(clf, params, cv=10)
gb.fit(X_train_skb, y_train_sample)
results = gb.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
print(results.get('mean_train_score'))
print("Best Hyper Parameters:\n",gb.best_params_)
print(gb.best_score_)
final_model = gb.best_estimator_
print(final_model)


Gridsearchcv for all the models look good.There is no overfitting as well.This shows the best parameters and these parameters are actually good but these 3 models-KNN,SVC,GBR are very slow to run

# Cross Validation

In [30]:
# Naive Bayes model 
print("Training data accuracy for Naive Bayes: ",cross_val_score(bnb, X_train_skb, y_train_sample, cv=10))
print("Testing data accuracy Naive Bayes: ",cross_val_score(bnb, X_test_skb, y_test_sample, cv=10))
#KNN

#print("Training data accuracy for KNN: ",cross_val_score(knn_model, X_train_pca, y_train, cv=10))
#print("Testing data accuracy for KNN: ",cross_val_score(knn_model, X_test_pca, y_test, cv=10))
#Random Forest
print("Training data accuracy for RandomForest: ",cross_val_score(rfc, X_train_skb, y_train_sample, cv=10))
print("Testing data accuracy for RandomForest: ",cross_val_score(rfc, X_test_skb, y_test_sample, cv=10))
#Decision Tree
print("Training data accuracy for Decision Tree: ",cross_val_score(decision_tree, X_train_skb, y_train_sample, cv=10))
print("Testing data accuracy for Decision Tree: ",cross_val_score(decision_tree, X_test_skb, y_test_sample, cv=10))
#SVC
#print("Training data accuracy for SVC: ",cross_val_score(SVC(), X_train_pca, y_train, cv=6))
#print("Testing data accuracy for SVC: ",cross_val_score(SVC(), X_test_pca, y_test, cv=6))
#Linear Regression
print("Training data accuracy for Linear Regression : ",cross_val_score(lr, X_train_skb, y_train_sample, cv=10))
print("Testing data accuracy for Linear Regression  : ",cross_val_score(lr, X_test_skb, y_test_sample, cv=10))
#GBR
#print("Training data accuracy for GBR: ",cross_val_score(clf, X_train_pca, y_train, cv=10))
#print("Testing data accuracy for GBR: ",cross_val_score(clf, X_test_pca, y_test, cv=10))

Training data accuracy for Naive Bayes:  [0.70671975 0.71793944 0.70406563 0.70394499 0.70901194 0.71195849
 0.7102691  0.70821769 0.71702667 0.70749366]
Testing data accuracy Naive Bayes:  [0.70670526 0.72165943 0.7105644  0.72021225 0.71490593 0.70642202
 0.71269918 0.70980203 0.7136649  0.71704491]
Training data accuracy for RandomForest:  [0.80504283 0.8084208  0.81228134 0.8032332  0.80745566 0.81501146
 0.80547846 0.8028237  0.80209967 0.80378907]
Testing data accuracy for RandomForest:  [0.80125422 0.80752533 0.80511336 0.80173661 0.79739508 0.80733945
 0.79575085 0.81409947 0.80685659 0.80251086]
Training data accuracy for Decision Tree:  [0.77548558 0.77693329 0.78031126 0.7752443  0.77331403 0.78049958
 0.77277664 0.78230964 0.7671051  0.77036322]
Testing data accuracy for Decision Tree:  [0.72551857 0.75156778 0.74240232 0.74191992 0.73130728 0.72766779
 0.71994206 0.73635925 0.73684211 0.72815065]
Training data accuracy for Linear Regression :  [0.80733502 0.8179515  0.8067

It actually looks like we did good  on cross val with SKB than we did with our full feature set. There is no overfitting since we have derived our own features with the keywords

# AUC

In [32]:
from sklearn.metrics import roc_auc_score
print ("AUC Score for BNB :" ,roc_auc_score(y_train_sample, y_pred_bnb))
#print ("AUC Score for KNN:" ,roc_auc_score(y_train, y_pred_knn))
print ("AUC Score for Random Forest:" ,roc_auc_score(y_train_sample, y_pred_rfc))
print ("AUC Score for Decision Tree:" ,roc_auc_score(y_train_sample, y_pred_dt))
#print ("AUC Score for SVM:" ,roc_auc_score(y_train, y_pred_svm))
print ("AUC Score for Linear Regression:" ,roc_auc_score(y_train_sample, y_pred_lr))
#print ("AUC Score for GBR:" ,roc_auc_score(y_train, y_pred_gbr))

AUC Score for BNB : 0.7073864700171258
AUC Score for Random Forest: 0.8317776339482007
AUC Score for Decision Tree: 0.9983948360060639
AUC Score for Linear Regression: 0.8089648167617285


AUC score is also good after doing SKB

# Confusion Matrix

In [33]:

#Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score 
print(confusion_matrix(y_train_sample, y_pred_bnb))
print('Accuracy Score for BNB :',accuracy_score(y_train_sample, y_pred_bnb))
#print(confusion_matrix(y_train, y_pred_knn))
#print('Accuracy Score for KNN :',accuracy_score(y_train, y_pred_knn))
print(confusion_matrix(y_train_sample, y_pred_rfc))
print('Accuracy Score for Random Forest :',accuracy_score(y_train_sample, y_pred_rfc))
print(confusion_matrix(y_train_sample, y_pred_dt))
print('Accuracy Score for Decision Tree :',accuracy_score(y_train_sample, y_pred_dt))
#print(confusion_matrix(y_train, y_pred_svm))
#print('Accuracy Score for SVM:',accuracy_score(y_train, y_pred_svm))
print(confusion_matrix(y_train_sample, y_pred_lr))
print('Accuracy Score for Linear Regression :',accuracy_score(y_train_sample, y_pred_lr))
#print(confusion_matrix(y_train, y_pred_gbr))
#print('Accuracy Score for GBR :',accuracy_score(y_train, y_pred_gbr))

[[29719 11766]
 [12485 28910]]
Accuracy Score for BNB : 0.7073962355212355
[[31404 10081]
 [ 3868 37527]]
Accuracy Score for Random Forest : 0.8316964285714286
[[41435    50]
 [   83 41312]]
Accuracy Score for Decision Tree : 0.9983952702702703
[[34019  7466]
 [ 8366 33029]]
Accuracy Score for Linear Regression : 0.808976833976834


It did faily well but this is just for positive  keywords in customer reviews.

# Classification Report

In [34]:
#Classification Report
from sklearn.metrics import classification_report
#print("Classification Report for KNN:",classification_report(y_train, y_pred_knn))
print("Classification Report for RFC:",classification_report(y_train_sample, y_pred_rfc))
print("Classification Report for DT:",classification_report(y_train_sample, y_pred_dt))
#print("Classification Report for SVM:",classification_report(y_train, y_pred_svm))
print("Classification Report for LR:",classification_report(y_train_sample, y_pred_lr))
#print("Classification Report for GBR:",classification_report(y_train, y_pred_gbr))
print("Classification Report for BNB:",classification_report(y_train_sample, y_pred_bnb))

Classification Report for RFC:              precision    recall  f1-score   support

          0       0.89      0.76      0.82     41485
          1       0.79      0.91      0.84     41395

avg / total       0.84      0.83      0.83     82880

Classification Report for DT:              precision    recall  f1-score   support

          0       1.00      1.00      1.00     41485
          1       1.00      1.00      1.00     41395

avg / total       1.00      1.00      1.00     82880

Classification Report for LR:              precision    recall  f1-score   support

          0       0.80      0.82      0.81     41485
          1       0.82      0.80      0.81     41395

avg / total       0.81      0.81      0.81     82880

Classification Report for BNB:              precision    recall  f1-score   support

          0       0.70      0.72      0.71     41485
          1       0.71      0.70      0.70     41395

avg / total       0.71      0.71      0.71     82880



f1 score and recall score is very good.The score didnot change  when compared to PCA values.

# Conclusion:

All the models have had good accuracy score.There are 3 models -KNN,SVC,GBS which are very slow.It has been more than 7 hours , then these ,models are still running.There is  no difference in performance between PCA and SKB.This is a classification problem so we have used classification models and have tried to get the accuracy to predict the outome.The end goal is to predict the right outcome so for that the accuracy should be considerably high.If we observe this above cross validation accuracy,f1 score,precision is very promising.As observed above, it does not make any difference if we use PCA or SKB, accuracy score and f1 score remained similar.This is what we had expected.We usually expect GBS to slowest of all since it runs in multiple cycle to get more accuracy but a little surpised with SVC and KNN.This problem is about predicting positive reviews where we have been presented the text and reviews of some customers and we are checking how accurate are the models in predicting the outcome.