# Documentation
This notebook is intended for model deployment. <br>
It includes three parts:<br>
1. Import a pre-prepared dataset for modeling - obtained by another jupyter notebook:'Review_Score_Data_Preparation'.
2. Modeling - We mainly use random forest and XGBoost model with Gridsearch method to search for best hyperparameters. Among all the models, Random Forest with Gridsearch performs the best.Here are the metrics:<br>
    **Out-of-sample AUC Score:** 92%  <br>
    **Out-of-sample Recall rate:** 92% <br>
    *In this project, we especially care about recall rate because we want to pick out as much fraudulent reviews as possible.*<br>
3. Apply model on Reviewbox dataset - we apply our model on RSC review dataset using a prepared dataset obtained by another jupyter notebook: 'Review_Score_Feature_Engineering'.


**Input:**
1. model_data.csv
  * is dataset for the model training, web scraped from Reviewmeta
  * used to build Random Forest/Xgboost Model



2. full_merged_data_RSC.csv
  * is output from file "02_Review_Score_Feature_Engineering.ipynb"
  * input for the pre-trained model to get the 1/0 result. 
  
  <br>


**Output:**
1. final_output_RSC.csv - including reviews of RSC products, their features (flags) and a final outcome: 0/1 meaning fraudulent or not.

In [0]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, precision_recall_curve, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Import the dataset for modeling
**Notes**:
- The dataset 'model_data.csv' that we use for modeling is obtained by running the code in another jupter notebook: 'Review_Score_Data_Preparation'.
- There's no need to run the code in 'Review_Score_Data_Preparation' notebook unless there's new data that you want to add to train the model.
- To obtain this 'model_data.csv', there are mainly 3 steps: <br>
    1. In February 2020, the team scraped 20996 reviews from Reviewmeta website. The columns include one dependent variable: trust score of an individual review given by Reviewmeta; 20 independent variables indicating the flags that make the review untrustworthy.<br>
    Check how it looks like [Most Trusted Reviews/Most Trusted Reviews]:  https://reviewmeta.com/amazon/B00063446M
    2. In April 2020, the team discovered that some of the review links scraped in February didn't work because they were deleted. We assume that these reviews are deleted because they are fraudulent. Besides, we also sampled a number of good reviews from the rest of the dataset and combined the deleted reviews and good reviews together.
    3. Based on the links, we continued to scrape the reviewer profile of these reviews in order to get more  features.

In [0]:
model_data = pd.read_csv('model_data.csv', index_col = 0)

In [0]:
model_data.head()

Unnamed: 0,Fraudulent,Non_Verified_Purchases,Nvr_verified_reviewer,Contains_rep_phrases,high_vol_day_rev,Take_backs,Overrep_part,Overrep_wrd_cnt,Overlapping_rev_history,One_hit,...,single_day,num_of_unverified,mode_number,samedate_20,anonymous,only_5star,0_review,Easy_grade_rating,5_star,1_star
0,1,1,1,0,0,0,0,0,0,1,...,0,0.0,0.0,0.0,0.0,0.0,1.0,1,1,0
1,1,1,1,0,0,0,1,0,0,1,...,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,1
2,1,1,1,1,0,0,1,0,0,1,...,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0
3,1,1,0,1,0,1,0,0,1,0,...,0,0.0,0.0,0.0,0.0,0.0,1.0,1,1,0
4,1,0,0,0,1,1,0,0,1,0,...,0,0.0,0.0,0.0,0.0,0.0,1.0,1,1,0


In [0]:
model_data.shape

(2668, 24)

# Modeling
**Notes:**
1. In the modeling part, we mainly use two models: Random Forest and XGBoost. We also use Gridsearch method to optimize the hyperparameters to improve the model performance. Therefore, in total, there are four models.<br>
**Hyperparameter:** It is a parameter whose value is set before the learning process begins. <br>
**Gridsearch:** It is the process of performing hyper parameter tuning in order to determine the optimal values for a given model.
2. Among the 4 models, Random Forest with Gridsearch performs the best. Here are the metrics:<br>
**Out-of-sample AUC Score:** 92%  (Explanation of AUC: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5)<br>
**Out-of-sample Recall rate:** 92%  (meaning our classifier is able to pick out 80% of the fraudulent reviews)



## Random Forest
### With default hyperparameter

In [0]:
model_data.columns

Index(['Fraudulent', 'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'high_vol_day_rev', 'Take_backs',
       'Overrep_part', 'Overrep_wrd_cnt', 'Overlapping_rev_history', 'One_hit',
       'incentivized', 'Brand_repeater', 'Brand_Loyalist', 'Brand_Monogamist',
       'single_day', 'num_of_unverified', 'mode_number', 'samedate_20',
       'anonymous', 'only_5star', '0_review', 'Easy_grade_rating', '5_star',
       '1_star'],
      dtype='object')

**Initially, there are in total 22 independent variables. We selected 13 out of them based on how they contribute to the model performance.** <br>

* Dependent variable "Fraudulent":

1. 'Non_Verified_Purchases': Whether the review comes from a non-verified purchase. Will be inputted as 1 if true;
2. 'Nvr_verified_reviewer': Whether the reviewer has never written a verified purchaser review. Will be inputted as 1 if true;

3. 'Contains_rep_phrases':Phrases that have a potential to indicate incentivized behaviors are selected to help detect reviews. Find reviews with problematic phrase repetition and label them as 1;
4. 'Overlapping_rev_history': hether a reviewer has reviewed >= 3 products that are same as another reviewer. Will be inputted as 1 if true;
5. 'high_vol_day_rev': Whether the review was written on a date when a certain product got a larger amount of reviews than usual. Will be inputted as 1 if true;
6. 'Take_backs': Whether the reviewer has deleted review(s). Will be inputted as 1 if true;
7. 'Ovverrep_wrd_cnt': Whether there was a specific word count range that contributes much more reviews for a certain product based on a comparison within the product category, and the review falls into that range group. Will be inputted as 1 if true;
8. 'One_hit': Indicates that the user has only given 1 review. Will be inputted as 1 if true;
9. 'Incentivized': Use a pre-defined list of incentivized words, like "free product", and check the incentivized words existence. Will be inputted as 1 if true;
10. 'single_day': Whether the reviewer posted all reviews on a same date. Will be inputted as 1 if true;
11. 'samedate_20': Whether the reviewer posted >= 20 reviews in a same date. Will be inputted as 1 if true;
12. '0_review':   Whether the reviewer's profile has no reviews displayed. Will be inputted as 1 if true;
13. 'Easy_grader': Whether the reviewer has an average rating of >= 4.5 AND gave a 5-star review for this purchase. Will be inputted as 1 if true


In [0]:
# creating an interaction term: an esay grader (average rating >= 4.5) graded this review with 5 star
# We are putting this new interaction feature Easy_grader into the model and deleting the Easy_grade_rating and 5_star
model_data['Easy_grader'] = model_data['5_star']*model_data['Easy_grade_rating']

In [0]:
model_data = model_data[['Fraudulent', 'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases','Overlapping_rev_history','high_vol_day_rev', 'Take_backs','Overrep_wrd_cnt', 'One_hit', 'incentivized',
        'single_day','samedate_20', '0_review','Easy_grader']]

In [0]:
model_data['Fraudulent'].value_counts()

0    1954
1     714
Name: Fraudulent, dtype: int64

In a realistic situation, we would have a much smaller proportion of fraudulent reviews. To better train the model, we use a oversampled dataset with roughly 36% fraudulent reviews.

In [0]:
x = model_data.loc[:,'Non_Verified_Purchases':] 
y = model_data['Fraudulent']

In [0]:
# Split train and test dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 123, train_size = 0.7)

In [0]:
x_train

Unnamed: 0,Non_Verified_Purchases,Nvr_verified_reviewer,Contains_rep_phrases,Overlapping_rev_history,high_vol_day_rev,Take_backs,Overrep_wrd_cnt,One_hit,incentivized,single_day,samedate_20,0_review,Easy_grader
2127,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
952,0,0,1,1,1,1,0,0,0,0,0.0,1.0,0
458,0,0,0,1,0,1,0,0,0,0,0.0,0.0,1
1599,0,0,0,1,0,0,0,0,0,0,0.0,0.0,1
2224,0,0,0,1,0,1,0,0,0,0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1147,0,0,0,0,0,0,1,1,0,0,0.0,0.0,1
2154,0,0,0,0,0,0,0,0,0,1,0.0,1.0,0
1766,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1122,0,0,0,0,0,1,0,0,0,0,0.0,1.0,0


In [0]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((1867, 13), (801, 13), (1867,), (801,))

In [0]:
clf = RandomForestClassifier(random_state = 123)
clf.fit(x_train, y_train)
test_pred = clf.predict(x_test)
train_pred = clf.predict(x_train)



**Evaluation**

In [0]:
print ('RF result: %.3f/%.3f' % (roc_auc_score(y_train, train_pred), roc_auc_score(y_test, test_pred)))
print ("=== Confusion Matrix ===")
print (confusion_matrix(y_test, test_pred))
print ('\n')
print ("=== Classification Report ===")
print (classification_report(y_test, test_pred))
print ('\n')

RF result: 0.930/0.905
=== Confusion Matrix ===
[[554  39]
 [ 26 182]]


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.96      0.93      0.94       593
           1       0.82      0.88      0.85       208

    accuracy                           0.92       801
   macro avg       0.89      0.90      0.90       801
weighted avg       0.92      0.92      0.92       801





Precision indicates: 

Recall indicates: 

### Gridsearch for best hyperparameter

In [0]:
clf = RandomForestClassifier(n_jobs=-1)

# Possible values for these hyperparameters
param_grid = {
    'min_samples_split': [3, 5, 10], 
    'n_estimators' : [100, 300],
    'max_depth': [3, 5, 15, 25],
    'max_features': [0.2, 0.35, 0.6, 0.8, 1.0]
}

# Evaluation metrics
scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score)
}

In [0]:
def grid_search_wrapper(refit_score='recall_score'):
    """
    fits a GridSearchCV classifier using refit_score for optimization
    prints classifier performance metrics
    """
    skf = StratifiedKFold(n_splits=10)
    grid_search = GridSearchCV(clf, param_grid, scoring=scorers, refit=refit_score,
                           cv=skf, return_train_score=True, n_jobs=-1)
    grid_search.fit(x_train.values, y_train.values)

    # make the predictions
    
    y_train_pred = grid_search.predict(x_train.values)
    y_test_pred = grid_search.predict(x_test.values)
    
    print('Best params for {}'.format(refit_score))
    print(grid_search.best_params_)
    print('======================')

    # roc_auc_score
    print('roc_auc_score: %.3f/%.3f' % (roc_auc_score(y_train, y_train_pred), roc_auc_score(y_test, y_test_pred)))
    print('======================')
    
    # confusion matrix on the test data.
    print('\nConfusion matrix of Random Forest optimized for {} on the test data:'.format(refit_score))
    print(pd.DataFrame(confusion_matrix(y_test, y_test_pred),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))
    
    # classification on the test data.
    print('=== Classification Report ==='.format(refit_score))
    print(classification_report(y_test, y_test_pred))
    return grid_search

**Evaluation**

We only look at the Confusion matrix optimized for recall_score. <br>
Note that it takes a long time to run the following chunk.

In [0]:
grid_search_clf = grid_search_wrapper(refit_score='recall_score')



Best params for recall_score
{'max_depth': 25, 'max_features': 0.35, 'min_samples_split': 3, 'n_estimators': 100}
roc_auc_score: 0.944/0.920

Confusion matrix of Random Forest optimized for recall_score on the test data:
     pred_neg  pred_pos
neg       544        49
pos        16       192
=== Classification Report ===
              precision    recall  f1-score   support

           0       0.97      0.92      0.94       593
           1       0.80      0.92      0.86       208

    accuracy                           0.92       801
   macro avg       0.88      0.92      0.90       801
weighted avg       0.93      0.92      0.92       801



In [0]:
# Getting the importance features
feature_importances = pd.DataFrame(grid_search_clf.best_estimator_.feature_importances_,
                                   index = x_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)

In [0]:
feature_importances

Unnamed: 0,importance
Easy_grader,0.238431
high_vol_day_rev,0.211279
Overlapping_rev_history,0.126146
0_review,0.088427
Contains_rep_phrases,0.061613
Take_backs,0.059005
Non_Verified_Purchases,0.056102
Nvr_verified_reviewer,0.053598
Overrep_wrd_cnt,0.050447
One_hit,0.025683


## XGBoost
### With default hyperparameter

In [0]:
xg_model = XGBRegressor(objective = 'binary:logistic')
xg_model.fit(x_train, y_train)

  if getattr(data, 'base', None) is not None and \


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='binary:logistic',
             random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
             seed=None, silent=None, subsample=1, verbosity=1)

In [0]:
test_pred = xg_model.predict(x_test)
test_pred = [round(value) for value in test_pred]
train_pred = xg_model.predict(x_train)
train_pred = [round(value) for value in train_pred]

**Evaluation**

In [0]:
print ('xgboost result: %.3f/%.3f' % (roc_auc_score(y_train, train_pred), roc_auc_score(y_test, test_pred)))
print ("=== Confusion Matrix ===")
print (confusion_matrix(y_test, test_pred))
print ('\n')
print ("=== Classification Report ===")
print (classification_report(y_test, test_pred))
print ('\n')

xgboost result: 0.889/0.902
=== Confusion Matrix ===
[[562  31]
 [ 30 178]]


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.95      0.95      0.95       593
           1       0.85      0.86      0.85       208

    accuracy                           0.92       801
   macro avg       0.90      0.90      0.90       801
weighted avg       0.92      0.92      0.92       801





### Gridsearch hyperparameter

In [0]:
params = {"objective":["binary:logistic"],'colsample_bytree': [0.2,0.3,0.4,0.5],'learning_rate': [0.1,0.2,0.3],
                'max_depth': [3, 5, 15, 25], 'alpha': [10,11,12]}

In [0]:
best_xgb = GridSearchCV(
    xg_model, param_grid=params, cv=10, verbose=0, n_jobs=-1)

In [0]:
best_xgb.fit(x_train, y_train)

  if getattr(data, 'base', None) is not None and \


GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                    colsample_bylevel=1, colsample_bynode=1,
                                    colsample_bytree=1, gamma=0,
                                    importance_type='gain', learning_rate=0.1,
                                    max_delta_step=0, max_depth=3,
                                    min_child_weight=1, missing=None,
                                    n_estimators=100, n_jobs=1, nthread=None,
                                    objective='binary:logistic', random_st...
                                    reg_alpha=0, reg_lambda=1,
                                    scale_pos_weight=1, seed=None, silent=None,
                                    subsample=1, verbosity=1),
             iid='warn', n_jobs=-1,
             param_grid={'alpha': [10, 11, 12],
                         'colsample_bytree': [0.2, 0.3, 0.4, 0.5],
      

In [0]:
test_pred_grid = best_xgb.predict(x_test)
test_pred_grid = [round(value) for value in test_pred_grid]
train_pred_grid = best_xgb.predict(x_train)
train_pred_grid = [round(value) for value in train_pred_grid]

**Evaluation**

In [0]:
print ('xgboost result: %.3f/%.3f' % (roc_auc_score(y_train, train_pred_grid), roc_auc_score(y_test, test_pred_grid)))
print ("=== Confusion Matrix ===")
print (confusion_matrix(y_test, test_pred_grid))
print ('\n')
print ("=== Classification Report ===")
print (classification_report(y_test, test_pred_grid))
print ('\n')

xgboost result: 0.914/0.916
=== Confusion Matrix ===
[[559  34]
 [ 23 185]]


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.96      0.94      0.95       593
           1       0.84      0.89      0.87       208

    accuracy                           0.93       801
   macro avg       0.90      0.92      0.91       801
weighted avg       0.93      0.93      0.93       801





**Feature importance**

In [0]:
best_xgb.best_estimator_.get_booster().get_score(importance_type="gain")

{'Non_Verified_Purchases': 3.6706794878634437,
 'Take_backs': 3.6944666632300014,
 'Overrep_wrd_cnt': 1.894281941586701,
 'samedate_20': 0.8568166589799999,
 'incentivized': 0.56343452125,
 'Easy_grader': 15.795185098030851,
 '0_review': 2.9063036379911313,
 'Contains_rep_phrases': 2.881587537415733,
 'single_day': 1.3248296605870484,
 'One_hit': 1.2463246602344673,
 'Overlapping_rev_history': 3.6675401073728056,
 'high_vol_day_rev': 17.35198059797523,
 'Nvr_verified_reviewer': 1.4696701618345964}

# Apply model on Reviewbox Data


With the model trained, our next step is to apply feature engineering on Reviewbox's dataset to create the exact same features as the trained model input.<br> We then use the trained model to predict fraudulent reviews on Reviewbox's dataset.<br><br>
The dataset used is 'full_merged_data_RSC.csv' which was obtained using another jupyter notebook: 'Review_Score_Feature_Engineering'.

In [0]:
full_data_RSC = pd.read_csv('full_merged_data_RSC.csv')

**Here we need to drop the NAs because we encountered some problems while scraping, due to which the page was not downloaded correctly. Therefore, we failed to extract useful information from these files.<br>
This also means there are some reviews we fail to evaluate due to the scraping problem.**

In [0]:
full_data_RSC = full_data_RSC.dropna()
full_data_RSC = full_data_RSC.reset_index(drop = True)

In [0]:
full_data_RSC.columns

Index(['reviewid', 'author', 'source', 'product', 'profile',
       'Verified_Purchases', 'source_product', 'profile_id',
       'Non_Verified_Purchases', 'helpful_votes', 'name', 'num_of_reviews',
       'num_of_reviews_count', '0_review', 'One_hit', 'take_back',
       'Take_backs', 'num_of_verified', 'num_of_unverified',
       'Nvr_verified_reviewer', 'single_day', 'avg_rating',
       'Easy_grade_rating', 'mode_number', 'samedate_20', 'totalwords',
       'Overrep_wrd_cnt', 'high_vol_day_rev', 'Contains_rep_phrases',
       'incentivized', 'text', 'stars', '5_star', 'Easy_grader',
       'Overlapping_rev_history'],
      dtype='object')

In [0]:
x_train.columns

Index(['Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'Overlapping_rev_history', 'high_vol_day_rev',
       'Take_backs', 'Overrep_wrd_cnt', 'One_hit', 'incentivized',
       'single_day', 'samedate_20', '0_review', 'Easy_grader'],
      dtype='object')

In [0]:
# Selecting the exact same features.
model_data_RSC = full_data_RSC[['Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'Overlapping_rev_history', 'high_vol_day_rev',
       'Take_backs', 'Overrep_wrd_cnt', 'One_hit', 'incentivized',
       'single_day', 'samedate_20', '0_review', 'Easy_grader']]

In [0]:
# Use the model that performs the best to predict fraudulent or not for RSC reviews
model_data_RSC_pred = grid_search_clf.predict(model_data_RSC)
model_data_RSC_pred = [round(value) for value in model_data_RSC_pred]

In [0]:
model_data_RSC_pred = pd.DataFrame(model_data_RSC_pred)

In [0]:
# Adding additional information in the dataset for checking.
final_output_RSC = pd.merge(full_data_RSC[['reviewid','name','product','profile','text']],model_data_RSC, left_index = True, right_index = True)
final_output_RSC = pd.merge(final_output_RSC,model_data_RSC_pred, left_index = True, right_index = True)

In [0]:
final_output_RSC = final_output_RSC.set_axis([*final_output_RSC.columns[:-1], 'pred_fraud'], axis=1, inplace=False)

Final output is a table that contains all the reviews and their predictive fraudulency label ("pred_fraud" = 1 or 0). You can find the csv file "final_output_RSC.csv" in folder "sample_outputs". <br>
A glance of how the table looks:

![](https://drive.google.com/uc?id=1masKtwXWVJjxlAAVahxAbkBqgglqChlW)

In [0]:
final_output_RSC.to_csv('sample_outputs/final_output_RSC.csv')