# Documentation

1. This part of code is to prepare the dataset for model training ('model_data.csv').
2. **You don't need to run the code in this jupyter notebook unless there's new data that you want to add to re-train the model so that we may want to learn our methodology for reference.**
2. If you want to modify something in model_data without including new data, the web-scraping sections may be skipped. Accordingly, **please run the code from section 3.3 to the end.**

**Input:**

1. review_dataframe_mega_ALL_New.csv - the web-scraped dataset of Reviewmeta
2. wrong_link.csv - After about two months, we checked the reviews in the web-scraped RM dataset and found out that actually a certain amount of them have been already deleted. The links in this csv are the links of reviews that have been deleted.
3. Scraped_data.csv <br>
*Note: After we scraped profile pages and extract features, we save the result to this Scraped_data.csv because it takes a long time to scrape and extract features.*

**Output:**
1. model_data.csv




This notebook mainly consists of 3 parts:<br>
1. Get deleted reviews and good reviews and combine them together.
2. Scrape the profile pages to create more features.
3. Remove reviews with only 1/2 flags to improve model performance

In [0]:
import pandas as pd
import numpy as np
import re
import os 
import math
import datetime
import random
import requests
import bs4
from tqdm import tqdm_notebook as tqdm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Get deleted reviews and good reviews


## import datasets
1. review_dataframe_mega_ALL_New.csv : the web-scraped dataset of Reviewmeta
2. wrong_link.csv: After about two months, we checked the reviews in the web-scraped RM dataset and found out that actually a certain amount of them have been already deleted.
check this link: https://www.amazon.com/gp/customer-reviews/R1KIX5COX51UFL?pldnSite=1.
So we re-scrape and identify the review links that shows "Sorry, we couldn't find that page." All the links are in this csv file.

In [0]:
reviewmeta = pd.read_csv('review_dataframe_mega_ALL_New.csv', index_col=0)
wrong_link = pd.read_csv('wrong_link.csv')

In [0]:
reviewmeta.head()

Unnamed: 0,product,trust,Unnamed: 3,review_rating,review_title,reviewer_details,reviewer_link_RM,rvwr_text_Amazon,rvwr_link_Amazon,Amazon_ID,Verified_Purchases,Nvr_verified_reviewer,Contains_rep_phrases,high_vol_day_rev,Average_Rating,Critical_Rev_rating,Take_backs,Take_backs_rating,Overrep_part,Overrep_wrd_cnt,Easy_grade_rating,Overlapping_rev_history,Brand_Rep_freq,Brand_rep_rating,One_hit,incentivized,Brand_repeater,Brand_Loyalist,Brand_Monogamist,single_day
0,B00SMMDNCA,1.0,0.970929,2,Nicht gut verpackt!,\n Verified PurchaserReviewer: Ulrike Kohlhaas...,https://reviewmeta.com/profile/amazon-de/A291K...,\n\t\t\t\tAn sich ein tolles Teil. Hatte mich ...,https://smile.amazon.de/gp/customer-reviews/RB...,RB8O5NGJMI0KN,1,0,0,0,5.0,0.0,0,0.0,0,0,5.0,0,0,0.0,0,0,0,0,0,0
1,B00SMMDNCA,1.0,0.966222,5,Würde ich wieder kaufen.,\n Verified PurchaserReviewer: M. Fritzsch\n\n...,https://reviewmeta.com/profile/amazon-de/A12RI...,\n\t\t\t\tIch war wirklich sehr sehr skeptisch...,https://smile.amazon.de/gp/customer-reviews/R3...,R3PWIWOZ36AHAB,1,0,0,0,4.0,4.0,0,0.0,0,0,0.0,0,0,0.0,0,0,0,0,0,0
2,B00SMMDNCA,1.0,0.970929,1,Fehlkauf,\n Verified PurchaserReviewer: Barbara\n\n Eas...,https://reviewmeta.com/profile/amazon-de/AELCA...,\n\t\t\t\tLeider hat es unser yorkshire terrie...,https://smile.amazon.de/gp/customer-reviews/R3...,R3MS9TWCGVYCIL,1,0,0,0,5.0,0.0,0,0.0,0,0,5.0,0,0,0.0,0,0,0,0,0,0
3,B00SMMDNCA,1.0,0.947291,1,Plastikschrott in grottenschlechter Qualität,\n Verified PurchaserReviewer: Martin Zimmerma...,https://reviewmeta.com/profile/amazon-de/A44CR...,\n\t\t\t\tVöllig ausgefranster Kunstrasen. Da ...,https://smile.amazon.de/gp/customer-reviews/R1...,R1HTSZPEKJTW5A,1,0,0,0,1.0,1.0,0,0.0,0,0,0.0,0,0,0.0,0,0,0,0,0,0
4,B00SMMDNCA,1.0,0.947291,1,Keine Funktion,\n Verified PurchaserReviewer: Ahmet Sahin\n\n...,https://reviewmeta.com/profile/amazon-de/AHMUV...,\n\t\t\t\tHilft nichts [Go to full review]\n,https://smile.amazon.de/gp/customer-reviews/RC...,RCXPXSEUFKKQZ,1,0,0,0,1.0,1.0,0,0.0,0,0,0.0,0,0,0.0,0,0,0,0,0,0


In [0]:
wrong_link.head()

Unnamed: 0,wrong link
0,https://smile.amazon.com/gp/customer-reviews/R...
1,https://smile.amazon.com/gp/customer-reviews/R...
2,https://smile.amazon.com/gp/customer-reviews/R...
3,https://smile.amazon.com/gp/customer-reviews/R...
4,https://smile.amazon.com/gp/customer-reviews/R...


In [0]:
deleted_reviews = pd.merge(reviewmeta,wrong_link,left_on='rvwr_link_Amazon',right_on='wrong link')
deleted_reviews['rvwr_link_Amazon'].nunique()
# We have 921 deleted reviews.

921

In [0]:
# There are some duplicate rows with the same amazon review link but with different product ASIN. Since they are the same review,
# We decided to remove the duplicates according to the review link.
deleted_reviews = deleted_reviews.drop_duplicates(subset='rvwr_link_Amazon', keep="first")

## Data cleaning

In [0]:
# Change names drop unnecessary columns
deleted_reviews.rename(columns={"trust": "RM_Score", "Unnamed: 3": "RB_Score"}, inplace = True)
deleted_reviews.drop(['wrong link'], axis=1, inplace = True)
deleted_reviews = deleted_reviews.drop(['Average_Rating','RB_Score','Take_backs_rating','Brand_Rep_freq','Brand_rep_rating','product', 'RM_Score', 'review_title', 'reviewer_details', 'rvwr_text_Amazon','rvwr_link_Amazon'], axis=1)
deleted_reviews = deleted_reviews.reset_index(drop = True)

In [0]:
# Since all the flags have 1 as not good and 0 as good, we change the column Verified_Purchases to Non_Verified_Purchases.
def modify_column_veri_purchase(df):
    df['Verified_Purchases'] = 1-df['Verified_Purchases']
    df.rename(columns = {'Verified_Purchases': 'Non_Verified_Purchases'}, inplace = True)
modify_column_veri_purchase(deleted_reviews)

## Get good reviews 
Get good reviews by subtracting the datasets: <br>full - bad reviews = good reviews

In [0]:
good_reviews = pd.merge(reviewmeta, wrong_link, left_on='rvwr_link_Amazon',right_on='wrong link', how = "outer", indicator=True)
good_reviews = good_reviews[good_reviews['_merge'] == 'left_only']
# Change names drop unnecessary columns
good_reviews = good_reviews.drop_duplicates(subset='rvwr_link_Amazon', keep="first")
good_reviews.drop(['wrong link','_merge'], axis=1, inplace = True)
good_reviews.rename(columns={"trust": "RM_Score", "Unnamed: 3": "RB_Score"}, inplace = True)
good_reviews = good_reviews.drop(['Average_Rating','RB_Score','Take_backs_rating','Brand_Rep_freq','Brand_rep_rating','product', 'review_title', 'reviewer_details', 'rvwr_text_Amazon','rvwr_link_Amazon'], axis=1).reset_index(drop = True)

## Sampling the good reviews
1. Since we have 13559 good reviews and 912 bad reviews, the modeling dataset would be extremely unbalanced. So we decide to do a sampling on the good review dataset. <br>
2. Specifically, we do a stratified sampling. It will help us build a stronger classifier because if we only include good reviews with a 100% authenticity score, the reviews might not have many flags. So, it will be easy for the classifier to pick out fraudulent reviews.
3. Except for scores ranging from 0.5 and 0.7, we select 450 reviews from each bin to make the sampled dataset not that skewed.

In [0]:
# Check the distribution of RM_Score. 
bins = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
good_reviews['categories'] = pd.cut(good_reviews['RM_Score'], bins)
good_reviews.groupby('categories').count()

Unnamed: 0_level_0,RM_Score,review_rating,reviewer_link_RM,Amazon_ID,Verified_Purchases,Nvr_verified_reviewer,Contains_rep_phrases,high_vol_day_rev,Critical_Rev_rating,Take_backs,Overrep_part,Overrep_wrd_cnt,Easy_grade_rating,Overlapping_rev_history,One_hit,incentivized,Brand_repeater,Brand_Loyalist,Brand_Monogamist,single_day
categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
"(0.1, 0.3]",472,472,472,472,472,472,472,472,472,472,472,472,472,472,472,472,472,472,472,472
"(0.3, 0.5]",985,985,985,985,985,985,985,985,985,985,985,985,985,985,985,985,985,985,985,985
"(0.5, 0.7]",154,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154,154
"(0.7, 0.9]",1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118
"(0.9, 1.0]",7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277,7277


In [0]:
RM_Score = good_reviews['RM_Score']
sample_df1 = good_reviews[(RM_Score <= 0.3) & (RM_Score > 0.1)].sample(n = 450, random_state=123)
sample_df2 = good_reviews[(RM_Score <= 0.5) & (RM_Score > 0.3)].sample(n = 450, random_state=123)
sample_df3 = good_reviews[(RM_Score <= 0.7) & (RM_Score > 0.5)]
sample_df4 = good_reviews[(RM_Score <= 0.9) & (RM_Score > 0.7)].sample(n = 450, random_state=123)
sample_df5 = good_reviews[(RM_Score <= 1.0) & (RM_Score > 0.9)].sample(n = 450, random_state=123)

In [0]:
good_reviews_sample = pd.concat([sample_df1, sample_df2, sample_df3, sample_df4, sample_df5], ignore_index = True).reset_index(drop = True)


In [0]:
# Drop the column categories as we don't need it.
good_reviews_sample = good_reviews_sample.drop('categories', 1)
modify_column_veri_purchase(good_reviews_sample)

In [0]:
# We assign deleted reviews a score of 1 and other reviews a score of 0
deleted_reviews['RM_Score'] = 1
good_reviews_sample['RM_Score'] = 0

In [0]:
good_reviews_sample.columns

Index(['RM_Score', 'review_rating', 'reviewer_link_RM', 'Amazon_ID',
       'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'high_vol_day_rev', 'Critical_Rev_rating',
       'Take_backs', 'Overrep_part', 'Overrep_wrd_cnt', 'Easy_grade_rating',
       'Overlapping_rev_history', 'One_hit', 'incentivized', 'Brand_repeater',
       'Brand_Loyalist', 'Brand_Monogamist', 'single_day'],
      dtype='object')

In [0]:
deleted_reviews.columns

Index(['review_rating', 'reviewer_link_RM', 'Amazon_ID',
       'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'high_vol_day_rev', 'Critical_Rev_rating',
       'Take_backs', 'Overrep_part', 'Overrep_wrd_cnt', 'Easy_grade_rating',
       'Overlapping_rev_history', 'One_hit', 'incentivized', 'Brand_repeater',
       'Brand_Loyalist', 'Brand_Monogamist', 'single_day', 'RM_Score'],
      dtype='object')

In [0]:
# make the order of columns the same for two datasets.
deleted_reviews = deleted_reviews[['RM_Score', 'review_rating', 'reviewer_link_RM', 'Amazon_ID',
       'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'high_vol_day_rev', 'Critical_Rev_rating',
       'Take_backs', 'Overrep_part', 'Overrep_wrd_cnt', 'Easy_grade_rating',
       'Overlapping_rev_history', 'One_hit', 'incentivized', 'Brand_repeater',
       'Brand_Loyalist', 'Brand_Monogamist', 'single_day']]
deleted_reviews = deleted_reviews.reset_index(drop = True)

## Combining deleted reviews and good reviews

In [0]:
combined_df = pd.concat([deleted_reviews, good_reviews_sample], sort = False).reset_index(drop = True)

In [0]:
combined_df

Unnamed: 0,review_rating,reviewer_link_RM,Amazon_ID,Non_Verified_Purchases,Nvr_verified_reviewer,Contains_rep_phrases,high_vol_day_rev,Critical_Rev_rating,Take_backs,Overrep_part,Overrep_wrd_cnt,Easy_grade_rating,Overlapping_rev_history,One_hit,incentivized,Brand_repeater,Brand_Loyalist,Brand_Monogamist,single_day,RM_Score
0,5,https://reviewmeta.com/profile/amazon/AFG4VMDI...,RXDGH790RKPUF,1,1,0,0,0.0,0,0,0,5.0,0,1,0,0,0,0,0,1
1,1,https://reviewmeta.com/profile/amazon/AHWBOFLE...,RVJE4LSV9ZWLK,1,1,0,0,1.0,0,1,0,0.0,0,1,0,0,0,0,0,1
2,2,https://reviewmeta.com/profile/amazon/A12K842R...,RV3XIX9GL0RTH,1,1,1,0,2.0,0,1,0,0.0,0,1,0,0,0,0,0,1
3,5,https://reviewmeta.com/profile/amazon/A18LBGL7...,R8P2NMWQ7HZFO,1,0,1,0,0.0,1,0,0,4.6,1,0,0,1,0,0,0,1
4,5,https://reviewmeta.com/profile/amazon/A315QJ0Z...,R1OF6OLI5LWG8T,0,0,0,1,0.0,1,0,0,4.9,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2870,4,https://reviewmeta.com/profile/amazon/AHTHWVPV...,RAXV97PZ9VAVS,0,0,0,0,4.0,0,0,0,0.0,0,0,0,0,0,0,0,0
2871,5,https://reviewmeta.com/profile/amazon/A20W67PS...,R1AM6XQJML5XE7,0,0,0,0,0.0,0,0,0,4.5,0,0,0,0,0,0,0,0
2872,4,https://reviewmeta.com/profile/amazon-de/A1P8M...,R327FYLRX3UVKA,0,0,0,0,4.2,0,0,0,0.0,0,0,0,0,0,0,0,0
2873,3,https://reviewmeta.com/profile/amazon-de/AGHA2...,R3FYWX703RVOD7,0,0,0,0,0.0,0,0,0,4.4,0,0,0,0,0,0,0,0


# Web-scraping reviewer profiles
1. We are doing this because we want to create more profile-related features to help predict the score (0/1).
2. Don't run this whole section unless there's new data coming in and you need profile pages.

## Scrape and get reviewer profile link
Currectly, we only have reviewer_link_RM. Check how it looks like: https://reviewmeta.com/profile/amazon/AFG4VMDIIXHIEJQ5XF2JVPKOE4LA
Thus, in order to scrape profile pages, we first need to scrape this reviewer_link_RM and get the Amazon reviewer profile link.

In [0]:
user_agent = {'User-agent': 'Mozilla/5.0'}
for i in tqdm(range(0,2875)):
    url = combined_df.loc[i, 'reviewer_link_RM']
    response=requests.get(url,headers = user_agent)
    soup = bs4.BeautifulSoup(response.text)
    profile_url = soup.find_all('div', class_ = 'col-md-8')[2].find('a').get('href')
    combined_df.loc[i,'profile_url'] = profile_url
profile_url = final_df['profile_url']
profile_url.to_csv('profile_url.csv')

In [0]:
profile_url = pd.read_csv('profile_url.csv', header = None, index_col = 0)
final_df['profile_url'] = profile_url

## Web-scrape reviewer profile
1. We are using webdriver because we want the page to automatically scroll down and scrape all the reviews. Otherwise, it will only scrape around 10 reviews.

In [0]:
%time
from selenium.webdriver.common.keys import Keys


d = webdriver.Chrome(executable_path=os.path.abspath('chromedriver'))   
#d = webdriver.Chrome(executable_path=os.path.abspath('chromedriver')) 
for i in tqdm(range(2475, 2875)):
    time.sleep(3) #Hold 1 seconds before the next scrape.
    num=str(i)
    newurl = final_df.loc[i,'profile_url']
    Amazon_ID = final_df.loc[i,'Amazon_ID']
 
    
    body = d.find_element_by_tag_name("body")
    body.send_keys(Keys.CONTROL + 't')
    
    d.get(newurl)
    d.find_element_by_tag_name('body').send_keys(Keys.COMMAND + 'w') 
    d.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    lenOfPage = d.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match=False
    counter=0
    while(match==False):
            counter=counter+1
            if(counter>=10):
                break
            lastCount = lenOfPage
            time.sleep(3)
            lenOfPage = d.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
            if lastCount==lenOfPage:
                match=True
    
    
    #time.sleep(2) # sleep again the let the page load
    path = os.getcwd() +"/profile_RM/"
    name= Amazon_ID +'.txt' #The new file name. 
    with open(path + name, 'w') as file:
        file.write(d.page_source)
        file.close()
  

    #Close the google webpage that webdriver open for you, otherwise it will be crazy.
d.close()

In [0]:
# Save all the profile txts in a dictionary.
soup = {}
for i in tqdm(range(0, 2875)):
    try:
        Amazon_ID = final_df.loc[i,'Amazon_ID']
        slash = '/'
        name = Amazon_ID +'.txt'
        path = os.getcwd() + '/profile_RM/'
        f = open("{}{}{}".format(path,slash,name),"r", encoding="utf-8").read()
        soup[i]=bs4.BeautifulSoup(f) #Create a beautifulsoup object using the txt we got.
    except:
        print(i)

In [0]:
# extracting more features
for i in tqdm(range(0, 2875)):
    try:
        tag0 = soup[i].find_all('div', class_='dashboard-desktop-stat-value')[0] 
        final_df.loc[i,'helpful_votes'] = tag0.find('span', class_='a-size-large a-color-base').get_text() 

        for tag in soup[i].find_all('div', class_='a-row a-spacing-none name-container'):    
            final_df.loc[i,'name'] = tag.find('span', class_='a-size-extra-large').get_text() 

        tag1 = soup[i].find_all('div', class_='dashboard-desktop-stat-value')[1]    
        final_df.loc[i,'num_of_reviews'] = int(tag1.find('span', class_='a-size-large a-color-base').get_text())
        
        final_df.loc[i,'num_of_reviews_count'] = len(soup[i].find_all('div', class_='a-section profile-at-content'))
        
        
        # verified
        verified = []
        if len(soup[i].find_all('div', class_='a-row a-spacing-mini')) == 0:
            final_df.loc[i,'num_of_verified'] = 0 
        else:
            for tag in soup[i].find_all('div', class_='a-row a-spacing-mini'): 
                try:   
                    verified.append(tag.find('span', class_='a-size-small a-color-state profile-at-review-badge a-text-bold').get_text())
                    final_df.loc[i,'num_of_verified'] = len(verified)
                except:
                    continue 


        final_df.loc[i,'num_of_unverified'] = final_df.loc[i,'num_of_reviews_count'] - final_df.loc[i,'num_of_verified']

        
        date_mode_number = []
        
        # mode_number means if one person has many purchases on one day, how many purchases? I found out the date that appears most times.
        if len(soup[i].find_all('div', class_='a-profile-content')) == 0:
            final_df.loc[i,'mode_number'] = 0
        else:
            for tag in soup[i].find_all('div', class_='a-profile-content'):
                date_mode_number.append(tag.find('span', class_='a-profile-descriptor').get_text())
                final_df.loc[i,'mode_number'] = len([j for j, review in enumerate(date_mode_number) if review == max(set(date_mode_number), key=date_mode_number.count)])
        if final_df.loc[i,'mode_number'] > 20:
            final_df.loc[i,'samedate_20'] = 1
        else:
            final_df.loc[i,'samedate_20'] = 0

        # reviewer anonymous
        if ('Customer' in final_df.loc[i,'name']) | ('customer' in final_df.loc[i,'name']):
            final_df.loc[i,'anonymous'] = 1
        else:
            final_df.loc[i,'anonymous'] = 0
        
        # only 5 star reviews
        star5 = []
        if soup[i].find_all('div',class_='a-section a-spacing-mini') == 0:
            final_df.loc[i,'only_5star'] = 0
        else:
            for tag in soup[i].find_all('div',class_='a-section a-spacing-mini'):
                star5.append(tag.find('span',class_='a-icon-alt').text)
            if (len(set(star5)) == 1) & ('5 out of five stars' in set(star5)):
                final_df.loc[i,'only_5star'] = 1
            else:
                final_df.loc[i,'only_5star'] = 0
    except:
        print(i)
  

In [0]:
final_df.to_csv('Scraped_data.csv')

## Read the csv with new profile-related features

In [0]:
Scraped_data = pd.read_csv('Scraped_data.csv', index_col = 0)

In [0]:
Scraped_data.head()

Unnamed: 0,RM_Score,review_rating,reviewer_link_RM,Amazon_ID,Non_Verified_Purchases,Nvr_verified_reviewer,Contains_rep_phrases,high_vol_day_rev,Take_backs,Overrep_part,Overrep_wrd_cnt,Average_Rating,Overlapping_rev_history,One_hit,incentivized,Brand_repeater,Brand_Loyalist,Brand_Monogamist,single_day,profile_url,helpful_votes,name,num_of_reviews,num_of_reviews_count,num_of_verified,num_of_unverified,mode_number,samedate_20,anonymous,only_5star
0,1,5,https://reviewmeta.com/profile/amazon/AFG4VMDI...,RXDGH790RKPUF,1,1,0,0,0,0,0,5.0,0,1,0,0,0,0,0,https://smile.amazon.com/gp/profile/amzn1.acco...,0,PSPP Inc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1,https://reviewmeta.com/profile/amazon/AHWBOFLE...,RVJE4LSV9ZWLK,1,1,0,0,0,1,0,1.0,0,1,0,0,0,0,0,https://smile.amazon.com/gp/profile/amzn1.acco...,0,Philip Powell,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,2,https://reviewmeta.com/profile/amazon/A12K842R...,RV3XIX9GL0RTH,1,1,1,0,0,1,0,2.0,0,1,0,0,0,0,0,https://smile.amazon.com/gp/profile/amzn1.acco...,0,K. Jan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,5,https://reviewmeta.com/profile/amazon/A18LBGL7...,R8P2NMWQ7HZFO,1,0,1,0,1,0,0,4.6,1,0,0,1,0,0,0,https://smile.amazon.com/gp/profile/amzn1.acco...,0,Larry,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,5,https://reviewmeta.com/profile/amazon/A315QJ0Z...,R1OF6OLI5LWG8T,0,0,0,1,1,0,0,4.9,1,0,0,0,0,0,0,https://smile.amazon.com/gp/profile/amzn1.acco...,0,Pete Ramos,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
Scraped_data.columns

Index(['RM_Score', 'review_rating', 'reviewer_link_RM', 'Amazon_ID',
       'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'high_vol_day_rev', 'Take_backs',
       'Overrep_part', 'Overrep_wrd_cnt', 'Average_Rating',
       'Overlapping_rev_history', 'One_hit', 'incentivized', 'Brand_repeater',
       'Brand_Loyalist', 'Brand_Monogamist', 'single_day', 'profile_url',
       'helpful_votes', 'name', 'num_of_reviews', 'num_of_reviews_count',
       'num_of_verified', 'num_of_unverified', 'mode_number', 'samedate_20',
       'anonymous', 'only_5star'],
      dtype='object')

## Add new columns

### adding a new column: 0_review
Some reviewers have had reviews before but now there isn't any review on their profile page.


In [0]:
model_data = Scraped_data

In [0]:
for i in range(len(model_data)):
    if (model_data.loc[i,'num_of_reviews_count'] == 0) | math.isnan(model_data.loc[i,'num_of_reviews_count']) == True:
        model_data.loc[i,'0_review'] = 1
    else:
        model_data.loc[i,'0_review'] = 0

### Create and modify rating related feature
1. Easy_grade_rating: We do this due to explanability, because we could not explain how average rating works in a classification model. Thus, we change it to easy grader as another flag.
2. 5_star: indicating if the rating for this induvidual review is 5
3. 1_star: indicating if the rating for this induvidual review is 1

In [0]:
model_data['Easy_grade_rating'] = np.where(model_data['Average_Rating'] >= 4.5, 1, 0)
model_data = model_data.drop('Average_Rating', axis = 1)

In [0]:
model_data['5_star'] = np.where(model_data['review_rating'] == 5, 1, 0)

In [0]:
model_data['1_star'] = np.where(model_data['review_rating'] == 1, 1, 0)

# removing reviews with only 1/2 flags
In order to make our model stricter / decrease the false positives, we remove reviews that only have 1 flag or 2 flags and have been deleted by Amazon.

In [0]:
model_data.columns

Index(['RM_Score', 'review_rating', 'reviewer_link_RM', 'Amazon_ID',
       'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'high_vol_day_rev', 'Take_backs',
       'Overrep_part', 'Overrep_wrd_cnt', 'Overlapping_rev_history', 'One_hit',
       'incentivized', 'Brand_repeater', 'Brand_Loyalist', 'Brand_Monogamist',
       'single_day', 'profile_url', 'helpful_votes', 'name', 'num_of_reviews',
       'num_of_reviews_count', 'num_of_verified', 'num_of_unverified',
       'mode_number', 'samedate_20', 'anonymous', 'only_5star', '0_review',
       'Easy_grade_rating', '5_star', '1_star'],
      dtype='object')

In [0]:
num_flags = pd.DataFrame(model_data[['Non_Verified_Purchases','Nvr_verified_reviewer', 'Contains_rep_phrases', 
    'high_vol_day_rev','Take_backs', 'Overrep_part', 'Overrep_wrd_cnt',
    'Overlapping_rev_history', 'One_hit', 'incentivized', 'Brand_repeater',
    'Brand_Loyalist', 'Brand_Monogamist', 'single_day']].apply(lambda x: x.sum(), axis = 1), columns = ['num_flags'])

In [0]:
model_data_flag = model_data.merge(num_flags, left_index = True, right_index = True)

In [0]:
model_data_flag.drop(model_data_flag[model_data_flag['RM_Score'] == 1][model_data_flag['num_flags'] < 3].index, inplace=True)

  """Entry point for launching an IPython kernel.


In [0]:
model_data_flag = model_data_flag.reset_index(drop = True)
model_data = model_data_flag

In [0]:
model_data.columns

Index(['RM_Score', 'review_rating', 'reviewer_link_RM', 'Amazon_ID',
       'Non_Verified_Purchases', 'Nvr_verified_reviewer',
       'Contains_rep_phrases', 'high_vol_day_rev', 'Take_backs',
       'Overrep_part', 'Overrep_wrd_cnt', 'Overlapping_rev_history', 'One_hit',
       'incentivized', 'Brand_repeater', 'Brand_Loyalist', 'Brand_Monogamist',
       'single_day', 'profile_url', 'helpful_votes', 'name', 'num_of_reviews',
       'num_of_reviews_count', 'num_of_verified', 'num_of_unverified',
       'mode_number', 'samedate_20', 'anonymous', 'only_5star', '0_review',
       'Easy_grade_rating', '5_star', '1_star', 'num_flags'],
      dtype='object')

# Data cleaning and export dataset
In this part, we mainly do some data cleaning, deal with the null values and export the dataset for modeling.

In [0]:
# For this column, NA exists because all of this reviewer's reviews are unverified.
model_data['num_of_unverified'] = model_data['num_of_unverified'].fillna(model_data['num_of_reviews_count'])

In [0]:
# Drop unnecessary columns.
model_data = model_data.drop(['reviewer_link_RM','Amazon_ID','profile_url','name','num_of_reviews', 
                              'num_of_reviews_count','num_of_verified','helpful_votes','num_flags','review_rating'],1)

In [0]:
# Change the target variable name to fradulent or not.
model_data.rename(columns = {'RM_Score':'Fraudulent'}, inplace = True)

In [0]:
# These NAs are there because there are no reviews on the reviewer page. 0_review is 0 for all these rows.
pd.options.display.max_columns = None
model_data[model_data.isnull().values == True]

Unnamed: 0,Fraudulent,Non_Verified_Purchases,Nvr_verified_reviewer,Contains_rep_phrases,high_vol_day_rev,Take_backs,Overrep_part,Overrep_wrd_cnt,Overlapping_rev_history,One_hit,incentivized,Brand_repeater,Brand_Loyalist,Brand_Monogamist,single_day,num_of_unverified,mode_number,samedate_20,anonymous,only_5star,0_review,Easy_grade_rating,5_star,1_star
64,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,,,,,,1.0,1,1,0
64,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,,,,,,1.0,1,1,0
64,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,,,,,,1.0,1,1,0
64,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,,,,,,1.0,1,1,0
64,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,,,,,,1.0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2664,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,1.0,0,0,1
2664,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,1.0,0,0,1
2664,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,1.0,0,0,1
2664,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,1.0,0,0,1


In [0]:
model_data = model_data.fillna(0)

In [0]:
model_data.to_csv('model_data.csv')