# Documentation

This notebook is used to generate features for model fitting. By running through it with required inputs, you can get one full table that stores all the reviews and their features we selected for modeling. We used reviews from the brand RSC to create these features. And if you only want to test one/some of the features, you can refer to respective chunks and export the outputs. <br>

<br>
Features included are listed below:<br> 

*Features that need profile id's scraped pages to extract*: <br>
<li>'0_review': Whether the reviewer's profile has no reviews displayed. Will be inputted as 1 if true;<br></li>
<li>'One_hit': Indicates that the user has only given 1 review. Will be inputted as 1 if true; <br></li>
<li>'Take_backs': Whether the reviewer has deleted review(s). Will be inputted as 1 if true;<br></li>
<li>'Nvr_verified_reviewer': Whether the reviewer has never written a verified purchaser review. Will be inputted as 1 if true;<br></li>
<li>'single_day': Whether the reviewer posted all reviews on a same date. Will be inputted as 1 if true;<br></li>
<li>'Easy_grade_rating': Whether the reviewer has an average rating of >= 4.5. Will be inputted as 1 if true;<br></li>
<li>'samedate_20': Whether the reviewer posted >= 20 reviews in a same date. Will be inputted as 1 if true;<br></li>
<li>'Easy_grader': Whether the reviewer has an average rating of >= 4.5 AND gave a 5-star review for this purchase. Will be inputted as 1 if true;<br></li>

*Features that don't need scraped pages to extract*:<br>
<li>'Non_Verified_Purchases': Whether the review comes from a non-verified purchase. Will be inputted as 1 if true;<br> </li>
<li>'high_vol_day_rev': Whether the review was written on a date when a certain product got a larger amount of reviews than usual. Will be inputted as 1 if true;<br></li>
<li>'Overrep_wrd_cnt': Whether there was a specific word count range that contributes much more reviews for a certain product based on a comparison within the product category, and the review falls into that range group. Will be inputted as 1 if true;<br></li>
<li>'incentivized': Use a pre-defined list of incentivized words, like "free product", and check the incentivized words existence. Will be inputted as 1 if true;<br></li>
<li>'Contains_rep_phrases': Phrases that have a potential to indicate incentivized behaviors are selected to help detect reviews. Find reviews with problematic phrase repetition and label them as 1;<br></li>
<li>'Overlapping_rev_history': Whether a reviewer has reviewed >= 3 products that are same as another reviewer. Will be inputted as 1 if true;<br></li>

**Input:** <br>
1. RB Review dataset: RSC reviews with profile ids.csv<br>
2. Profile pages scraped for feature extraction<br>
3. RB Sales rank dataset: SalesRankExport_f0337c16-d7f3-4fc0-a46b-a0e14f18b595.csv<br>

**Output:** <br>
full_merged_data_RSC.csv

In [0]:
import pandas as pd
import numpy as np
import requests
from selenium import webdriver
import time
import bs4
import re
import os 
import math
import datetime
from tqdm import tqdm_notebook as tqdm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from numpy import loadtxt
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tnrange

In [0]:
# Read the dataset with the profile links
profile_urls = pd.read_csv('RSC reviews with profile ids.csv')

# Features that need scraped pages to extract

*Features that need scraped profile pages to extract*: <br>
<li>'0_review': Whether the reviewer's profile has no reviews displayed. Will be inputted as 1 if true;<br></li>
<li>'One_hit': Indicates that the user has only given 1 review. Will be inputted as 1 if true; <br></li>
<li>'Take_backs': Whether the reviewer has deleted review(s). Will be inputted as 1 if true;<br></li>
<li>'Nvr_verified_reviewer': Whether the reviewer has never written a verified purchaser review. Will be inputted as 1 if true;<br></li>
<li>'single_day': Whether the reviewer posted all reviews on a same date. Will be inputted as 1 if true;<br></li>
<li>'Easy_grade_rating': Whether the reviewer has an average rating of >= 4.5. Will be inputted as 1 if true;<br></li>
<li>'samedate_20': Whether the reviewer posted >= 20 reviews in a same date. Will be inputted as 1 if true;<br></li>
<li>'Easy_grader': Whether the reviewer has an average rating of >= 4.5 AND gave a 5-star review for this purchase. Will be inputted as 1 if true;<br></li>

\**Note that in this part we would use the profile pages scraped to retrieve information*

In [0]:
# Keep columns for dataset aggregation
profile_urls['source_product'] = profile_urls['source'] +' '+ profile_urls['product']
profile_urls = profile_urls.rename(columns={'verified': 'Verified_Purchases'})
profile_urls.Verified_Purchases = profile_urls.Verified_Purchases.astype(int)
profile_urls_useful = profile_urls[['author','source','reviewid','product','profile','Verified_Purchases','stars']]
profile_urls_useful = profile_urls_useful.dropna()
profile_urls_useful = profile_urls_useful[profile_urls_useful['profile'].str.contains('account')].reset_index(drop = True)

In [0]:
# To save profile pages and name them using profile id, we first get the profile id from the url links to profiles
for i in range(len(profile_urls_useful)):
    if profile_urls_useful.loc[i,'profile'].find('ref') == -1:
        # extract profile id from profile link
        profile_urls_useful.loc[i,'profile_id'] = re.findall(r"account.(.+?)$",profile_urls_useful.loc[i,'profile'])[0]
    else:
        profile_urls_useful.loc[i,'profile_id'] = re.findall(r"account.(.+?)/",profile_urls_useful.loc[i,'profile'])[0]

In [0]:
profile_urls_useful.head()

Unnamed: 0,author,source,reviewid,product,profile,Verified_Purchases,profile_id
0,Amazon Customer,amazon.ca,R31B5G60GS531M,B078N8NR7G,https://www.amazon.ca/gp/profile/amzn1.account...,1,AHXGHQLLRKPURPVJYTWO7I7AB4DA
1,nathalie,amazon.ca,RRCY4V48RQBXG,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AHNM2DIYUKF6P6JQOU2KNOKRPEPQ
2,Conure Mum,amazon.ca,R18076F5C879LP,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AHOEGFFK7INBIPCDO7PKPUTZMRXQ
3,Wayne Smith,amazon.ca,RLA1DFN3DCSFJ,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AGZ476P3EVM6UAENKIUUUYSA4J4A
4,Rob Self,amazon.ca,R3F4GS6FDS5ALH,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AGHAHXXQGDC6W4B6RGY6D2ILUJPA


## Web-Scraping

In [0]:
folder = os.getcwd() + '/profiles/'
if not os.path.exists(folder):
    os.mkdir(folder)

In [0]:
%time
from selenium.webdriver.common.keys import Keys


d = webdriver.Chrome(executable_path=os.path.abspath('chromedriver'))   

for i in tqdm(range(0,10)):
    time.sleep(3) #Hold 1 seconds before the next scrape.
    num=str(i)
    newurl = profile_urls_useful.loc[i,'profile']
 
    
    body = d.find_element_by_tag_name("body")
    body.send_keys(Keys.CONTROL + 't')
    
    d.get(newurl)
    d.find_element_by_tag_name('body').send_keys(Keys.COMMAND + 'w') 
    lenOfPage = d.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match=False
    while(match==False):
            lastCount = lenOfPage
            time.sleep(5)
            lenOfPage = d.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
            if lastCount==lenOfPage:
                match=True
    
    
    #time.sleep(2) # sleep again the let the page load
    path = os.getcwd() +"/profiles/"
    profile_id = profile_urls_useful.loc[i,'profile_id']
    name = profile_id+'.txt' #The new file name. 
    with open(path + name, 'w') as file:
        file.write(d.page_source)
        file.close()
  

    #Close the google webpage that webdriver open for you, otherwise it will be crazy.
d.close()

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.05 µs


HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




## Extracting features from pages scraped

In [0]:
# Save the scraped pages into dictionary
def save_in_dict(folder_name, df_name, soup):
    for i in tqdm(range(len(df_name))):
        try:
            num = str(i)
            slash = '/'
            name = df_name.loc[i,'profile_id']+'.txt'
            path = os.getcwd() + '/' + folder_name + '/'
            f = open("{}{}{}".format(path,slash,name),"r", encoding="utf-8").read()

            soup[i]=bs4.BeautifulSoup(f) #Create a beautifulsoup object using the txt we got.
        except:
            print(i)

In [0]:
# Generating new columns: one reviewer per row
def generate_features(df_name, soup):
    for i in tqdm(range(len(df_name))):
        try:
            # Store some info in order to generate features:
            tag0 = soup[i].find_all('div', class_='dashboard-desktop-stat-value')[0] 
            df_name.loc[i,'helpful_votes'] = tag0.find('span', class_='a-size-large a-color-base').get_text() 

            for tag in soup[i].find_all('div', class_='a-row a-spacing-none name-container'):    
                df_name.loc[i,'name'] = tag.find('span', class_='a-size-extra-large').get_text() 

            tag1 = soup[i].find_all('div', class_='dashboard-desktop-stat-value')[1]    
            df_name.loc[i,'num_of_reviews'] = int(tag1.find('span', class_='a-size-large a-color-base').get_text())

            df_name.loc[i,'num_of_reviews_count'] = len(soup[i].find_all('div', class_='a-section profile-at-content'))

            # Next, let's generate features of interests:
            # 0 review
            if df_name.loc[i,'num_of_reviews_count'] == 0:
                df_name.loc[i,'0_review'] = 1
            else:
                df_name.loc[i,'0_review'] = 0
            
            # One-Hit wonder
            if df_name.loc[i,'num_of_reviews'] == 1:
                df_name.loc[i,'One_hit'] = 1
            else:
                df_name.loc[i,'One_hit'] = 0

            # Take-backs
            df_name['take_back'] = df_name.apply(lambda x: x['num_of_reviews'] - x['num_of_reviews_count'], axis=1)
            if df_name.loc[i,'take_back'] > 0:
                df_name.loc[i,'Take_backs'] = 1
            else:
                df_name.loc[i,'Take_backs'] = 0

            # Never verified reviewer
            verified = []
            for tag in soup[i].find_all('div', class_='a-row a-spacing-mini'): 
                try:   
                    verified.append(tag.find('span', class_='a-size-small a-color-state profile-at-review-badge a-text-bold').get_text())
                except:
                    continue  
            df_name.loc[i,'num_of_verified'] = len(verified)
            df_name.loc[i,'num_of_unverified'] = df_name.loc[i,'num_of_reviews_count'] - df_name.loc[i,'num_of_verified']
            if (df_name.loc[i,'num_of_unverified'] == df_name.loc[i,'num_of_reviews_count']) & (df_name.loc[i,'num_of_unverified'] > 0):
                df_name.loc[i,'Nvr_verified_reviewer'] = 1
            else:
                df_name.loc[i,'Nvr_verified_reviewer'] = 0

            # Single day
            date_mode_number = []
            for tag in soup[i].find_all('div', class_='a-profile-content'):
                date_mode_number.append(tag.find('span', class_='a-profile-descriptor').get_text())
                if len(set(date_mode_number)) == 1:
                    df_name.loc[i,'single_day'] = 1
                else:
                    df_name.loc[i,'single_day'] = 0
                    
            # Easy grader
            ## compute reviewers' average ratings 
            stars = []
            for tag in soup[i].find_all('div',class_='a-section a-spacing-mini'):
                stars.append(int(tag.find('span',class_='a-icon-alt').text[0]) )
            df_name.loc[i,'avg_rating'] = sum(stars)/len(stars) 
            
            # when average rating >=4.5, then easy grader 
            if df_name.loc[i,'avg_rating'] >= 4.5:
                df_name.loc[i,'Easy_grade_rating'] = 1
            else:
                df_name.loc[i,'Easy_grade_rating'] = 0
            
            # Samedate >= 20 reviews
            # if a person has many purchases on one day, how many purchases in total for that day? 
            date_mode_number = []  
            # We first find the date that appears most times.
            for tag in soup[i].find_all('div', class_='a-profile-content'):
                date_mode_number.append(tag.find('span', class_='a-profile-descriptor').get_text())
                df_name.loc[i,'mode_number'] = len([j for j, review in enumerate(date_mode_number) if review == max(set(date_mode_number), key=date_mode_number.count)])
            if df_name.loc[i,'mode_number'] > 20:
                df_name.loc[i,'samedate_20'] = 1
            else:
                df_name.loc[i,'samedate_20'] = 0

        except:
            continue

In [0]:
soup = {}
save_in_dict('profiles',profile_urls_useful, soup)
generate_features(profile_urls_useful, soup)

In [0]:
# Check the features
print(profile_urls_useful.loc[0,'profile'])

https://www.amazon.ca/gp/profile/amzn1.account.AHJBVGCBBQYCWRCMMZTRCI2I6ZCQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8


### Easy Grader

In [0]:
# Check if the review is a 5-star review
profile_urls_useful['5_star'] = np.where(profile_urls_useful['stars'] == 5, 1, 0)

# Easy grader needs to satisfy 2 conditions: has an average rating of >=4.5; also in this review gave a 5-star
profile_urls_useful['Easy_grader'] = profile_urls_useful['Easy_grade_rating'] * profile_urls_useful['5_star']

In [0]:
profile_urls_useful.head()

Unnamed: 0,author,source,reviewid,product,profile,Verified_Purchases,source_product,profile_id,Non_Verified_Purchases,helpful_votes,...,take_back,Take_backs,num_of_verified,num_of_unverified,Nvr_verified_reviewer,single_day,avg_rating,Easy_grade_rating,mode_number,samedate_20
0,Amazon Customer,amazon.ca,R31B5G60GS531M,B078N8NR7G,https://www.amazon.ca/gp/profile/amzn1.account...,1,amazon.ca B078N8NR7G,AHXGHQLLRKPURPVJYTWO7I7AB4DA,0,1,...,0.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,3.0,0.0
1,nathalie,amazon.ca,RRCY4V48RQBXG,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,amazon.ca B01HO8U5NC,AHNM2DIYUKF6P6JQOU2KNOKRPEPQ,0,0,...,3.0,1.0,17.0,0.0,0.0,0.0,5.0,1.0,4.0,0.0
2,Conure Mum,amazon.ca,R18076F5C879LP,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,amazon.ca B01HO8U5NC,AHOEGFFK7INBIPCDO7PKPUTZMRXQ,0,58,...,3.0,1.0,153.0,9.0,0.0,0.0,3.67284,0.0,13.0,0.0
3,Wayne Smith,amazon.ca,RLA1DFN3DCSFJ,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,amazon.ca B01HO8U5NC,AGZ476P3EVM6UAENKIUUUYSA4J4A,0,0,...,0.0,0.0,2.0,0.0,0.0,0.0,4.0,1.0,1.0,0.0
4,Rob Self,amazon.ca,R3F4GS6FDS5ALH,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,amazon.ca B01HO8U5NC,AGHAHXXQGDC6W4B6RGY6D2ILUJPA,0,3,...,0.0,0.0,7.0,0.0,0.0,0.0,3.857143,0.0,5.0,0.0


# Features that don't need scraped pages to extract

*Features that don't need scraped pages to extract*:<br>
<li>'Non_Verified_Purchases': Whether the review comes from a non-verified purchase. Will be inputted as 1 if true;<br> </li>
<li>'high_vol_day_rev': Whether the review was written on a date when a certain product got a larger amount of reviews than usual. Will be inputted as 1 if true;<br></li>
<li>'Overrep_wrd_cnt': Whether there was a specific word count range that contributes much more reviews for a certain product based on a comparison within the product category, and the review falls into that range group. Will be inputted as 1 if true;<br></li>
<li>'incentivized': Use a pre-defined list of incentivized words, like "free product", and check the incentivized words existence. Will be inputted as 1 if true;<br></li>
<li>'Contains_rep_phrases': Phrases that have a potential to indicate incentivized behaviors are selected to help detect reviews. Find reviews with problematic phrase repetition and label them as 1;<br></li>
<li>'Overlapping_rev_history': Whether a reviewer has reviewed >=3 products that are same as another reviewer. Will be inputted as 1 if true;<br></li>

The data used:<br>
- '**RSC reviews with profile ids.csv**'( = profile_urls in the previous session) 
- **SalesRankExport_f0337c16-d7f3-4fc0-a46b-a0e14f18b595.csv**

## Non-verified purchases

In [0]:
# Non-verified purchases
profile_urls_useful['Non_Verified_Purchases'] = 1-profile_urls_useful['Verified_Purchases']

In [0]:
# Save the results
profile_urls_useful.to_csv('sample_outputs/profile_urls_useful_profile_features.csv',index=False)

## High Volumn Day

This feature is designed to retrieve the dates where a certain product got a larger amount of reviews than usual. The threshold we chose was 1 standard diviation above from the average number of reviews/day.<br>

**Input**: 'RSC reviews with profile ids.csv'( = profile_urls in the previous session)

In [0]:
# Use review dataset
reviews = profile_urls.copy()

In [0]:
reviews.head()

Unnamed: 0,source,product,PART NUMBER_custom,SKU_custom,analysis_purpose_custom_custom,flag_custom,special_name_custom,test_field2_custom,test_field3_custom,name,...,statustime,helpfulcount,commenttext,commentauthor,officialcomment,totalcomments,commentts,commentdatestring,inputtime,source_product
0,amazon.ca,B078N8NR7G,,,,,,,,PetSafe 900 Meter Remote Trainer,...,,,,,,,,,2018-12-22 06:24,amazon.ca B078N8NR7G
1,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,,,2019-04-15 10:32,amazon.ca B01HO8U5NC
2,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,,,2019-04-04 12:02,amazon.ca B01HO8U5NC
3,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,,,2019-04-02 20:35,amazon.ca B01HO8U5NC
4,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,,,2018-07-31 08:00,amazon.ca B01HO8U5NC


In [0]:
# Set product as index for later merging tasks
reviews.set_index('product',drop=True,inplace=True)

In [0]:
# Create review dictionary
reviews_dict = {}
for i in reviews.index.unique():
    reviews_dict[i] = reviews.loc[i, ['source','date','reviewid', 'text']]

In [0]:
# Check the date column
pd.to_datetime(reviews_dict["B01HO8U5NC"]['date'])

product
B01HO8U5NC   2019-04-13 08:00:00
B01HO8U5NC   2019-04-02 08:00:00
B01HO8U5NC   2019-03-31 08:00:00
B01HO8U5NC   2018-07-31 08:00:00
B01HO8U5NC   2018-07-29 08:00:00
B01HO8U5NC   2018-04-12 08:00:00
B01HO8U5NC   2018-01-21 08:00:00
B01HO8U5NC   2018-01-08 08:00:00
B01HO8U5NC   2017-10-21 08:00:00
B01HO8U5NC   2017-10-16 08:00:00
B01HO8U5NC   2017-08-18 08:00:00
B01HO8U5NC   2017-08-14 08:00:00
B01HO8U5NC   2017-05-24 08:00:00
B01HO8U5NC   2017-04-24 08:00:00
B01HO8U5NC   2017-04-09 08:00:00
B01HO8U5NC   2017-03-15 08:00:00
B01HO8U5NC   2017-03-13 08:00:00
B01HO8U5NC   2017-02-17 08:00:00
B01HO8U5NC   2017-02-14 08:00:00
B01HO8U5NC   2017-02-01 08:00:00
B01HO8U5NC   2017-01-17 08:00:00
B01HO8U5NC   2017-01-12 08:00:00
B01HO8U5NC   2017-01-10 08:00:00
B01HO8U5NC   2017-01-07 08:00:00
B01HO8U5NC   2017-01-01 08:00:00
B01HO8U5NC   2016-12-29 08:00:00
B01HO8U5NC   2016-12-28 08:00:00
B01HO8U5NC   2016-12-18 08:00:00
B01HO8U5NC   2016-12-18 08:00:00
B01HO8U5NC   2016-12-16 08:00:00
  

In [0]:
# Create a 'new_date' column for date comparison
for i in reviews_dict:
    if type(reviews_dict[i]) != pd.core.series.Series:
        reviews_dict[i] = reviews_dict[i].sort_values('date').drop_duplicates()
        datetime = pd.to_datetime(reviews_dict[i]['date'])
        reviews_dict[i]['new_date'] = datetime.dt.strftime('%Y-%m-%d')
        reviews_dict[i]['new_date'] = pd.to_datetime(reviews_dict[i]['new_date'])

In [0]:
# Generate the high volumn day feature
for i in reviews_dict:
    if type(reviews_dict[i]) != pd.core.series.Series:
        num_review_per_day = reviews_dict[i][["text", "new_date"]].groupby(by = "new_date", as_index =False).count()
        if len(num_review_per_day) > 1:
            num_avg = num_review_per_day["text"].mean(axis = 0)
            num_std = num_review_per_day["text"].std(axis = 0)
            num_limit = math.ceil(num_avg + num_std)
            high_volumes_day = num_review_per_day[num_review_per_day['text'] > num_limit]['new_date'] 
            reviews_dict[i]['whether_high_volume'] = reviews_dict[i]['new_date'].isin(high_volumes_day)
        else:
            reviews_dict[i]['whether_high_volume'] = True # because all the reviews were left on the same date
    else:
        reviews_dict[i]['whether_high_volume'] = False

In [0]:
# Check the results:
reviews_dict

{'B078N8NR7G': source                                                     amazon.ca
 date                                                2018-12-20 08:00
 reviewid                                              R31B5G60GS531M
 text                   Produit disfonctionnel. J'exige remboursement
 whether_high_volume                                            False
 Name: B078N8NR7G, dtype: object,
 'B01HO8U5NC':                source              date        reviewid  \
 product                                                   
 B01HO8U5NC     amazon  2016-09-09 08:00  R3JHD87AZ0P9P4   
 B01HO8U5NC     amazon  2016-09-12 08:00  R2TDVPYR6WM6TH   
 B01HO8U5NC     amazon  2016-09-13 08:00  R12OI2Q1O0L92O   
 B01HO8U5NC     amazon  2016-09-18 08:00  R3KLGQ74I3L33C   
 B01HO8U5NC     amazon  2016-09-20 08:00  R249T5AUQANFRK   
 B01HO8U5NC     amazon  2016-09-23 08:00  R397TRIAMDL9QW   
 B01HO8U5NC     amazon  2016-10-26 08:00  R2IFVW5R6ABTWX   
 B01HO8U5NC     amazon  2016-10-28 08:00  R3QYGA

In [0]:
# Combine the feature back to the dataframe
reviews_high_volume = pd.DataFrame()
for i in reviews_dict:
    if type(reviews_dict[i]) != pd.core.series.Series:
        df_bin = reviews_dict[i]
    else:
        df_bin = reviews_dict[i].to_frame().T
    reviews_high_volume = pd.concat([reviews_high_volume, df_bin], ignore_index = True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [0]:
# Set the results to 1 and 0
reviews_high_volume['high_vol_day_rev'] =reviews_high_volume['whether_high_volume'].apply(lambda x: 1 if x==True else 0)

In [0]:
reviews_high_volume

Unnamed: 0,date,new_date,reviewid,source,text,whether_high_volume,high_vol_day_rev
0,2018-12-20 08:00,NaT,R31B5G60GS531M,amazon.ca,Produit disfonctionnel. J'exige remboursement,0,0
1,2016-09-09 08:00,2016-09-09,R3JHD87AZ0P9P4,amazon,dogs love it,0,0
2,2016-09-12 08:00,2016-09-12,R2TDVPYR6WM6TH,amazon,Stopped working after 3 weeks,0,0
3,2016-09-13 08:00,2016-09-13,R12OI2Q1O0L92O,amazon,I was going to return this water fountain. On...,0,0
4,2016-09-18 08:00,2016-09-18,R3KLGQ74I3L33C,amazon,"Took my pets a few hours to get used to, but a...",0,0
5,2016-09-20 08:00,2016-09-20,R249T5AUQANFRK,amazon,Nice idea but the fountain pump failed the sec...,0,0
6,2016-09-23 08:00,2016-09-23,R397TRIAMDL9QW,amazon,I like the Pet Fountain I add water every two ...,0,0
7,2016-10-26 08:00,2016-10-26,R2IFVW5R6ABTWX,amazon,It took a little while for my girls to start d...,0,0
8,2016-10-28 08:00,2016-10-28,R3QYGAS400FKDS,amazon,I have to fill it more often. Its small for tw...,0,0
9,2016-11-05 08:00,2016-11-05,R2GOO41TBSYQC4,amazon.ca,Works great! My cat doesn't drink enough water...,0,0


In [0]:
# We can save the table for further use
reviews_high_volume.to_csv('sample_outputs/reviews_high_volume.csv',index=False)

In [0]:
# To compile we just keep the feature we need:
reviews_high_volume.set_index('reviewid',drop=True,inplace=True)
reviews_high_volume = reviews_high_volume[['high_vol_day_rev']]

## Overrepresented Word Count

To build our word count distribution, we start by putting every single review for a product into a “word count group”.  For example, a 23 word review would fall into the “21-25 word count group”, a 109 word review would fall into the “101-125 word count group”, and a 600 word review would fall into the “201+ word count group”.  This gives us the product’s word count distribution.  But just a product’s  word count distribution doesn’t really tell us that much: we need something to compare it to. That is why we grab the word count distribution for all of the reviews in the products category (category2) to get the expected word count distribution.  

Once we have the word count distribution of the product and the expected distribution of the category we compare the two distributions and identify product word count groups that are higher in concentration than we’d expect to see. For each of the larger groups we run a significance test to ensure that it isn’t due to random chance or lack of data points but rather that they are substantially overrepresented. If a product doesn’t have that many reviews, we are likely to see more variance due to random chance.  However, if our formula determines the difference is statistically significant, we’ll label that group as an **Overrepresented Word Count Group**.

**Input**: <br>
'RSC reviews with profile ids.csv'( = profile_urls in the previous session)<br>
SalesRankExport_f0337c16-d7f3-4fc0-a46b-a0e14f18b595.csv

In [0]:
# Load the sales dataset
sales = pd.read_csv("SalesRankExport_f0337c16-d7f3-4fc0-a46b-a0e14f18b595.csv")
sales.shape

  interactivity=interactivity, compiler=compiler, result=result)


(2258613, 17)

In [0]:
# Check column names
sales.columns

Index(['source', 'id', 'start_ts', 'end_ts', 'date', 'category_id1',
       'category_name1', 'category_rank1', 'category_id2', 'category_name2',
       'category_rank2', 'category_id3', 'category_name3', 'category_rank3',
       'category_id4', 'category_name4', 'category_rank4'],
      dtype='object')

In [0]:
# Extract only columns of interest
sales = sales[['id','category_id2']]

# Take only the unique product id
sales = sales.drop_duplicates('id')
sales.shape

(2525, 2)

In [0]:
# Now let's compile the reviews and sales dataframes to identify the category of each product in the reviews dataset
reviews.reset_index(inplace=True)
compiled = pd.merge(profile_urls,sales, how = 'inner', left_on = "product", right_on="id")
compiled.head()

Unnamed: 0,source,product,PART NUMBER_custom,SKU_custom,analysis_purpose_custom_custom,flag_custom,special_name_custom,test_field2_custom,test_field3_custom,name,...,commenttext,commentauthor,officialcomment,totalcomments,commentts,commentdatestring,inputtime,source_product,id,category_id2
0,amazon.ca,B078N8NR7G,,,,,,,,PetSafe 900 Meter Remote Trainer,...,,,,,,,2018-12-22 06:24,amazon.ca B078N8NR7G,B078N8NR7G,pet-supplies
1,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,2019-04-15 10:32,amazon.ca B01HO8U5NC,B01HO8U5NC,pet-supplies
2,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,2019-04-04 12:02,amazon.ca B01HO8U5NC,B01HO8U5NC,pet-supplies
3,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,2019-04-02 20:35,amazon.ca B01HO8U5NC,B01HO8U5NC,pet-supplies
4,amazon.ca,B01HO8U5NC,,,,,,,,Drinkwell Platinum Pet Fountain 168oz,...,,,,,,,2018-07-31 08:00,amazon.ca B01HO8U5NC,B01HO8U5NC,pet-supplies


In [0]:
# Check information 
print(compiled.shape)
print("\n")
print(compiled.info())

(65451, 42)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 65451 entries, 0 to 65450
Data columns (total 42 columns):
source                            65451 non-null object
product                           65451 non-null object
PART NUMBER_custom                0 non-null float64
SKU_custom                        0 non-null float64
analysis_purpose_custom_custom    0 non-null float64
flag_custom                       0 non-null float64
special_name_custom               0 non-null float64
test_field2_custom                0 non-null float64
test_field3_custom                0 non-null float64
name                              65451 non-null object
date                              65451 non-null object
status                            65451 non-null object
sentiment                         65451 non-null object
topic                             65451 non-null object
notes                             0 non-null float64
profile                           32187 non-null object
autho

In [0]:
# Take only columns of interest
## We will select category_id2 where we will be comparing the word count of the individual products with the word count
## of this category level
compiled = compiled[['source','product','text','category_id2','reviewid']]
compiled.head()

Unnamed: 0,source,product,text,category_id2,reviewid
0,amazon.ca,B078N8NR7G,Produit disfonctionnel. J'exige remboursement,pet-supplies,R31B5G60GS531M
1,amazon.ca,B01HO8U5NC,J’ai adorer,pet-supplies,RRCY4V48RQBXG
2,amazon.ca,B01HO8U5NC,Bought this as a running bird bath for my two ...,pet-supplies,R18076F5C879LP
3,amazon.ca,B01HO8U5NC,Lid is easily knocked off but it’s still a gre...,pet-supplies,RLA1DFN3DCSFJ
4,amazon.ca,B01HO8U5NC,Works okay but is VERY NOISY.,pet-supplies,R3F4GS6FDS5ALH


In [0]:
# Let's create the word count column
compiled['totalwords'] = compiled['text'].str.split().str.len()

In [0]:
# Create word bins with appropriate ranges
compiled['word_bins'] = pd.cut(x=compiled['totalwords'], bins=[0, 5, 15, 25, 40, 65, 100, 200, 100000])
compiled['word_bins'] = pd.cut(x=compiled['totalwords'], bins=[0, 5, 15, 25, 40, 65, 100, 200, 100000], labels=['0 - 5 words', '6 - 15 words', '16 - 25 words', '26 - 40 words', '41 - 65 words', '66 - 100 words', '101 - 200 words','200+'])
compiled.head()

Unnamed: 0,source,product,text,category_id2,reviewid,totalwords,word_bins
0,amazon.ca,B078N8NR7G,Produit disfonctionnel. J'exige remboursement,pet-supplies,R31B5G60GS531M,4.0,0 - 5 words
1,amazon.ca,B01HO8U5NC,J’ai adorer,pet-supplies,RRCY4V48RQBXG,2.0,0 - 5 words
2,amazon.ca,B01HO8U5NC,Bought this as a running bird bath for my two ...,pet-supplies,R18076F5C879LP,36.0,26 - 40 words
3,amazon.ca,B01HO8U5NC,Lid is easily knocked off but it’s still a gre...,pet-supplies,RLA1DFN3DCSFJ,11.0,6 - 15 words
4,amazon.ca,B01HO8U5NC,Works okay but is VERY NOISY.,pet-supplies,R3F4GS6FDS5ALH,6.0,6 - 15 words


In [0]:
# Create a dataframe to aggregate word bins across products
# Normalize to get proportions
product_aggregation = pd.crosstab(compiled["product"], compiled["word_bins"], margins=True, normalize='index')
product_aggregation.head()

word_bins,0 - 5 words,6 - 15 words,16 - 25 words,26 - 40 words,41 - 65 words,66 - 100 words,101 - 200 words,200+
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
B0000AVVPU,0.181818,0.272727,0.181818,0.272727,0.0,0.0,0.090909,0.0
B0000BYCM0,0.4,0.292308,0.092308,0.076923,0.0,0.107692,0.030769,0.0
B0000DAPGK,0.411765,0.176471,0.235294,0.117647,0.058824,0.0,0.0,0.0
B0001ZWZ9S,0.052632,0.263158,0.0,0.157895,0.0,0.368421,0.157895,0.0
B00023N7TG,0.189873,0.303797,0.113924,0.160338,0.122363,0.067511,0.037975,0.004219


In [0]:
# Create a dataframe to aggregate word bins across categories
# Normalize to get proportions
category_aggregation = pd.crosstab(compiled["category_id2"], compiled["word_bins"], margins=True, normalize='index')
category_aggregation.head()

word_bins,0 - 5 words,6 - 15 words,16 - 25 words,26 - 40 words,41 - 65 words,66 - 100 words,101 - 200 words,200+
category_id2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ce-de/3578331,0.334525,0.235883,0.173695,0.128306,0.083631,0.029664,0.012152,0.002144
diy,0.054878,0.073171,0.097561,0.146341,0.152439,0.182927,0.146341,0.146341
garden/4339577031,0.254054,0.275676,0.237838,0.118919,0.081081,0.021622,0.010811,0.0
industrial/4546048031,0.252708,0.296029,0.148014,0.111913,0.101083,0.057762,0.021661,0.01083
pet-supplies,0.121428,0.168432,0.147719,0.152633,0.146834,0.107325,0.108578,0.047053


In [0]:
# Next we need to merge product_aggregation and category_aggregation!
# To do that we first merge category_id to product_aggregation on id
product_aggregation = pd.merge(product_aggregation,sales, how = 'inner', left_on = "product", right_on="id")
product_aggregation.head()

Unnamed: 0,0 - 5 words,6 - 15 words,16 - 25 words,26 - 40 words,41 - 65 words,66 - 100 words,101 - 200 words,200+,id,category_id2
0,0.181818,0.272727,0.181818,0.272727,0.0,0.0,0.090909,0.0,B0000AVVPU,pet-supplies/2975425011
1,0.4,0.292308,0.092308,0.076923,0.0,0.107692,0.030769,0.0,B0000BYCM0,pet-supplies
2,0.411765,0.176471,0.235294,0.117647,0.058824,0.0,0.0,0.0,B0000DAPGK,pet-supplies/2975425011
3,0.052632,0.263158,0.0,0.157895,0.0,0.368421,0.157895,0.0,B0001ZWZ9S,pet-supplies
4,0.189873,0.303797,0.113924,0.160338,0.122363,0.067511,0.037975,0.004219,B00023N7TG,pet-supplies/2975349011


In [0]:
# Next we merge the the category_aggregation table by joining it on category_id2
product_aggregation = pd.merge(product_aggregation,category_aggregation, how = 'inner', left_on = "category_id2", right_on="category_id2")
product_aggregation.head()

Unnamed: 0,0 - 5 words_x,6 - 15 words_x,16 - 25 words_x,26 - 40 words_x,41 - 65 words_x,66 - 100 words_x,101 - 200 words_x,200+_x,id,category_id2,0 - 5 words_y,6 - 15 words_y,16 - 25 words_y,26 - 40 words_y,41 - 65 words_y,66 - 100 words_y,101 - 200 words_y,200+_y
0,0.181818,0.272727,0.181818,0.272727,0.0,0.0,0.090909,0.0,B0000AVVPU,pet-supplies/2975425011,0.135165,0.174176,0.134066,0.143956,0.157692,0.092308,0.101099,0.061538
1,0.411765,0.176471,0.235294,0.117647,0.058824,0.0,0.0,0.0,B0000DAPGK,pet-supplies/2975425011,0.135165,0.174176,0.134066,0.143956,0.157692,0.092308,0.101099,0.061538
2,0.087324,0.135211,0.135211,0.16338,0.205634,0.101408,0.126761,0.04507,B00062F6HE,pet-supplies/2975425011,0.135165,0.174176,0.134066,0.143956,0.157692,0.092308,0.101099,0.061538
3,0.0,0.0,0.571429,0.0,0.285714,0.142857,0.0,0.0,B00062F6OM,pet-supplies/2975425011,0.135165,0.174176,0.134066,0.143956,0.157692,0.092308,0.101099,0.061538
4,0.173077,0.298077,0.153846,0.173077,0.125,0.038462,0.038462,0.0,B00068R98C,pet-supplies/2975425011,0.135165,0.174176,0.134066,0.143956,0.157692,0.092308,0.101099,0.061538


In [0]:
# Now let's compile the word count comparison to our original dataframe to begin comparing on a review basis
compiled_word_count = pd.merge(compiled,product_aggregation, how = 'inner', left_on = "product", right_on="id")
compiled_word_count.head()

Unnamed: 0,source,product,text,category_id2_x,reviewid,totalwords,word_bins,0 - 5 words_x,6 - 15 words_x,16 - 25 words_x,...,id,category_id2_y,0 - 5 words_y,6 - 15 words_y,16 - 25 words_y,26 - 40 words_y,41 - 65 words_y,66 - 100 words_y,101 - 200 words_y,200+_y
0,amazon.ca,B078N8NR7G,Produit disfonctionnel. J'exige remboursement,pet-supplies,R31B5G60GS531M,4.0,0 - 5 words,1.0,0.0,0.0,...,B078N8NR7G,pet-supplies,0.121428,0.168432,0.147719,0.152633,0.146834,0.107325,0.108578,0.047053
1,amazon.ca,B01HO8U5NC,J’ai adorer,pet-supplies,RRCY4V48RQBXG,2.0,0 - 5 words,0.163636,0.139394,0.145455,...,B01HO8U5NC,pet-supplies,0.121428,0.168432,0.147719,0.152633,0.146834,0.107325,0.108578,0.047053
2,amazon.ca,B01HO8U5NC,Bought this as a running bird bath for my two ...,pet-supplies,R18076F5C879LP,36.0,26 - 40 words,0.163636,0.139394,0.145455,...,B01HO8U5NC,pet-supplies,0.121428,0.168432,0.147719,0.152633,0.146834,0.107325,0.108578,0.047053
3,amazon.ca,B01HO8U5NC,Lid is easily knocked off but it’s still a gre...,pet-supplies,RLA1DFN3DCSFJ,11.0,6 - 15 words,0.163636,0.139394,0.145455,...,B01HO8U5NC,pet-supplies,0.121428,0.168432,0.147719,0.152633,0.146834,0.107325,0.108578,0.047053
4,amazon.ca,B01HO8U5NC,Works okay but is VERY NOISY.,pet-supplies,R3F4GS6FDS5ALH,6.0,6 - 15 words,0.163636,0.139394,0.145455,...,B01HO8U5NC,pet-supplies,0.121428,0.168432,0.147719,0.152633,0.146834,0.107325,0.108578,0.047053


In [0]:
# Rename columns to make them look a little prettier
compiled_word_count.rename(columns={'0 - 5 words_x':'product_0-5',
                                   '6 - 15 words_x':'product_6-15',
                                   '16 - 25 words_x':'product_16-25',
                                   '26 - 40 words_x':'product_26-40',
                                   '41 - 65 words_x':'product_41-65',
                                   '66 - 100 words_x':'product_66-100',
                                   '101 - 200 words_x':'product_101-200',
                                   '200+_x':'product_200+',
                                   '0 - 5 words_y':'category_0-5',
                                   '6 - 15 words_y':'category_6-15',
                                   '16 - 25 words_y':'category_16-25',
                                   '26 - 40 words_y':'category_26-40',
                                   '41 - 65 words_y':'category_41-65',
                                   '66 - 100 words_y':'category_66-100',
                                   '101 - 200 words_y':'category_101-200',
                                   '200+_y':'category_200+'}, inplace=True)

In [0]:
# Include the number of reviews per product as this will be one of our thresholds 
## We will only look at overrepresented word category for products having > 10 reviews; otherwise the results could
## be due to lack of data
compiled_word_count['number_of_reviews'] = compiled_word_count['product'].map(compiled_word_count['product'].value_counts())

In [0]:
# Create functions that will output a value of 1 for products with overrepresented word category counts (>10 %)
## 0 will be shown for products that are not within the overrepresented word category
## This is applied across all the word bins

def a(row):
    if (row['product_0-5'] > 0.1 + row['category_0-5']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_0-5'] < 0.1 + row['category_0-5']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

def b(row):
    if (row['product_6-15'] > 0.1 + row['category_6-15']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_6-15'] < 0.1 + row['category_6-15']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

def c(row):
    if (row['product_16-25'] > 0.1 + row['category_16-25']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_16-25'] < 0.1 + row['category_16-25']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

def d(row):
    if (row['product_26-40'] > 0.1 + row['category_26-40']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_26-40'] < 0.1 + row['category_26-40']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

def e(row):
    if (row['product_41-65'] > 0.1 + row['category_41-65']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_41-65'] < 0.1 + row['category_41-65']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

def f(row):
    if (row['product_66-100'] > 0.1 + row['category_66-100']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_66-100'] < 0.1 + row['category_66-100']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

def g(row):
    if (row['product_101-200'] > 0.1 + row['category_101-200']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_101-200'] < 0.1 + row['category_101-200']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

def h(row):
    if (row['product_200+'] > 0.1 + row['category_200+']) and (row['number_of_reviews'] > 10):
        val = 1
    elif (row['product_200+'] < 0.1 + row['category_200+']) and (row['number_of_reviews'] > 10):
        val = 0
    else:
        val = 0
    return val

In [0]:
# Create a new column showing the overrepresented word bins for each product
compiled_word_count['0-5_OR'] = compiled_word_count.apply(a, axis=1)
compiled_word_count['6-15_OR'] = compiled_word_count.apply(b, axis=1)
compiled_word_count['16-25_OR'] = compiled_word_count.apply(c, axis=1)
compiled_word_count['26-40_OR'] = compiled_word_count.apply(d, axis=1)
compiled_word_count['41-65_OR'] = compiled_word_count.apply(e, axis=1)
compiled_word_count['66-100_OR'] = compiled_word_count.apply(f, axis=1)
compiled_word_count['101-200_OR'] = compiled_word_count.apply(g, axis=1)
compiled_word_count['200+_OR'] = compiled_word_count.apply(h, axis=1)

In [0]:
# Check what it looks like!
compiled_word_count.head()

Unnamed: 0,source,product,text,category_id2_x,reviewid,totalwords,word_bins,product_0-5,product_6-15,product_16-25,...,category_200+,number_of_reviews,0-5_OR,6-15_OR,16-25_OR,26-40_OR,41-65_OR,66-100_OR,101-200_OR,200+_OR
0,amazon.ca,B078N8NR7G,Produit disfonctionnel. J'exige remboursement,pet-supplies,R31B5G60GS531M,4.0,0 - 5 words,1.0,0.0,0.0,...,0.047053,1,0,0,0,0,0,0,0,0
1,amazon.ca,B01HO8U5NC,J’ai adorer,pet-supplies,RRCY4V48RQBXG,2.0,0 - 5 words,0.163636,0.139394,0.145455,...,0.047053,165,0,0,0,0,0,0,0,0
2,amazon.ca,B01HO8U5NC,Bought this as a running bird bath for my two ...,pet-supplies,R18076F5C879LP,36.0,26 - 40 words,0.163636,0.139394,0.145455,...,0.047053,165,0,0,0,0,0,0,0,0
3,amazon.ca,B01HO8U5NC,Lid is easily knocked off but it’s still a gre...,pet-supplies,RLA1DFN3DCSFJ,11.0,6 - 15 words,0.163636,0.139394,0.145455,...,0.047053,165,0,0,0,0,0,0,0,0
4,amazon.ca,B01HO8U5NC,Works okay but is VERY NOISY.,pet-supplies,R3F4GS6FDS5ALH,6.0,6 - 15 words,0.163636,0.139394,0.145455,...,0.047053,165,0,0,0,0,0,0,0,0


In [0]:
# Create a function that will check if the individual review in the row with subject word bin is within the overrepresented criteria
def i(row):
    if row['word_bins'] == "0 - 5 words":
        val =  row['0-5_OR']
    elif row['word_bins'] == "6 - 15 words":
        val =  row['6-15_OR']
    elif row['word_bins'] == "16 - 25 words":
        val =  row['16-25_OR']
    elif row['word_bins'] == "26 - 40 words":
        val =  row['26-40_OR']
    elif row['word_bins'] == "41 - 65 words":
        val =  row['41-65_OR']
    elif row['word_bins'] == "66 - 100 words":
        val =  row['66-100_OR']
    elif row['word_bins'] == "101 - 200 words":
        val =  row['101-200_OR']
    elif row['word_bins'] == "200+":
        val =  row['200+_OR']
    else:
        val = 0
    return val

In [0]:
# Apply function for every row and create new column
compiled_word_count['Overrep_wrd_cnt'] = compiled_word_count.apply(i, axis=1)

In [0]:
# Delete any unneccessary columns
compiled_word_count.drop(['category_id2_x', 'id','category_id2_y'], axis=1, inplace = True)

In [0]:
compiled_word_count.head()

Unnamed: 0,source,product,text,reviewid,totalwords,word_bins,product_0-5,product_6-15,product_16-25,product_26-40,...,number_of_reviews,0-5_OR,6-15_OR,16-25_OR,26-40_OR,41-65_OR,66-100_OR,101-200_OR,200+_OR,Overrep_wrd_cnt
0,amazon.ca,B078N8NR7G,Produit disfonctionnel. J'exige remboursement,R31B5G60GS531M,4.0,0 - 5 words,1.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1,amazon.ca,B01HO8U5NC,J’ai adorer,RRCY4V48RQBXG,2.0,0 - 5 words,0.163636,0.139394,0.145455,0.115152,...,165,0,0,0,0,0,0,0,0,0
2,amazon.ca,B01HO8U5NC,Bought this as a running bird bath for my two ...,R18076F5C879LP,36.0,26 - 40 words,0.163636,0.139394,0.145455,0.115152,...,165,0,0,0,0,0,0,0,0,0
3,amazon.ca,B01HO8U5NC,Lid is easily knocked off but it’s still a gre...,RLA1DFN3DCSFJ,11.0,6 - 15 words,0.163636,0.139394,0.145455,0.115152,...,165,0,0,0,0,0,0,0,0,0
4,amazon.ca,B01HO8U5NC,Works okay but is VERY NOISY.,R3F4GS6FDS5ALH,6.0,6 - 15 words,0.163636,0.139394,0.145455,0.115152,...,165,0,0,0,0,0,0,0,0,0


In [0]:
# We can save the table for further use
compiled_word_count.to_csv('sample_outputs/compiled_word_count.csv',index=False)

In [0]:
# To compile we just keep the feature we need:
compiled_word_count.set_index('reviewid',drop=True,inplace=True)
word_count_labeled = compiled_word_count[['totalwords', 'Overrep_wrd_cnt']]

Compiling the above 2 features with the main table **"profile_urls_useful"**:

In [0]:
merged_highvol_wordcnt = pd.merge(word_count_labeled,reviews_high_volume,how='left', left_index=True, right_index=True)

In [0]:
profile_urls_useful.set_index('reviewid',drop=True,inplace=True)
merged_profile_highvol_wordcnt = pd.merge(profile_urls_useful,merged_highvol_wordcnt,how='left', left_index=True, right_index=True)

In [0]:
# save the table for further use
merged_profile_highvol_wordcnt.to_csv('sample_outputs/merged_profile_highvol_wordcnt.csv',index=True)

## Text-related: Repetitive phrases and incentivized reviews

- The features are created based on the review text analysis.
- The analysis could be applied to any column containing a list of text format data

**Input**: 'RSC reviews with profile ids.csv'( = profile_urls in the previous session)

**Structures of Data explained** :
1. text -> a column of reviews, each on is a string.   
    * e.g. df['text']
2. norm_corpus -> A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.   
    * e.g. norm_corpus = normalize_corpus(text)


**Incentivized phrases** <br>
Use a pre-defined list of incentivized words, like "free product", and check the incentivized words existence. 

*Methodlogy*

1. Clean Reviewbox text column to tidytext format
    * function normalize_document
2. Craft incentivized word list 
    * function create_incentivized_words():
    * clean the word with normalize_ducument
    * Use a pre-defined list of incentivized words, like "free product", and check the incentivized words existence in review column 
3. (Level1) Check whether a Review contains incentivized word list
    * <b>Result: pick 0.45% of reviews</b>
4. Craft Word features<br>


*(The 2 steps below are not included in this feature, they are further employed and used in the incentivized product model.)*
5. Use the features to build random forest model
    * manual label dataset
6. (Level 2) Make Prediction from Labeled data

**Phrase Repetition**<br>
Phrases that have a potential to indicate incentivized behaviors are selected to help detect reviews.<br>

*Methodology* <br>
1. Clean Reviewbox text column to tidytext format
    * function normalize_document
2. Create Dataframe subset 
    * Aggregate the dataframe by [Productid (Product Level), Reviewerid (Amazon Profiler Level)].
    * Break the dataframe into multiple sub-dataframes with same productId
3. Turn the subsets to Matrixs
    - function NLP Models
    - use CountVectorizer to turn the into matrix
4. Find similarity for the matrixs
    - function find_text_similarity
    - Filler low similarity records with a pre-define threshold
    


*CountVectorizer has three options*
   - Bag of Word
       * similarity between word
   - Bag of N-grams (We used 2-gram)
       * similarity between word pairs
   - TF-IDF (term frequency, document frequency)

In [0]:
#load_NLP_packages
def normalize_document(doc):   # doc is one Review text
    '''
       Input: list of text (Reviewbox provided text)
       Output: Cleaned text
        
       sample: 
           texts=df['text'].apply(str)
           normalize_corpus=np.vectorize(normalize_document)
           texts_clean= normalize_corpus(texts)
       
    ''' 
    # Lemmatizer, tokenizer, stop_words
    lemmatizer = WordNetLemmatizer() 
    stop_words = nltk.corpus.stopwords.words('english')

    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()

    # tokenize & lemmatize document
    tokens = [lemmatizer.lemmatize(word,pos="v") for word in word_tokenize(doc)]
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc


def NLP_models(norm_corpus,option=0):
    '''
    turn text data into text matrixs
    
    Input: 
    - preprocessed Text data
    - option for CountVectorizer
        * 0 bag of word
        * 1 bag of n grams
        * 2 TF IDF
    
    Output:
    - text matrixs
    '''
    # 1. Define CountVectorizer
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    if(option==1):
         # Option 1: Bag of Words
        cv = CountVectorizer(min_df=0.02,max_df=0.99,max_features=300)
    
    elif(option==2):
        # Option 2: Bag of 2-grams
        cv = CountVectorizer(ngram_range=(2,2))
    else:
        # Option 3: TF-IDF
        cv = TfidfVectorizer(min_df=0.02,max_df=0.99,max_features=300, use_idf=True)

    cv_fit=cv.fit_transform(norm_corpus)
    cv_matrix = cv.fit_transform(norm_corpus)
    cv_matrix = cv_matrix.toarray()

    # get all unique words in the corpus
    vocab = cv.get_feature_names()

    # show document feature vectors
    cv_matrix_df=pd.DataFrame(np.round(cv_matrix,2), columns=vocab)
    return cv_matrix_df


def find_text_similarity(norm_corpus,model_option=2,bench_mark=0.5):
    '''
    '''
    cv_matrix_df=NLP_models(norm_corpus,option=model_option)
    similarity_matrix = cosine_similarity(cv_matrix_df)
    similarity_df=pd.DataFrame(cosine_similarity(similarity_matrix))
    index= (similarity_df[similarity_df>bench_mark].notna().sum() !=1)==True
    true_index=index[index==True]
    
    return(true_index)

In [0]:
texts=profile_urls['text'].apply(str)
normalize_corpus=np.vectorize(normalize_document)
texts_clean= normalize_corpus(texts)

### Incentivized Review

In [0]:
def create_incentivized_words():
    
    incentivized_words=["Free collar"
    ,"Free collar offer"
    ,"Free one"
    ,"Free product"
    ,"Free dog collar for a positive review"
    ,"Free second collar"
    ,"Free gift"
    ,"Additional free chargers for a positive review"
    ,"Promised a free collar"
    ,"Another free"
    ,"In exchange for a positive review"
    ,"In exchange for a review"
    ,"If you review"
    ,"If I reviewed the product"
    ,"Write a review"
    ,"Writing a review"
    ,"Leave us a review"
    ,"Leave a review"
    ,"Positive review"
    ,"If I Left a review"
    ,"Reviews are paid"
    ,"Review in return"
    ,"For a review"
    ,"For our review"
    ,"For my review"
    ,"Leave a 5 star review"
    ,"Incentive"
    ,"Incentivized"
    ,"Gift card"
    ,"Inside the packaging was a flyer"
    ,"Flyer"
    ,"Bribe"]
    cleaned= normalize_corpus(incentivized_words)
    cleaned=cleaned[cleaned!='review']

    incentivized_words_list=[]
    [incentivized_words_list.append(x) for x in cleaned if x not in incentivized_words_list]
    print(incentivized_words_list)
    return(incentivized_words_list)

In [0]:
# Use the word list above
incentivized_words_cleaned=create_incentivized_words()

# Create an incentivized reviews list
vector=[]
for text in texts_clean:
    if any(word in text for word in incentivized_words_cleaned):
        vector.append(1)
    else:
        vector.append(0)
print("Total incentivzed reviews= {} ".format(sum(vector)))

print("{} percent of the reviews are incentivized ".format(sum(vector)/profile_urls.shape[0]*100))
profile_urls['incentivized']=vector

['free collar', 'free collar offer', 'free one', 'free product', 'free dog collar positive review', 'free second collar', 'free gift', 'additional free chargers positive review', 'promise free collar', 'another free', 'exchange positive review', 'exchange review', 'review product', 'write review', 'leave us review', 'leave review', 'positive review', 'review pay', 'review return', 'leave star review', 'incentive', 'incentivized', 'gift card', 'inside package flyer', 'flyer', 'bribe']
Total incentivzed reviews= 294 
0.44862208929715874 percent of the reviews are incentivized 


In [0]:
# Save output to csv
incentivized_review = profile_urls[['reviewid','incentivized']]
incentivized_review.set_index('reviewid',inplace=True,drop=True)
incentivized_review.to_csv("sample_outputs/incentivized_review.csv",index=True)

### Phrase repetition

In [0]:
df = profile_urls.copy()
df = df.dropna(axis=1,thresh=len(df)*0.9)
print(df.head(3))

      source     product                                   name  \
0  amazon.ca  B078N8NR7G       PetSafe 900 Meter Remote Trainer   
1  amazon.ca  B01HO8U5NC  Drinkwell Platinum Pet Fountain 168oz   
2  amazon.ca  B01HO8U5NC  Drinkwell Platinum Pet Fountain 168oz   

               date        status     sentiment         topic  \
0  2018-12-20 08:00  Not Assigned  Not Assigned  Not Assigned   
1  2019-04-13 08:00  Not Assigned  Not Assigned  Not Assigned   
2  2019-04-02 08:00  Not Assigned  Not Assigned  Not Assigned   

            author  Verified_Purchases  stars  ...  \
0  Amazon Customer                   1      1  ...   
1         nathalie                   1      5  ...   
2       Conure Mum                   1      4  ...   

                            title  \
0                   Remboursement   
1  La livraison très rapide merci   
2            Awesome for my birds   

                                                text image video  \
0      Produit disfonctionnel. J'exi

In [0]:
# Check number of products
num_products= len(df['product'].value_counts())
print('there are {} products'.format(num_products))
unique_product_list=df['product'].unique()

# Dictionary to Store Products
product_dict={}
for product in unique_product_list:
    product_dict[product]= df.loc[df['product']==product,]

print("The product we are interested is {}".format(product))
print("\n")
print(product_dict[product].head(3))
sub_df= product_dict[product]
text=sub_df['text']
text.reset_index(drop=True, inplace=True)
###    Finish Preparing Text

normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(text)

# Run TF-IDF model, get 
cv_matrix_df=NLP_models(norm_corpus,option=0)

# Use cosine similarity
similarity_matrix = cosine_similarity(cv_matrix_df)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df.head(3)

there are 962 products
The product we are interested is B00074L4UO


       source     product                                               name  \
65365  amazon  B00074L4UO  PetSafe Gentle Leader Headcollar, No-Pull Dog ...   
65366  amazon  B00074L4UO  PetSafe Gentle Leader Headcollar, No-Pull Dog ...   
65367  amazon  B00074L4UO  PetSafe Gentle Leader Headcollar, No-Pull Dog ...   

                   date        status     sentiment         topic  \
65365  2019-12-26 08:00  Not Assigned  Not Assigned  Not Assigned   
65366  2019-12-25 08:00  Not Assigned  Not Assigned  Not Assigned   
65367  2019-12-22 08:00  Not Assigned  Not Assigned  Not Assigned   

                author  Verified_Purchases  stars  ...  \
65365  SpringersRGreat                   1      5  ...   
65366            Jason                   1      1  ...   
65367          A. Bell                   1      3  ...   

                                   title  \
65365                   A big difference   
65366   I gu

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,159,160,161,162,163,164,165,166,167,168
0,1.0,0.242133,0.144538,0.0,0.0,0.0,0.0,0.0,0.018503,0.134383,...,0.055782,0.0,0.0,0.0,0.040905,0.089112,0.0,0.127143,0.0,0.0
1,0.242133,1.0,0.058671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.194382,0.0,0.387328,0.0,0.129582,0.0,0.0,0.070571,0.0,0.0
2,0.144538,0.058671,1.0,0.0,0.0,0.41045,0.250534,0.129391,0.253577,0.213089,...,0.160162,0.025428,0.06546,0.022507,0.126007,0.110065,0.0,0.15707,0.075348,0.082393


In [0]:
# Char length for each review
df['text_len']=df['text'].apply(str).apply(lambda x:len(x))
temp=df.copy()

In [0]:
# filter records for text length<30
df=df[df['text_len']>=30]
print("{} percentage of reviews keeped".format(df.shape[0]/temp.shape[0]))

0.8765831476790673 percentage of reviews keeped


In [0]:
# Find reviews with problematic phrase repetition
problem_review_id=[]

for product in unique_product_list:
    try:
        print(product)
        product_dict[product]= df.loc[df['product']==product,]
        sub_df= product_dict[product]
        text=sub_df['text'].apply(str)
        text.reset_index(drop=True, inplace=True)
        norm_corpus = normalize_corpus(text)
        index=find_text_similarity(norm_corpus)
        problem_review_id.append(sub_df.iloc[index.index,][["reviewid"]].values.tolist()) 
    except:
        continue

B078N8NR7G
B01HO8U5NC
B01HI5ZXN8
B01HB7N5ZQ
B078N83GS4
B078N564WT
B078N3JVYV
B01GCAS5SK
B01GCAS4VS
B01GCAS4RM
B01J18Z1BO
B01J18Z1AU
B01E6TI2DC
B01E6TI1Q0
B01GCAS4JA
B01EYK74FK
B01ESR0PT6
B078N35M1S
B01GCAS4MC
B01GCAS4KE
B01GCAS4K4
B01ESR0O5G
B01ESR0OAQ
B01EA7E88I
B01EA7E766
B01ESR0MAI
B01ATSHB5E
B01ESR0MSU
B01ESR0MR6
B01CZ6VENI
B01B1FT4H2
B01E6THUR6
B01E6THUK8
B01E6THUJE
B01ATS8NUQ
B01E6THUIU
B01E6THU30
B01DGEGIPW
B015TNVVGY
B014COTASW
B0188Y676U
B017N6IF5U
B0167GU9AG
B015TNW2FS
B015TNW12C
B015TNW0Z0
B019I1ZTXY
B019I1ZTKC
B0188Y67J2
B015TNVZEW
B015TNVYRU
B015TNVYP2
B012F869RM
B010E08V06
B015TNW0GE
B015TNW01O
B015TNVZHE
B014COTAK0
B014COTA6E
B014COTA46
B00ZCFPHO2
B00ZCFPH56
B00YHPNWWC
B015TNVY0M
B015TNVXQ2
B015TNVXAI
B00ZEGHU8A
B00ZEGHS4G
B00ZEGHR10
B015TNVWWM
B015TNVWEK
B00VIXRB6O
B00UTIASZ0
B00T88U5DC
B00YHPNS8U
B00MPE5KFY
B00MPE5KCM
B00VPYYR9A
B00VPYYR8G
B00VPYYQZA
B00QV5GF34
B00QTCUV0C
B00Q52H0DW
B00OZMOR26
B00OZMOQM2
B00OH46TSW
B00MPE5U2W
B00MPE5PAO
B00MPE5P5O
B00MPE5JZA
B00MPE5FUY

B00CMLS0VG
B000A27NGW
B000LXY3CC
B00F0JD184
B0007RD9O0
B00B17ETPI
B000LXVYM4
B00CW9XWTI
B000LXW0YA
B000LXU3N0
B000LXU3NA
B01171OR6I
B000RXY4H0
B00IAOB4VC
B00B732D2W
B00CW9XWXE
B000RXVJEQ
B00B17ETPS
B00B17ETR6
B000LY0XWU
B0016HNU12
B00LPFP31A
B000241NRI
B00CZ7HP4A
B00CZ7HP5O
B00CZ7HP68
B00QGYMAIY
B00VPYYY16
B00CZ7HO1Y
B00CZ7HO3W
B00CZ7HOS2
B0016HPTFW
B01MYBV6FN
B00CZ7HE4Q
B00CZ7HE9G
B00CZ7HFBS
B01K4KYZL0
B00B23AUVS
B00B17ETNU
B00C1FI63A
B00LHUWS6Q
B008LUKBGE
B008LUKC7W
B00I04Y7RA
B00L51ZQHU
B0752XP3R5
B073FV5LVW
B0011F4WWK
B004WO90E2
B008LUKARE
B075T6VM7W
B01ATS8NY2
B01ATS8OP0
B01ATS8EVY
B01ATS8JFU
B00WFKJWNY
B01ATS8JH8
B01ATS8ESM
B00U2P342E
B00US6U6ZU
B00W8GDDBM
B01ATS8JKU
B01ATS8OK0
B00VKW57VE
B00VPVJKMM
B00W8GGDQ4
B00W8GJK64
B00VKW6Y2U
B00SX8JQR4
B00T3X1W52
B00VKVZAZI
B00VV5TG08
B00W8G9NBG
B00VKW1R2C
B00VKW3GIA
B00QRSA540
B00QHID8VC
B00QHID92K
B00VKVHBT6
B00PJ8RQR8
B00PJ8RFI8
B00PJ8RGNW
B00RKFK3Q4
B00S8JW1T8
B00PJ8NDB6
B00DQXR42U
B00DSOMPEY
B00E1T0CAO
B00DLQWSCI
B00DLQWYZO
B00DLQWSV4

In [0]:
def get_flattened_list(lst):
    flattened_list = []
    #flatten the lis
    for x in lst:
        for y in x:
            flattened_list.append(y)
    return(flattened_list)

problem_review_id=get_flattened_list(get_flattened_list(problem_review_id))

In [0]:
# Create the new column
new_col=df['reviewid'].apply(lambda x: 1 if x in problem_review_id else 0)
df['Contains_rep_phrases']=new_col

In [0]:
df.head()

Unnamed: 0,source,product,name,date,status,sentiment,topic,author,Verified_Purchases,stars,...,image,video,reviewid,reviewlink,parent,inputtime,source_product,incentivized,text_len,Contains_rep_phrases
0,amazon.ca,B078N8NR7G,PetSafe 900 Meter Remote Trainer,2018-12-20 08:00,Not Assigned,Not Assigned,Not Assigned,Amazon Customer,1,1,...,No,No,R31B5G60GS531M,https://www.amazon.ca/review/R31B5G60GS531M/re...,B078Y3F7MG,2018-12-22 06:24,amazon.ca B078N8NR7G,0,45,0
2,amazon.ca,B01HO8U5NC,Drinkwell Platinum Pet Fountain 168oz,2019-04-02 08:00,Not Assigned,Not Assigned,Not Assigned,Conure Mum,1,4,...,Yes,No,R18076F5C879LP,https://www.amazon.ca/review/R18076F5C879LP/re...,B01MQ1JZ3L,2019-04-04 12:02,amazon.ca B01HO8U5NC,0,172,0
3,amazon.ca,B01HO8U5NC,Drinkwell Platinum Pet Fountain 168oz,2019-03-31 08:00,Not Assigned,Not Assigned,Not Assigned,Wayne Smith,1,4,...,No,No,RLA1DFN3DCSFJ,https://www.amazon.ca/review/RLA1DFN3DCSFJ/ref...,B01MQ1JZ3L,2019-04-02 20:35,amazon.ca B01HO8U5NC,0,54,0
8,amazon.ca,B01HO8U5NC,Drinkwell Platinum Pet Fountain 168oz,2018-01-08 08:00,Not Assigned,Not Assigned,Not Assigned,Lisa,1,5,...,No,No,R3HA5C606GWKN,https://www.amazon.ca/review/R3HA5C606GWKN/ref...,B01MQ1JZ3L,2018-01-08 18:43,amazon.ca B01HO8U5NC,0,138,0
9,amazon.ca,B01HO8U5NC,Drinkwell Platinum Pet Fountain 168oz,2017-10-21 08:00,Not Assigned,Not Assigned,Not Assigned,mika jiang,1,1,...,No,No,R2OHNULSZ4ARHS,https://www.amazon.ca/review/R2OHNULSZ4ARHS/re...,B01MQ1JZ3L,2017-10-23 06:09,amazon.ca B01HO8U5NC,0,252,0


In [0]:
# Save output to csv
repetitive_phrase = df[['reviewid','Contains_rep_phrases']]
repetitive_phrase.to_csv("sample_outputs/Contains_rep_phrases.csv",index=False)

In [0]:
# Add this feature to the main dataset
repetitive_phrase.set_index('reviewid',drop=True,inplace=True)
merged_profile_highvol_wordcnt_text = pd.merge(merged_profile_highvol_wordcnt,repetitive_phrase,how='left', left_index=True, right_index=True)
merged_profile_highvol_wordcnt_text = pd.merge(merged_profile_highvol_wordcnt_text,incentivized_review,how='left', left_index=True, right_index=True)

In [0]:
merged_profile_highvol_wordcnt_text.head()

Unnamed: 0_level_0,author,source,product,profile,Verified_Purchases,source_product,profile_id,Non_Verified_Purchases,helpful_votes,name,...,single_day,avg_rating,Easy_grade_rating,mode_number,samedate_20,totalwords,Overrep_wrd_cnt,high_vol_day_rev,Contains_rep_phrases,incentivized
reviewid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
R1001WAW3T7HTQ,gilles gaujard,amazon.fr,B0756GW8SJ,https://www.amazon.fr/gp/profile/amzn1.account...,1,amazon.fr B0756GW8SJ,AFMS5OAIEA7PLFXGKOCG54VEK5SA,0,30,gilles gaujard,...,0.0,3.333333,0.0,2.0,0.0,60.0,0.0,0.0,0.0,0
R1004ZE6L2BXTF,Cindy Oliver,amazon.ca,B005CO91TK,https://www.amazon.ca/gp/profile/amzn1.account...,1,amazon.ca B005CO91TK,AHXTUBEEM6W63JGMG7QEBVXHDYVQ,0,0,Cindy Oliver,...,1.0,5.0,1.0,1.0,0.0,4.0,0.0,0.0,,0
R1007O54FB5M3J,Amazon Kunde,amazon.de,B004MXPIMQ,https://www.amazon.de/gp/profile/amzn1.account...,1,amazon.de B004MXPIMQ,AH2YDI2ZQEPTYU4FNIGZJZGF2IEQ,0,2,Amazon Kunde,...,0.0,3.8,0.0,8.0,0.0,14.0,0.0,0.0,0.0,0
R100C3T3PQ3L1G,FreshRandy,amazon.ca,B002RT8M8O,https://www.amazon.ca/gp/profile/amzn1.account...,1,amazon.ca B002RT8M8O,AFD6AWSQDRBABCE3XQIREVTJTC6A,0,1,FreshRandy,...,0.0,3.5,0.0,5.0,0.0,2.0,0.0,0.0,,0
R100JBMVROD5NL,James Palovich,amazon,B00VPVGAPM,https://www.amazon.com/gp/profile/amzn1.accoun...,1,amazon B00VPVGAPM,AHSHQ4GUFFMMSLADSLRPRAQMJ3CQ,0,4,James Palovich,...,0.0,3.75,0.0,2.0,0.0,8.0,0.0,0.0,0.0,0


## Overlap History

This feature is created to detect whether certain reviewers have overlapping review histories with others. If a reviewer has >= 3 items(products) in his/her history that are same as another reviewer, than we will flag **the reviewer**.

Possible bias: without the full review history of a certain reviewer, we can only tag according to the reviews we scraped for a certain brand. The result will be more reliable if the full review page of a customer is available.

**Input**: <br>
'RSC reviews with profile ids.csv'( = profile_urls in the previous session)<br>

In [0]:
# use the dataset with the profile links
profiles = profile_urls_useful.copy()

# Create a subset to contain only reviewer and the products they reviewed
sub=profiles[['profile_id','product']]
sub.groupby('profile_id').agg({'product':len}).sort_values('product',ascending=False)

Unnamed: 0_level_0,product
profile_id,Unnamed: 1_level_1
AHSG62PGZDRYBGJCXEDUDR5NZ77Q,7
AEFZNPOYVG3HRDUZLC6GNYO37Y7Q,7
AHHRSOIFKMBHDPPP25MB7KJ7ZQ5A,7
AGBD44L3D2EI457VKT4XHMJ44HIA,7
AE3UUDONQZB7F2RCS4QAA6ZDN6TQ,7
AF5N7TVAEBTUBTMHUT6YF74YXXCA,7
AHKMAG5Y5DJ3IMJGK3KRHBUDXAHA,7
AE7QE73NQZWH2HS4U5YWIX2FI5IA,6
AFHK7X5ZHGSCZZ5WDMCZ6E74IKQA,6
AFDQ6KXDEM4YCII3362IQX7T2U3A,6


In [0]:
profiles.shape

(26769, 7)

In [0]:
# Create an author list
author_list = list(sub['profile_id'].unique())

In [0]:
len(author_list)

25142

In [0]:
# Create a product list that shows the set of products that an author reviewed
product_list = []
for i in range(0,len(author_list)):
    author_name = sub['profile_id'][i]
    products = list(sub[sub['profile_id']==author_name]['product'].unique())
    product_list.append(products)

In this part we will compare each random 2 pairs of product lists from different authors, <br>and get all the authors that have >= 3 overlapping review histories with others.

In [0]:
# Define a function to find out for each pair of author, how many products overlap
def find_overlay(s1,s2):
    return len([x for x in list(s1) if x in list(s2)])

# Choose only the pairs with more than two products overlapping
pairs = []
for i in tnrange(-1,len(author_list)-2):
    i += 1
    j = i + 1
    for j in range(i+1,len(author_list)):
        if find_overlay(product_list[i], product_list[j]) >= 3:
            pairs.append([i,j])
        else:
            pass
        j += 1

  import sys


HBox(children=(FloatProgress(value=0.0, max=25141.0), HTML(value='')))




In [0]:
# Retrieve only the unique reviewers in the pairs
pairs = pd.DataFrame(pairs)
pairs.columns = [['author','overlap_author']]

list1 = pairs.overlap_author.values.tolist()
list2 = pairs.author.values.tolist()
list1.extend(list2)
row = list(np.unique(list1))

In [0]:
len(row)

318

In [0]:
# Change from number to the respective names of authors
overlap_author_ids = [author_list[x] for x in row]
overlap_author_ids

['AHY5J5UKSO5IVCBHTPBCJ57VAQDQ',
 'AHCFSPAAGVRKSZBNAVCIBUVSRC7Q',
 'AGDABQI2NMUCDNUGNX4L2NXJ5TRA',
 'AFU5IX7VB4GCEU7WTG3QKEFGGAEA',
 'AEHXC2Z2EMMVS4CBXJLLRZ2ZZJRA',
 'AFRBHN26LK5X36CTUJQGVGUC73TA',
 'AHHCY5AG764TZCBGJVYKRTQXGCLA',
 'AHEH4UQ6WDQSD4RXMNWX4OLWI2EA',
 'AFMMQ7M242BQVWGU2RRV6ZX2WEVQ',
 'AEUZCG56PRVEZCCYYAXRL7E5G7XA',
 'AGS6K7KXB76QNAQDNBIFM5UJ42ZA',
 'AEDCQ7APA3OPPBV43O3RTCFB7A4A',
 'AEDMOPVM3FJLBUBVEZLNGMVJO7EA',
 'AH6HNYZXCHCY5VWWPC7CUJRNPXIA',
 'AF4QFVMCKRG5DBOIVM33AXI3BA2Q',
 'AG7H2QQTHWGIPIYJDZI3YZDXSFQA',
 'AHPJUH5GZ3ZHAOUXTKUWU3CJGRHA',
 'AGGXCQOHSVI6EWWD6EDYNY5KSXMA',
 'AHZF3VWQZQWLYPIVUX2MQGREDAOA',
 'AEAJ47N7SV7SXTRXJGGSU3V7M63Q',
 'AHNHK52FFJKKSHMDUOVTRKH4LHOA',
 'AGEXLIX6ARYO6CW6GONDMOSFKDEA',
 'AGBJUZOQYPKATDXUNWBADC4AFQXA',
 'AH36E6GFODCDRGUDAX7QQZYLSLSA',
 'AFVA4WBSP4YGUZWCMNSHOJNNWXFA',
 'AEQIZDOQGAUTUV3X5DCRVJWNHA5A',
 'AFGQIWI2MWQJFSEEA2626K6I34JA',
 'AEM3DZD5G2BNCNJ2SPLFUEXFZ7WQ',
 'AFGNVG5O5GJ6QEYXMCDFVHGYOYVQ',
 'AHME4KBN7QAHCAVLDMAI4J7JCIIA',
 'AFCW6LV3

In [0]:
# Tag these authors with overlapping history labels
overlap_to_join = pd.DataFrame(overlap_author_ids)
overlap_to_join.columns = ['profile_id']
overlap_to_join['Overlapping_rev_history'] = 1

In [0]:
# join with the review table
overlaps = pd.merge(profiles, overlap_to_join, how='left', on='profile_id')
overlaps.Overlapping_rev_history = overlaps.Overlapping_rev_history.fillna('0')
overlaps

Unnamed: 0,author,source,reviewid,product,profile,Verified_Purchases,profile_id,Overlapping_rev_history
0,Amazon Customer,amazon.ca,R31B5G60GS531M,B078N8NR7G,https://www.amazon.ca/gp/profile/amzn1.account...,1,AHXGHQLLRKPURPVJYTWO7I7AB4DA,0
1,nathalie,amazon.ca,RRCY4V48RQBXG,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AHNM2DIYUKF6P6JQOU2KNOKRPEPQ,0
2,Conure Mum,amazon.ca,R18076F5C879LP,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AHOEGFFK7INBIPCDO7PKPUTZMRXQ,0
3,Wayne Smith,amazon.ca,RLA1DFN3DCSFJ,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AGZ476P3EVM6UAENKIUUUYSA4J4A,0
4,Rob Self,amazon.ca,R3F4GS6FDS5ALH,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AGHAHXXQGDC6W4B6RGY6D2ILUJPA,0
5,Richard Goods,amazon.ca,R1KDTNCKJ3DCV2,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AGEIDBVV56UXQVRXJEAZUVZLAAIA,0
6,Jerome Tanguay,amazon.ca,R10TY9YVK98S85,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AELNNSQ7AJWS6USZ3H6PBPPBSUOQ,0
7,Tammy Roode,amazon.ca,R1ZRSF0QANSTRY,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AHL63IJNR23UZN3UKHES45H3QURQ,0
8,Lisa,amazon.ca,R3HA5C606GWKN,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AESVWPHBGIQHV25TV4KVVRFVOUJA,0
9,mika jiang,amazon.ca,R2OHNULSZ4ARHS,B01HO8U5NC,https://www.amazon.ca/gp/profile/amzn1.account...,1,AFMRFPAP2N7DX2LTKCYWNXP6R3FA,0


In [0]:
overlaps.to_csv('sample_outputs/Overlapping_rev_history.csv',index=False)

In [0]:
# Keep only the columns in interest:
overlap_labeled = overlaps[['reviewid','profile_id','Overlapping_rev_history','text']]
overlap_labeled.set_index('reviewid',drop=True,inplace=True)

In [0]:
# Merge with the full dataset
full_merged_data = pd.merge(merged_profile_highvol_wordcnt_text,overlap_labeled,how='left', left_index=True, right_index=True)

# Export data
full_merged_data.to_csv('sample_outputs/full_merged_data_RSC.csv',index=True)