# Shopee competition 1: Order Brushing Detection

**Task**
1. Identify all shops that are deemed to have conducted order brushing.
2. For each shop that is identified to have conducted order brushing, identify the buyers suspected to have conducted order brushing for that shop.

For the purpose of this question, shops are deemed to have conducted order brushing if their concentrate rate is greater than or equal to 3 at any instance

[link to kaggle](https://www.kaggle.com/c/order-brushing-shopee-code-league)

In [112]:
import pandas as pd
import numpy as np
from datetime import datetime

# Load dataset

In [113]:
# the .csv file was downloaded to my local drive
df = pd.read_csv('order_brush_order.csv')
df.head()

Unnamed: 0,orderid,shopid,userid,event_time
0,31076582227611,93950878,30530270,2019-12-27 00:23:03
1,31118059853484,156423439,46057927,2019-12-27 11:54:20
2,31123355095755,173699291,67341739,2019-12-27 13:22:35
3,31122059872723,63674025,149380322,2019-12-27 13:01:00
4,31117075665123,127249066,149493217,2019-12-27 11:37:55


# Data Exploration

In [114]:
print("Dataset: {} rows & {} columns".format(df.shape[0], df.shape[1]))

Dataset: 222750 rows & 4 columns


In [115]:
# change 'event_time' datatype to datetime

df['event_time'] = pd.to_datetime(df['event_time'])

In [116]:
# add another columen -> timestamp()
# this parse the 'event_time' into numeric value (resoluation down to sec)

df['ts'] = df['event_time'].apply(lambda x: x.timestamp()).astype(int)
df.head()

Unnamed: 0,orderid,shopid,userid,event_time,ts
0,31076582227611,93950878,30530270,2019-12-27 00:23:03,1577406183
1,31118059853484,156423439,46057927,2019-12-27 11:54:20,1577447660
2,31123355095755,173699291,67341739,2019-12-27 13:22:35,1577452955
3,31122059872723,63674025,149380322,2019-12-27 13:01:00,1577451660
4,31117075665123,127249066,149493217,2019-12-27 11:37:55,1577446675


In [117]:
# sort dataframe in ascending time order

df = df.sort_values(by = 'event_time', ascending = True)
df.head()

Unnamed: 0,orderid,shopid,userid,event_time,ts
150060,31075200506751,6042309,97707522,2019-12-27 00:00:00,1577404800
25985,31075200506752,104804492,97707522,2019-12-27 00:00:00,1577404800
81020,31075200506753,8715449,97707522,2019-12-27 00:00:00,1577404800
146598,31075201870570,190969466,170182475,2019-12-27 00:00:02,1577404802
80355,31075205798264,2859407,12532131,2019-12-27 00:00:05,1577404805


#### Compile list of unique shopid   
There are total of 18770 shops   
However, most of the shops only have few orders.    
50% of shops less than 2 orders

In [118]:
unique_shopid = df['shopid'].unique()
print(f"Total shops: {len(unique_shopid)}")

df.groupby('shopid').size().describe()

Total shops: 18770


count    18770.000000
mean        11.867342
std        104.244723
min          1.000000
25%          1.000000
50%          2.000000
75%          8.000000
max      11703.000000
dtype: float64

# Data Analysis

#### Compile list of shop that has at least 3 orders

In [119]:
shopid_over3_dict = df.groupby('shopid').size()[df.groupby('shopid').size()>=3].to_dict()
shopid_over3 = shopid_over3_dict.keys()

print(f"{len(shopid_over3)} shops has >= 3 orders")


9031 shops has >= 3 orders


In [120]:
# take a peak on the first shopid in the list that has order >=3
df[df['shopid'] == list(shopid_over3)[0]]

Unnamed: 0,orderid,shopid,userid,event_time,ts
188513,31195675919209,10061,168750452,2019-12-28 09:27:55,1577525275
174620,31205132327893,10061,194819216,2019-12-28 12:05:32,1577534732
200439,31412115824794,10061,130633421,2019-12-30 21:35:15,1577741715
145426,31431527100615,10061,62464559,2019-12-31 02:58:48,1577761128


#### Define function to extract the value of a dict that is equal or more than 3    
This function will be applied on a dict, where its'      
key => userid   
value => count of userid appeared in the df

In [121]:
def find_brush(df_dict):
    key_list = set()
    for key, value in df_dict.items():
        if value >=3:
            key_list.update([key])
    return list(key_list)

#### Below block of code to detect brushing order   
1. Loop over the shopid with at least 3 orders
2. Build a dataframe for each shopid (i.e.: shop_df)
3. Loop through each row of the dataframe (shop_df), slice it into a smaller dataframe (i.e.: tmp_df) within 1hr time span. Search and detect brushing order

In [122]:
timespan = 3600    #1 hour interval
ans_dict = dict()  # place holder to record the shopid that has brushing order


# Loop through the shopid with at least 3 orders:
for count, shop in enumerate(list(shopid_over3)):
    shop_df = df[df['shopid']==shop]
    len_shop = shop_df.shape[0]
    
    if count%1000 ==0:
        print(f"{count} of {len(list(shopid_over3))}")
        
    # record the userid if found brushing
    brush_userid = set()
    
    # start to loop from firt row till the row within 1hour timespan
    # need at least 3 orders to for brushing, thus no need to loop for the last 2 rows
    for row in range(len_shop-2):
        mask_1hr = shop_df['ts'] <= shop_df['ts'].iloc[row]+timespan
        mask_current = shop_df['ts'] >= shop_df['ts'].iloc[row]    # this is to exclude the earlier row
        
        tmp_df = shop_df[mask_1hr][mask_current]     #dataframe 1hr span
   
        # start compile if brushing detected in 2 scenario
        # scenario one: only 3 orders and all order came from same userid
        if (tmp_df.shape[0] == 3) & (tmp_df['userid'].nunique() == 1):
            brush_userid.update(tmp_df['userid'])
        
        # scenario two: more than 3 orders
        elif tmp_df.shape[0] > 3:
            user_dict = tmp_df.groupby('userid').size().to_dict()    #key=userid, value = count of userid
            brush_id = find_brush(user_dict)      # use 'find_brush' funciton to get userid with >= 3 orders
            if len(brush_id) > 0:
                brush_userid.update(brush_id)
    
    if len(brush_userid) !=0:
        ans_dict[shop] = list(brush_userid)
  

0 of 9031




1000 of 9031
2000 of 9031
3000 of 9031
4000 of 9031
5000 of 9031
6000 of 9031
7000 of 9031
8000 of 9031
9000 of 9031


#### count how many shop is brushing:

In [123]:
len(ans_dict)

393

#### define function to join the userid if detected more than 1 userid perform brushing

In [81]:
#This function join all the items in input 'list' with "&"
# e.g. list = [11,22,33], output => '11&22&33'
def join_dict_value(list):
    return '&'.join(str(i) for i in list)

# Compiled results

In [124]:
brush_shopid = []   # placeholder to save ALL unique shopid
brush_userid = []   # placeholder to save userid if brushing detected

# ans_dict: key=shopid, value=userid detected in brushing
for shop in ans_dict.keys():
    if len(ans_dict[shop]) >1:
        brush_userid.append(join_dict_value(ans_dict[shop]))    # user 'join_dict_value' function to join all userid
    else:
        brush_userid.append(str(ans_dict[shop][0]))   #index 0 is to access the item in the list
    brush_shopid.append(shop)

# save shopid that DO NOT detect brushing
for shop in unique_shopid:
    if shop not in ans_dict.keys():
        brush_userid.append("0")
        brush_shopid.append(shop)

print("total list of shopid: {}".format(len(brush_shopid)))
print("total list of userid: {}".format(len(brush_userid)))

submission_df = pd.DataFrame(zip(brush_shopid, brush_userid), columns = ['shopid', 'userid'])
submission_df

total list of shopid: 18770
total list of userid: 18770


Unnamed: 0,shopid,userid
0,10159,214988798
1,10287,214226569&198097381
2,10402,77819
3,10536,672345
4,29583,214495220&215275495
...,...,...
18765,173249173,0
18766,212058195,0
18767,577768,0
18768,162561288,0


In [101]:
# save to csv file
submission_df.to_csv('brushing_submission.csv', index = False)