# Order Brushing Detection

Task
1. Identify all shops that are deemed to have conducted order brushing.
2. For each shop that is identified to have conducted order brushing, identify the buyers suspected to have conducted order brushing for that shop.

For the purpose of this question, shops are deemed to have conducted order brushing if their concentrate rate is greater than or equal to 3 at any instance

[link to kaggle](https://www.kaggle.com/c/order-brushing-shopee-code-league)

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

## Load data

In [2]:
df = pd.read_csv('order_brush_order.csv')
df.head()

Unnamed: 0,orderid,shopid,userid,event_time
0,31076582227611,93950878,30530270,2019-12-27 00:23:03
1,31118059853484,156423439,46057927,2019-12-27 11:54:20
2,31123355095755,173699291,67341739,2019-12-27 13:22:35
3,31122059872723,63674025,149380322,2019-12-27 13:01:00
4,31117075665123,127249066,149493217,2019-12-27 11:37:55


# Data Exploration

In [3]:
# there are total of 222750 transaction
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222750 entries, 0 to 222749
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   orderid     222750 non-null  int64 
 1   shopid      222750 non-null  int64 
 2   userid      222750 non-null  int64 
 3   event_time  222750 non-null  object
dtypes: int64(3), object(1)
memory usage: 6.8+ MB


There are total 18770 shops, with most of the shops only have few orders.   
50% of shops less than 2 orders

In [4]:
unique_shopid = df['shopid'].unique()
len(unique_shopid)

18770

In [5]:
#df.groupby('shopid').size().describe()
df.groupby('shopid')['userid'].count().describe()

count    18770.000000
mean        11.867342
std        104.244723
min          1.000000
25%          1.000000
50%          2.000000
75%          8.000000
max      11703.000000
Name: userid, dtype: float64

List of TOP 10 shops with MOST orders

In [6]:
df.groupby('shopid')['userid'].count().sort_values(ascending = False).head(10)

shopid
147941492    11703
61556313      6691
54615708      2640
91799978       707
57189823       687
107921853      684
25924280       655
89768730       607
43412276       529
32504026       516
Name: userid, dtype: int64

# Data Analysis

Generate timestamp

In [7]:
df['event_time'] = pd.to_datetime(df['event_time'])

# find the start and last timestamp for the entire dataframe, is is only 5 days datasets
print(f"transaction start: {df['event_time'].min()}")
print(f"transaction end: {df['event_time'].max()}")

transaction start: 2019-12-27 00:00:00
transaction end: 2019-12-31 23:59:56


In [8]:
# turn obj type
df['day-hr'] = df['event_time'].dt.strftime('%d/%m/%Y-%H')

In [9]:
df.head()

Unnamed: 0,orderid,shopid,userid,event_time,day-hr
0,31076582227611,93950878,30530270,2019-12-27 00:23:03,27/12/2019-00
1,31118059853484,156423439,46057927,2019-12-27 11:54:20,27/12/2019-11
2,31123355095755,173699291,67341739,2019-12-27 13:22:35,27/12/2019-13
3,31122059872723,63674025,149380322,2019-12-27 13:01:00,27/12/2019-13
4,31117075665123,127249066,149493217,2019-12-27 11:37:55,27/12/2019-11


In [10]:
def get_brushers(shopid):
    by_id = df[df['shopid'] == shopid].groupby(['day-hr', 'userid']).agg({'orderid':'count'}).reset_index()
    brusher_ids = by_id[by_id['orderid'] >=3]['userid'].sort_values().astype(str)
    
    if len(brusher_ids):
        return '&'.join(brusher_ids)
    else:
        return 0

In [11]:
df_submit = pd.DataFrame()
df_submit['shopid'] = df['shopid'].unique()
df_submit['userid'] = df_submit['shopid'].apply(get_brushers)

In [82]:
userid_list = []
for userid in df_submit[df_submit['userid'] !=0]['userid']:
    userid_list.append(userid)

    

In [12]:
df_submit

Unnamed: 0,shopid,userid
0,93950878,0
1,156423439,0
2,173699291,0
3,63674025,0
4,127249066,0
...,...,...
18765,163337224,0
18766,772599,0
18767,50236030,0
18768,203587596,0


In [16]:
len(df_submit[df_submit['userid'] != 0])

296