# Order Brushing Detection

Task
1. Identify all shops that are deemed to have conducted order brushing.
2. For each shop that is identified to have conducted order brushing, identify the buyers suspected to have conducted order brushing for that shop.

For the purpose of this question, shops are deemed to have conducted order brushing if their concentrate rate is greater than or equal to 3 at any instance

[link to kaggle](https://www.kaggle.com/c/order-brushing-shopee-code-league)

In [11]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

## Load data

In [12]:
df = pd.read_csv('order_brush_order.csv')
df.head()

Unnamed: 0,orderid,shopid,userid,event_time
0,31076582227611,93950878,30530270,2019-12-27 00:23:03
1,31118059853484,156423439,46057927,2019-12-27 11:54:20
2,31123355095755,173699291,67341739,2019-12-27 13:22:35
3,31122059872723,63674025,149380322,2019-12-27 13:01:00
4,31117075665123,127249066,149493217,2019-12-27 11:37:55


# Data Exploration

In [13]:
# there are total of 222750 transaction
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222750 entries, 0 to 222749
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   orderid     222750 non-null  int64 
 1   shopid      222750 non-null  int64 
 2   userid      222750 non-null  int64 
 3   event_time  222750 non-null  object
dtypes: int64(3), object(1)
memory usage: 6.8+ MB


There are total 18770 shops, with most of the shops only have few orders.   
50% of shops less than 2 orders

In [14]:
unique_shopid = df['shopid'].unique()
len(unique_shopid)

18770

In [15]:
#df.groupby('shopid').size().describe()
df.groupby('shopid')['userid'].count().describe()

count    18770.000000
mean        11.867342
std        104.244723
min          1.000000
25%          1.000000
50%          2.000000
75%          8.000000
max      11703.000000
Name: userid, dtype: float64

List of TOP 10 shops with MOST orders

In [16]:
df.groupby('shopid')['userid'].count().sort_values(ascending = False).head(10)

shopid
147941492    11703
61556313      6691
54615708      2640
91799978       707
57189823       687
107921853      684
25924280       655
89768730       607
43412276       529
32504026       516
Name: userid, dtype: int64

# Data Analysis


#### Convert `event_time` datatype to datetime

In [17]:
df['event_time'] = pd.to_datetime(df['event_time'])

# find the start and last timestamp for the entire dataframe, is is only 5 days datasets
print(f"transaction start: {df['event_time'].min()}")
print(f"transaction end: {df['event_time'].max()}")

transaction start: 2019-12-27 00:00:00
transaction end: 2019-12-31 23:59:56


#### Find all shops ID which have more than 3 orders

In [28]:
#dict, key = shopid, value = # orders
shopid_over3_dict = pd.DataFrame(df.groupby('shopid').size()[df.groupby('shopid').size() >=3]).to_dict()

shopid_over3 = shopid_over3_dict[0].keys()
print(f"{len(shopid_over3)} shops with >= 3 orders")


9031 shops with >= 3 orders


#### Generate timestamp

In [30]:
#.timestamp() -> Return POSIX timestamp as float.

df['ts'] = df['event_time'].apply(lambda x: x.timestamp()).astype(int)
df.head()

Unnamed: 0,orderid,shopid,userid,event_time,ts
0,31076582227611,93950878,30530270,2019-12-27 00:23:03,1577406183
1,31118059853484,156423439,46057927,2019-12-27 11:54:20,1577447660
2,31123355095755,173699291,67341739,2019-12-27 13:22:35,1577452955
3,31122059872723,63674025,149380322,2019-12-27 13:01:00,1577451660
4,31117075665123,127249066,149493217,2019-12-27 11:37:55,1577446675


#### Sort order by event time

In [32]:
df = df.sort_values(by = 'event_time', ascending = True)
df.head()

Unnamed: 0,orderid,shopid,userid,event_time,ts
150060,31075200506751,6042309,97707522,2019-12-27 00:00:00,1577404800
25985,31075200506752,104804492,97707522,2019-12-27 00:00:00,1577404800
81020,31075200506753,8715449,97707522,2019-12-27 00:00:00,1577404800
146598,31075201870570,190969466,170182475,2019-12-27 00:00:02,1577404802
80355,31075205798264,2859407,12532131,2019-12-27 00:00:05,1577404805


## Brushing orders detection

**Variable definition:**

- previous_order: The nearest order before the time interval.
- previous_order_time: The event time of previous_order.
- start_order: The first order of this time interval.
- start_order_time: The event time of start_order.
- end_order: The last order of this time interval.
- end_order_time: The event time of end_order.
- next_order: The nearest order after the time interval.
- next_order_time: The event time of next_order.   

**We will find time interval with following two conditions:**

1. start_order_time and end_order_time are within 1 hour.
2. next_order_time - previous_order_time > 1 hour.   
Warning: It will take more than 5 minutes to run this cell.

In [33]:
import time

In [35]:
print(f"{time.time()}")
print(f"local time: {time.localtime(time.time())}")
print(f"asc time: {time.asctime(time.localtime())}")

1592171704.925118
local time: time.struct_time(tm_year=2020, tm_mon=6, tm_mday=15, tm_hour=5, tm_min=55, tm_sec=4, tm_wday=0, tm_yday=167, tm_isdst=0)
asc time: Mon Jun 15 05:55:04 2020


In [34]:
from datetime import datetime
import time

count = 0
ans_dict = dict()

for shop_id in shopid_over3:
    time_record = time.time()
    cheat = 0
    # record the cheating row index in shop_df
    cheat_order_list = set() 
    
    shop_df = df[df['shopid'] == shop_id]
    len_shop_data = len(shop_df)
    # initial previous_order_time which is 1 hour earlier than the start_order_time, here we set 9999 secs which is greater than 3600 secs (aka 1 hour)
    previous_order_time = shop_df.iloc[0]['ts'] - 9999
    # We don't need to check for the last 2 orders since it need more than 3 orders to fit the condition of brushing order
    for start_order in range(len_shop_data - 2): 
        start_order_time = shop_df.iloc[start_order]['ts']
        # same as the reason above, end_order is start from start_order+2
        for end_order in range(start_order + 2, len_shop_data): 
            
            # start_order_time and end_order_time need to within 1 hour
            end_order_time = shop_df.iloc[end_order]['ts']
            if end_order_time > start_order_time + 3600:
                break
            
            # Find the event time of next_order
            # edge condition which end_order is the last order of this shop
            if end_order == len_shop_data - 1: 
                # similar to how we did for previous_order_time
                next_order_time = shop_df.iloc[end_order]['ts'] + 9999 
            else:
                next_order_time = shop_df.iloc[end_order + 1]['ts']
            
            # check whether next_order_time - previous_order_time > 1 hour
            if next_order_time > previous_order_time + 3601:
                # we can not divide orders which have same timestamp into different time interval
                if next_order_time == end_order_time: 
                    continue

                tmp_df = shop_df.iloc[start_order: end_order + 1]
                order_num = tmp_df.orderid.nunique()
                user_num = tmp_df.userid.nunique()

                # check whether order brushing
                if order_num / user_num >= 3:
                    cheat = 1
                    cheat_order_list.update(list(range(start_order, end_order + 1)))
        # refresh the previous order time
        previous_order_time = start_order_time 
        
    count += 1 
    
    if cheat == 1:
        # find the brushing user
        tmp_df = shop_df.iloc[list(cheat_order_list)]
        tmp_dict = tmp_df.groupby('userid').size().to_dict()
        max_time = max(list(tmp_dict.values()))
        
        ans_dict[shop_id] = []
        for user in tmp_dict:
            if tmp_dict[user] == max_time:
                ans_dict[shop_id].append(user)
                        
        print(f'{count:4d}, shop {shop_id} used {time.time() - time_record:4.2f} sec(s), cheat!!!')


   8, shop 10159 used 0.06 sec(s), cheat!!!
  27, shop 10402 used 0.01 sec(s), cheat!!!
  37, shop 10536 used 0.01 sec(s), cheat!!!
  67, shop 42472 used 0.01 sec(s), cheat!!!
  69, shop 42818 used 0.05 sec(s), cheat!!!
  77, shop 76934 used 0.01 sec(s), cheat!!!
 101, shop 195531 used 0.02 sec(s), cheat!!!
 136, shop 425364 used 0.00 sec(s), cheat!!!
 148, shop 599533 used 0.01 sec(s), cheat!!!
 149, shop 605561 used 0.01 sec(s), cheat!!!
 167, shop 722132 used 0.08 sec(s), cheat!!!
 169, shop 731606 used 0.01 sec(s), cheat!!!
 170, shop 736620 used 0.03 sec(s), cheat!!!
 178, shop 769445 used 0.07 sec(s), cheat!!!
 186, shop 823357 used 0.02 sec(s), cheat!!!
 226, shop 1175477 used 0.12 sec(s), cheat!!!
 241, shop 1532569 used 0.01 sec(s), cheat!!!
 292, shop 2831130 used 0.04 sec(s), cheat!!!
 293, shop 2856708 used 0.12 sec(s), cheat!!!
 309, shop 3124091 used 0.01 sec(s), cheat!!!
 332, shop 3701311 used 0.02 sec(s), cheat!!!
 348, shop 4149345 used 0.21 sec(s), cheat!!!
 377, sho

4579, shop 114919380 used 0.39 sec(s), cheat!!!
4593, shop 115179739 used 0.20 sec(s), cheat!!!
4600, shop 115506347 used 0.05 sec(s), cheat!!!
4659, shop 116337057 used 0.01 sec(s), cheat!!!
4672, shop 116348362 used 0.03 sec(s), cheat!!!
4721, shop 117079735 used 0.02 sec(s), cheat!!!
4749, shop 118139770 used 0.01 sec(s), cheat!!!
4784, shop 118949192 used 0.01 sec(s), cheat!!!
4812, shop 119309887 used 0.01 sec(s), cheat!!!
4837, shop 119362482 used 0.06 sec(s), cheat!!!
4863, shop 119595148 used 0.01 sec(s), cheat!!!
4873, shop 119690532 used 0.01 sec(s), cheat!!!
4886, shop 120358496 used 0.07 sec(s), cheat!!!
4895, shop 120692254 used 0.02 sec(s), cheat!!!
4972, shop 122238789 used 0.00 sec(s), cheat!!!
4982, shop 122459518 used 0.03 sec(s), cheat!!!
5003, shop 123401085 used 0.01 sec(s), cheat!!!
5007, shop 123548863 used 0.01 sec(s), cheat!!!
5047, shop 126261266 used 0.01 sec(s), cheat!!!
5059, shop 126587078 used 0.01 sec(s), cheat!!!
5063, shop 126599655 used 0.01 sec(s), c

In [35]:
# count how many shop is brushing (tips: the true num is 315 shops)
len(ans_dict)

315

In [36]:
ans_dict

{10159: [214988798],
 10402: [77819],
 10536: [672345],
 42472: [740844],
 42818: [170385453],
 76934: [190449497],
 195531: [214992524],
 425364: [72914921],
 599533: [264511],
 605561: [181682008],
 722132: [7670129],
 731606: [75558350],
 736620: [62618064],
 769445: [141006168],
 823357: [188942105],
 1175477: [122277324],
 1532569: [181408876],
 2831130: [15053804],
 2856708: [123959597],
 3124091: [214568881],
 3701311: [80690628],
 4149345: [212325226],
 4888564: [143847348],
 6765647: [740844],
 8566282: [556867],
 8715449: [9753706],
 8996761: [215382704],
 9153340: [197569856],
 9372692: [139795934],
 9374147: [148176353],
 9466594: [84811421],
 9676950: [210920501],
 10199219: [8405753],
 10206302: [95058664],
 10498480: [199416406],
 11342466: [152292010],
 11474437: [148215831],
 11612863: [214546342],
 11809607: [136680607],
 11849988: [148215831],
 11896733: [156614746],
 12078788: [214588488],
 12537642: [48412388],
 13735592: [215424202],
 14184981: [32594],
 14598427:

## Save the result

In [37]:
ans_shop = []
ans_user = []
for shop in ans_dict.keys():
    if len(ans_dict[shop]) > 1:
        user_cheat = set(ans_dict[shop])
        user_cheat = sorted(list(user_cheat), reverse=False)
        ans_user.append("&".join([str(i) for i in user_cheat]))
    else:
        ans_user.append(str(ans_dict[shop][0]))
    ans_shop.append(shop)

for shop in unique_shopid:
    if shop not in ans_dict.keys():
        ans_shop.append(shop)
        ans_user.append("0")

print(len(ans_shop))
print(len(ans_user))

df_ans = pd.DataFrame({'shopid': ans_shop, 'userid': ans_user})
df_ans.to_csv('prediction.csv',index=False)

18770
18770


In [38]:
df_ans

Unnamed: 0,shopid,userid
0,10159,214988798
1,10402,77819
2,10536,672345
3,42472,740844
4,42818,170385453
...,...,...
18765,163337224,0
18766,772599,0
18767,50236030,0
18768,203587596,0


In [21]:
df_temp = df_temp[df_temp.duplicated(['userid'], keep = False)]
df_temp

Unnamed: 0,shopid,userid,event_time,day,hour,sec
65430,137739819,125683576,2019-12-27 21:17:39,27,21,39
209383,137739819,125683576,2019-12-27 21:21:36,27,21,36
170852,137739819,10842469,2019-12-28 14:57:58,28,14,58
128266,137739819,10842469,2019-12-28 15:00:39,28,15,39
178518,137739819,117286517,2019-12-28 16:52:12,28,16,12
209988,137739819,117286517,2019-12-28 16:59:57,28,16,57
66773,137739819,117286517,2019-12-28 17:14:03,28,17,3
222584,137739819,185908193,2019-12-28 18:03:51,28,18,51
194357,137739819,185908193,2019-12-28 18:05:28,28,18,28
170970,137739819,98685464,2019-12-28 18:49:32,28,18,32


In [79]:
a = df_temp.groupby('userid')['userid'].count()
a

userid
6978796      2
8296144      2
10842469     2
19165534     2
43640107     2
59014318     2
98685464     2
117286517    3
125683576    2
155492625    2
185908193    2
193404254    2
202808247    2
213030082    2
Name: userid, dtype: int64

In [102]:
b = dict(a.value_counts())

In [105]:
if key

1

In [101]:
for v in dict(b):
    print(v)

2
3


In [100]:
dict(b)

{2: 13, 3: 1}

In [21]:
unique_shopid[0]

93950878

In [23]:
df_temp = df[df.shopid == unique_shopid[4]].sort_values(by = 'event_time')
df_temp.head(10)

Unnamed: 0,orderid,shopid,userid,event_time
156868,31075952319741,127249066,49281203,2019-12-27 00:12:32
49442,31080441865252,127249066,295870,2019-12-27 01:27:22
118245,31080987985123,127249066,2829682,2019-12-27 01:36:27
157001,31098840741099,127249066,66416773,2019-12-27 06:34:00
193417,31099060351386,127249066,153868837,2019-12-27 06:37:40
193,31111092590803,127249066,21958993,2019-12-27 09:58:13
166900,31114106875564,127249066,42725367,2019-12-27 10:48:27
4,31117075665123,127249066,149493217,2019-12-27 11:37:55
30837,31117223287389,127249066,149493217,2019-12-27 11:40:24
27744,31120265364530,127249066,213763075,2019-12-27 12:31:05


In [19]:
df_temp.shape

(6, 4)

In [41]:
offset = timedelta(hours = 1)

In [43]:
df['event_time'][0] + offset

Timestamp('2019-12-27 01:23:03')

In [24]:
a = df_temp.groupby(['userid'])['userid'].count()

In [26]:
len(a)

118

In [27]:
a.unique()

array([1, 2])