# Web Analytics (2IID0) Homework Assignment 3
## Exceptional Model Ming vs A/B Testing
#### <i>Abdel K. Bokharouss, Bart van Helvert, Joris Rombouts & Remco Surtel</i>   -   December 2017

In this assignment we will combine Exceptional Model Mining (EMM) and A/B Testing. The dataset has been provided by StudyPortals. The core concept of A/B Testing is that each test subject gets a corresponding variant assigned. After that, we measure the rate of success per variant and the variant with most success is kept, while the other is discarded. However, subgroups that might be served better by the losing version will be disadvantaged. To do better than this, we further mine the data to find coherent subgroups where alternative delivers more success. New visitors to the website that correspond to a specific subgroup, will get either version A or B of the website. Exceptional Model Mining allows us to mine the data further, so that we can discover these subgroups. First, we need to make a decision which attributes will be used as descriptors and which of the attributes will be used as targets.

### <font color="green">imports, preparation and configuration</font>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### 1 Data Preprocessing

<b><i>a. Which columns, available in which of the input files, must we designate as our targets in the EMM process?</i></b><br>
The raw data that StudyPortals delivered, contains of three datasets: `clicking_data`, `experiment_details` and `meta_data`. The primary imporance to StudyPortals is analyzing the association between which version of the website were shown (version A or B) and whether or not the user clicked on the button. Therefore the binary attribute `action` of the dataset `clicking_data` is the first target attribute we define. However, we also want to know which version (A or B) was showed to the user. Therefore the attribute `condition` of the dataset `experiment_details` is the second target attribute we define.

<b><i>b.</i></b><br>After the targets are defined, the next step to do is to select which attributes will be used as descriptors. The goal is to gather as much descriptor informaiton as is reasonably possible. First we explore the data, and check how many NAN-values the three datasets contain.

In [2]:
df_click = pd.read_csv('data/clicking_data.csv')
df_click.head()

Unnamed: 0,action,action_label,action_type,tstamp,user_session
0,clic,revenue,link,1472755490,379881d5-32d7-49f4-bf5b-81fefbc5fcce
1,clic,revenue,link,1472839117,2a0f4218-4f62-479b-845c-109b2720e6e7
2,clic,revenue,link,1472879219,a511b6dc-2dca-455b-b5e2-bf2d224a5505
3,clic,revenue,link,1472890876,9fb616a7-4e13-4307-ac92-0b075d7d376a
4,clic,revenue,link,1472892380,64816772-688d-4460-a591-79aa49bba0d5


In [7]:
df_click.shape

(919, 5)

In [9]:
df_click.user_session.describe()

count                                      919
unique                                     778
top       8ab1e096-cbf3-4484-bacd-01a833e3e0fd
freq                                        17
Name: user_session, dtype: object

In [10]:
df_click.loc[df_click.user_session == '8ab1e096-cbf3-4484-bacd-01a833e3e0fd']

Unnamed: 0,action,action_label,action_type,tstamp,user_session
94,view,premium,org,1472891985,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
95,view,premium,org,1472892049,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
274,view,premium,org,1472806209,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
296,view,premium,org,1472887685,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
297,view,premium,org,1472887801,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
298,view,premium,org,1472887848,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
300,view,premium,org,1472891365,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
301,view,premium,org,1472891381,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
402,view,premium,org,1472887207,8ab1e096-cbf3-4484-bacd-01a833e3e0fd
403,view,premium,org,1472887330,8ab1e096-cbf3-4484-bacd-01a833e3e0fd


In [11]:
pd.concat(g for _, g in df_click.groupby("user_session") if len(g) > 1)

Unnamed: 0,action,action_label,action_type,tstamp,user_session
333,clic,revenue,link,1473024926,007f8a0c-e8ad-4268-a818-eaf0723aecba
764,view,premium,org,1473024321,007f8a0c-e8ad-4268-a818-eaf0723aecba
168,view,premium,org,1473114699,02cca032-cdc2-4e41-87d6-9edf4c7c6706
789,view,premium,org,1473114419,02cca032-cdc2-4e41-87d6-9edf4c7c6706
22,clic,revenue,link,1473103156,06b119f3-3ced-44df-bf1f-f1b1ae75f515
476,view,premium,org,1473103043,06b119f3-3ced-44df-bf1f-f1b1ae75f515
448,view,premium,org,1473030266,07ed74dd-4db7-4a31-aa5d-a24ad80bc01d
765,view,premium,org,1473030324,07ed74dd-4db7-4a31-aa5d-a24ad80bc01d
241,view,premium,org,1473365377,081e0fd6-6385-4228-b2c9-8e90b905c076
250,view,premium,org,1474572550,081e0fd6-6385-4228-b2c9-8e90b905c076


In [None]:
df_click.info()

In [12]:
df_experiment = pd.read_csv('data/experiment_details_new.csv')
df_experiment.head()

Unnamed: 0,user_id,experiment_id,condition,collector_tstamp
0,7794c3b2-4d08-4bb8-bfc9-b50935fed1fc,86-v2-Butonny buttons,1-Control,2016-09-01 12:21:25.716000
1,0baf5074-74bd-4257-ac9f-a07d25f37667,86-v2-Butonny buttons,2-Buttony-Conversion-Buttons,2016-09-01 12:22:01.726000
2,623d19a6-64b4-412a-8143-750995742605,86-v2-Butonny buttons,2-Buttony-Conversion-Buttons,2016-09-01 12:22:31.797000
3,d4b62fc9-dead-441f-936a-db08c2711a1e,86-v2-Butonny buttons,2-Buttony-Conversion-Buttons,2016-09-01 12:22:53.218000
4,8de611c2-0e10-4f63-a016-0f1bd7a683e1,86-v2-Butonny buttons,1-Control,2016-09-01 12:24:15.978000


In [17]:
df_experiment.loc[df_experiment.user_id == "8ab1e096-cbf3-4484-bacd-01a833e3e0fd"]

Unnamed: 0,user_id,experiment_id,condition,collector_tstamp
593,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-02 08:50:27.865000
1031,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 08:05:36.112000
2964,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 07:40:11.146000
2979,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 08:12:13.334000
2982,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 08:13:41.091000
6879,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 07:20:33.206000
6886,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 07:40:13.986000
6905,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 08:27:41.687000
11309,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-01 23:55:20.910000
11890,8ab1e096-cbf3-4484-bacd-01a833e3e0fd,86-v2-Butonny buttons,1-Control,2016-09-03 07:14:22.320000


In [27]:
df_experiment['count'] = 1

In [38]:
test = df_experiment.pivot_table(index='user_id',columns='condition',aggfunc=sum)
test.fillna(0, inplace = True)
test["two_versions_one_userid"] = np.where((((test[('count', '1-Control')] > 0) & (test[('count', '2-Buttony-Conversion-Buttons')] == 0)) |
                                            ((test[('count', '1-Control')] == 0) & (test[('count', '2-Buttony-Conversion-Buttons')] > 0))), 0, 1)

In [51]:
have_seen_two_versions = re
one = list(have_seen_two_versions.index.unique())

In [52]:
two = list(df_click.user_session.unique())

In [53]:
list(set(one) & set(two))

['0d44242b-bd92-4f38-bb04-6ef1d5890fd6',
 'c2bde2e2-d0e0-476b-8a48-78529cb4c5d0',
 '10094519-9aa0-4665-a099-634a7311fb8f',
 '1a491e50-5b0c-46c6-81b9-5dd8e3c21d51',
 'd9db6db7-bf57-46d1-994f-82bedea5ad26',
 'ddc9391b-08b6-460f-a37c-261bd58ea3d9',
 '17bc8bf8-e432-4577-8667-f6f310149f04',
 '1a296072-dd42-4158-8c1d-be82d5fdf80f',
 'bedecc9a-22d0-422c-80ac-2ba13fdfbbf2',
 '03af59ca-fb4f-4fda-ad44-e6f8bfc13151',
 'f08a611c-7f09-45dd-bb8d-8af9a5300ced',
 'aad247eb-57bf-46eb-b865-ed2d047d2db5',
 '26660906-5f26-4f32-8d75-9b912bbe83af',
 '985b802b-edbe-4168-ac39-87eae3ce5042',
 '6d9ad57d-3deb-4987-b171-1fa42bce607d',
 'f72aab36-93b1-46a2-9d0c-60ecb07f9eca']

In [37]:
test.columns.values

array([('count', '1-Control'), ('count', '2-Buttony-Conversion-Buttons')], dtype=object)

In [15]:
df_experiment.user_id.describe()

count                                    14764
unique                                    9360
top       0df1a4d2-1d98-40b0-9f37-588fc7ebd848
freq                                       185
Name: user_id, dtype: object

In [None]:
df_experiment.rename(columns={'timestamp':'etl_tstamp'}, inplace=True)

In [None]:
df_experiment.info()

In [54]:
df_meta = pd.read_csv('data/meta_data_modified.csv', encoding = 'ansi')
df_meta.head()

Unnamed: 0,platform,etl_tstamp,collector_tstamp,dvce_created_tstamp,event,domain_userid,domain_sessionid,user_id,geo_country,geo_region,...,pp_yoffset_max,useragent,browser_language,browser_cookies,browser_colordepth,browser_viewdepth,browser_viewheight,os_name,os_timezone,dvce_type
0,web,06:22.2,06:33.7,06:26.2,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,915.0,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,,,,,,,,
1,web,06:49.2,07:00.2,06:52.6,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
2,web,07:09.2,07:20.1,07:12.6,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
3,web,09:49.3,10:01.0,09:53.4,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,3411.0,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,,,,,,,,
4,web,10:40.3,10:51.8,10:44.4,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,


In [None]:
df_meta

In [None]:
df_meta.info()

In [None]:
sum(df_meta.apply(lambda x: sum(x.isnull().values), axis = 1) > 0) # number of rows with NaN values

This whole assignment is about the analysis of the behavior of visitors in terms of an unusual association between which version of the website they were shown (version A or B) and whether or not they clicked on a particular button. Before we actually make an assesment of the best possible strategy for the missing values, we first need to evaluate the volume and completeness of the attributes which identify a specific visitor since the visitors are at the core of this analysis.

In [None]:
df_meta[['domain_userid', 'domain_sessionid', 'user_id']].apply(
    pd.Series.nunique) # unique number of values for each column

In [None]:
"unique user-sessions in click data: ", len(df_click.user_session.unique())

In [None]:
"unique user-id's in experiment data: ", len(df_experiment.user_id.unique())

Each user is identified with a unique user id and each visit is idenitified by a unique user session. Since we are interested in whether an user which visits the website in a certain version, clicks on a particular button or not during a session, we want to limt ourselves to the session id's for which action labels are availabe. We can see that the number of user-id's and session-id's in the three different datasets are not in line with eachother. The goal is to acquire a dataframe in which we have for each user session and id, the experiment details and all the relevant information we can obtain from the meta data file.

In [None]:
overlap = list(set(df_experiment.columns) & set(df_meta.columns))

In [None]:
len(list(set(df_meta.user_id.values) & set(df_experiment.user_id.values)))

In [None]:
#len(list(set(df_meta.collector_tstamp.values) & set(df_experiment.timestamp.values)))

In [None]:
df_experiment.shape, df_meta.shape

Tara from study_portals: <i>"Timestamp in the experiment file indicates the timestamp that the users starts the session while seeing the variation A or B. I recommend you to use the meta-data timestamp for your further analysis. "</i>

In [None]:
session_action = df_experiment.merge(df_meta, on = overlap, how = 'left')
session_action.info()

In [None]:
len(list(set(df_meta.user_id.values) & set(df_click.user_session.values)))

In [None]:
session_action = session_action.merge(df_click, how = 'left', left_on = 'user_id', right_on = 'user_session')

In [None]:
session_action = session_action[pd.notnull(session_action['action'])] # not sure whether absence implies no click

In [None]:
session_action.info()

### 2 A&B Testing

### 3 Beyond A&B Testing