# Web Analytics (2IID0) Homework Assignment 3
## Exceptional Model Ming vs A/B Testing
#### <i>Abdel K. Bokharouss, Bart van Helvert, Joris Rombouts & Remco Surtel</i>   -   January 2018

In this assignment we will combine Exceptional Model Mining (EMM) and A/B Testing. The dataset has been provided by StudyPortals. The core concept of A/B Testing is that each test subject gets a corresponding variant assigned. After that, we measure the rate of success per variant and the variant with most success is kept, while the other is discarded. However, subgroups that might be served better by the losing version will be disadvantaged. To do better than this, we further mine the data to find coherent subgroups where alternative delivers more success. New visitors to the website that correspond to a specific subgroup, will get either version A or B of the website. Exceptional Model Mining allows us to mine the data further, so that we can discover these subgroups. First, we need to make a decision which attributes will be used as descriptors and which of the attributes will be used as targets.

<font color = "red">Search for "to-do" (Ctrl-F) to find the to-do's that need some evaluation/action</font>

### <font color="green">imports, preparation and configuration</font>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# 1 Data Preprocessing

<b><i>a. Which columns, available in which of the input files, must we designate as our targets in the EMM process?</i></b><br>
The raw data that StudyPortals delivered, contains of three datasets: `clicking_data`, `experiment_details` and `meta_data`. The primary imporance to StudyPortals is analyzing the association between which version of the website were shown (version A or B) and whether or not the user clicked on the button. Therefore the binary attribute `action` of the dataset `clicking_data` is the first target attribute we define. However, we also want to know which version (A or B) was showed to the user. Therefore the attribute `condition` of the dataset `experiment_details` is the second target attribute we define.

<b><i>b.</i></b><br>After the targets are defined, the next step to do is to select which attributes will be used as descriptors. The goal is to gather as much descriptor informaiton as is reasonably possible. First we explore the data, and check how many NAN-values the three datasets contain.

In [2]:
df_click = pd.read_csv('data/clicking_data.csv')
df_click.head()

Unnamed: 0,action,action_label,action_type,tstamp,user_session
0,clic,revenue,link,1472755490,379881d5-32d7-49f4-bf5b-81fefbc5fcce
1,clic,revenue,link,1472839117,2a0f4218-4f62-479b-845c-109b2720e6e7
2,clic,revenue,link,1472879219,a511b6dc-2dca-455b-b5e2-bf2d224a5505
3,clic,revenue,link,1472890876,9fb616a7-4e13-4307-ac92-0b075d7d376a
4,clic,revenue,link,1472892380,64816772-688d-4460-a591-79aa49bba0d5


In [3]:
df_click.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 919 entries, 0 to 918
Data columns (total 5 columns):
action          919 non-null object
action_label    919 non-null object
action_type     919 non-null object
tstamp          919 non-null int64
user_session    919 non-null object
dtypes: int64(1), object(4)
memory usage: 36.0+ KB


In [4]:
df_experiment = pd.read_csv('data/experiment_details_new.csv') # to-do: need to use "_new" dataset?
df_experiment.head()

Unnamed: 0,user_id,experiment_id,condition,collector_tstamp
0,7794c3b2-4d08-4bb8-bfc9-b50935fed1fc,86-v2-Butonny buttons,1-Control,2016-09-01 12:21:25.716000
1,0baf5074-74bd-4257-ac9f-a07d25f37667,86-v2-Butonny buttons,2-Buttony-Conversion-Buttons,2016-09-01 12:22:01.726000
2,623d19a6-64b4-412a-8143-750995742605,86-v2-Butonny buttons,2-Buttony-Conversion-Buttons,2016-09-01 12:22:31.797000
3,d4b62fc9-dead-441f-936a-db08c2711a1e,86-v2-Butonny buttons,2-Buttony-Conversion-Buttons,2016-09-01 12:22:53.218000
4,8de611c2-0e10-4f63-a016-0f1bd7a683e1,86-v2-Butonny buttons,1-Control,2016-09-01 12:24:15.978000


In [5]:
df_experiment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14847 entries, 0 to 14846
Data columns (total 4 columns):
user_id             14764 non-null object
experiment_id       14847 non-null object
condition           14847 non-null object
collector_tstamp    14847 non-null object
dtypes: object(4)
memory usage: 464.0+ KB


In [6]:
df_meta = pd.read_csv('data/meta_data_modified.csv', encoding = 'ansi')
df_meta.head()

Unnamed: 0,platform,etl_tstamp,collector_tstamp,dvce_created_tstamp,event,domain_userid,domain_sessionid,user_id,geo_country,geo_region,...,pp_yoffset_max,useragent,browser_language,browser_cookies,browser_colordepth,browser_viewdepth,browser_viewheight,os_name,os_timezone,dvce_type
0,web,06:22.2,06:33.7,06:26.2,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,915.0,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,,,,,,,,
1,web,06:49.2,07:00.2,06:52.6,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
2,web,07:09.2,07:20.1,07:12.6,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
3,web,09:49.3,10:01.0,09:53.4,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,3411.0,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,,,,,,,,
4,web,10:40.3,10:51.8,10:44.4,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,


In [7]:
df_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171949 entries, 0 to 171948
Data columns (total 30 columns):
platform               171949 non-null object
etl_tstamp             171949 non-null object
collector_tstamp       171949 non-null object
dvce_created_tstamp    171949 non-null object
event                  171949 non-null object
domain_userid          171949 non-null object
domain_sessionid       171949 non-null object
user_id                171949 non-null object
geo_country            171690 non-null object
geo_region             171949 non-null object
geo_city               88981 non-null object
geo_region_name        87899 non-null object
geo_timezone           165463 non-null object
page_url               171949 non-null object
page_title             133368 non-null object
page_referrer          63876 non-null object
refr_source            26349 non-null object
pp_xoffset_min         40918 non-null float64
pp_xoffset_max         40918 non-null float64
pp_yoffset_min     

In [8]:
sum(df_meta.apply(lambda x: sum(x.isnull().values), axis = 1) > 0) # number of rows with NaN values

171862

This whole assignment is about the analysis of the behavior of visitors in terms of an unusual association between which version of the website they were shown (version A or B) and whether or not they clicked on a particular button. Before we actually make an assesment of the best possible strategy for the missing values, we first need to evaluate the volume and completeness of the attributes which identify a specific visitor since the visitors are at the core of this analysis.

In [9]:
df_meta[['domain_userid', 'domain_sessionid', 'user_id']].apply(
    pd.Series.nunique) # unique number of values for each column

domain_userid        8666
domain_sessionid    10310
user_id              9373
dtype: int64

In [10]:
"unique user-sessions in click data: ", len(df_click.user_session.unique())

('unique user-sessions in click data: ', 778)

In [11]:
"unique user-id's in experiment data: ", len(df_experiment.user_id.unique())

("unique user-id's in experiment data: ", 9361)

Each user is identified with a unique user id and each visit is idenitified by a unique user session. Since we are interested in whether an user which visits the website in a certain version, clicks on a particular button or not during a session, we want to limt ourselves to the session id's for which action labels are availabe. We can see that the number of user-id's and session-id's in the three different datasets are not in line with eachother. The goal is to acquire a dataframe in which we have for each user session and id, the experiment details and all the relevant information we can obtain from the meta data file.

### First step: Removing users who have seen two versions 

It is customary in A/B testing and similar research to remove users that have seen both versions of the web page, as is explained in the referenced paper on A/B- and A&B-testing with EMM. This is the first step that we are going to make in the pre-processing part of the assignment. Note that we can obtain the version to which users are exposed from the experiment dataset. If a user (identified with a particular <i>user_id</i> in this problem context) has different entries in the experiment dataset with different values in the <i>condition</i> column (website version indicator), this user should be removed from the dataset(s) which is going to be used in this assignment.

In [12]:
pd.concat(g for _, g in df_click.groupby("user_session") if len(g) > 1).shape # just an oberservation

(235, 5)

From the previous statement we can conclude that there are multiple instances of the same unique <i>user_session</i> identifier in the clicking data. This is already an indicator that having users who have seen the two versions of the website can be deteriorating for the research results. Let's start looking for the users (id's, sessions) in question.

In [13]:
two_versions = df_experiment.copy()
two_versions['count'] = 1 # new column to be used later
two_versions = two_versions.pivot_table(index = 'user_id', columns = 'condition', aggfunc = sum)
two_versions.fillna(0, inplace = True)
two_versions["two_versions_one_userid"] = np.where((((two_versions[('count', '1-Control')] > 0) & (two_versions[('count', '2-Buttony-Conversion-Buttons')] == 0)) |
                                            ((two_versions[('count', '1-Control')] == 0) & (two_versions[('count', '2-Buttony-Conversion-Buttons')] > 0))), 0, 1)
have_seen_two_versions = list(two_versions.loc[two_versions.two_versions_one_userid == 1].index.unique())

We now have a list of users (<i>have_seen_two_versions</i>) and the actual removal of these users is going to be done in the next step.

In [14]:
len(have_seen_two_versions)

233

### Second step: Merging the clicking data with the experiment details

Unfortunately, there are quite a few users who would be deteriorating to the research results. Te clicking dataset is already relatively small, let's hope we do not use too many instances of this dataset.

In [15]:
len(list(set(have_seen_two_versions) & set(df_click.user_session.unique())))

16

As can be seen from the previous statement, removing these users from the clicking data will only result in the loss of clicking/experiment data (click or view) of 16 users (id's, sessions)

Aside from the user_id/sessions in the list <i>have_seen_two_versions</i>, each unique idenitifier in the clicking dataset will be associated with at most one version (<i>condition</i>) as can be found in the experiment datasets. Let's make a dataframe with two relevant columns; one being the unique the identifier and the other column being the condition (version). This will only work for the users who have seen one version. So these users need to be removed first.

In [16]:
users_conditions = df_experiment.copy()
users_conditions = users_conditions[-users_conditions.user_id.isin(have_seen_two_versions)] # drop user who have seen more than one version

In [17]:
users_conditions.drop_duplicates(subset = ['user_id'], keep = 'first', inplace = True)
users_conditions.reset_index(drop = True, inplace = True)

We now have a dataframe <i>user_conditions</i> in which have all the unique user_id's/session which were exposed to this experiment (and to on version). The next step is to merge this dataset with the clicking data. We then will know the condition (website version) of particular user_id's/sessions which resulted in a click/view for each entry in the clicking dataset.

In [18]:
clicking_conditions = pd.merge(left = df_click, right = users_conditions, left_on = 'user_session', right_on = 'user_id')
clicking_conditions.drop(['user_session', 'experiment_id'], axis = 1, inplace = True) # to-do: can also remove action_label, action_type

We can obviously delete some columns, since they are not relevant to the further analysis.

In [19]:
clicking_conditions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 902 entries, 0 to 901
Data columns (total 7 columns):
action              902 non-null object
action_label        902 non-null object
action_type         902 non-null object
tstamp              902 non-null int64
user_id             902 non-null object
condition           902 non-null object
collector_tstamp    902 non-null object
dtypes: int64(1), object(6)
memory usage: 56.4+ KB


### Third step: enriching the clicking data with meta data

The third and final preprocessing step is to enrich the (merged) clicking and experiment data with as much meta-data as is possible. the main objective of this assignment is to identify subpopulations where these targets display an unusual interaction; can we find subgroups where the click rate interacts exceptionally with the web page version (condition). To find these subgroups we need to have suitable descriptors, which can and will be mostly obtained from the meta data. Since things such as device characteristics, location information, language data. This information can be used for A&B Testing (with Exceptional Model Mining) since this data can be queried before exposing a new user (session/id) to one of the versions, based on these characteristics which put the particular user in a certain subgroup if all goes well.

Let's first make an assesmment of columns in the meta dataset which are definitely not usable

In [20]:
enrich_meta = df_meta.copy()

In [21]:
print(enrich_meta.platform.unique()) # one unique value for the entire dataframe
enrich_meta.drop(['platform'], axis = 1, inplace = True)

['web']


In [22]:
enrich_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171949 entries, 0 to 171948
Data columns (total 29 columns):
etl_tstamp             171949 non-null object
collector_tstamp       171949 non-null object
dvce_created_tstamp    171949 non-null object
event                  171949 non-null object
domain_userid          171949 non-null object
domain_sessionid       171949 non-null object
user_id                171949 non-null object
geo_country            171690 non-null object
geo_region             171949 non-null object
geo_city               88981 non-null object
geo_region_name        87899 non-null object
geo_timezone           165463 non-null object
page_url               171949 non-null object
page_title             133368 non-null object
page_referrer          63876 non-null object
refr_source            26349 non-null object
pp_xoffset_min         40918 non-null float64
pp_xoffset_max         40918 non-null float64
pp_yoffset_min         40918 non-null float64
pp_yoffset_max     

From the dataset info we can see that there are some columns which do not contain many NaN values. Let's first focus on these columns before checking whether we can still obtain something from the columns which contain (relatively) a lot of NaN values. Let's start with considering a completion percentage of 75% as lowerbound

In [23]:
atleast_seventy_five = list(enrich_meta.loc[:, enrich_meta.isnull().mean() < 0.25].columns)
atleast_seventy_five

['etl_tstamp',
 'collector_tstamp',
 'dvce_created_tstamp',
 'event',
 'domain_userid',
 'domain_sessionid',
 'user_id',
 'geo_country',
 'geo_region',
 'geo_timezone',
 'page_url',
 'page_title']

The columns 'event', 'geo_country', 'geo_region', 'geo_timezone', 'page_url' and 'page_title' would certainly be good candidate descriptors. Let's check whether this data is available for all/the majority of the users/sessions in the clicking dataset.

In [47]:
enrich_meta_first = enrich_meta[atleast_seventy_five]

We can merge the subset with these columns with the clicking data by using the user_id and the collector_tstamp. Note that the clicking data did not have a collector_tstamp attribute which makes it hard to be a hundred percent sure whether the particular action is made during a particular session, but the least we can do is make sure that the meta-data info such as the page_url and page_title is confirming to the experiment dataset (things such as geo information will probably be less prone to changes).

In [31]:
enrich_meta_first.loc[enrich_meta_first.user_id == '379881d5-32d7-49f4-bf5b-81fefbc5fcce'].head(1)

Unnamed: 0,etl_tstamp,collector_tstamp,dvce_created_tstamp,event,domain_userid,domain_sessionid,user_id,geo_country,geo_region,geo_timezone,page_url,page_title
4441,55:48.3,56:05.0,55:30.6,page_view,5346c722-7418-4c41-8ce8-f17ac198ff05,ce5519f6-4952-4722-8a04-b186f856f640,379881d5-32d7-49f4-bf5b-81fefbc5fcce,CY,,Asia/Nicosia,http://www.mastersportal.eu/study-options/2689...,10 Human Computer Interaction Master's degrees...


In [30]:
clicking_conditions.loc[clicking_conditions.user_id == '379881d5-32d7-49f4-bf5b-81fefbc5fcce']

Unnamed: 0,action,action_label,action_type,tstamp,user_id,condition,collector_tstamp
0,clic,revenue,link,1472755490,379881d5-32d7-49f4-bf5b-81fefbc5fcce,1-Control,2016-09-01 18:43:14.257000


One can, however, see that the timestamp formats of the two data set are not conforming. In addition, even if when one could go as far formatting the <i>collector_tstamp</i> to the same format as the time stamps used in the meta dataset, one would still find different / conflicting timestamps. This was later on confirmed and explained by Tara from StudyPortals: <i>"Timestamp in the experiment file indicates the timestamp that the users starts the session while seeing the variation A or B. I recommend you to use the meta-data timestamp for your further analysis. "</i>. We are thus just going to merge on the user-id's.

In [50]:
enrich_meta_user_id = enrich_meta_first.copy()

In [51]:
enrich_meta_user_id.drop_duplicates(subset = ['user_id'], inplace = True, keep = 'first')
enrich_meta_user_id.reset_index(drop = True, inplace = True)

In [58]:
enrich_click = pd.merge(clicking_conditions, enrich_meta_user_id, on = 'user_id')
enrich_click.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 902 entries, 0 to 901
Data columns (total 18 columns):
action                 902 non-null object
action_label           902 non-null object
action_type            902 non-null object
tstamp                 902 non-null int64
user_id                902 non-null object
condition              902 non-null object
collector_tstamp_x     902 non-null object
etl_tstamp             902 non-null object
collector_tstamp_y     902 non-null object
dvce_created_tstamp    902 non-null object
event                  902 non-null object
domain_userid          902 non-null object
domain_sessionid       902 non-null object
geo_country            900 non-null object
geo_region             902 non-null object
geo_timezone           864 non-null object
page_url               902 non-null object
page_title             872 non-null object
dtypes: int64(1), object(17)
memory usage: 133.9+ KB


We now have a relatively complete dataset with clicking data, experiment data (condition) and a significant amount of meta data which are potential descriptors in the next phase. Let's check whether we can still obtain some other useful meta data from the columns which were less complete.

In [59]:
less_complete = list(set(enrich_meta.columns) - set(atleast_seventy_five))
less_complete

['browser_language',
 'os_timezone',
 'os_name',
 'browser_colordepth',
 'pp_yoffset_min',
 'refr_source',
 'browser_cookies',
 'geo_city',
 'dvce_type',
 'page_referrer',
 'geo_region_name',
 'pp_xoffset_min',
 'browser_viewdepth',
 'useragent',
 'pp_xoffset_max',
 'pp_yoffset_max',
 'browser_viewheight']

In [60]:
enrich_meta_second = enrich_meta[less_complete + ['user_id']]
#enrich_meta_second = enrich_meta_second.copy() # to prevent a potential warning

In [63]:
enrich_meta_second.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171949 entries, 0 to 171948
Data columns (total 18 columns):
browser_language      7853 non-null object
os_timezone           7677 non-null object
os_name               7862 non-null object
browser_colordepth    7862 non-null float64
pp_yoffset_min        40918 non-null float64
refr_source           26349 non-null object
browser_cookies       7862 non-null object
geo_city              88981 non-null object
dvce_type             7862 non-null object
page_referrer         63876 non-null object
geo_region_name       87899 non-null object
pp_xoffset_min        40918 non-null float64
browser_viewdepth     7862 non-null float64
useragent             69825 non-null object
pp_xoffset_max        40918 non-null float64
pp_yoffset_max        40918 non-null float64
browser_viewheight    7862 non-null float64
user_id               171949 non-null object
dtypes: float64(7), object(11)
memory usage: 23.6+ MB


An attribute fit to be a candidate descriptor should be available before or while loading the page since the main goal of A&B Testing is to have a dynamic web page / page elements. Therefore, some columns can already be excluded from the list of potential descriptors in the next task. Take for example the scrolling characteristics. One cannot expect these continiuous values to be of any use as descriptors.

In [69]:
not_fit = ['pp_xoffset_min', 'pp_yoffset_min', 'pp_xoffset_max', 'pp_yoffset_max']

The browser information can be used to load dynamic pages. One can for example use JavaScript for this purpose (if you want to handle this, for example, on the front-end (but on the server side would be better)). The colordepth can, for example be obtained by statements in the language, and this value can then be used in logical statements (<i>object.colorDepth</i>) to load a certain webpage version/elements.

Let's check the remaining attributes and there values in the dataset to exclude some more potential descriptors.

In [80]:
columns_meta = list(set(enrich_meta_second.columns) - set(not_fit))
for c in columns_meta:
    print(c, '\n', "unique number of values:", len(enrich_meta_second[c].unique()), '\n',
          enrich_meta_second[c].unique(), '\n')

browser_language 
 unique number of values: 26 
 [nan 'en' 'zh-cn' 'en-US' 'en-GB' 'en-IN' 'sw' 'fr-FR' 'fr' 'vi' 'el' 'ro'
 'es-ES' 'ar' 'id' 'de' 'zh-CN' 'ru' 'en-SG' 'de-CH' 'en-CN' 'en-JM' 'ca'
 'pt' 'hi' 'it'] 

os_timezone 
 unique number of values: 47 
 [nan 'Africa/Lagos' 'UTC' 'Europe/London' 'Asia/Baghdad' 'Asia/Dhaka'
 'Asia/Tokyo' 'Pacific/Majuro' 'Asia/Kolkata' 'Europe/Minsk'
 'Africa/Johannesburg' 'America/Los_Angeles' 'Asia/Jakarta'
 'America/Mexico_City' 'Europe/Helsinki' 'Asia/Karachi'
 'America/Argentina/Buenos_Aires' 'America/Santo_Domingo' 'Asia/Kathmandu'
 'Africa/Cairo' 'Australia/Sydney' 'Asia/Shanghai' 'Asia/Baku'
 'Africa/Windhoek' 'Europe/Berlin' 'Asia/Omsk' 'America/Montevideo'
 'America/Noronha' 'Atlantic/Azores' 'Pacific/Gambier'
 'Atlantic/Cape_Verde' 'America/New_York' 'America/Mazatlan'
 'Pacific/Pago_Pago' 'Asia/Beirut' 'Europe/Moscow' 'Asia/Tehran'
 'Asia/Irkutsk' 'Asia/Rangoon' 'Pacific/Kiritimati' 'Pacific/Tongatapu'
 'Asia/Dubai' 'America/Godthab' '

* browser_language seems to be fit to use as a potential descriptor
* os_timezone seems also usable, but we already have the geo_timezone, which is of the same format and holds the same information. We can, however, check whetehr we can use this attribute to fill in NaN values for the geo_timezone.
* os_name seems to be fit to use as a potential descriptor, but needs some feature engineering (i.e. flaten sub-OSs into one OS (for example, different Android Versions to just "Android")). This will increase the attribute's (descriptor) usability later on
* brower_color_depth can be used, but it would be a very far fetch to assume that the color depth will be a decent descriptor of any subgroup
* refr_source seems to be fit to use as a potential descriptor. To-do: Check whether a nan-value for this attribute implies that the reference was internal (i.e. clicking though website)
* browser_cookies seems to be fit to use as a potential descriptor, if one assumes that a NaN value is an indication that user does not accept the website to use cookies.
* geo_city has a lot of unique values. One is better of using the higher level location attributes
* devc_type seems to be fit to use as a potential descriptor
* page_referrer needs some feature engineering to be used as potential descriptor. This attribute can be used to complete/improve the usability of the refr_source attribute.
* geo_region_name has a lot of unique values. One is better of using the higher level location attributes
* useragent is in itself not very useful as a descriptor, but can be used to complete the os_name info, if necessary.
* browser_viewdepth/browser_viewheight could be used, but it would be a very far fetch to assume that these attributes will be a decent descriptor of any subgroup (even if they bould be binned)

In [96]:
user_meta = df_meta.copy()
user_meta.drop_duplicates(subset = ['user_id'], inplace = True)
user_meta.reset_index(inplace = True, drop = True)

In [99]:
user_meta.fillna(np.nan, inplace = True)
user_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9373 entries, 0 to 9372
Data columns (total 30 columns):
platform               9373 non-null object
etl_tstamp             9373 non-null object
collector_tstamp       9373 non-null object
dvce_created_tstamp    9373 non-null object
event                  9373 non-null object
domain_userid          9373 non-null object
domain_sessionid       9373 non-null object
user_id                9373 non-null object
geo_country            9354 non-null object
geo_region             9373 non-null object
geo_city               4858 non-null object
geo_region_name        4802 non-null object
geo_timezone           9002 non-null object
page_url               9373 non-null object
page_title             9001 non-null object
page_referrer          5876 non-null object
refr_source            4896 non-null object
pp_xoffset_min         1760 non-null float64
pp_xoffset_max         1760 non-null float64
pp_yoffset_min         1760 non-null float64
pp_yoffset

In [117]:
df_meta

Unnamed: 0,platform,etl_tstamp,collector_tstamp,dvce_created_tstamp,event,domain_userid,domain_sessionid,user_id,geo_country,geo_region,...,pp_yoffset_max,useragent,browser_language,browser_cookies,browser_colordepth,browser_viewdepth,browser_viewheight,os_name,os_timezone,dvce_type
0,web,06:22.2,06:33.7,06:26.2,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,915.0,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,,,,,,,,
1,web,06:49.2,07:00.2,06:52.6,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
2,web,07:09.2,07:20.1,07:12.6,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
3,web,09:49.3,10:01.0,09:53.4,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,3411.0,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,,,,,,,,
4,web,10:40.3,10:51.8,10:44.4,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
5,web,11:00.3,11:11.8,11:04.4,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,,,,,,,,,,
6,web,23:43.5,23:54.9,23:47.0,page_ping,e8702985-ffee-4ede-a89f-396312539812,0ca47d52-1375-4346-ab2e-b205b64f642e,bc5effb7-a476-414f-a62e-5479022e7553,GR,,...,55.0,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,,,,,,,,
7,web,02:50.5,03:02.7,02:56.2,page_ping,cd6b8831-aaec-43ef-94d8-4eec2f31c98b,6a7c5739-b04c-45a3-9687-a645d10e159d,90ecf090-9317-4a63-91f6-63b60b8d069e,NL,6,...,,,,,,,,,,
8,web,38:06.9,38:18.8,36:48.6,page_view,b8a9c575-3aa6-4ff6-b101-c5135c25e8b9,ae17ece4-ea7b-4b6f-a8a1-aceb1274018c,c46ec943-7db4-4f3a-a093-86a53c841d43,NG,,...,,Mozilla/5.0 (Linux; Android 4.4.2; Infinix-X55...,,,,,,,,
9,web,03:24.4,03:36.5,03:30.0,page_ping,cd6b8831-aaec-43ef-94d8-4eec2f31c98b,08c3c5fa-b943-4c74-9b82-92b3fbfaba13,90ecf090-9317-4a63-91f6-63b60b8d069e,NL,6,...,,,,,,,,,,


In [116]:
df_meta.groupby('user_id').first().reset_index()

Unnamed: 0,user_id,platform,etl_tstamp,collector_tstamp,dvce_created_tstamp,event,domain_userid,domain_sessionid,geo_country,geo_region,...,pp_yoffset_max,useragent,browser_language,browser_cookies,browser_colordepth,browser_viewdepth,browser_viewheight,os_name,os_timezone,dvce_type
0,00054fe1-4c25-4634-882a-50813fb2cd15,web,11:54.5,12:10.9,12:01.2,page_view,a7fe1b94-e793-4a29-8f49-3618557cc7fd,844454a0-2b8b-41f0-a6dd-e95aea97d365,GR,35,...,,Mozilla/5.0 (Linux; Android 5.0.2; SAMSUNG SM-...,,,,,,,,
1,0009ab14-821d-4275-b2ba-bf6ace3dcf5b,web,18:17.3,18:40.8,18:27.7,page_view,b982453f-eb9a-4bf0-9be7-3b72f4f2a8a6,5bd40a76-2e1c-4b91-8d9f-61f5f8884540,GB,,...,,Opera/9.80 (BlackBerry; Opera Mini/8.0.35667/3...,en,True,4.0,640.0,330.0,BlackBerryOS,Africa/Lagos,Mobile
2,000ddf66-651b-4701-8439-577d45048f0e,web,33:35.9,34:02.1,33:46.9,page_view,3f57247d-02d2-41db-93bf-c3a23492f7e8,fefd9881-ef8a-486c-ae2b-7c6a1a7b47e3,CZ,52,...,3520.0,Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_4 like ...,,,,,,,,
3,00109076-45a4-4491-8b19-94216053a091,web,45:22.7,45:46.5,45:30.5,page_view,b6a677b3-893a-4107-9a0f-9b5570d98f10,f73564dd-950a-4e95-a21b-2a3fcacbf039,NG,,...,1341.0,Mozilla/5.0 (BB10; Kbd) AppleWebKit/537.35 (K...,,,,,,,,
4,00161b83-56b0-42c6-b719-cd2f0d1bd818,web,13:01.2,13:17.7,13:14.2,page_ping,11d32392-ecad-486e-9180-4aed6fa67ed4,6408debe-65bb-49b2-933f-a1d9c6fc59f8,HR,,...,541.0,Mozilla/5.0 (Linux; Android 4.2.2; ALCATEL ONE...,,,,,,,,
5,001c2bad-237c-4fb1-b39c-6f65ba541806,web,41:34.8,41:58.1,41:45.0,page_view,fa1e9d38-b3af-449e-8759-e1a35ff45be0,3a719d50-483d-482f-b91d-ae5f0103d4af,GB,,...,2791.0,Mozilla/5.0 (Linux; Android 6.0.1; SM-G903F Bu...,,,,,,,,
6,001feed7-56d3-460b-ba78-b0fa2132d931,web,58:23.5,58:51.8,58:31.4,page_ping,8f623b37-5203-4e82-b94e-51f1acfe087a,b12e6ca3-2c27-4655-a82c-85eda580a4ac,DE,16,...,431.0,Mozilla/5.0 (Mobile; Windows Phone 8.1; Androi...,,,,,,,,
7,002bca42-f8d4-4686-a344-c7ae90823dc3,web,16:28.2,16:56.8,16:39.9,page_view,8fa02db8-c1be-496d-aa5b-3668244745d9,f5233037-2916-48b2-8583-050212e45f5b,US,CA,...,2115.0,Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like ...,,,,,,,,
8,002f2214-81f3-4ad6-a875-a7d4ef117196,web,14:47.6,15:16.3,15:00.1,page_view,f08062ca-f975-4313-a9fa-e325406495de,976e1d60-92f4-4485-af36-ea802c26c3f3,NL,,...,,Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_5 like ...,,,,,,,,
9,002fad5a-0f02-4d6f-83ee-1b497ad5dc4b,web,28:18.5,28:39.3,28:05.2,page_view,1fbef24a-c68d-4344-ae52-a84360ca6d23,fd522c30-7e9b-48ac-b3cc-3195c2eda625,IN,,...,2138.0,Mozilla/5.0 (Linux; Android 6.0.1; XT1562 Buil...,,,,,,,,


In [115]:
df_meta.drop_duplicates(subset = ['user_id']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9373 entries, 0 to 171935
Data columns (total 30 columns):
platform               9373 non-null object
etl_tstamp             9373 non-null object
collector_tstamp       9373 non-null object
dvce_created_tstamp    9373 non-null object
event                  9373 non-null object
domain_userid          9373 non-null object
domain_sessionid       9373 non-null object
user_id                9373 non-null object
geo_country            9354 non-null object
geo_region             9373 non-null object
geo_city               4858 non-null object
geo_region_name        4802 non-null object
geo_timezone           9002 non-null object
page_url               9373 non-null object
page_title             9001 non-null object
page_referrer          5876 non-null object
refr_source            4896 non-null object
pp_xoffset_min         1760 non-null float64
pp_xoffset_max         1760 non-null float64
pp_yoffset_min         1760 non-null float64
pp_yoffs

In [93]:
for user_1 in user_meta.user_id:
    for user_2 in df_meta.user_id:
        if (user_1 == user_2):
            for c in list(user_meta.columns):
                if (np.isnan())


<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

In [87]:
enrich_

Unnamed: 0,browser_language,os_timezone,os_name,browser_colordepth,pp_yoffset_min,refr_source,browser_cookies,geo_city,dvce_type,page_referrer,geo_region_name,pp_xoffset_min,browser_viewdepth,useragent,pp_xoffset_max,pp_yoffset_max,browser_viewheight,user_id
0,,,,,345.0,,,,,,,0.0,,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,0.0,915.0,,bc5effb7-a476-414f-a62e-5479022e7553
1,,,,,,,,,,,,,,,,,,bc5effb7-a476-414f-a62e-5479022e7553
2,,,,,,,,,,,,,,,,,,bc5effb7-a476-414f-a62e-5479022e7553
3,,,,,3356.0,,,,,,,0.0,,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,0.0,3411.0,,bc5effb7-a476-414f-a62e-5479022e7553
4,,,,,,,,,,http://www.mastersportal.eu/search/?q=ci-8|lv-...,,,,,,,,bc5effb7-a476-414f-a62e-5479022e7553
5,,,,,,,,,,http://www.mastersportal.eu/search/?q=ci-8|lv-...,,,,,,,,bc5effb7-a476-414f-a62e-5479022e7553
6,,,,,55.0,,,,,,,0.0,,Mozilla/5.0 (Linux; Android 4.4.4; GT-I9300 Bu...,0.0,55.0,,bc5effb7-a476-414f-a62e-5479022e7553
7,,,,,,,,Eindhoven,,,Noord-Brabant,,,,,,,90ecf090-9317-4a63-91f6-63b60b8d069e
8,,,,,,Google,,,,https://www.google.com.ng/,,,,Mozilla/5.0 (Linux; Android 4.4.2; Infinix-X55...,,,,c46ec943-7db4-4f3a-a093-86a53c841d43
9,,,,,,,,Eindhoven,,,Noord-Brabant,,,,,,,90ecf090-9317-4a63-91f6-63b60b8d069e


## --------------------------------------------------------------------------------------------

In [None]:
df_experiment.rename(columns={'timestamp':'collector_tstamp'}, inplace=True)

In [None]:
overlap = list(set(df_experiment.columns) & set(df_meta.columns))

In [None]:
len(list(set(df_meta.user_id.values) & set(df_experiment.user_id.values)))

In [None]:
#len(list(set(df_meta.collector_tstamp.values) & set(df_experiment.timestamp.values)))

In [None]:
df_experiment.shape, df_meta.shape

Tara from study_portals: 

In [None]:
session_action = df_experiment.merge(df_meta, on = overlap, how = 'left')
session_action.info()

In [None]:
len(list(set(df_meta.user_id.values) & set(df_click.user_session.values)))

In [None]:
session_action = session_action.merge(df_click, how = 'left', left_on = 'user_id', right_on = 'user_session')

In [None]:
session_action = session_action[pd.notnull(session_action['action'])] # not sure whether absence implies no click

### 2 A&B Testing

### 3 Beyond A&B Testing