# 2 A&B Testing

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import csv
from queue import Queue
import math
import itertools
from pandas.api.types import is_numeric_dtype

import warnings 
warnings.filterwarnings('ignore') #ignore warning messages from output beam_search

In [2]:
df_action = pd.read_csv('data/action_condition_meta.csv')
df_action.head()

Unnamed: 0,action,user_id,condition,geo_country,refr_source,browser_language,os_name,os_timezone,dvce_type
0,clic,379881d5-32d7-49f4-bf5b-81fefbc5fcce,1-Control,FI,Google,greek,Android,Europe,Mobile
1,clic,2a0f4218-4f62-479b-845c-109b2720e6e7,2-Buttony-Conversion-Buttons,AU,Google,english,iOS,Australia,Mobile
2,clic,a511b6dc-2dca-455b-b5e2-bf2d224a5505,2-Buttony-Conversion-Buttons,GB,Google,english,Android,Europe,Mobile
3,clic,9fb616a7-4e13-4307-ac92-0b075d7d376a,2-Buttony-Conversion-Buttons,FI,Google,english,iOS,Europe,Mobile
4,clic,64816772-688d-4460-a591-79aa49bba0d5,2-Buttony-Conversion-Buttons,BD,Google,english,Android,Asia,Mobile


### 2a - Beam Search implementation

<b>We created our own priorityqueue class, because the standard priorityqueue blocks (can't insert element) when max.size is reached. We need to have a priorityqueue where the item with lowest priority is discarded and replaced by the new item.
<i>priority_queue</i> represents a min heap</b>

The following function is used to check wheter a certain element is already present in the datastructure (which is basically a list with tuples as elements) used to store the attributes and is used in both the priority-queue class as the implementation of the beam-search algorithm

In [3]:
def alreadyInList(list1, list2):
    permutations = list(itertools.permutations(list1))
    for t in permutations:
        if list(t) in list2:
            return True
    return False

In [4]:
import heapq
from heapq import heappush, heappop

class priority_queue:
    def __init__(self, max_size):
        self.items = []
        self.max = max_size
   
    def push(self, item, priority):
        if ((len(self.items) < self.max) and (not alreadyInList(item, self.get_items()))):
            heapq.heappush(self.items, (priority, item))
        elif (not alreadyInList(item, self.get_items())):
            heapq.heappushpop(self.items, (priority, item))
    
    def get_items(self):
        result = []
        for i in self.items:
            result.append(i[1])
        return result

    def pop(self):
        return heapq.heappop(self.items)

    def get_max_item(self):
        return self.items[0]
    
    def empty(self):
        return not self.items
    
    def print_elements(self):
        result = []
        for i in self.items:
            result.append(i)
        return result
       
    def heap_sort(self):
        return [heapq.heappop(self.items) for _ in range(len(self.items))]

The following function <i>create_dataframe</i> is used to obtain a subset of the <i>df_action</i> dataframe, based on the the description language which is passed as argument. The dataframe which is returned is the subgroup of records (users) belonging to that particular description language. For example, a group of users who all were using iOS as operating system at the time of the experiment. 

<b>to-do (in prepocessing part?)</b>: The <i>create_dataframe</i> drops duplicate rows after each iteration. In other words, users that visited the particular version multiple times, with the same cookie-settings and the same action-type, are discarded from the dataset. It could be that a user is very fanatic in visiting the website and at every visit the user has the same action (view/clic). This can influence the dataset, because it is not fair compared to users who visit the particular version less frequently. We don't remove all the duplicated user (id's), because sometimes a user clicked on the button in a particular version, while the other time the user did not clicked. 

In [5]:
def create_dataframe(Set):
    d =df_action.copy()
    count = 0
    for column, item in Set:
        if count==0:
            if isinstance(item, str): # single value
                df_new = pd.merge(d, df_action.loc[(df_action[column] == item)], on=list(df_action), how='inner')
            else: # list of values
                df_new = pd.merge(d, df_action.loc[df_action[column].isin(item)], on=list(df_action), how='inner')
            df_new.drop_duplicates(inplace = True)
        else:
            if isinstance(item, str): # single value
                df_new = pd.merge(df_new, df_action.loc[(df_action[column] == item)], on=list(df_action), how='inner')
            else: # list of values
                df_new = pd.merge(df_new, df_action.loc[df_action[column].isin(item)], on=list(df_action), how='inner')
        count +=1
    return df_new

The following function <i>constraints</i> checks wheter a certain subgroup, represented by a dataframe which is passed as argument, satisfies the constraints $C$. At the moment the only constraint of a subgroup is that it should be represented by at least $2$% of the records (users) in the original dataset and at most $40$% of the dataframe. The upper bound is added because otherwise we get subgroups that are simply too large. Then you get subgroups that have a slightly difference and therefore you can't draw any conclusions from the results.

In [6]:
def constraints(df_matches):
    return (df_matches.shape[0] > (df_action.shape[0] *0.02)) & (df_matches.shape[0] < (df_action.shape[0] *0.40)) 

The refinement operator gets the records and chooses to which type of attribute they belong.
There are 3 types of attributes:
    1. Numeric: Attribute with all number records
    2. Binary: Attribute with true or false records
    3. Nominal: Attribute with multiple different values in their records which are not numeric
    
1. For numeric values we sort all records which are in the description language D. After that we make equal-sized bins.
The amount of bins is dependend on a predefined value. For each split point we add a description based on whether the
numeric value is greater or equal or smaller or equal as the split point.
2. For binary records we add one description where the description is true and one description where the description is false.
3. For nominal values we add for each description an entry with the description and one without the description. In our case
we take the first description in the list which is not equal to that description. 

For the first level, we generate all patterns consisting of <i>one</i> condition on <i>one</i> attribute. All patterns are evaluated with the quality measure $\varphi$ and the $w$ best are saved as the <i>beam</i>. The attributes `action`, `condition` and `user_id` are discarded from the list of attributes, because these are not relevant for defining a description language.

In [7]:
def all_paterns_one_condition(y): # level 1
    result = []
    columns = list(df_action.columns.values)
    not_relevant = ['action', 'condition', 'user_id']
    columns = [x for x in columns if (x not in not_relevant)]
    for column in columns:
        if df_action[column].unique().size == 2: # binary attribute
            result.append((column, True))
            result.append((column, False))
        elif is_numeric_dtype(df_action[column]): # numeric attribute
        # equal height binning
            n = df_action.shape[0]
            values = sorted(df_action[column].values)
            for a in range(1, y):
                boundary_point = (a * n) / y
                found_boundary_value = values[boundary_point] # value at boundary point
                result.append(column, found_boundary_value)
        else: # nominal attribute
            for value in df_action[column].unique():
                all_unique = list(df_action[column].unique())
                complement_value = [x for x in all_unique if x != value]
                result.append((column, value))
                result.append((column, complement_value))
    return result

As alluded in the referenced paper, the StudyPortals (original) dataset comes natural equiped with $m=2$ nominal targets. The first nominal target attribute is $condition$, which represents a binary column that tells us to which version the particular user was exposed during the experiment (i.e. version A or B). The second nominal target attribute is $action$, which is the binary column representing whether the page visitor merely viewed or also clicked on the button in question during the experiment. Considering these pecularities of the StudyPortals dataset, the natural choice of EMM instance would be the association model class. So we strive to find subgroups for which the association between view/click and A/B is exceptional.

|      | View | Click |
|------|------|-------|
|   A  |$n_1$ | $n_2$ |
|   B  |$n_3$ | $n_4$ |

Now that we know what model class will be exploited, the next step is to define or exploit an appropriate quality measure. Since one can easily achieve huge deviations in target behaviour (assiociation between the differences in the two nominal target attributes), it makes sense to have a dimension in the quality measure which reflects the group size. In addition one also needs to have a target deviation dimension/component in the quality measure, of course. 

<b>The Target Deviation Component</b> ($\varphi_{Q}(S)$)

The first quality measure that is implemented, is Yule's Quality Measure as described in section $4.3$ of the A&B Testing paper. For the quality measure, we use the cells of the target contingency table, given in the table above. Given a subgroup $S\subseteq \Omega$, we can assign each record in $S$ to the appropiate cell of this contingency table, which leads to count values for each of the $n_i$ such that: $n_1 + n_2 + n_3 + n_4 = |S|$. Yule's Q is defined as: $\frac{(n_1\bullet n_4 - n_2\bullet n_3)}{(n_1\bullet n_4 + n_2 \bullet n_3)}$. Higher numbers on the main diagonal implies a possive assocation between the two targets and higher numbers off the main diagonal implies a negative association between the two targets. The value for $Q$ instantiated by the subgroup $S$ is denoted by $Q_S$. We contrast Yule's Q instantiated by a subgroup $S$ with Yule's Q instantiated by that subgroup complements $S^\mathsf{C}$: $\varphi_{Q}(S) = |Q_S - Q_{S^\mathsf{C}} |$. This component detects subgroups whose view/click-A/B association is different from the rest of the dataset.

In [8]:
def target_deviation(df_matches):
    zero = np.finfo(np.double).tiny
    
    n_1 = df_matches.loc[(df_matches.action == 'view') & (df_matches.condition == '1-Control')].count()[0]
    n_2 = df_matches.loc[(df_matches.action == 'clic') & (df_matches.condition == '1-Control')].count()[0]
    n_3 = df_matches.loc[(df_matches.action == 'view') & (df_matches.condition == '2-Buttony-Conversion-Buttons')].count()[0]
    n_4 = df_matches.loc[(df_matches.action == 'clic') & (df_matches.condition == '2-Buttony-Conversion-Buttons')].count()[0]
    Q_s = (n_1*n_4 - n_2*n_3)/(n_1*n_4 + n_2*n_3)
    
    df_complement = df_matches.merge(df_action, indicator=True, how='outer')
    df_complement = df_complement[df_complement['_merge'] == 'right_only']
    df_complement.drop(['_merge'], axis=1, inplace = True)
    
    n_c_1 = df_complement.loc[(df_complement.action == 'view') & (df_complement.condition == '1-Control')].count()[0]
    n_c_2 = df_complement.loc[(df_complement.action == 'clic') & (df_complement.condition == '1-Control')].count()[0]
    n_c_3 = df_complement.loc[(df_complement.action == 'view') & (df_complement.condition == '2-Buttony-Conversion-Buttons')].count()[0]
    n_c_4 = df_complement.loc[(df_complement.action == 'clic') & (df_complement.condition == '2-Buttony-Conversion-Buttons')].count()[0]

    Q_s_c = (n_c_1*n_c_4 - n_c_2*n_c_3)/(n_c_1*n_c_4 + n_c_2*n_c_3)
    
    if (math.isnan(Q_s_c)):
        Q_s_c = zero
    if (math.isnan(Q_s)):
        Q_s = zero

    phi_Q_S = abs(Q_s - Q_s_c)
    
    return phi_Q_S

<b>The Subgroup Size Component</b><br>
To represent the subgroup size, we take the entropy function as described in section $3.1$ of the referenced paper. The function conceptually rewards $50/50$ splits between subgroup and complements, while punishing subgroups that are either (relatively) small or cover the vast majority of the dataset.

$\varphi_{ef}(D) = - \frac{n}{N}lg(\frac{n}{N}) - \frac{n^C}{N}lg(\frac{n^C}{N})$

In [9]:
def entropy_function(df_matches):
    zero = np.finfo(np.double).tiny # dealing with divisions/logarithms of 0
    n = df_matches.shape[0] # size subgroup in original dataset
    N = df_action.shape[0] # size original dataset
    n_c = N - n # complement of subgroup in orginal dataset
    if (n == 0):
        n = zero
    if (n_c == 0): #to-do if (n_c <= 0)
        n_c = zero
    return ((-(n/N) * math.log2(n/N)) - ((n_c/N) * math.log2(n_c/N)))

When combining these two components/dimensions, one obtains an association model class quality measure known as <i>Yule's Quality Measure</i>. $\varphi_{Yule}(S) = \varphi_{Q}(S) \cdot \varphi_{ef}(S) $, which boils down to a multiplication of the target deviation- and the subgroup size -component.

In [10]:
def phiYule(df_matches):
    return target_deviation(df_matches) * entropy_function(df_matches)

The multiplication of the two components ensures that subgroups are evaluated well (i.e. score well) on both components

The next function is responsible for the generation of new candidate patterns for the $n^{th}+1\,level$ by refining patterns from the $n^{th}\,level$ beam (denoted by <i>current_beam</i>). A pattern is refined into many new candidates by the conjuction of each possible single condition on a single attribute. The function results a list with all the candidates. The $beam-search$ algorithm should evaluate all $level\,n+1$ candidates with $\varphi$ and store the $w$ best as the new beam. In addition, it should update th list of $q-$best-performing subgroups, if there are new candidates which surpass the current top$-q$ in terms of the particular quality measure $\varphi$

In [11]:
def refine(current_beam, d):
    w = len(current_beam)
    result= []
    for i in range(0, w):
        for j in range(0, w):
            if (i != j):
                len_i = len(current_beam[i])
                len_j = len(current_beam[j])
                min_len = min(len_i, len_j)
                new = []
                for t in (0, min_len - 1):
                    if(current_beam[i][t] not in new):
                        new.append(current_beam[i][t])
                    if(current_beam[j][t] not in new):
                        new.append(current_beam[j][t])
                if ((not alreadyInList(new, current_beam)) & (len(new) == d)):
                    result.append(new)
    return result

The next step is the implementation of the $beam\_search$ algorithm. There are a few parameters which influence the outcome, and can be changed accordingly to what one sees fit for a certain experiment

In [12]:
d = 2 # search depth
w = 10 # search width (i.e. beam width)
q = 5 # size of the set of best subgroup results (i.e. top q subgroups)
y = 10 # number of in which numeric descriptors are dynamically discretized
quality_measure = phiYule # choose between phiYule (Yule's Quality Measure), ...

In [13]:
def beam_search(d, w, q, quality_measure):
    resultSet = priority_queue(q)
    beam = priority_queue(w)
    # first level
    all_paterns = all_paterns_one_condition(y)
    print("level: "+str(1))
    for desc in all_paterns:
        subset_desc = create_dataframe([desc])
        quality = quality_measure(subset_desc)
        if (constraints(subset_desc)):
            beam.push([desc], quality)
            resultSet.push([desc], quality)
    for level in range(2, d + 1):
        print("level: "+str(level))
        current_beam = beam.get_items()
        new_candidates = refine(current_beam, level)
        for c in new_candidates:
            subset_desc = create_dataframe(c)
            quality = quality_measure(subset_desc)
            if (constraints(subset_desc)):
                beam.push(c, quality)
                resultSet.push(c, quality)
    return resultSet

In [14]:
result = beam_search(d = d, w = w, q = q, quality_measure = quality_measure)
sorted_result = priority_queue.heap_sort(result)
sorted_result

level: 1
level: 2


[(0.34039077670350143, [('geo_country', 'DE')]),
 (0.3647984213664936, [('browser_language', 'latin_lan'), ('os_name', 'iOS')]),
 (0.47027223676731572,
  [('browser_language',
    ['greek', 'latin_lan', 'asian', 'cyrillic', 'herbrew']),
   ('os_name', 'iOS')]),
 (0.50998200142197703, [('geo_country', 'DE'), ('os_name', 'iOS')]),
 (0.56310255294633826, [('os_name', 'iOS')])]

The implementation given above allows the end user to:
* manually set the beam width $w$ and search depth $d$
* manually choose the number of bins $y$ in which numeric desciptors are dynamically distretized
* easily swap out the association model class on these specific two targets for another model class (to be coded by the end user) with any number of targets of the user's choosing (by changing the $quality\_measure$ (which is releated to a particular model class) parameter to the modelclass one wants to use)


### Some code used to fill in the table (see 2b)

In [15]:
def yule_Q_multiple_return(df_matches):
    zero = np.finfo(np.double).tiny
    
    n_1 = df_matches.loc[(df_matches.action == 'view') & (df_matches.condition == '1-Control')].count()[0]
    n_2 = df_matches.loc[(df_matches.action == 'clic') & (df_matches.condition == '1-Control')].count()[0]
    n_3 = df_matches.loc[(df_matches.action == 'view') & (df_matches.condition == '2-Buttony-Conversion-Buttons')].count()[0]
    n_4 = df_matches.loc[(df_matches.action == 'clic') & (df_matches.condition == '2-Buttony-Conversion-Buttons')].count()[0]
    Q_s = (n_1*n_4 - n_2*n_3)/(n_1*n_4 + n_2*n_3)
    
    df_complement = df_matches.merge(df_action, indicator=True, how='outer')
    df_complement = df_complement[df_complement['_merge'] == 'right_only']
    df_complement.drop(['_merge'], axis=1, inplace = True)
    
    n_c_1 = df_complement.loc[(df_complement.action == 'view') & (df_complement.condition == '1-Control')].count()[0]
    n_c_2 = df_complement.loc[(df_complement.action == 'clic') & (df_complement.condition == '1-Control')].count()[0]
    n_c_3 = df_complement.loc[(df_complement.action == 'view') & (df_complement.condition == '2-Buttony-Conversion-Buttons')].count()[0]
    n_c_4 = df_complement.loc[(df_complement.action == 'clic') & (df_complement.condition == '2-Buttony-Conversion-Buttons')].count()[0]

    Q_s_c = (n_c_1*n_c_4 - n_c_2*n_c_3)/(n_c_1*n_c_4 + n_c_2*n_c_3)
    
    if (math.isnan(Q_s_c)):
        Q_s_c = zero
    if (math.isnan(Q_s)):
        Q_s = zero

    phi_Q_S = abs(Q_s - Q_s_c)
    
    return {"phi_Q_S": phi_Q_S,"Q_s": Q_s, "Q_s_c": Q_s_c}

In [16]:
S_1 = sorted_result[0][1]
df_S1 = create_dataframe(S_1)
size_S1 = df_S1.shape[0]
size_S1

155

In [17]:
yule_S1 = yule_Q_multiple_return(df_S1)
yule_S1

{'Q_s': -0.12408759124087591,
 'Q_s_c': 0.38916994566235713,
 'phi_Q_S': 0.51325753690323306}

In [18]:
S_2 = sorted_result[1][1]
df_S2 = create_dataframe(S_2)
size_S2 = df_S2.shape[0]
size_S2

61

In [19]:
yule_S2 = yule_Q_multiple_return(df_S2)
yule_S2

{'Q_s': -0.69465648854961837,
 'Q_s_c': 0.32473061061593722,
 'phi_Q_S': 1.0193870991655556}

In [20]:
S_3 = sorted_result[2][1]
df_S3 = create_dataframe(S_3)
size_S3 = df_S3.shape[0]
size_S3

94

In [21]:
yule_S3 = yule_Q_multiple_return(df_S3)
yule_S3

{'Q_s': -0.6216216216216216,
 'Q_s_c': 0.3514468430908384,
 'phi_Q_S': 0.97306846471246}

In [22]:
S_4 = sorted_result[3][1]
df_S4 = create_dataframe(S_4)
size_S4 = df_S4.shape[0]
size_S4

91

In [23]:
yule_S4 = yule_Q_multiple_return(df_S4)
yule_S4

{'Q_s': -0.6908212560386473,
 'Q_s_c': 0.38767876787678768,
 'phi_Q_S': 1.078500023915435}

In [24]:
S_5 = sorted_result[4][1]
df_S5 = create_dataframe(S_5)
size_S5 = df_S5.shape[0]
size_S5

355

In [25]:
yule_S5 = yule_Q_multiple_return(df_S5)
yule_S5

{'Q_s': -0.08771929824561403,
 'Q_s_c': 0.49407114624505927,
 'phi_Q_S': 0.5817904444906733}

### 2b - Found Subgroups

The <i>beam search</i> algorithm is executed with parameters $d=2, w = 5$ and $q = 5$. We choose for $d=2$, i.e. a conjunction of at most $2$ conditions on single descriptors, because of interpretability. When $d>2$, the results become more complex and therefore give less information on which a domain expert can act. With $q = 5$, the output of <i> beam search</i> can be easily compared to the top five subgroups found in the A&B Testing paper. The biggest influence on the results was the maximum size of the subgroup that is allowed. This constraint is added in the <i>constraints</i> function. When adding no upper bound, the top $q$ results contained subgroups representing over $85$% of the dataset, with very small differences between the values of them. We choose for $w=10$, so that the search width of the beam is twice the top $q$ best subgroups. Larger values for $w$ were tested, but they did not change the result a lot. As one can see the <i>beam search</i> is executed with the same parameters as in the referenced A/B Testing paper. This makes it also much easier to compare the results, however one must note that the dataset `df_action`,preprocessed in question 1, contains $899$ records while the dataset that is used in the referenced A/B Testing paper contains over $3500$ records. 

In [26]:
target_deviation(df_action)

0.26575602240066371

The dataset $\Omega$ that is preprocessed in question 1 has a total number of 899 records. Yule's Q has a value of $\varphi_Q(\Omega) = 0.27$. In other words, the result of the traditional A/B test tells us that variant B: the more buttony variantion generates more clicks than the less buttony control version. The new variation is therefore slightly better, however it can be argued whether the difference is significant. In A&B Testing, we can mine deeper into the data and find specific subgroups which prefer version A or version B. This allows us to give represent each found subgroup which its own preferable version.

The top-five subgroups found are presented in the table below, in order of descending quality. Let $S_1$ denote the best subgroup and $S_5$ represents the subgroup with the lowest $\varphi_{Yule}(S)$ value. Besides the values of the quality measure $\varphi_{Yule}$, the value of the Yule's Q component on both the subgroup and its complement and also the subgroup size are presented. 

<table>
    <tr>
    <th>Subgroup definition</th>
        <th>$\varphi_{Yule}(S)$</th>
        <th>$\varphi_{Q}(S)$</th>
        <th>$\varphi_{Q}(S^C)$</th>
        <th>$|S|$</th>
    </tr>
    <tr>
        <td>os_name = "iOS"</td>
        <td>$0.56$</td>
        <td>$-0.09$</td>
        <td>$0.49$</td>
        <td>$355$</td>
    </tr>
    
    <tr>
        <td>geo_country = "DE" $\wedge$ os_name = "iOS"</td>
        <td>$0.51$</td>
        <td>$-0.69$</td>
        <td>$0.38$</td>
        <td>$91$</td>
    </tr>
    
    <tr>
        <td>browser language = "greek","latin_lan","asian", "cyrillic", "herbrew" $\wedge$ os_name = "iOS"</td>
        <td>$0.47$</td>
        <td>$-0.62$</td>
        <td>$0.35$</td>
        <td>$94$</td>
    </tr>
    
    <tr>
        <td>browser_language = "latin-lan" $\wedge$ os_name = "iOS"</td>
        <td>$0.36$</td>
        <td>$-0.69$</td>
        <td>$0.32$</td>
        <td>$61$</td>
    </tr>
    
    <tr>
        <td>geo_country = "DE"</td>
        <td>$0.34$</td>
        <td>$-0.12$</td>
        <td>$0.39$</td>
        <td>$155$</td>
    </tr>

</table>

Before we explain the top-five subgroups found, we first make clear what a positive/negative value for Yule's Q means. First, note that version A denotes the control (current) version of the website, while version B denotes the new version with a more buttony styled button. A positive value for Yule's Q value implies a possitive association between the two targets. In other words, a positive value for Yule's Q tells us that people presented with web page variant B (the more buttony button variant) click the button more often than people presented with web page variant A. The same holds for when Q is negative, then the subgroup presented with variant A clicks the button more ofthen than people presented with variant B. 

The best subgroup found, denoted as $S_1$, is defined by people who visits the web page from an iOS-device. This subgroup is also the largest in size among the other subgroups and represents almost $40$% of the total dataset. One can see that the Q-value on $S_1$ is slightly negative. It is therefore hard to say anything about whether iOS users prefers the old version A, but it is very clear that they don't click more often when they are presented with the more buttony button version. The buttony button is not in the characteristic flat iOS design style and this may have to do something about whether the iOS user clicks the more modern button version more often than the buttony button version, presented in website B. We also see that that Yule's Q value on the complement of $S_1$ has a positive value. So Android, Windows and 'other' users clicks the buttony button version more often than the modern version. For Windows users, a buttony button is more familiar, because these kind of buttons are widely used in desktop applications. There are also lots of Android phones that run an old version of the Android OS. These version don't have the new flat design and therefore also for these users the new variant B might be more familiar and more in style of their operating system. Because subgroup $S_1$ has a substantial size, this is an subgroup on which the domain expert can act on. Presenting iOS users the more modern version of the button and Android, Windows and 'other' users the more button button version increases the overal revenue. 

The second, third and fourth subgroups are specializations of $S_1$. One can see that iOS users combined with a second condition in the description language gives a significantly lower value (or: higher negative value) for Yule's Q. Lets start with the second best subgroup, $S_2$. $S_2$ represents iOS users from Germany. Compared to Yule's Q value of $S_1$, there is a significant difference. Apparantly, iOS users from Germany click less often the button on version B compared to version A. As we can read from the Yule's Q value on the complement subgroup, we see that iOS users that are not from Germany click the button on version B more often compared to users that are not from Germany that gets version A presented.

The third subgroup requires more attention. This description language contains many values for `browser_language`. It is therefore easier to first take a look at the complement of this subgroup. The complement of this dataframe are the users who have either not set up the browser language to one of these values <i>or</i> the users who have visited the website from an Android, Windows or 'other' device.  To identify the complement of these browser languages, we execute the following code. The following cell creates a dataframe with only iOS users and where the browser_language does not equal one of the browser language values that is in $S_3$. Finally it outputs the unique values in the column `browser_language`:

The following function returns the complement dataframe of the subgroup, given as a parameter. This function is part of the original <i>target_deviation</i> function.

In [27]:
def create_complement(df_matches):
    df_complement = df_matches.merge(df_action, indicator=True, how='outer')
    df_complement = df_complement[df_complement['_merge'] == 'right_only']
    df_complement.drop(['_merge'], axis=1, inplace = True)
    return df_complement

In [28]:
S_3 = sorted_result[2][1]
df_S3 = create_dataframe(S_3)
df_S3_c = create_complement(df_S3)
df_S3_c = df_S3_c.loc[(df_S3_c.os_name == 'iOS')]
df_S3_c['browser_language'].unique()

array(['english'], dtype=object)

The only browser language that is in this dataframe is English. Yule's Q on $S^C$ is positive. In other words, users that are either non-iOS users <i>or</i> users that have set their browser language to English, click the more buttony button more often, compared with the same subgroup who gets represented the more modern version of the button. Just as for $S_2$, $S_3$ is a subgroup who click more often on the button in version A, compared to the same group of users who visit version B.
Recall that the dataset contains 899 records. This means that $S_2$ and $S_3$ represents $10$% of the dataset, which is a substantial percentage. Therefore, it is useful for the domain expert to give German iOS users the more modern version, as well as the users who have set their browser language to Englisch. 
The fourth ranked subgroup $S_4$ represents iOS users who have set their browser_language in Latin. This is the smallest subgroup in size among all the other subgroups, so we have to take care about what we are going to see about this subgroup. However, it's clear that people who have set their browser language to Latin and are iOS user, definetely not click version B more often compared to the same group of people who visited the webpage with the more modern styled button version A. 

The fifth ranked subgroup $S_5$ represent, just as $S_1$, a substantial part of the total dataset. This description language is very clear: German visitors don't click more on the new buttony buttion version, compared to German users who gets represented the more control version button. There is quite a difference between the subgroup and its complement. The complement of this subgroup, $S^C_5$, click more often on the modern button when visiting version A, compared to non-German users who visit the website with the buttony button. Because the size of $S_5$, this is a potential subgroup on which StudyPortal can focus. Visitors that are not from German click more often on the buttony button , and therefore increases the overall revenue.

### Comparison between found subgroups
When we compare the found subgroups to the subgroups of the referenced A/B Testing paper, there are a few noteable differences. First, we see that iOS dominate the subgroups that we found above, with four of the five subgroups containting iOS as `os_name`. The referenced paper only contains one subgroup containing iOS users. Also, the conclusion and results about this subgroup is very different compared to what we concluded above. The referenced paper concludes that visitors who run the iOS operating system strongly prefer version B. In the paper this is noted as remarkable, because the buttons of version B do not conform to Apple's design standards. In the results of the found subgroups we concluded the opposite. iOS users strongly prefer version A. This seems to make more sense, because this button is more in line with the modern, flat look of the iOS operating system. Because iOS users are dominated in the top-five subgroups of our result, we can say with confidence that iOS users prefer version A above B. The one who have set their browser language not to English or are non iOS users click the buttony button more often, which is also the conclusion in the A/B testing paper. The subgroups that are identified in the <i>beam search</i> algorithm we implemented, $S_2$, $S_4$ and $S_5$ are not identified in the top five subgroups of the A/B testing paper. This could be a difference in the <i>beam search</i> implementation, but also note that the dataset that is used in the paper is almost four times larger than we have used. This has ofcourse a big influence on the output. The subgroups that are different from the subgroups that are found in the paper are all explained above and don't need any further explanation in this paragraph.

### Conclusion
The top five found subgroups that are found with <i>beam search</i> can be found in the table above. The largest schism lies between people who visit the website from an iOS device, who strongly prefer version A and people who visit the website from an Android, Windows or 'other' device. The latter group strongly prefers version B, the more buttony button version. Because of the sizes of these subgroups, the overall revenue will substantially increase when iOS users get represented version A and Android, Windows and 'other' users gets represented version B. 

<font color = "red"><b>This is assignment is made in Jupyter notebooks, if you (the end user/reviewer/evaluator) want to change the parameters (e.g. the beam width $w$ and/or the search depth $d$)of our implementation of the beam-search algorithm you can download the notebook file by clicking the following link: <a href="https://www.dropbox.com/sh/y81uxp5oahfzy76/AADHMc0Ofm5QWnPMrzX7dTKLa?dl=0">Beam-Search implementation Notebook and pre-processed data</a>. This particular notebook can be found in the file "Beam_search-V2.ipynb". The link will remain live unitl 01 February 2018.</b></font>

# 3 Beyond A&B Testing

In Question 2 we have explored the straightforward solution to StudyPortals’ inquiry, however, there is more to explore. In Question 3, we will design an EMM instance that extracts more information out of the StudyPortals situation. So we make a shift from implementing existing ideas, to contributing new ones to this field of research and possibly inspired by already existing ideas.

## 3a. The design of a new EMM instance

### Multivariate testing

Let's look at what would have happened if StudyPortals had designed four different variations of their webpage. For example, they might not only be interested in the two different button designs that were mentioned, but they might also be interested in knowing the effect of a new logo, a different font, or another colour, at the same time as they are testing their different buttons. As such, we would not be performing a simple A/B test, but instead we would be doing an A/B/C/D test, also known as a multivariate test.

With regular A/B testing, we had targets $t_1$, the binary column representing whether the page visitor merely viewed or also clicked, and $t_2$, the binary column representing whether the visitor was presented version A or B of the buttons. Therefore, the natural choice of EMM instance was the association model class, which allows us to determine the association between two nominal targets.

Now, with A/B/C/D testing, it still makes sense to use the association model, because we are still using two nominal targets. In this case, our targets will be $x$, the binary column representing whether the page visitor merely viewed or also clicked, and $y$, the numerical column representing whether the visitor was presented version A, B, C, or D of the webpage.

Even though we will be using integers (1, 2, 3, 4 for version A, B, C, D respectively) to code the distinct values of $y$, their values will be treated as unordered (nominal).

### Quality measure

Having fixed the model class, we need to define an appropriate quality measure. To ensure the discovery of subgroups that represent substantial effects within the datasets, a common approach is to craft a quality measure by multiplying two components: one reflecting the target deviation, and one reflecting the subgroup size.

|      | View | Click |
|------|------|-------|
|   A  |$V_A$ | $C_A$ |
|   B  |$V_B$ | $C_B$ |
|   C  |$V_C$ | $C_C$ |
|   D  |$V_D$ | $C_D$ |

<i><b>Table 1: Target-cross table</i></b><br>
Where $V_X$, $C_X$ denotes the number of views and the number of clicks, respectively, found for class $X$ for a certain group within the dataset (described by a certain (group) of attribute values).

### Target deviation component

For the quality measure component representing the target deviation, we use the target contingency table, shown in Table 1. Now, since this is no longer a 2x2 matrix, we cannot use Yule's Quality Measure here. Instead, we will have to define another appropriate quality measure.

An interesting (and logically sound) target deviation statistic for a subgroup would be the maximum deviation found accross the classes. One can imagine that we would like to know whether there is a subgroup what the highest deviation is that a subgroup has from the original dataset across all the models present in the dataset. Let us specifiy this.

In this case we compute the average for each of the classes in the orginal dataset:

Let in this case $ClickTotal_X$ denote the number of records in the orginal datastet of class $X$ (i.e. version "$X$") in which a click was registered and let, similarly, $ViewTotal_X$ denote the number of records in the original dataset of class $X$ for which a view was registered.

$avgTotal_A = ClickTotal_{A} \,/\, (ViewTotal_{A} + ClickTotal_{A})$<br>
$avgTotal_B = ClickTotal_{B} \,/\, (ViewTotal_{B} + ClickTotal_{B})$<br>
$avgTotal_C = ClickTotal_{C} \,/\, (ViewTotal_{C} + ClickTotal_{C})$<br>
$avgTotal_D = ClickTotal_{D} \,/\, (ViewTotal_{D} + ClickTotal_{D})$

For a particular subgroup we then calculate<br>
$a_{group} = C_A \,/\, (V_A + C_A)$ <br>
$b_{group} = C_B \,/\, (V_B + C_B)$ <br>
$c_{group} = C_C \,/\, (V_C + C_C)$ <br>
$d_{group} = C_D \,/\, (V_D + C_D)$ <br>

And the deviation now becomes:

$devGroup_A = |\,a_{group} - avgTotal_A\,|$<br>
$devGroup_B = |\,b_{group} - avgTotal_B\,|$<br>
$devGroup_C = |\,c_{group} - avgTotal_C\,|$<br>
$devGroup_D = |\,d_{group} - avgTotal_D\,|$<br>

Which will be a value between 0 and 1 encapsulating the deviation for a particular group (subset of the original dataset, filtered on certain attribute value(s)) for a particular class (i.e. "version" in our problem context)

The final target deviation component for this particualr group becomes:

$devGroup = \max \{devGroup_A, \,devGroup_B, \,devGroup_C, \,devGroup_D\}$

Which is a perfect target deviation component considering the (unbounded) number of classes one should be able to expose to this EMM instance. Imagine having a certain group which would shouw large deviations for a particular class, just taking the average deviations would result in skewed results since it also takes into account the deviations for other classes. This would affect the relative ranking of the subgroups in the next step of this elaboration. Using this particular target deviation component will allow one to compare groups which show a certain deviation for a certain class, instead of constantly comparing the averages or an other statistical estimator. You want to rate the deviations for subgroups, but considering the unbounded nature of the classes in this EMM instance, it makes sence to rate the subgroups for their deviations for <b>one</b> class (the maximum). It could be that you get a list of subgroups rated by different classes, but does not matter for the results. If you want a list of the top-5 subgroups found in the dataset (with certain constraints), it does not matter for which of the classes the subgroups showed a deviation from the original dataset. You eventually want to know for which class a top-group is showing the deviation, but this is an implementation detail which we will tackle later on. In Yule Q's one could simply look at whether a value is positive/negative to know to which class it belongs, which is no longer possible if one would take a stastical measure over the deviation for each classe (i.e. average), which is why we choose this approach (the maximum).\\

This also tackles the subgroup-size component of the quality measure, one can simply pass the model class which had the highest deviation to the function responsible for the subgroup-size component of the quality measure. Which immediately brings us to the subgroup size component.

### Subgroup-size component

The target deviation component is now defined as $devGroup = \max \{devGroup_A, \,devGroup_B, \,devGroup_C, \,devGroup_D\}$. If the maximum deviation is found for class X, we can simply use the same entropy function as was used in question 2. A brief recap:

To represent the subgroup size, we take the entropy function as described in section $3.1$ of the referenced paper. The function conceptually rewards $50/50$ splits between subgroup and complements, while punishing subgroups that are either (relatively) small or cover the vast majority of the dataset.

$\varphi_{ef}(D) = - \frac{n}{N}lg(\frac{n}{N}) - \frac{n^C}{N}lg(\frac{n^C}{N})$

In this case one would filter the group (already a subset of the original dataset of $N$ records) on the modelclass ($condition$) $X$. The size (number of records) of this subgroup for the model class which showed the maximum deviation, is in this case  the $n$ in the entropy function.

### Combining the target-size and subgroup-size component: Max-dev measure

The last step is to combine the two compontents into one final quality measure which can be used to rate the subgroups based on the deviations they show for a certian model class.

$\varphi_{maxdev}(D) = \varphi_{ef}(D) * devGroup(D)$

We multiply the two components to have a quality measure which captures the maximum target deviation found across the classes (i.e. versions), while having a dimension/component in the quality measure which reflects the group size, considering the fact that one can easily achieve huge deviations in target behaviour in (very) small sub-groups.

### Revision of the given answers (3a) and the proposed ideas which be implemented 

Note that this quality measure focuses on the deviation found for one particular class. The quality measure does not focus on/capture how the subgroup deviates from the norm across all classes. It checks each class seperately for the subgroup and reports the maximum deviation which it found, which makes sense in the problem context of "A&B Testing" since you want to be able to present a certain group one version (i.e. class) of the website, for which the subgroup showed deviation from the norm.

<i>The choice is made to design an EMM instance for the situation in which subgroups can be researched when there are multiple (i.e. more than 2) versions of a website/application available. In particular, the situation in which 4 versions of the StudyPortals website is picked as an example, and will be examined in more detail in the next section (3b). This examination of the model class and related quality measure will be done by artificially modifications to the orginal dataset, since these versions are, of course, not present in the original dataset.</i>

This EMM instance and can be used to test different versions of a website/application, but it is not suitable (just like the EMM instance in the referenced paper) for the evaluation of deviations among subgroups of components which are related to each other. So there needs to be one particular component which is different amongst the versions (i.e. the styling of a button). This model class is not suitable to test the performance (i.e. target behaviour) of different type of components of a webpage (i.e. font, button styling, colours, reponse times, contrast etc.). An other model class and related quality measure would need to be designed for such an experiment.

So the EMM instance is able to extract additional information for when there are more than 2 versions available. The quality measure asses for a particular group (described by a (group of) attribute values; description languange) and each different version used in the experiment, the target deviation relative to the target statistics of that particular version in the orignal (complete) dataset. It then picks the class (version) which has the highest target deviation and evaluates the subgroup-size component of this subset. The target deviation component and subgroup-size component are then combined into the final quality measure. This quality meausure can then be implemented in algorithms such as beam-search. The implementation has to make sure that the end-user is able to extract the actual class (version), for which a subgroup showed interesting deviation for the target component, corrected by the subgroup size component. This quality measure is then useful for experiments in which there are $>2$ versions of a particular website/application, and one wants to find subgroups which show unusual target behaviour for <b>one</b> of these versions. This information can, for example, be used to have dynamic webpages and/or components, based on the subgroup to which a visitor belongs.

## 3b. Implementation of the model class & related quality-measure into the beam-search algorithm

Before we can actually start with the implementation of the new quality measure to find interesting subgroups with the new EMM instance, we have to modify the existing dataset. In the original dataset there were two conditions (versions) present.A control version (<i>1-Control</i>) and the experimental version (<i>2-Buttony-Conversion-Buttons</i>). The EMM instance which has been designed in section $3a$ is in particular interesting for when there are $>2$ versions evaluated in an experiment. Unfortunately, there is no data available of such an experiment (done by StudyPortals), which is why the choice is made (as proposed in the assignment) to alter the original dataset, for the sake of research! (i.e. the experimental evaluation of the newly designed EMM instance)