# AB Testing Notes

A general methodology used to test out a new product or feature

- Take two sets of users
- One set is shown an existing product
- Second set is given a treated version
- How do the customers respond differently? determine which one is better based on some metric

Can you use AB tests for everything?

"AB testing is good for optimizing an existing product but not good for developing a new product based on an existing one"

Amazon did AB tests for personalized recommendations and found that they had an increase in revenue when given the personalized recommendations.

Google tested 41 different shades of blue

LinkedIn tested a ranking process where they checked whether it's better to show news articles or an encouragement to add more contacts on a users "stream". (Use click-through rate as metric?)

Amazon determined that ever 100ms increase in page load time decreased sales by 1%

Need a consistent response from your control and experiment groups

### What can't you do with AB tests?

- Test out new experiences
    - Change aversion - Users refuse to participate in the test
    - Novelty effect - Too drastic of a change leads customers to "try out everything". Can't test out a specific treatment
- No baseline for comparison
    - Can't set up a control group if there is no baseline. 
- How much time do you need to have your users adapt to the new experience?
    - Need the plateaued experience to make a "robust decision". The metric being observed will be noisy in the beginning and only when it has stabilised can you check for any statistically significant changes to your metric.
- Long term effects are difficult to test
    - Difficult to measure changes in your metric over a long time period where other aspects of your product or users will change (can't attribute the change in your metric to the treatment)
- Can't test whether your missing something in your product
    - No baseline, can;t set up a control and treatment group because what do you change about the treatment group?

### Other techniques

- Logs of what users did on your website. Analyse them retrospectively or observationally to see if a hypothesis can be developed about what caused changes in their behaviour. This can then be used to design an experiment. 
- User experience research, focus groups, surveys, human evaluation
- A/B testing gives quantitative data, other techniques give qualitative data.
- A completetly new product is difficult test

In online A/B tests, you don't know much about your users. You're using online user data and so it's difficult to distinguish whether a user is a single person, internet cafe etc.

The goal is to determine whether a new feature is desirable. To do this, you need to design an experiment that can be **repeatable**.

### Online Case Study

Audacity

Creates online finance courses

**User flow/Customer funnel**

- Homepage visits
- Explore the site
- Create an account
- Completion

Listed in decreasing number of users.

**The hypothesis:**

Changing the "Start Now" button from orange to pink will **increase** how many students explore Audacity's courses

**Possible metrics**

- ~~**Total number of courses completed**~~
    - Will take too much time. Students make take months to complete a course
- ~~**How many users click on the "Start Now" button**~~
    - Assumes that users who progress through the top of the customer funnel will eventually lead to more users being passed through the rest of the customer funnel
    - In unequal control/treatment groups, the number of users in the group will affect the total number of clicks
- **CTR: $\frac{\text{Number of clicks}}{\text{Number of page views}}$**
    - Called Click-Through-Rate
    - Single users can click more than once and inflate the CTR
- **CTP: $\frac{\text{Unique visitors who click}}{\text{Unique vistors to the page}}$**
    - Called Click-Through-Probability
    - The better metric to use in this case.
    
**Updated metric**

Changing the "Start Now" button from orange to pink will **increase** the Click-Through-Probability of the button

**When do you use CTR vs. CTP?**

Generally:

- Use a rate when you want to measure **usability**
    - Users have a number of different places they can press, you use a rate to measure how often the users clicked a specific button
    - Will have to change the website to log every page view of the website and every click of a button
- Use a probability when you want to measure **total impact**
    - You don't want to count when users double clicked, reloaded etc when measuring a total effect (e.g. getting to the second level of a page)
    - Will have to change the website to match each page view with all of their "child clicks" to count at most one click per page view

<br>
<br>

___

<br>
<br>

# The statistics

**Which distribution?**

When producing the CTP, the sample proportion $p_0$ was computed to be 0.1 $(\frac{100}{1000})$. When using a different sample to compute the CTP, the sample proportion was instead computed to be 0.15.

Is 0.15 or 15% considered to be surprising? How do you know how variable your estimate is likely to be?

We compare the sample proportion computed to the **binomial** distribution where we model each click as a bernoulli trial. Each unique visitor either clicks the button (success) or doesn't (failure).

**Variance**

We can use the standard error formula for a sample proportion $SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$ to estimate how variable the sample proportion $p_0$ should be as a result of sampling variability. Then we compare the second computed sample proportion (0.15) to the first relative to the variance to see if it is a surprising value or not. 

Either compute a CI or perform a hypothesis test

<img src="images/difference_of_two_proportions.png" width=600>

**Practical significance**

We have to decide how big of a change is practically significant (aka substantive) to warrant changing the existing system. Statistical significance of any arbitrary difference in proportion can be achieved with a big enough sample size, however a small difference may not be practically significant. 

- Each change may require an investment in resources and so a small change may not warrant the investment
- Online A/B tests have a smaller margin for practical significance
- We need to make sure for online A/B tests that the change is **repeatable**. - We want a big enough sample size to have it so that the statistical significance bar is lower than the practical significance bar to ensure repeatability

We will decide that a 2% change in CTP is practically significant

**Size vs. Power tradeoff**

The power of a hypothesis test is the probability that the test rejects the null hypothesis  $H_0$  when a specific alternative hypothesis  $H_1$  is true. The idea is that, given a practically significant effect size and a significance level, we want the hypothesis test to be able detect the effect (by rejecting the null hypothesis) at a high enough probability, which can be controlled by increasing the sample size.

$\alpha$ is the probability of making a type 1 error i.e. False Positive (reject the null hypothesis when it is True). Statistical power is equal to $1-\beta$

$\beta$ is the probability of making a type 2 error (failing to reject the null hypothesis when it is false). 

Statistical power is equal to $1-\beta$

When the CI captures the null hypothesis $H_0$, the test is statistically insignificant (recall that you create a CI around the point estimate $\hat{p}$ or $\hat{d} = \hat{p}_1 - \hat{p}_2$). When the CI is outside of both $H_0$ **and** the practical significance level $d_{\text{min}}$, then we can agree to launch the change. For cases inbetween where the CI is too wide or does not capture $H_0$ but does capture $d_{\text{min}}$, we have to use our best judgement.

<br>
<br>

___

<br>
<br>

# The Ethics

Participants in any kind of experimental test need to be adequately protected

**Example**

Facebook experiment to gauge the effect of altering user's news feeds on emotions. In particular for this study, there is no discussion by the experimenters on the benefits of the study being conducted.

Four main principles govern the ethics of experiments with regard to the safety of their participants:

1. Risk - What risks are the participants being exposed to?
2. Benefit - What benefits might be the outcome of the study?
3. Choice - What other choices do participants have?
4. Privacy - What expectation and confidentiality do participants have?

**Risk**

Does the risk exceed that of "minimal risk". **Minimal risk** is defined as the probability (and magnitude) of harm that a participant would be exposed to in normal daily life. In most online A/B testing, the risk (of the test) does not exceed minimal risk although there are grey areas such as in the Facebook example.

**Benefits**

How might the results of the experiment help? It is important to be able to state what the benefit would be from completing the study. In most online A/B testing, the befits are around improving the product.

**Alternatives**

Do participants of the test really have a choice in whether to participate or not (and how does that effect the risks and the benefits)?. For example, in medical clinical trials testing our new drugs for terminal illnesses, the alternative for most participants is death. Thus, the risk allowable for participants, given **informed consent**, is quite high.

For online experiments, if users do not want to participate in the testing, we must consider how this may inconvenience the users (for example costs, time, information etc. required to switch services).

**Data Sensitivity**

How sensitive is the data? What is the re-identification risk of individuals from the data? As the sensitivity and the risk increases, then the level of data protection must increase: confidentiality, access control, security, monitoring & auditing, etc. Sensitive data includes bank information, health information etc. whilst the re-identification risk of the data is determined by whether it is considered to be identified, pseudonymous, anonymous or anonymized.

**Identified data** means that data is stored and collected with personally identifiable information. This can be names, IDs such as a social security number or driver’s license ID, phone numbers, etc. HIPAA is a common standard, and that standard has 18 identifiers (see the Safe Harbor method) that it considers personally identifiable. Device id, such as a smartphone’s device id, are considered personally identifiable in many instances.

**Anonymous data** means that data is stored and collected without any personally identifiable information. This data can be considered **pseudonymous** if it is stored with a randomly generated id such as a cookie that gets assigned on some event, such as the first time that a user goes to an app or website and does not have such an id stored.

In most cases, anonymous data still has time-stamps -- which is one of the HIPAA 18 identifiers. Why? Well, we need to distinguish between anonymous data and anonymized data. **Anonymized data** is identified or anonymous data that has been looked at and guaranteed in some way that the re-identification risk is low to non-existent, i.e. given the data, it would be hard to impossible for someone to be able to figure out which individual this data refers to. Often times, this guarantee is done statistically, and looks at how many individuals would fall into every possible bucket (i.e., combination of values).

What this means is that anonymous data may still have high re-identification risk. **Aggregated data** is usually not sensitive

For online A/B testing, questions that must be considered include:

- Are users being informed about the data being gathered via a ToS or Privacy policy?
- What user identifiers are tied to the data being gathered? are there any identified data being gathered?
- What type of data is being collected? Any health or financial data?
- What level of confidentiality and security is the data subject to? Is the access of data being logged and audited.

**Informed consent**

Participants are told about the risks that they may face if they take part in the study, what benefits might result, what other options they have, what data is being gathered and how that data is being handled. Typically informed consent is handled by giving participants a document detailing all of the aforementioned information and participants can then choose whether they want to participate or not.



<br>
<br>

___

<br>
<br>

# Metrics

There are two main uses for metrics in A/B testing:

- Invariant checking - The metrics that shouldn't change across your treatment and control. 
- Evaluation - To check whether the treatment group is "performing better" than your evaluation group

Two types of Evaluation metrics

- High level metrics
- Well defined metric 



### High level metrics

These usually relate to the business objective. They are not directly used to perform the A/B test but they help to decide on which metrics will eventually be used to do so.

In a customer driven business, we can use a **customer funnel** to brainstorm the possible high level metrics that will be important.

<img src="">

Each level in the customer funnel is a high level metric that can be used to answer questions such as "what business objective are you tracking?" 

You can categorize every metric into three different types:
- Count
- Rate
- Probability

You want the metric that you eventually use in an A/B test to be a KPI for your business, but these high level metrics still need to be transformed into formal definitions.

### Difficult metrics

Some metrics will be difficult to use in an A/B test. This comes down to two specific reasons

- **Time**: For some metrics, the data will just take too long to collect. It is also difficult to ensure the smooth operation of a long-term ongoing online experiment  
- **Availability**: For other metrics, the data may not be readily available to use. The business may not have access to the data

### Other techniques for coming up with metrics

There are other techniques you can use to help you get an understanding of your users, which you can use to come up with ideas for metrics, validate your existing metrics, or even brainstorm ideas of what you might want to test in
your experiments in the first place. See [here](https://s3-us-west-2.amazonaws.com/gae-supplemental-media/additional-techniquespdf/additional_techniques.pdf)

<img src="images/gathering_additional_data.png" width=600>

### Well defined metrics

To convert a high level metric into a well defined metric, you must consider two mains things:

- What data are we actually going to look at to compute the metric (cookies? page views? clicks? are we going to filter the data?)
- Given the events (e.g. clicks) how will we summarize the metric? (mean? median? etc)

Essentially you must now consider the logistics of collecting and summarizing the metric

Example:

$\text{CTP} = \frac{\text{Unique visitors who click}}{\text{Unique vistors to the page}}$

This is not a well defined metric yet

- how do we determine when two events are from the same user? 
    - Let's use a cookie
- What time period do we use to count events? hour? day? week?


A few well defined metrics that act like CTP are:

1 - $\text{Cookie Probability} = \frac{\text{Number of cookies that click}}{\text{Total number of cookies}}$ for every hour (or any other time interval)

2 - $\text{Pageview probability} = \frac{\text{Pageview with click}}{\text{Total number of page views}}$ for every hour (or any other time interval)

3 - $\text{CTR (with time period)} = \frac{\text{Total number of clicks}}{\text{Total number of page views}}$ for every hour (or any other time interval)

### Segmenting and filtering data

Often times you want to filter out any **unexpected** traffic or participants from your experiments. An example would be spam IP addresses that don't represent a typical user (and thus you do not care how a change to your website affects these participants and thus do not want these participants to skew your metric). 

Filtering out changes that are targeted towards a specific subset of users is also important because you want to avoid **diluting your results**. Results are diluted when a change in your metric can no longer be detected due a **reduction in power** of the A/B test where you have included participants in your experiment whom you know will not be affected. $N$ is then much larger than it should be, and legitimate changes to the behaviour of the participants (that you suspect should be affected) may be considered as **sampling variation** in the hypothesis test.

Filtering is used to **de-bias** the data whilst avoiding **introducing** bias to the data. Bias may be introduced if you choose to use a filter but it is removing participants from certain subsets disproportionately. The experiment is no longer **randomized**.

Use **slicing** to check if your filter is introducing bias

### Summary metrics

For count metrics in particular and other well defined metrics (e.g. load time of a video) you have many different ways to summarize the metric e.g. mean, median, 50%ile, 75%ile, 90%ile etc. 

To choose between these metrics, you should consider two main factors:

- **Sensitivity** - You want your metric to be sensitive enough to changes in the treatment
- **Robustness** - You want your metric to be robust to ongoing changes in your system that aren't the treatment but you have no control over

To choose how to summarize the metric, you can perform a **retrospective analysis** and produce a **histogram** of the metric. For skewed distributions, the mean is not necessarily the best measure of central tendency and a median or %ile may be more appropriate.

>Let’s talk about some common distributions that come up when you look at real user data. For example, let’s measure the rate at which users click on a result on our search page, analogously, we could measure the average staytime on the results page before traveling to a result. In this case, you’d probably see what we call a **Poisson distribution**, or that the stay times would be exponentially distributed. Another common distribution of user data is a “power-law,” Zipfian or **Pareto distribution**. That basically means that the probability of a more extreme value, z, decreases like 1/z (or 1/z^exponent). This distribution also comes up in other rare events such as the frequency of words in a text (the most common word is really really common compared to the next word on the list). These types of heavy-tailed distributions are common in internet data. Finally, you may have data that is a composition of different distributions - latency often has this characteristic because users on fast internet connection form one group and users on dial-up or cell phone networks form another. Even on mobile phones you may have differences between carriers, or newer cell phones vs. older text-based displays. This forms what is called a **mixture distribution** that can be hard to detect or characterize well. The key here is not to necessarily come up with a distribution to match if the answer isn’t clear - that can be helpful - but to choose summary statistics that make the most sense for what you do have. If you have a distribution that is lopsided with a very long tail, choosing the mean probably doesn’t work for you very well - and in the case of something like the Pareto, the mean may be infinite!

There are 4 main types of summary metrics

<img src="images/categories_for_summary_metrics.png" width=600>

A review of the literature may need to be considered to select the best summary metric (e.g. say a study showed that people take 5 seconds to internalize the information of a web page, then use the number of users who spent 5 seconds or longer on a page)

### Sensitivity or Robustness

The mean is **not a robust metric** as it is sensitive to outliers, the mean of a metric is heavily influnced by values that proportional to their size. The median is robust but may not be sensitive enough.

To measure the **sensitivity** of a summary metric, you can perform a **retrospective study** and look at previous changes to the website (that match the change you're trying to test) to see if it affected the metric in ways that you expect. You can also perform separate simple experiments to check if your summary metric responds to the change in expected ways. Looking back at previous experiments performed may also be done.

To measure the **robustness** of a summary metric, you can perform an **A/A test** where you don't change anything and see if the metric picks up on any spurious changes. The metric is not robust if there are more statistically significant changes to your metric between A/A groups than expected due to sampling variation. (See the spreadsheet for example). You can also perform retrospective studies or simple experiments and check to see if your metric is robust between comparable groups.

Here is an example of a simple experiment to check for robustness where each video should be comparable in latency (each video is the same size, resolution etc). 

<img src="images/robustness.png" width=600>


Here is an example of a simple experiment to check for sensitivity where the higher the video number, the lower the resolution (i.e. we expect the latency to decrease w.r.t video number)

<img src="images/sensitivity.png" width=600>

**From the experiments, the 85%ile might be a good metric to use based on sensitivity and robustness**

### Absolute vs Relative difference

How do you compute the difference in your metric between treatment and control? If you run multiple experiments, then using a relative difference (e.g. percentage change) means that you only need to define one practical effect size (e.g. 2% increase).

Absolute vs. relative difference

Suppose you run an experiment where you measure the number of visits to your homepage, and you measure 5000 visits in the control and 7000 in the experiment. Then the absolute difference is the result of subtracting one from the other, that is, 2000. The relative difference is the absolute difference divided by the control metric, that is, 40%.

Relative differences in probabilities

For probability metrics, people often use percentage points to refer to absolute differences and percentages to refer to relative differences. For example, if your control click-through-probability were 5%, and your experiment click-through-probability were 7%, the absolute difference would be 2 percentage points, and the relative difference would be 40 percent. However, sometimes people will refer to the absolute difference as a 2 percent change, so if someone gives you a percentage, it's important to clarify whether they mean a relative or absolute difference!

## Variability

You need to compute the variance of the metric in order to check whether a statistically (or practically) significant difference in your metric has resulted from the treatment.

For some non-standard metrics (e.g. ratios, 90%ile) it is difficult to calculate the variance analytically, and so it is better to compute the variance empirically. Here is a list of common metrics and their sample distribution/sample variance

<img src="images/metrics_and_variance.png" width=600>

For difficult metrics where you can't calculate the variance analytically, you may also not want to make any assumptions about the distribution of the metric. If so, you have a few options

- **Sign test** - Run 20 experiments, if on 15 of those experiments the metric increased, use binomial distribution to perform a hypothesis test to check the chance of this happening due to sampling variation (assume p - the chance of success is 50% for no difference between the two groups).
    - Doesn't help estimate the size of the effect (practical significance)
    - Part of a broader range of methods called non-parametric methods

- **Emperically compute confidence interval** - This can be done by performing multiple experiments (possibily on different sample sizes) and calculating the sample variance (or sample standard deviation) of the metric between the experiments and use that as an estimate of the population variance.

Empirical CIs/Variances are done using A/A tests. See the spreadsheet for an example of this. 

To compute empirical CIs/Variances, you can either run many experiments or **bootstrap** from one large experiment to compute the sample standard deviation (which is an unbiased estimate of the standard error of the metric). 

<br>
<br>

___

<br>
<br>

# Designing an Experiment

### Units of diversion

In A/B testing, we need to assign different subjects to our control and treatment groups. How do you decide what is a subject in the experiment? With a user visible change, ideally we want a unique person to be a subject in our experiment, however there only exists imperfect proxys for a unique people in online testing. The way we define/approximate a subject in our experiment is called the **unit of diversion** where you divert each unit into the control or treatment groups.

We may try to distinguish users by

- **User ID**
    - A single user may sometimes have multiple accounts
    - is personally identifiable
- **Cookie** 
    - If you switch browser, you get assigned a different cookie
    - If you clear your history, you get assigned a different cookie
- **Event**
    - Users will not get a consistent experience
- **Device ID**
    - Only available for mobile
    - tied to a specific device
    - is personally identifiable
- **IP address**
    - Can change

### Consistency of diversion

There are three main things to consider when selecting the unit of diversion

**Consistency and Statefulness**

The idea is to make sure that the subjects in your experiment has a consistent experience during testing i.e. the experience of each subject is the same throughout the lifetime of the experiment.

For user visible changes, it's best to use User ID or Cookies to ensure consistency of diversion

For non user visible changes e.g. latency changes, backend infrastructure changes, ranking changes etc, you do not need to worry about consistency and thus you can used an event based unit of diversion


If the metric you're measuring is some kind of learning effect i.e. if a user adapts to changes made, then you need to track the same user over a long period of time. You then need to use a **stateful** unit of diversion that is attached to a single user for which User ID and Cookie is most appropriate.

**Ethical considerations**

Cookie and User ID are personally identifiable units of diversion. To use these then, you need to consider whether your prepared to go through all of the necessary data protection steps that are required when collecting such data. This includes for example collecting informed consent from your participants.

**Variability**

When the **unit of analysis** is the same as the unit of diversion, you have a lower **empirical variability** in your metric due to sampling variation (which is desirable as it increases the power of the A/B test). This means that the analytical estimate of the variance of your metric is an overestimate. The unit of analysis is the denominator of the metric (if your metric is a ratio).

You then want to choose a **unit of diversion** that matches the denominator of your metric when you can (which usually ends up being an event based diversion).

<img src="images/unit_of_diversion_unit_of_analysis.png" width=600>

### Inter vs. Intra user experiment

A/B testing is an inter user experiment where there are different participants in each of the control and treatment groups. More information in [this paper](http://www.cs.cornell.edu/people/tj/publications/chapelle_etal_12a.pdf)

### Target population

There are various reasons why you would want to target a specfic group in the population. This is particularly important if you know in advance who will be affected by the changes made in your treatment group. Some reasons to target a specific group in your population include (but is not limited to):

- If you're testing a high-profile feature that you're unsure about whether it will be released or not, you may want to restrict the change so that only a limited amount of users experience it and as a result avoid getting press coverage etc.
- You may want to restrict the target population by language as you want to avoid going through the trouble of testing in different languages.
- You may not be sure whether your feature works on all browsers and thus you may want to restrict the target population to specific browsers
- If you're running multiple experiments at your company, you may not want to overlap participants of the study (have subjects take part in multiple experiments)
- You may not want to dilute the effect of your experiment if you know your change will only affect a specific group of your population. Increasing $N$ whilst keeping the number of participants affected the same will reduce the effect size e.g. the difference between proportions for CTP and thus reduce the power of your experiment (see filtering section)

You need to ask the engineering team if they have an idea about who the changes will target.

After the experiment, you will want to test the changes on your global population just to check if there are any unwanted effects on the traffic you were not targetting. An example of diluting the effect size is shown below:

<img src="images/diluting_variance_1.png" width=600>

<img src="images/diluting_variance_2.png" width=600>

Adding all of the unaffected traffic outside of the population that would be affected (New Zealand traffic) diluted the difference of the poportions (disproportionately to the reduction in SE) and thus the difference is no longer statistically significant. 

### Cohort

If you just divert your users by Cookie or User ID, you may have participants drop out of your groups or join your groups during the lifetime of your experiment. Cohorts are groups of users who entered the experiment at the same time and/or have a similar level of activity. You typically want to use a cohort in experiments when:

- You're looking for learning effects (whether users are adapting to a change or not)
- Examining user retention
- Want to increase user activity
- Anything requiring the user to be established i.e. has used the site for at least a specific number of hours

This is essentially a form of filtering to ensure that the participants of the study are who you want to test the change on. In the audacity case study, they may change the structure of a specific course and only target a cohort of participants who haven't completed the course yet.

### Sizing the experiment

Before we only considered how to size the experiment based on practical significance, statistical significance and sensitivity (e.g. ensuring the experiment has at least 60% power). From what we just learned, we now also need to take into account the unit of diversion vs. unit of analysis as well as the target population and then decide whether the size is realistic relative to how long you have to run the experiment.

To perform the calculation, you just do a power calculation for $N$, the sample size required to ensure a certain $\alpha$ (usually 0.05), $\beta$ (usually 0.2) and $d_{\text{min}}$ whilst estimating SE empirically (or analytically for simple metrics).

To reduce the size of the experiment required whilst maintaining the same $\alpha$, $\beta$ and $d_{\text{min}}$, you could try:
- Changing the unit of diversion to match the unit of analysis
- Target experiment to specific traffic (filtering or using cohort)

### Sizing triggering

Even after deciding on a well-defined metric, you often don't know beforehand whether the population that is supposed to be exposed to the changes in your treatment are truly exposed to the changes or not. A good idea is to run a "pilot" where you turn on the changes and observe who actually are affected and see if it matches your intutions.

### Duration

Depending on the sizing calculations relative to the traffic on your website, you will need to decide on a suitable **duration** and **exposure proportion**.

For example, if your website gets $100000$ users per day, and the sizing calculations resulted in a required sample size of $N=1000000$, if you expose your change to your website's entire traffic, then you will need to run your experiment for 10 days.

However, exposing the change to the entire traffic isn't a good idea because of reasons mentioned before in the "Target Population" section such as avoiding press coverage, overlapping experiments, filtering by cohorts etc.

Reducing the exposure proportion will then increase the required duration of your experiment.

Another reason is the fact that you are trying to account for confounding factors by **randomizing across your unit of diversion**. However, other **temporal factors** might be affecting your subjects/metric which you have to also account for. For example, if you run your online experiment over a holiday, the results of the experiment (if statistically and practically significant) may not apply all year round. 

By having a longer running experiment, or by running your experiment on specific days of the week/year, you can ensure that the results of your experiment are **generalizable**.

<img src="images/limit_exposure.png" width=600>

### Learning effects

When you want to measure learning effects, you're trying to quantify whether users are adapting to the changes (in your treatment group) or not. Measuring this type of effect is difficult due to the two reasons mentioned before:

**Change aversion** - Users refuse to participate in the test
**Novelty effect** - Too drastic of a change leads customers to "try out everything". Can't test out a specific treatment

The key idea is that for learning effects, you need to consider the duration of the experiment that is required for the user behaviour to plateau or converge.

These are the things you need to keep in mind when measuring learning effects (some of which has been mentioned already)

- You need a **stateful** unit of diversion
- You need to consider the dosage (how often the user experiences the change) and thus you should select a **cohort** based on dosage
- Learning effects are usually high risk and thus you'll probably want to test on a smaller proportion of users for a longer duration of time (to meet the sample size $N$ quota)

**pre-period vs. post-period experiments** is a method you can use to ensure that user experience and populations are comparable between your control and treatment groups. The idea is that you run A/A tests **before and after** the actual A/B test on your treatment and control groups to ensure that there are no inherent differences between the two groups. This ensures that after your treatment group is exposed to the change, the differences between the two groups can be attributed to the treatment and not any other confounding factors.