# Practical matters in A/B testing
Selected material from:  
[*Tuning up: From A/B testing to Bayesian Optimization*](manning.com/books/tuning-up-from-a-b-testing-to-bayesian-optimization)  
Manning Publications,  2021 (summer, estimated)



David Sweet  
[linkedin.com/in/dsweet99/](http://linkedin.com/in/dsweet99/)  
[@phinance99](http://twitter.com/phinance99)  

[<img src="images/Sweet-TUp-HI.jpeg" style="border:1px solid black;width:400px">](https://www.manning.com/books/tuning-up-from-a-b-testing-to-bayesian-optimization)

# Audience  

- ML/AI engineers
- Quantiative traders, "quants"
- Software engineers



# A/B Test

- A: The current system
- B: A good idea, meant to improve the system
- Test: An experiment

- systems: trading, ads, social media, streaming songs
- compare A to B; measure the difference
- in production; ex., live trading, live website, live ads
- won't talk (today) about the details of how to design or analyze an A/B test; read my book
- multiple methods taught in book, focus on A/B testing here


# Good ideas aren't that good

- How many experiments improve metrics?
    - Amazon: 50%
    - Microsoft: 33.3%
    - Netflix: 10%


- key point; why A/B at all
- 1/10 "air of resignation"
- all experiments run on good ideas from smart people at top companies who know their products
- not domain knowledge or simulations or R^2 or cross-entropy <== complementary to A/B testing

# Engineer's workflow
<img src="images/CH01_F01_sweet.png" style="width:1000px">

- ...so every new idea gets tested; measure some business metric online / live / in prod
- 

# A/B test basics
- Randomization
- Replication
- Limit false positives (5%) and false negatives (20%)
    - ex: one A/B test every two weeks for a year, 33.3% accepted, <1 f.p., ~2 f.n.

In [None]:
print ((52/2) * .333 * .05)
print ((52/2) * .333 * .20)

- randomize to avoid biasing results; ex: don't want to try B only in one region, or only on a certain demongraphic, or only at a certain time of day; want to try it a little bit everywhere; inc. accuracy / dec. bias
- replication ex: show new way (B) to many users or use B on many trades; inc. precision / dec. variation


- 33.3%, like MSFT
- fp: making system worse; explicit cost
- fn: missing opportunity to make better; opportunity cost

# Holdout test

- Many A/B tests over 6 months
- Holdout
    - A: System at start of 6 months
    - B: System at end of 6 months
- Net improvement < sum of individual improvments
    - 5% f.p, nonstationarity
  

- keep some small sample (ex., of users) on old system
- measures net improvement of all A/B-tested changes


# Business Metric

- Immediate reward
    - Click-through rate
    - Markout profit
    - Engagement: like, retweet, comment, skip song, etc.
- Daily aggregates
    - Revenue, pnl, trading volume
    - Time spent on app, number of songs streamed
    - Active users
- Long-term
    - Monthly active users
    - Pnl/trade with multi-day hold time
    - Will an ad view lead to a purchase later?
    - User activity over next D days

- daily aggregates a sweet spot for A/B testing b/c
    - can't handle with contextual bandit, like immediate reward
    - still short enough (in time) to run an A/B test, unlike long-term

# Multiple business metrics

- Don't usually care about just one
- Maybe trade off: more revenue, less time spent
- Maybe "guardrail": higher CTR, but only if revenue and engagement don't drop


# Deciding to accept or reject

- Acceptance / rejection a group discussion
- Higher stakes ==> larger discussion
- Sanity check surprising/dramatic results
    - Could there be an error in the experiment?
    - Did you learn something new? Dig deeper to understand
- Carefully weigh tradeoffs of multiple metrics

- possibly also discuss across teams
- use soft skills
- earlier engagement reduces communication problems later on

# Running an A/B test

- Risks
    - deploying bugs
    - reducing business metrics
    - wasting time


# Running an A/B test

<img src="images/CH02_F20_sweet.png" style="width:1000px">

- A/A test, small: Does experimentation alon code change metrics?
- A/B test, small: Does new code have bugs or dramatically change metrics?
- A/B test, large: The full experiment

# Early stopping
<img src="images/CH02_F22_sweet.png" style="width:600px">
- Idea: "To save time, if t-stat crosses threshold, I'll stop"  **NO**
- Can generate false positives

- This is no not "early stopping" when fitting an NN or boosted trees.


# Early stopping

<img src="images/CH02_F23_sweet.png" style="width:600px">

- False positive rate can be *very* high
- Much higher than 5%, for which A/B test is (usually) designed

# p - hacking

- "cherry-picking"
- 5% f.p. is 1/20
- p-Hack: Run an experiment 20 times
- p-Hack: Run an experiment and examine 20 metrics


- maybe run slightly different experiments until you get statistical significance in one of them
- in practice metrics correlated, so look at more than 20


# Transient effects

- Short-lived, goes away
- Ex: Users engage with your new feature b/c it's novel, then abandon it
- Fix: Drop first K samples or days of data

- learn K for your system by running many different experiments

[<img src="images/Sweet-TUp-HI.jpeg" style="border:1px solid black;width:200px">](https://www.manning.com/books/tuning-up-from-a-b-testing-to-bayesian-optimization)

[*Tuning up: From A/B testing to Bayesian Optimization*](manning.com/books/tuning-up-from-a-b-testing-to-bayesian-optimization)  
Manning Publications,  2021 (summer, estimated)



David Sweet  
[linkedin.com/in/dsweet99/](http://linkedin.com/in/dsweet99/)  
[@phinance99](http://twitter.com/phinance99)  

## To view slides
    - jupyter nbconvert Practical\ matters\ in\ A-B\ testing.ipynb --to slides --post serve
