This repository has been archived by the owner on Jan 1, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 269
/
ab_testing.html
210 lines (206 loc) · 14.9 KB
/
ab_testing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
<!DOCTYPE html>
<html>
<head>
<title>Vanity — A/B Testing</title>
<link href="css/page.css" media="screen,print" rel="stylesheet" type="text/css">
<link href="css/print.css" media="print" rel="stylesheet" type="text/css">
<link href="images/favicon.png" rel="shortcut icon">
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script>
<script type="text/javascript" src="site.js"></script>
</head>
<body>
<div id="header">
<div class="title"><a href="http://vanity.labnotes.org" title="Mirror, mirror, on the wall …">Vanity</a></div>
<div class="tagline">Experiment<br>Driven Development</div>
</div>
<div id="links">
<a href="http://github.com/assaf/vanity">Source code</a> |
<a href="http://stackoverflow.com/questions/tagged/vanity" title="stackoverflow vanity tag">StackOverflow Tag</a> |
<a href="http://rdoc.info/gems/vanity">API reference</a>
</div>
<div id="sidebar">
<ul>
<li><a href="index.html#intro" title="A/B Testing with Rails">Intro</a></li>
<li><a href="metrics.html">Metrics</a></li>
<li><a href="ab_testing.html" title="Everything you need to know">A/B Testing</a></li>
<li><a href="rails.html">Using with Rails</a></li>
<li><a href="email.html">Testing emails</a></li>
<li><a href="identity.html">Managing Identity</a></li>
<li><a href="configuring.html">Configuring</a></li>
<li><a href="adapters.html">Adapters</a></li>
<li><a href="contributing.html">Contributing</a></li>
<li><a href="experimental.html">Experimental</a></li>
</ul>
<ul id="stats">
<li><a href="http://travis-ci.org/assaf/vanity"><img src="https://api.travis-ci.org/assaf/vanity.png?branch=master"></a></li>
<li><a href="http://wiki.github.com/assaf/vanity/whos-using-vanity">Who's using it?</a></li>
</ul>
</div>
<div id="content">
<h1 id="a/b testing">A/B Testing</h1>
<div id="toc">
<ol>
<li><a href="#tf">True or False</a></li>
<li><a href="#interpret">Interpreting the Results</a></li>
<li><a href="#multiple">Multiple Alternatives</a></li>
<li><a href="#test">A/B Testing and Code Testing</a></li>
<li><a href="#decide">Let the Experiment Decide</a><br />
</div></li>
</ol>
<p><a href="http://en.wikipedia.org/wiki/A/B_testing">A/B testing</a> (or “split testing”) are experiments you can run to compare the performance of different alternatives. A classical example is using an A/B test to compare two versions of a landing page, to find out which alternative leads to more registrations.</p>
<p>You can use A/B tests to gauge interest in a new feature, response to a feature change, improve the site’s design and copy, and so forth. In spite of the name, you can use A/B tests to check out more than two alternatives.</p>
<blockquote>
<p>“If you are not embarrassed by the first version of your product, you’ve launched too late” — Reid Hoffman, founder of LinkedIn</p>
</blockquote>
<h3 id="tf">True or False</h3>
<p>Let’s start with a simple experiment. We have this idea that a bigger sign-up link will increase the number of people who sign up for our service. Let’s see how well our hypothesis holds.</p>
<p>We already have a <a href="metrics.html">metric</a> we’re monitoring, and our experiment will measure against it:</p>
<pre>
ab_test "Big signup link" do
description "Testing to see if a bigger sign-up link increases number of signups."
metrics :signup
end
</pre>
<p>Next, we’re going to show some of our visitors a bigger sign-up link:</p>
<pre>
<% bigger = "font:14pt;font-weight:bold" if ab_test(:big_signup_link) %>
<%= link_to "Sign up", signup_url, style: bigger %>
</pre>
<p>Approximately half the visitors to our site will see this link:</p>
<a href="#">Sign up</a>
<p>The other half will see this one:</p>
<a href="#" style="font:14pt;font-weight:bold">Sign up</a>
<h3 id="interpret">Interpreting the Results</h3>
<p>An A/B test has two parts, we just covered the part which decides which alternative to show. The second part measures the effectiveness of each alternative. This happens as result of measuring the metric.</p>
<p>Remember that we’re measuring signups, so we already have this in the code:</p>
<pre>
class SignupController < ApplicationController
def signup
Account.create(params[:account])
Vanity.track!(:signup)
end
end
</pre>
<p>We’re going to let the experiment run for a while and track the results using <a href="rails.html#dashboard">the dashboard</a>, or by running the command <code>vanity report</code>.</p>
<p>Vanity splits the audience randomly — using <a href="identity.html">cookies and other mechanisms</a> — and records who got to see each alternative, and how many in each group converted (in our case, signed up). Dividing conversions by participants gives you the conversion rate.</p>
<p><img src="images/clear_winner.png" alt="" /></p>
<p>Vanity will show the conversion rate for each alternative, and how that conversion compares to the worst performing alternative. In the example above, option A has 80.6% conversion rate, 11% more than option B’s 72.6% conversion rate (72.6 * 111% ~ 80.6%).</p>
<p>(These large numbers are easily explained by the fact that this report was generated from made up data)</p>
<p>It takes only a handful of visits before you’ll see one alternative clearly performing better than all others. That’s a sign that you should continue running the experiment. You see, small sample size tend to give out random results.</p>
<p>To get actionable results, you want a large enough sample, more specifically, you want to look at the probability. Vanity picks the top two alternatives and <a href="http://20bits.com/articles/statistical-analysis-and-ab-testing/">calculates a z-score</a> to determine the probability that the best alternative performed better than second best. It presents that probability which should tell you when is a good time to wrap up the experiment.</p>
<p>This is the part that gets most people confused about A/B testing. Let’s say we ran an experiment with two alternatives and we notice that option B performs 50% better than option A (A * 150% = B). We calculate from the z-score a 90% probability.</p>
<p>“With 90% probability” does not mean 90% of the difference (50%), it does not mean B performs 45% better than A (90% * 50% = 45%). In fact, it doesn’t tell us how well B performs relative to A. Option B may perform exceptionally well during the experiment, not so well later on.</p>
<p>The only thing “with 90% probability” tells us is the probability that option B is somewhat better than option A. And that means 10% probability that the results we’re seeing are totally random and mean nothing in the long run. In other words: 9 out of 10 times, B is indeed better than A.</p>
<p>If you run the test longer to collect a larger sample size you’ll see the probability increase to 95%, then 99% and finally 99.9%. That’s big confidence in the outcome of the experiment, but it might take a long time to get there.</p>
<p>You might want to instead decide on some target probability (which could change from one experiment to another). For example, if you pick 95% as the target, you’re going to act on the wrong conclusion 1 out of 20 times, but you’re going to finish your experiments faster, which means you’ll get to iterate quickly and more often. Fast iterations are one way to improve the quality of your software.</p>
<p>You’ll want to read more about <a href="http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/">A/B testing and statistical significance</a></p>
<h3 id="multiple">Multiple Alternatives</h3>
<p>Your A/B tests can have as many alternatives as you care, with two caveats. The more alternatives you have the larger the sample size you need, and so the longer it will take to find out the outcome of your experiment. You want alternatives that are significantly different from each other, testing two pricing options at $5 and $25 is fast, testing all the prices between $5 and $25 at $1 increments will take a long time to reach any conclusive result.</p>
<p>The second caveat is that right now Vanity only scores the two best performing alternatives. This may be an issue in some experiments, it may also be fixed in a future release.</p>
<p>To define an experiment with multiple alternatives:</p>
<pre>
ab_test "Price options" do
description "Mirror, mirror on the wall, who's the better price of them all?"
alternatives 5, 15, 25
metrics :signup
end
</pre>
<p>The <code>ab_test</code> method returns the value of one of the chosen alternatives, so in your views you can write:</p>
<pre>
<h2>Get started for only $<%= ab_test(:price_options) %> a month!</h2>
</pre>
<p><img src="images/price_options.png" alt="" /></p>
<p>If you don’t given any values, Vanity will run your experiment with the values false and true. Here are other examples for rendering A/B tests with multiple values:</p>
<pre>
def index
# alternatives are names of templates
render template: Vanity.ab_test(:new_page)
end
</pre>
<pre>
<%= ab_test(:greeting) %> <%= current_user.name %>
</pre>
<pre>
<% ab_test(:features) do |count| %>
<%= count %> features to choose from!
<% end %>
</pre>
<h3 id="weights">Weighted alternatives & Multi-arm bandits</h3>
<p>By default, for n variations, Vanity will assign each of those with a probability of 1/n. If non-uniform weights for alternatives are desired, weights can be assigned to different alternatives. For example:</p>
<pre>
ab_test "Background color" do
metrics :coolness
alternatives "red" => 10, "blue" => 5, "orange => 1
default "red"
end
</pre>
<p>This would make “red” 10 times as likely to appear as orange. (Note that these are weights, not percentanges, so the probability of assigning red above is about 62%, while blue is 31% and orange is 6%.) This is useful, for example, to assign a higher weight to a ‘control’ variation to ensure that most users continue having the default experience while assigning lower probabilities to test variations.</p>
<h4 id="bandits">Multi-arm bandits</h4>
<p>Another alternative to uniform splits of traffic is called “multi-armed bandits”, and the specific implementation included is Bayesian. In this mode, most traffic is sent to the currently best performing alternative (called exploitation). The minority of traffic is split between the available alternatives (called exploration). Since worse performing alternatives receive much less traffic, this leads to higher average conversion rates in bandit-driven experiments than traditional A/B split testing. The disadvantage is that in the worst case scenario, it can take a lot more traffic to declare a conclusion with statistical significance.</p>
<p>This can be enabled in the experiment definition, for example:<br />
<pre><br />
ab_test “noodle_test” do<br />
alternatives “spaghetti”, “linguine”<br />
metrics :signup<br />
score_method :bayes_bandit_score<br />
rebalance_frequency 100<br />
end<br />
</pre></p>
<p>Note: Setting the score method to `bayes_bandit_score` won’t adjust alternative probabilities (ie, you won’t get the benefit of maximizing conversions) unless you also set `rebalance_frequency`, which controls how many impressions must pass before rebalancing alternative probabilities.</p>
<p>Also note that impressions count is stored in-memory and is therefore reset upon restart of your app.</p>
<h3 id="test">A/B Testing and Code Testing</h3>
<p>If you’re presenting more than one alternative to visitors of your site, you’ll want to test more than one alternative. Don’t let A/B testing become A/broken.</p>
<p>You can force a functional/integration test to choose a given alternative:</p>
<pre>
def test_big_signup_link
Vanity.experiment(:big_signup_link).chooses(true)
get :index
assert_select "a[href=/signup][style^=font:14pt]", "Sign up"
end
</pre>
<p>Here’s another example using Webrat:</p>
<pre>
def test_price_option
[19, 25, 29].each do |price|
Vanity.experiment(:price_options).chooses(price)
visit root_path
assert_contain "Get started for only $#{price} a month!"
end
end
</pre>
<p>You’ll also want to test each alternative visually, from your Web browser. For that you’ll have to install the <a href="rails.html#dashboard">the Dashboard</a>, which lets you pick which alternative is shown to you:</p>
<p><img src="images/ab_in_dashboard.png" alt="" /></p>
<p>Once the experiment is over, simply remove its definition from the experiments directory and run the test suite again. You’ll see errors in all the places that touch the experiment (from failing to load it), pointing you to what parts of the code you need to remove/change.</p>
<h3 id="decide">Let the Experiment Decide</h3>
<p>Sample size and probability help you interpret the results, you can also use them to configure an experiment to automatically complete itself.</p>
<p>This experiment will conclude once it has 1000 participants for each alternative, or a leading alternative with probability of 95% or higher:</p>
<pre>
ab_test "Self completed" do
description "This experiment will self-complete."
metrics :coolness
complete_if do
alternatives.all? { |alt| alt.participants >= 1000 } ||
(score.choice && score.choice.probability >= 95)
end
end
</pre>
<p>When it reaches its end, the experiment will stop recording conversions, chose one of its alternatives as the outcome and switch every usage of <code>ab_test</code> to that alternative.</p>
<p>By default Vanity will choose the alternative with the highest conversion rate. This is most often, but not always, the best outcome. Imagine an experiment where option B results in less conversions, but higher quality conversion that option A. Perhaps you’re interested in option B conversions if you’re losing no more than 20% compared to option A. Here’s a way to write that outcome:</p>
<pre>
ab_test "Self completed" do
description "This experiment will self-complete."
metrics :coolness
complete_if do
score(95).choice # only return choice with probability >= 95
end
outcome_is do
a, b = alternatives
b.conversion_rate >= 0.8 * a.conversion_rate ? b : a
end
end
</pre>
</div>
<div id="footer"><a href="credits.html">Credits / License</a></div>
<script type="text/javascript">var _gaq=_gaq||[];_gaq.push(["_setAccount", "UA-1828623-6"], ["_trackPageview"]);(function(){var ga=document.createElement("script");ga.src=("https:"==document.location.protocol?"https://ssl":"http://www")+".google-analytics.com/ga.js";ga.setAttribute("async", "true");document.documentElement.firstChild.appendChild(ga);})();</script>
</body>
</html>