In [None]:
In an A/B test, how can you check if assignment to the various buckets was truly random?

Identifying bucket imbalance
An effective way to detect bucketing bias is to test for imbalanced bucket sizes.
Our experimentation framework automatically checks if buckets are roughly the expected size, using two methods.
First, we perform an overall health check using the multinomial goodness of fit test.
This test checks if observed bucket allocations collectively matched expected traffic allocations.
If overall health is bad, we also perform binomial tests on each bucket to pinpoint which buckets might have problems,
and show a time series of newly bucketed users in case experimenters want to do a deep dive.


Assuming we have an input data (collected via experiments) in the following format:
    
(Also assuming that the experiment entails one control and one treatment in experimenting the effect of a particular feature). Therefore
the following table illustrates the number of clicks by number of unique visitors



                                   Treatment        Control

Number of unique customers visited      450        780

Observed                               1500        2300      

Expected                               3400        4356 (by using the binomial formula: P(X=x)= ((n x) * (p^x) * (1-p)^(n-x)))




A population is called multinomial if its data is categorical and belongs to a collection of discrete
non-overlapping classes.

In case of 5 buckets (multivariate testing) we shall conduct both:
Multinomial testing (Usually used in situations involving one control and one treatment group: It tests the overall health of the 5 buckets and can be used to flag when we get an observed value 
that fails the test significance.

The multinomial test can protect us from the woes of multiple hypothesis testing, but it has a disadvantage:
it does not tell us which buckets are problematic.
To provide more guidance, we run additional binomial tests in cases when the multinomial test flags an experiment(it flags an experiment and not the individual buckets used in the experiment)

It is possible to conduct binomial test for every bucket at specific time and so it allows us to figure which buckets are
unhealthy. Whereas in case of multinomial test we measure the difference between observed and expected amongst the buckets
used in an experiment. The idea is to consider the total number of customers than the total number of impressions as it is
likely that counting the impressions can be imbalanced (the experiment itself capable of causing imbalance: A properly designed and implemented experiment can have the total number of bucketed impressions vary across buckets due to experiment effects or implementation details. Comparing bucket imbalance based on unique bucketed users is a better test than looking at total triggers or total visits.)
The bucket balance can be checked using the number of unique customers allocated to each bucket.This combination of batch-level and global testing allows us to detect more subtle problems than either type of test would detect individually.

What happens with the above table?
1. First set the null and alternative hypotheses
2. Use the appropriate formulas of either testing(multinomial or binomial) to calculate the difference between the
observed and expected counts
3. Test the significance of the test by comparing the difference with p-value at appropriate confidence level
4. If the treatment value lies well beyond the confidence level then we shall set flags to indicate that change
5. We can then validate whether the traffic was split into either groups randomly without bias
6. Accordingly we shall proceed with an analysis 


In [None]:
What might be the benefits of running an A/A test, where you have two buckets who are exposed to the exact same product?
In A/A testing both the control and treatment are identical versions. In fact there are no two separate control and treatment versions
per se. Why use A/A testing at all? Well, when we ought to do some sanity checking to make sure of the randomness (therefore unbiased outcome) of A/B, for example, test that we might conduct later.

Example: Lets say I run a A/B testing platform, like Optimizely, for my clients. My clients pay to use my platform
to conduct their A/B tests and get results. But in one situation a client questions the outcome of their recent A/B test 
on the platform. They seemed to be having issues on making sure how unbiased was the outcome? So as a proof of concept to my client
I run A/A testing by taking the control page, assuming my client tested a page on their website, of my client and running
against it to prove that if there was a chance of randomness, which is very likely, it was rather acceptable, say, 5% of the time
assuming at 95% confidence level and therefore trust worthy.

Used when:
To set baseline conversion rate to compare with that of A/B testing later. So in this situation A/A testing can be conducted before conducting A/B test
To spot any bugs in the tracking and setup
To identify suspicious conversion lift:
For example, if the difference in the conversion rate or AOV is lower than 5% (at 95% confidence level), be very suspicious that the 
potential lift is not driven by the difference in the design but by chance
In order to check afterwards whether the test had been conducted correctly. In this case, A/A testing is conducted after A/B





In [None]:
You are AirBnB and you want to test the hypothesis that a greater number of photographs increases the chances that a 
buyer selects the listing.
How would you test this hypothesis?

Questions:
Do we have a baseline already re how many photographs mean greater? Here let us assume greater than 5 photos of the property(bedroom, bathroom, living space, patio, entrance, pool, and backyard; in short anything about the property)

Photographs of the property for sure, what else? ambiance, environment, beach access(for beach properties), showcasing special features

What is the conversion rate (of buyers selecting the listings: meaning buyers instant booking or sending a request to the host)
currently for listings with fewer than 5 photos?

The assumption for the hypothesis is regardless of the type of properties: Private room, Entire place, Hotel room, right? 

What category of buyers(age, gender, occupation(?), ) are we considering for bucketing?

Should I collect equal number of men and women?

Should I worry about representing both the genders equally in every age in the cluster 22 to 45?

Wont the sample size get smaller with too many filters like buyers opted specifically for instant bookings?

It is enough to worry only about the factors that I would be able to control from my end: meaning I do not have to worry either about how
many days/nights the buyers opting for while searching or how many guests are selected?

Should I worry about considering equal number of cities in North America, South America, etc...?
What would you do if you did not achieve the sample size within the stipulated time?




Let just assume that any other drastic environmental factors: like flood, storm, or volcanic eruption were absent so they did not play
a role in affecting the outcome of the experiment. Of course time of the year and availability 

Type of property:Let us consider only the properties that fall in the 'Entire place' category
    
Category of buyers:
    age: 22 to 45
    gender: both men and women
    
Neighborhood: Let us assume in any destination we are considering properties within 35 miles of city limits(from downtown or wherever we consider as the center of the city).
Say we are considering the properties that are located within 40 miles from downtown Stockholm

City Limits: Let us consider only the 25 popular cities only in the US for now: NYC, SF, LA...
    
Trip type: For families
    
Price range: $65 to $350
    
House rules: Let us consider properties that both allow and do not allow pets
    
Host and booking: Let us focus on buyers that opted specifically for Instant booking
    
Facilities: By large they are same for all the properties in our buckets, say 2 bedroom and 1.5 or 2 bath with similar size backyard, same/similar property accessibility 
and space for car parking

Unique homes: included if they are within the city limits specifies
    
Cancellations: considering all

Host language: considering all

Amenities: All the properties in our buckets provided the same amenities

Sample size:My guess is that we will end up (after applying all the filters) with max approx. 500 houses per city. So that amounts
to approx. 12500 houses. So let us consider 300 properties for control and 300 for treatment
(because we have to consider properties with fewer than 5 photos for control and more than 5 for treatment, let us go with 300 properties)

Expected time period to run the experiment: The number of properties available plays an effect on buyers selection behavior. So it got to be weeks, say 3 to 4 weeks. Let us consider 3 weeks, but will start from the last week of current month
month to first two weeks of next month: say from last week of March to first two weeks of April 
    
Treatment: Consider the properties: Entire place with more than 5 photos
    
Control: Consider the properties: Entire place with fewer than 5 photos

Experiment:
H0: There will be no difference between both the values
Ha: The percentage of unique buyers selecting the listings with more than 5 photos(treatment) > The percentage of unique buyers selecting the listings with fewer than 5 photos(control)
The idea is that if a buyer messages the host many times it would not be counted in the proportion thereby controlling the occurrence of
false positives

In other words:
H0: p1-p2=0
Ha: p1-p2!=0 i.e., p1 > p2 (that means one-tailed test: right tailed)
where p1 represents the percentage of unique buyers selecting the listings with more than 5 photos(treatment)
p2 represents the percentage of unique buyers selecting the listings with fewer than 5 photos(control)
    
What are we measuring: How many unique buyers in either buckets actually instant book the properties or message the host (it could be a request to book. whether the property was available to book and did they actually book is beside the point)

Statistic Test: Would use Z test to compare the two proportions and the formula is shown below:
z=(^p1-^p2)-0/SQRT(^p(1-^p)*((1/n1) + (1/n2)))

where ^p=y1+y2/n1+n2
^p is the proportion of successes in the two samples combined

Here n1, n2=300

Either use p-value or critical region approach to either accept H0 or not at 95% confidence level



In [None]:
2. A/B test plan.
 
At Minted, customers can personalize photo cards using their favorite photos. Customers can upload, store and choose photos using a functionality called “Photo Tray”. 
 
Below are two versions of “Photo Tray” UI. Your task is to figure out if the performance of Version 1 differs from Version 2 using an A/B test, analyze test results and help Minted choose the winner UI based on the results.
 
Please outline an analysis plan in no more than 1 page (bullet points are fine) describing:
●	Your hypothesis and justification for this hypothesis
●	Any metrics you will track
●	Any methods you will use to analyze the results
●	How you will present your results
 
Feel free to make any assumptions in order to craft a coherent analysis plan.


There are two steps to get to the stage of uploading photos:
To click the button: Personalize
Then to click 'upload photo' on personalize page
    
H0: p1-p2=0
Ha: p1-p2 !=0 (which means two tailed test)
    
where p1: the proportion of unique customers clicking upload photos on Version1 (actually this proportion is nothing but click through probability)
p2: the proportion of unique customers clicking upload photos on Version2
    
Let us assume uploading photo onto the photo tray defines the performance of that version. The reason for measuring how many unique customers clicked 'upload photo' is that the customers that click 'upload photo'
have found that version interactive and easily navigable (or for whatever compelling reasons). Whether they uploaded the photo or finally made the sale is beside the point
according to the context. Also by counting once per email-id we shall prevent the deliberate increase in false positives as
a customer can click upload many times, due to various reasons: power shutdown, technical glitch, got busy with
other chores, etc., during the time we conduct the experiment.
    
        
Experiment:
    
How would you get the count of unique customers?

Let us assume every unique email-id represents a unique customer

Also we shall not conduct the experiment during festive time and peak time for business (info from our records). But the fact is that
minted also makes more than 80% of its business during these times. 
Then the question becomes: Would there be enough traffic to conduct the experiment? Why not during peak season or festive time: well
it is highly likely that necessity/gift obligation might alter customers shopping behavior.

At the same time it is not a bad idea to conduct during the peak times to get more traffic.
    
Gender: Assuming that there would not be much of a difference between both men and women with regards to uploading photos
via either versions. Therefore we do not have to worry about equal number of men and women in our buckets.

Age: Not going to worry about customers age as we do not have much control over it
    
Both Signed(existing customers with a minted account and customers that have created an account) and guest customers
    

Helpful info: Standard shipping within the US takes usually 6 to 9 days and the RUSH option takes 5 days 
Time to conduct: Let us consider only the US customers in the month of November (assuming the traffic will be heavy)
    
    
Control:

Treatment:

  
How long would you run the experiment for?
    
Statistic Test: Would use Z test to compare the two proportions and the formula is shown below:
z=(^p1-^p2)-0/SQRT(^p(1-^p)*((1/n1) + (1/n2)))

where ^p=y1+y2/n1+n2
^p is the proportion of successes in the two samples combined

 
