In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Sample Size

The most common use of power calculations is to estimate how big a sample you will
need.

For example, suppose you are looking at click-through rates (clicks as a percentage of
exposures), and testing a new ad against an existing ad. How many clicks do you need
to accumulate in the study? If you are interested only in results that show a huge dif‐
ference (say, a 50% difference), a relatively small sample might do the trick. If, on the
other hand, even a minor difference would be of interest, then a much larger sample
is needed. A standard approach is to establish a policy that a new ad must do better
than an existing ad by some percentage, say, 10%; otherwise, the existing ad will
remain in place. This goal, the “effect size,” then drives the sample size.

For example, suppose current click-through rates are about 1.1%, and you are seeking
a 10% boost to 1.21%. So we have two boxes: box A with 1.1% ones (say, 110 ones and
9,890 zeros), and box B with 1.21% ones (say, 121 ones and 9,879 zeros). For starters,
let’s try 300 draws from each box (this would be like 300 “impressions” for each ad).

Suppose our first draw yields the following:
Box A: 3 ones
Box B: 5 ones

Right away we can see that any hypothesis test would reveal this difference (5 versus
3) to be well within the range of chance variation. This combination of sample size
(n = 300 in each group) and effect size (10% difference) is too small for any hypothe‐
sis test to reliably show a difference.
So we can try increasing the sample size (let’s try 2,000 impressions), and require a
larger improvement (50% instead of 10%).

For example, suppose current click-through rates are still 1.1%, but we are now seek‐
ing a 50% boost to 1.65%. So we have two boxes: box A still with 1.1% ones (say, 110
ones and 9,890 zeros), and box B with 1.65% ones (say, 165 ones and 9,868 zeros).

Now we’ll try 2,000 draws from each box. Suppose our first draw yields the following:
Box A: 19 ones
Box B: 34 ones

A significance test on this difference (34–19) shows it still registers as “not signifi‐
cant” (though much closer to significance than the earlier difference of 5–3). To cal‐
culate power, we would need to repeat the previous procedure many times, or use
statistical software that can calculate power, but our initial draw suggests to us that
even detecting a 50% improvement will require several thousand ad impressions.

In summary, for calculating power or required sample size, there are four moving
parts:
• Sample size
• Effect size you want to detect
• Significance level (alpha) at which the test will be conducted
• Power

Specify any three of them, and the fourth can be calculated. Most commonly, you
would want to calculate sample size, so you must specify the other three. W

## Procedure 

Here’s a fairly intuitive approach:

1. Start with some hypothetical data that represents your best guess about the data
that will result (perhaps based on prior data)—for example, a box with 20 ones
and 80 zeros to represent a .200 hitter, or a box with some observations of “time
spent on website.”
2. Create a second sample simply by adding the desired effect size to the first sam‐
ple—for example, a second box with 33 ones and 67 zeros, or a second box with
25 seconds added to each initial “time spent on website.”
3. Draw a bootstrap sample of size n from each box.
4. Conduct a permutation (or formula-based) hypothesis test on the two bootstrap
samples and record whether the difference between them is statistically
significant.
5. Repeat the preceding two steps many times and determine how often the differ‐
ence was significant—that’s the estimated power.

In [3]:
from scipy import stats
import statsmodels.api as sm 
from statsmodels.stats import power 

In [31]:
# To find the sample size with 10% boost from 1.1% to 1.21%

effect_size = sm.stats.proportion_effectsize(0.0121,0.011)

In [27]:
effect_size

0.01029785095103608

In [28]:
# Statistical Power calculations for t-test for two independent sample

analysis = sm.stats.TTestIndPower()

In [29]:
result = analysis.solve_power(effect_size = effect_size,alpha = 0.05,power = 0.8,alternative = 'larger')

In [30]:
print(f'Sample size %.3f',result)

Sample size %.3f 116602.39259746042


**This says we need atleast 11k samples to get 80% power**

In [32]:
effect_size = sm.stats.proportion_effectsize(0.0165, 0.011)
analysis = sm.stats.TTestIndPower()
result = analysis.solve_power(effect_size=effect_size, 
                              alpha=0.05, power=0.8, alternative='larger')
print('Sample Size: %.3f' % result)

Sample Size: 5488.408


#### If we increase the effect size to 50% to boost to 1.65%, we need a smaller sample as the difference is small 

We see that if we want a power of 80%, we
require a sample size of almost 120,000 impressions. If we are seeking a 50% boost
(p1=0.0165), the sample size is reduced to 5,500 impressions.