In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))
from formats import load_style
load_style(plot_style = False)

In [2]:
os.chdir(path)

Effective way of performing sampling

We don't have to test the entire population, we can test a sample instead. We choose our sample so that it's fairly representative of the population (conducting a sample survey involving just a sample is less time-consuming and expensive then performing a population census)

The key to sampling is to choose an sample that's representative of the population. If it's representative, it means it has similar characteristic (e.g. where the center of the data is) to the population.

We need a mechansim for making sure our samples our reliable representation of the population.

- Define the target population
- Define the sampling unit
- Define the sampling frame (a list of all the sampling units within our targeted population, preferably with each sampling unit either named or numbered, but it's likely that this isn't practical. Instead we can settle for a list showing how many units are there within each characteristic, e.g. how many gumballs there are in the population for each color)

e.g. when sampling from a list of registered customers, we need ot make sure the sample is fairly representative by customer gender, age group. So we can use stratified sampling, where we divide the customer into different strata, then use a simple random sampling to choose a proportionate number of each of the four colors.

We know that an unbiased sample shares the same characterisitics as its parent population, but what's the best way to analyze the sampled data points.

Estimating the population mean: We'd expect the distribution of the population to be a similar shape to that of the sample, so we can assume that the mean of the sample and the population should have the same value. Note that we can't say that they exactly match, but it's the best estimate we can make.

The population mean is often denoted as $\mu$, the estimated population mean as $\hat{\mu}$, mean of the sample $\bar{x}$. So here we're basically saying $\hat{\mu} = \bar{x}$, where we're using the sample mean to estimate the mean of the population and usually the larger the size of our sample, the more accurate our point estimator for the estimated population mean is going to be.

---

It's great that we've come up with a good estimate for the population mean, but what about the variance. If we can come up with a good estimate of the population varaince, then we will be able to tell how much variance there's likely to be based on the results of the sample data.

When we choose a sample, we have a smaller number of values than the population, and since we have fewer values, there's a good chance that they're more clustered around the mean. The rationale is that more extreme values are less likely to be in your sample.

If the size of the sample is $n$, we can estimate the population variance using:

$$
\hat{\sigma^2} = \frac{\sum(x - \bar{x})^2}{n - 1}
$$

It's just like finding the variance of the values in the sample, but dividing by $n - 1$ instead of $n$. Dividing by $n - 1$ gives a higher value than dividing by $n$, and this difference is more noticeable when $n$ is fairly small. Since the population variance tends to be higher than the variance of the data in the sample, this formula is a slightly better point estimate for the population variance.

Thus time we're did a survey on a random sample of people and asked them whether they prefer gumballs produced by brand A or whether they prefer those of company A's main rival. We're hoping we can use the result to predict what proportion of the population is likely to prefer brand A.

We found that in a sample of 40 people, 32 of them prefer company A and the other prefer those of their rival. So how do we use the sample data to predict the proportion of sucess in the population. In the same way that our best estimate of the population mean is the mean of the sample, our best guess for the proportion of successes in the population has to be the proportion of successes in the sample. i.e. $\hat{p} = p_s$, the point estimator for the proprotion of successes in the population is equivalent to the proportion of successes in the sample.

- Look at all possible samples the same size as the one we're considering
- Look at the distribution formed by all the samples, and find the expectation and variance for the population

Suppose we ran an A/B test with two different versions of a web page, a and b, for which we count the number of visitors and whether they convert or not. We can summarize this in a contingency table showing the frequency distribution of the events:

In [34]:
import numpy as np
import pandas as pd

data = pd.DataFrame({
    'version': ['A', 'B'],
    'not_converted': [4514, 4473],
    'converted': [486, 527]
})[['version', 'not_converted', 'converted']]
data

Unnamed: 0,version,not_converted,converted
0,A,4514,486
1,B,4473,527


It is trivial to compute the conversion rate of each version, 486/(486 + 4514) = 9.72% for a and 10.5% for b. With such a relatively small difference, however, can we convincingly say that the version b converts better? To test the statistical significance of a result like this, a hypothesis testing can be used.


## Background

An appropriate hypothesis test here is Pearson’s chi-squared test. There are two types of the chi-squared test, goodness of fit and test of independence, but it is the latter which is useful for the case in question. The reason as to why a test of “independence” is applicable becomes clear by converting the contingency table into a probability matrix by dividing each element by the grand total of frequencies:

In [32]:
cols = ['not_converted', 'converted']
data[cols] = data[cols] / data[cols].values.sum()
data

Unnamed: 0,version,not_converted,converted
0,A,0.4514,0.0486
1,B,0.4473,0.0527


We will denote $V$ as the version of the web page ($a$ or $b$) and $C$ as the conversion result, $f$ (false did not convert) or $t$ (true did convert). The table that we computed above, which this the data that we observed can then be translated into this form:

| Version (V) | Did Not Convert (f)  | Did in fact Convert (t) |
|:-----------:|:--------------------:|:-----------------------:|
|    A        | $P(V = a, C = f)$    | $P(V = a, C = t)$       |
|    B        | $P(V = b, C = f)$    | $P(V = b, C = t)$       |



Now, our interest is whether the conversion $C$ depends on the page version $V$, and if it does, to learn which version converts better. In probability theory, the events $C$ and $V$ are said to be independent if the joint probability can be computed by $P(V, C) = P(V) \cdot P(C)$, where $P(V)$ and $P(C)$ are marginal probabilities of $V$ and $C$, respectively. It is straightforward to compute the marginal probabilities from row and column marginals:

$$P(V = a) = \frac{4514 + 486}{10000}  \hspace{1cm} P(V = b) = \frac{4473 + 527}{10000}$$
$$P(C = f) = \frac{4514 + 4473}{10000} \hspace{1cm} P(V = b) = \frac{486 + 527}{10000}$$

The null hypothesis is that $V$ and $C$ are independent, in which case the elements of the matrix, a.k.a the distribution that we're expecting is equivalent to:

| Version (V) | Did Not Convert (f)  | Did in fact Convert (t) |
|:-----------:|:--------------------:|:-----------------------:|
|    A        | $P(V = a)P(C = f)$   | $P(V = a)P(C = t)$      |
|    B        | $P(V = b)P(C = f)$   | $P(V = b)P(C = t)$      |


The conversion $C$ is said to be dependent on the version $V$ of the web site if this null hypothesis is rejected. Hence rejecting the null hypothesis means that one version is better at converting than the other. This is the reason why the test is on independence.

The chi-squared test compares an observed distribution $O_{ij}$ to an expected distribution $E_{ij}$

\begin{align}
\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
\end{align}

The $\chi^2$ probability distribution gives a way of measuring the difference between the frequencies we observe and the frequencies we expect. The smaller the value of $\chi^2$, the smaller the difference overall between the observed and expected frequencies. The formula divides the squared differences with the expected frequency as this makes the final result proportional to our expected frequency.

In [61]:
from scipy.stats import chi2_contingency


data = pd.DataFrame({
    'version': ['A', 'B'],
    'not_converted': [4514, 4473],
    'converted': [486, 527]
})[['version', 'not_converted', 'converted']]
data = data.set_index('version')
observed = data.values
print(observed)

result = chi2_contingency(observed, correction = False)
chisq, p = result[:2]
print('chisq = {}, p = {}'.format(chisq, p))

[[4514  486]
 [4473  527]]
chisq = 1.8464754013996965, p = 0.17419388311716985


In [47]:
control = 486 / 5000
experiment = 527 / 5000
pooled = (486 + 527) / (5000 + 5000)
zscore = (experiment - control) / np.sqrt(pooled * (1 - pooled) * (1 / 5000 + 1 / 5000))
zscore

1.3588507649479744

In [66]:
from scipy.stats import norm
n = norm(loc = 0, scale = 1)
(1 - n.cdf(zscore)) * 2

0.17419388311717388

In [63]:
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportions_chisquare

counts = np.array([486, 527])
nobs = np.array([5000, 5000])
stat, pval = proportions_ztest(counts, nobs)
print(stat)
print('{0:0.3f}'.format(pval))

-1.35885076495
0.174


In [59]:
stat, pval, table = proportions_chisquare(counts, nobs)
print(stat)
print('{0:0.3f}'.format(pval))

1.8464754014
0.174
