# Sample Size Calculation

Sample size calculation is one of the first steps when designing a survey. 

Often at the designing phase, many variables are of interest. 

For the sample size calculation, the investigators must reduce the number to ideally a single variable or only a handful. 

In this tutorial, we will show how to use `samplics` to calculate sample size for a single variable.

Reference for more details on sample size calculation
- Chow, S., Shao, J., Wang, H., Lokhnygina, Y. (2018), Sample Size Calculations in Clinical Research, Third Edition., CRC Press, Taylor & Francis Group.
- Ryan, T.P. (2013), Sample Size Determination and Power, Jonh Wiley & Sons, Inc.

## The Class `SampleSize`

To import the class from the `sampling` module, we run the code below

```python
from samplics.sampling import SampleSize
```

The first step is the initialization using the Python class depending on the parameter and method of interest. 

```python
SampleSize(
    parameter = "proportion",  # possible values are "mean", "proportion", and "total"
    method = "wald",           # possible values are "wald" or "feiss"
    stratification = False
)
```


The main method of `SampleSize` is `calculate()` with the following signature

```python
SampleSize.calculate(
    self,
    target,                   # The expected value for the parameter
    half_ci,                  # The desired precision (margin of error / half of the CI)
    sigma = None,             # Optional. The standard deviation
    deff = 1.0,               # Design effect, default is 1
    resp_rate = 1.0,          # Response rate, default is 1
    number_strata = None,     # Optional. Number of strata
    pop_size  = None,         # Optional. Population size
    alpha  = 0.05,            # Type I error, default is 0.05
)
```

### Estimation of a Proportion 

In this situation, we want to estimate the minimum sample size required to estimate a proportion with a desired margin of error. 

The equation for this situation is $$ n_0 = \big(\frac{z_{\alpha/2}}{e}\big)^2 p(1-p)$$ where $z_{\alpha/2}$ is the z-score, $p$ is the expected proportion, and $e$ is the margin of error.

**Example 5.1** 

Calculate the sample size to estimate a proportion `p=0.5` with a margin of error `e=0.03`. 



In [1]:
from samplics.sampling import SampleSize

  from pandas import Int64Index as NumericIndex


In [2]:
# Instantiate an object SampleSize for a proportion

size_prop = SampleSize(parameter="proportion")

# Calculate the sample size for the expected proportion and the desired margin of error

size_prop.calculate(target=0.5, half_ci=0.03)

To show the content of the object `size_prop`, we can print the member `__dict__`

In [3]:
size_prop.__dict__

{'parameter': 'proportion',
 'method': 'wald',
 'stratification': False,
 'samp_size': 1068,
 'deff_c': 1.0,
 'deff_w': 1.0,
 'resp_rate': 1.0,
 'pop_size': None,
 'half_ci': 0.03,
 'target': 0.5,
 'sigma': 0.25,
 'alpha': 0.05}

Hence, the calculated sample size is stored in the member `samp_size`

In [4]:
# The calculated sample size is stored in the object

size_prop.samp_size

1068

In Situation 1, we assume that the size of the population is very large. 

If the population size is not too large, we should correct for finite population as follows $ n = \frac{n_0}{1+\frac{n_0}{N}} $

**Example 5.2** 

Calculate the same sample size as in situation 1 assuming that `N=5000`.

In [5]:
# Instantiate an object SampleSize for a proportion
size_prop_fpc = SampleSize(parameter="proportion")

# Calculate the sample size for the expected proportion and the desired margin of error
size_prop_fpc.calculate(target=0.5, half_ci=0.03, pop_size=5000)

# Print the sample size
print(f"\nThe minimum required sample size is: {size_prop_fpc.samp_size}")


The minimum required sample size is: 880


**Excercise 5.1** 

What is the minimum sample size required to estimate a proportion of 0.70 with a margin of error of 5%. The population size is 2500. 

### Estimation of a Mean

In the context of estimating means, the equation for sample size requires the standard deviation of the expected mean.

The equation for the sample size is $$ n_0 = \big(\frac{z_{\alpha/2} \sigma}{e}\big)^2 $$ where $z_{\alpha/2}$ is the z-score, $\sigma$ is the standard deviation, and $e$ is the margin of error. With the finite population correction factor, we get 
$$ n = \frac{N z_{\alpha/2}^2 \sigma^2}{(N-1)e^2+z_{\alpha/2}^2 \sigma^2} $$

**Example 5.3** 

Using $\sigma=2$, $e=0.5$, $N=1000$, and $z_{\alpha/2}=1.96$. Calculate the minimum required sample size.

In [6]:
# Instantiate an object SampleSize for a mean
size_mean_fpc = SampleSize(parameter="mean")

# Calculate the sample size for the expected standard deviation and the desired margin of error
size_mean_fpc.calculate(half_ci=0.5, sigma=2, pop_size=1000)

# Print the sample size
print(f"\nThe minimum required sample size is: {size_mean_fpc.samp_size}")


The minimum required sample size is: 58


If the population size is large enough and don't use the finite population correction, then we get 

In [7]:
# Instantiate an object SampleSize for a proportion
size_mean_no_fpc = SampleSize(parameter="mean")

# Calculate the sample size for the expected standard deviation and the desired margin of error
size_mean_no_fpc.calculate(half_ci=0.5, sigma=2)

# Print the sample size
print(f"\nThe minimum required sample size is: {size_mean_no_fpc.samp_size}")


The minimum required sample size is: 62


**Exercise 5.2** 

What is the minimum sample size required to estimate a mean with a standard deviation of 50 and a margin of error of 5.

## One-Sample Design

Hypothesis testing is a common analysis goal in survey sampling. 

When hypothesis testing is the main goal of the study, investigators need to calculate the sample sizes to appropriately power the tests.

### Comparing Means 

The equation to calculate the minimum required sample size depends on the type of the hypothesis test and other factors. 

With `samplics`, we can use the class `SampleSizeMeanOneSample` to conduct these tests associated with one sample for the mean.


```python
SampleSizeMeanOneSample(
    method = "wald",         # Only possible value is "wald" for now
    stratification = False,  # A boolean to indicate where the sampling design is stratified
    two_sides = True,        # A boolean to indicate it is a 2-side or 1-side test
    params_estimated = True  # A boolean to indicate where the parameters are known or estimated
)
```

The main method of `SampleSizeOneMean` is `calculate()` with the following signature

```python 
 SampleSizeOneMean.calculate(
        mean_0,                # mean value under the null hypothesis
        mean_1,                # mean value under the alternative hypothesis
        sigma,                 # standard deviation
        delta = 0.0,           # margin of the test
        deff = 1.0,            # design effect, default is 1.
        resp_rate = 1.0,       # response rate (proprotion), default is 1.
        number_strata = None,  # number of strata, default is None
        pop_size = None,       # population size, default is None
        alpha = 0.05,          # significance level, default is 0.05
        power = 0.80,          # power, deafult is 0.80
    )
```

#### Test for Equality 

Let's assume we have one sample and we are interested in the following hypotheses

$$ H_0: \mu = \mu_0 \quad versus \quad H_a: \mu \neq \mu_0$$

The equation for the sample size needed to achieve power $1-\beta$ is

$$ n = \frac{(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\epsilon^2}, $$

where $\epsilon$ is the difference $\mu - \mu_0$. This is call the **test for equality**.

**Example 5.4** 

We want to calculate sample size required to test the difference before and after treatment. We have that the average before treatment is 1.5 and the same average after treatment is 2. The standard deviation is 1. 

What the minimum sample size required? 

In [8]:
# we import the needed class
from samplics.sampling import SampleSizeMeanOneSample

# We instantiate the object with the default parameters
mean_equality = SampleSizeMeanOneSample()

# We update the object by calculating the sample size
mean_equality.calculate(mean_0=1.5, mean_1=2, sigma=1)

# we print the sample size
print(f"\nThe minimum required sample size is: {mean_equality.samp_size}")


The minimum required sample size is: 32


#### Test for Noninferiority/Superiority

Similarly, we have the test for **noninferiority**/**superiority** defined as  

$$ H_0: \epsilon \leq \delta \quad versus \quad H_a: \epsilon > \delta$$ which leads to 
$$ n = \frac{(z_{\alpha} + z_\beta)^2 \sigma^2}{(\epsilon - \delta)^2}, $$

where $\epsilon = \mu_0 - \mu_1$. We can use the same class `SampleSizeMeanOneSample` to calculate the sample size for the test for noninferiority.

The test for noinferiority is a one-side test, hence we need to use `two_sides = False`.

**Example 5.5** 

We want to calculate sample size required to test the difference before and after treatment. We have that the average before treatment is 1.5 and the same average after treatment is 2. The standard deviation is 1. 

What the minimum sample size required **to show that the mean after treatment is no less than before treatment by -0.1**? 

In [9]:
# We instantiate the object with two_sides = False
mean_superior = SampleSizeMeanOneSample(two_sides=False)

# We update the object by calculating the sample size
mean_superior.calculate(mean_0=1.5, mean_1=2, delta=-0.1, sigma=1)

# we print the sample size
print(f"\nThe minimum required sample size is: {mean_superior.samp_size}")


The minimum required sample size is: 18


#### Test for equivalence 

With the test for equivalent, the objective is to test the following hypothesis
$$ H_0: |\epsilon|  \geq \delta \quad versus \quad H_a: |\epsilon| < \delta$$ which leads to 
$$ n = \frac{(z_{\alpha} + z_\beta)^2 \sigma^2}{(\epsilon - \delta)^2}, $$
where $\epsilon = \mu_0 - \mu_1$. We can use the class `SampleSizeMeanOneSample` to calculate the sample size for the test for equivalence.

The test for equivalence is a two-side test, hence we need to use `two_sides = True` with $\delta \neq 0$.

**Example 5.6** 

We want to calculate sample size required to test the difference before and after treatment. We have that the average before treatment is 1.5 and the same average after treatment is 2. The standard deviation is 1. We will consider the before and after treatment equivalent if their different is within 0.1. 

What the minimum sample size required **to show that the mean before and after treatments are equivalent**? 

In [10]:
# We instantiate the object with two_sides = False
mean_equivalent = SampleSizeMeanOneSample(two_sides=True)

# We update the object by calculating the sample size
mean_equivalent.calculate(mean_0=1.5, mean_1=2, delta=-0.1, sigma=1)

# we print the sample size
print(f"\nThe minimum required sample size is: {mean_equivalent.samp_size}")


The minimum required sample size is: 24


**Exercise 5.3**

Let assume that the average income is 45k for men and 35k for women. The standard deviation for both men and women is 5k. What minimum sample size is required to test that average men and women are equal with a power of 85% and $\alpha=0.05$.

### Comparing Proportions

To calculate sample size for comparing proportions under one sample, we use the class `SampleSizePropOneSample`.

```python
SampleSizePropOneSample(
    method = "wald",         # Only possible value is "wald" for now
    stratification = False,  # A boolean to indicate where the sampling design is stratified
    two_sides = True,        # A boolean to indicate it is a 2-side or 1-side test
    params_estimated = True  # A boolean to indicate where the parameters are known or estimated
)
```

The main method of `SampleSizePropOneSample` is `calculate()` with the following signature

```python 
 SampleSizePropOneSample.calculate(
        prop_0,                # mean value under the null hypothesis
        prop_1,                # mean value under the alternative hypothesis
        delta = 0.0,           # margin of the test 
        deff = 1.0,            # design effect, default is 1.
        resp_rate = 1.0,       # response rate (proprotion), default is 1.
        number_strata = None,  # number of strata, default is None
        pop_size = None,       # population size, default is None
        alpha = 0.05,          # significance level, default is 0.05
        power = 0.80,          # power, deafult is 0.80
    )
```

The API for `SampleSizePropOneSample` is very similar to `SampleSizeMeanOneSample`.

**Example 5.7** 

The goal is to show that a new treatment is at least as good as the reference treatment. Suppose that the proportion of positive outcome to a new treatment is 50% ($p_1=0.50$) and the reference value is 30% ($p_0=0.30$). Also, suppose that a difference of 10% is considered to be of no clinical significance ($\delta=-0.10$). 

What is the required sample size for having a power of 80% ($\beta=0.20$) with $\alpha=0.05$ 

In [11]:
# we import the needed class
from samplics.sampling import SampleSizePropOneSample

# We instantiate the object with two_sides = False
prop_noninferior = SampleSizePropOneSample(two_sides=False)

# We update the object by calculating the sample size
prop_noninferior.calculate(prop_0=0.30, prop_1=0.50, delta=-0.10)

# we print the sample size
print(f"\nThe minimum required sample size is: {prop_noninferior.samp_size}")


The minimum required sample size is: 18


**Exercise 5.4**

Re-calculate the minimum required sample size as in Example 5.7 with $\delta = -0.05$ and $\beta=0.15$.

## Two-Samples Design

In situations where we are interested in testing means or proportions from different populations, the equations for the sample size calculation change. 

### Comparing of Means

With `samplics`, we can use the class `SampleSizeMeanTwoSample` to conduct these tests associated with **two-sample** design for the means.


```python
SampleSizeMeanTwoSample(
    method = "wald",         # Only possible value is "wald" for now
    stratification = False,  # A boolean to indicate where the sampling design is stratified
    two_sides = True,        # A boolean to indicate it is a 2-side or 1-side test
    params_estimated = True  # A boolean to indicate where the parameters are known or estimated
)
```

The main method of `SampleSizeOneMean` is `calculate()` with the following signature

```python 
 SampleSizeMeanTwoSample.calculate(
        mean_1,                # first mean 
        mean_2,                # second mean 
        sigma_1,               # standard deviation
        sigma_2 = None,        # standard deviation
        equal_variance = 1,    # boolean to indicate if variances are equal
        kappa = 1,             # factor to distribute the sample n1 = kappa*n2
        delta = 0.0,           # margin of the test
        deff = 1.0,            # design effect, default is 1.
        resp_rate = 1.0,       # response rate (proprotion), default is 1.
        number_strata = None,  # number of strata, default is None
        pop_size = None,       # population size, default is None
        alpha = 0.05,          # significance level, default is 0.05
        power = 0.80,          # power, deafult is 0.80
    )
```

#### Test for Equality (Equal Variance)

Let's assume we have one sample and we are interested in the following hypotheses

$$ H_0: \mu_1 = \mu_2 \quad versus \quad H_a: \mu_1 \neq \mu_2$$

The equation for the sample size needed to achieve power $1-\beta$ is

$$ n_1 = \kappa n_2 \quad and \quad n_2 = \frac{(z_{\alpha/2} + z_\beta)^2 \sigma^2(1+1/\kappa)}{\epsilon^2},$$

where $\epsilon$ is the difference $\mu_2 - \mu_1$ and $\kappa$ is the factor to control the balance of the samples. 

**Example 5.8** 

Suppose that we want to conduct as study to compare the difference in blood pressure between white and black people. Based on historical data, we have that white have an average systolic blood pressure (SBP) of 120 mm Hg and blacks have an average SBP of 135 mm Hg. Both populations (whites and blacks) have the same standard error of 10.

Given that we want 2/3 of whites and 1/3 of blacks in our samples,what the minimum sample size required to test for equality of SBP between whites and blacks with a power of 80% and $\alpha=0.05$? 

In [12]:
# we import the needed class
from samplics.sampling import SampleSizeMeanTwoSample

# We instantiate the object with two_sides = True
mean_equality2 = SampleSizeMeanTwoSample(two_sides=True)

# We update the object by calculating the sample size
mean_equality2.calculate(mean_1=120, mean_2=125, sigma_1=15, equal_variance=True, kappa=2)

# we print the sample size
print(f"\nThe minimum required sample sizes are: {mean_equality2.samp_size}")


The minimum required sample sizes are: (212, 106)


**Exercise 5.5**

How would you change the code in the cell above to ensure that the sample size for whites is half of the one for blacks?

#### Test for Noninferiority/Superiority and Equivalence (Equal Variance)

Similarly to the one sample situation, we can compute the minimum required sample sizes for both tests

The sample sizes for the test for **noninferiority**/**superiority** ($H_0: \epsilon \leq \delta \quad versus \quad H_a: \epsilon > \delta$) are

$$ n_1 = \kappa n_2 \quad and \quad n_2 = \frac{(z_\alpha + z_\beta)^2 \sigma^2(1+1/\kappa)}{(\epsilon-\delta)^2},$$

The sample sizes for the test for **equivalence** ($H_0: |\epsilon| \geq \delta \quad versus \quad H_a: |\epsilon| < \delta$) are

$$ n_1 = \kappa n_2 \quad and \quad n_2 = \frac{(z_\alpha + z_{\beta/2})^2 \sigma^2(1+1/\kappa)}{(\delta-|\epsilon|)^2},$$


where $\epsilon = \mu_0 - \mu_1$. 

We can use the same class `SampleSizeMeanOneSample` to calculate the sample size for the test for noninferiority.

The test for noinferiority is a one-side test, hence we need to use `two_sides = False`. 

### Comparing Proportions

With `samplics`, we can use the class `SampleSizePropTwoSample` to conduct these tests associated with **two-sample** design for the proportions.


```python
SampleSizePropTwoSample(
    method = "wald",         # Only possible value is "wald" for now
    stratification = False,  # A boolean to indicate where the sampling design is stratified
    two_sides = True,        # A boolean to indicate it is a 2-side or 1-side test
    params_estimated = True  # A boolean to indicate where the parameters are known or estimated
)
```

The main method of `SampleSizeOneMean` is `calculate()` with the following signature

```python 
 SampleSizePropTwoSample.calculate(
        prop_1,                # first proportion 
        prop_2,                # second proportion
        kappa = 1,             # factor to distribute the sample n1 = kappa*n2
        delta = 0.0,           # margin of the test
        deff = 1.0,            # design effect, default is 1.
        resp_rate = 1.0,       # response rate (proprotion), default is 1.
        number_strata = None,  # number of strata, default is None
        pop_size = None,       # population size, default is None
        alpha = 0.05,          # significance level, default is 0.05
        power = 0.80,          # power, deafult is 0.80
    )
```

**Example 5.9** 

Suppose that we want to conduct as study to compare the prevalence of obesity between white and black people. Based on historical data, we have that white have that on average the prevalence of obesity is 42% among whites and 50% among blacks.

Given that we want 2/3 of whites and 1/3 of blacks in our samples,what the minimum sample size required to test for equality of the prevalence of obesity between whites and blacks with a power of 80% and $\alpha=0.05$? 

In [13]:
# we import the needed class
from samplics.sampling import SampleSizePropTwoSample

# We instantiate the object with two_sides = True
prop_equality2 = SampleSizePropTwoSample(two_sides=True)

# We update the object by calculating the sample size
prop_equality2.calculate(prop_1=0.42, prop_2=0.50, kappa=2)

# we print the sample size
print(f"\nThe minimum required sample sizes are: {prop_equality2.samp_size}")


The minimum required sample sizes are: (912, 456)


## Stratified Design

In a stratified design, the population is divided into $H$ partitions or strata. 

Sample is selected independently from each stratum. 

We can consider two approaches for sample size calculation under stratified design. 

The above `samplics` APIs integrates the notion of stratification.

When instantiating the objects, we can indicate that it's a stratified design using `stratification = True`. 

The parameters should them be provided by stratum using Python dictionairies. 

For example, assume that we have a stratified design with four strata: `North`, `South`, `West`, and `East`. We can provide the means for the strata as follow

```python
prevalence = {
    "North": 0.50,
    "South": 0.65,
    "West": 0.55,
    "East": 0.45
}
```

**Example 5.10** 

Assume that we have a stratified design with four strata: `North`, `South`, `West`, and `East`. We want to conduct a study about the prevalence of obesity. From previous studies, we have that:
- Prevalence rates were 50% in the North, 65% in the South, 55% in the West, and 45% in the East. 
- Response rates were 90% in the North, 85% in the South, 85% in the West, and 70% in the East. 
- Design effects were 1.2 in the North, 1.1 in the South, 1.4 in the West, and 1.7 in the East. 

What are the required minimum sample sizes for the strata to achieve a margin of error of 7%?

In [14]:
# Enter the data into Python dictonaries

prevalence = {"North": 0.50, "South": 0.65, "West": 0.55, "East": 0.45}

response = {"North": 0.90, "South": 0.85, "West": 0.85, "East": 0.70}

deff_c = {"North": 1.2, "South": 1.1, "West": 1.4, "East": 1.7}


Then we can use the parameters to calculate the sample sizes


In [15]:
# Instantiate an object SampleSize for a proportion
obesity_study = SampleSize(parameter="proportion", stratification=True)

# Calculate the sample size for the expected proportion and the desired margin of error
obesity_study.calculate(target=prevalence, half_ci=0.07, deff=deff_c, resp_rate=response)

# Print the sample size
print(f"\nThe minimum required sample size is: {obesity_study.samp_size}")


The minimum required sample size is: {'North': 236, 'South': 197, 'West': 272, 'East': 330}



If more convenient, we can use the method `to_dataframe()` to transform the data into a Pandas DataFrame


In [16]:
obesity_study.to_dataframe()

Unnamed: 0,_parameter,_stratum,_target,_sigma,_half_ci,_samp_size
0,proportion,North,0.5,0.25,0.07,236
1,proportion,South,0.65,0.2275,0.07,197
2,proportion,West,0.55,0.2475,0.07,272
3,proportion,East,0.45,0.2475,0.07,330


**Exercise 5.6**

Assume that we want to calculate sample size for a stratified study. The stratification variable is race and it takes three values: black, hispanic, and white.  From previous studies, we have the following data on income

```python
mean_women = {"black": 25000, "hispanic": 30000, "white":35000}
mean_men = {"black": 35000, "hispanic": 33000, "white":55000}
sigma = {"black": 1200, "hispanic": 3000, "white":3200}
```

For each of these groups, we want to test whether men and women have the same income on average. Calculate the required sample size for each group for a power of 80% and $\alpha=0.05$. We will assume that both the response rate and the design effect are equal to 1.