In [1]:
import numpy as np
import pandas as pd
import altair as alt

# Lab 1: Sampling design and statistical bias

In the following scenarios you'll explore through simulation how nonrandom sampling can produce datasets with statistical properties that are distored relative to the population that the sample was drawn from. This kind of distortion is known as **bias**. 

In common usage, the word 'bias' means disproportion or unfairness. In statistics, the concept has the same connotation -- biased sampling favors certain observational units over others, and biased estimates are estimates that favor larger or smaller values than the truth.

This lab has you explore sampling bias. The goal is to refine your understanding about what (statistical) bias is and is not, and develop your intuition about potential mechanisms by which bias is introduced and the effect that this can have on sample statistics. 

### Objectives

* Simulate biased and unbiased sampling designs
* Examine the impact of sampling bias on the sample mean
* Apply a simple bias correction by inverse probability weighting

---

## Background

#### Sampling design

The **sampling design** of a study refers to _**the way observational units are selected**_ from the sampling frame (the collection of all observable units). Any design can be expressed by the probability that each unit is included in the sample. In a random sample, all units are equally likely to be included.

For example, you might want to learn about U.S. residents (population), but only be able for ethical reasons to study adults (sampling frame), and decide to do a mail survey of 2000 randomly selected addresses in each state (sampling design). (This is not a random sample of all individuals because individuals share addresses and the population sizes are different from state to state.)

#### Bias

Formally, **bias** describes _**the 'typical' deviation of a sample statistic (observed) from its population counterpart (unobserved)**_. 

For example, if a particular sampling design tends to produce an average measurement around 1.5 units, but the true average in the population is 2 units, then the estimate has a bias of -0.5 units. The language 'typical' and 'tends to' is important here. Estimates are rarely (almost never) perfect, so just because an estimate is off by -0.5 units for one sample doesn't make it biased -- it is only biased if it is *consistently* off under repeated sampling. 

Although bias is technically a property of a sample statistic (like the sample average), it's common to talk about a biased *sample* -- this term refers to a dataset collected using a sampling design that produces biased statistics. 

This is exactly what you'll explore in this lab -- the relationship between sampling design and bias.

#### Simulated data

You will be simulating data in this lab. **Simulation** is a great means of exploration for the present topic _**because you can control the population properties**_. 

When working with real data, you just have one dataset, and you don't know any of the properties of the population or what might have happened if a different sample were collected. That makes it difficult to understand sampling variation and impossible to directly compare the sample properties to the population properties! 

With simulated data, by contrast, you control how data are generated with exact precision -- so by extension, you know everything there is to know about the population. In addition, repeated simulation of data makes it possible to explore the typical behavior of a particular sampling design, so you can learn 'what usually happens' for a particular sampling design by direct observation.

---

## Scenario 1: unbiased samples

In this scenario you'll compare the sample mean and the distribution of sample values for a single viariable with the population mean and distribution for an unbiased sampling design.

### Hypothetical population

To provide a little context to this scenario, imagine that you're measuring eucalyptus seeds to determine their typical diameter. The cell below simulates diameter measurements for a hypothetical population of 5000 seeds; imagine that this is the total number of seeds in a small grove at some point in time.

In [2]:
# simulate seed diameters
np.random.seed(40221) # for reproducibility
population = pd.DataFrame(
    data = {'diameter': np.random.gamma(shape = 2, scale = 1/2, size = 5000), 
            'seed': np.arange(5000)}
).set_index('seed')

# check first few rows
population.head(3)

Unnamed: 0_level_0,diameter
seed,Unnamed: 1_level_1
0,0.831973
1,1.512187
2,0.977392


#### Question 1a

Calculate the mean diameter for the hypothetical population.

In [3]:
# solution
np.mean(population.diameter)

1.0189291497049837

#### Question 1b

Calculate the standard deviation of diameters for the hypothetical population.

In [4]:
# solution
np.std(population.diameter)

0.7238573219955743

The cell below produces a histogram of the population values -- the distribution of diameter measurements among the hypothetical population -- with a vertical line indicating the population mean.

In [5]:
# base layer
base_pop = alt.Chart(population).properties(width = 400, height = 300)

# histogram of diameter measurements
hist_pop = base_pop.mark_bar(opacity = 0.8).encode(
    x = alt.X('diameter', 
              bin = alt.Bin(maxbins = 20), 
              title = 'Diameter (mm)', 
              scale = alt.Scale(domain = (0, 6))),
    y = alt.Y('count()', title = 'Number of seeds in population')
)

# vertical line for population mean
mean_pop = base_pop.mark_rule(color='blue').encode(
    x = 'mean(diameter)'
)

# display
hist_pop + mean_pop

### Hypothetical sampling design

Imagine that your sampling design involves collecting bunches of plant material from several locations in the grove and sifting out the seeds with a fine sieve until you obtaining 250 seeds. We'll suppose that using your collection method, any of the 5000 seeds is equally likely to be obtained, so that your 250 seeds comprise a *random sample* of the population. 

We can simulate samples obtained using your hypothetical design by drawing values without replacement from the population.

In [6]:
# draw a random sample of seeds
np.random.seed(40221) # for reproducibility
sample = population.sample(n = 250, replace = False)
sample

Unnamed: 0_level_0,diameter
seed,Unnamed: 1_level_1
2885,0.733690
1516,0.419389
1044,0.235597
1944,1.559574
270,0.615990
...,...
3541,0.680930
191,0.842957
1792,1.124163
1797,1.687671


#### Question 1c

Calculate the mean diameter of seeds in the simulated sample. Is it close to the population mean?

In [7]:
# solution
np.mean(sample.diameter)

0.9777218824084053

#### Answer

*It is slightly less than the population mean diameter, but still pretty close.*

The cell below produces a histogram of the sample values, and displays it alongside the histogram of population values.

In [8]:
# base layer
base_samp = alt.Chart(sample).properties(width = 400, height = 300)

# histogram of diameter measurements
hist_samp = base_samp.mark_bar(opacity = 0.8).encode(
    x = alt.X('diameter', 
              bin = alt.Bin(maxbins = 20),
              scale = alt.Scale(domain = (0, 6)),
              title = 'Diameter (mm)'),
    y = alt.Y('count()', title = 'Number of seeds in sample')
)

# vertical line for population mean
mean_samp = base_samp.mark_rule(color='blue').encode(
    x = 'mean(diameter)'
)

# display
hist_samp + mean_samp | hist_pop + mean_pop

Notice that while there are some small differences, the overall shape is similar and the sample mean is almost exactly the same as the population mean. So with this sampling design, you obtained a dataset with few distortions of the population properties, and the sample mean is a good estimate of the population mean.

### Assessing bias

You may wonder: *does that happen all the time, or was this just a lucky draw?* This question can be answered by simulating a large number of samples to see whether the undistorted representation of the population is typical for this sampling design. To simplify life a little, let's focus on whether the sample mean is usually accurate.

The cell below estimates the bias of the sample mean by:

* drawing 1000 samples of size 300;
* storing the sample mean from each sample; 
* computing the average difference between the sample means and the population mean.

In [9]:
np.random.seed(40221) # for reproducibility

# number of samples to simulate
nsim = 1000

# storage for the sample means
samp_means = np.zeros(nsim)

# repeatedly sample and store the sample mean
for i in range(0, nsim):
    samp_means[i] = population.sample(n = 250, replace = False).mean()

The bias of the sample mean is its average distance from the population mean. We can estimate this using our simulation results as follows:

In [10]:
# bias
samp_means.mean() - population.diameter.mean()

-0.0012458197406362004

So the average error observed in 1000 simulations was about 0.001 mm! This suggests that the sample mean is *unbiased*: on average, there is no error. Therefore, at least with respect to estimating the population mean, random samples appear to be *unbiased samples*.

However, **unbiasedness does not mean that you won't observe estimation error**. There is a natural amount of variability from sample to sample, because in each sample a different collection of seeds is measured.

The cell below plots a histogram representing the distribution of values of the sample mean across the 1000 samples you simulated (this is known as the *sampling distribution* of the sample mean). It shows a peak right at the population mean (blue vertical line) but some symmetric variation to either side -- most values are between about 0.93 and 1.12.

In [11]:
# plot the simulated sampling distribution
sampling_dist = alt.Chart(pd.DataFrame({'sample mean': samp_means})).mark_bar().encode(
    x = alt.X('sample mean', bin = alt.Bin(maxbins = 30), title = 'Value of sample mean'),
    y = alt.Y('count()', title = 'Number of simulations')
)

sampling_dist + mean_pop

---

## Scenario 2: biased sampling

In this scenario, you'll use the same hypothetical population of eucalyptus seed diameter measurements and explore the impact of a biased sampling design.

### Hypothetical sampling design

In the first design, you were asked to imagine that you collected and sifted plant material to obtain seeds. Suppose you didn't know that the typical seed is about 1mm in diameter and decided to use a sieve that is a little too coarse, tending only to sift out larger seeds and letting smaller seeds pass through. As a result, small seeds have a lower probability of being included in the sample and large seeds have a higher probability of being included in the sample. 

This kind of sampling design can be described by assigning differential *sampling weights* $w_1, \dots, w_N$ to each observation. The cell below defines a `weight_fn` that calculates a weight $w_i$ between 0 and 1 according to diameter, so that larger diameters have larger weights and are more likely to be sampled.

In [12]:
# inclusion weight as a function of seed diameter
def weight_fn(x, r = 2, c = 2):
    out = 1/(1 + np.e**(-r*(x - c)))
    return out

# create a grid of values to use in plotting the function
grid = np.linspace(0, 6, 100)
weight_df = pd.DataFrame(
    {'seed diameter': grid,
     'weight': weight_fn(grid)}
)

# plot of inclusion probability against diameter
weight_plot = alt.Chart(weight_df).mark_area(opacity = 0.3, line = True).encode(
    x = 'seed diameter',
    y = 'weight'
).properties(height = 100)

# show plot
weight_plot

The actual probability that a seed is included in the sample -- its **inclusion probability** -- is proportional to the sampling weight. These inclusion probabilities $\pi_i$ can be calculated by normalizing the weights $w_i$:

$$\pi_i = \frac{w_i}{\sum_i w_i}$$

It may help you to picture how the weights will be used in sampling to line up this plot with the population distribution. In effect, we will sample more from the right tail of the population distribution, where the weight is nearest to 1.

In [13]:
hist_pop & weight_plot

The following cell draws a sample with replacement from the hypothetical seed population *with seeds weighted according to the inclusion probability given by the function above*.

In [14]:
# assign inclusion probability to each seed
population['inclusion_prob'] = weight_fn(population.diameter)/(weight_fn(population.diameter)).sum()

# draw weighted sample
np.random.seed(40721)
sample = population.sample(n = 250, replace = False, weights = 'inclusion_prob').drop(columns = 'inclusion_prob')
sample

Unnamed: 0_level_0,diameter
seed,Unnamed: 1_level_1
3426,1.721296
1412,2.308436
4979,1.676742
2839,2.986968
1715,1.651348
...,...
3958,1.813409
2199,2.549680
859,2.755936
1659,0.925576


#### Question 2a

Calculate the mean diameter of seeds in the simulated sample. Is it close to the population mean?

In [15]:
# solution
np.mean(sample.diameter)

1.8117214583859176

#### Answer

*Compared to the random sampled data, this calculated mean is further away from the population mean. It is also greater than the population mean.*

#### Question 2b

Show side-by-side plots of the distribution of sample values and the distribution of population values, with vertical lines indicating the corresponding mean on each plot. (*Hint*: copy the cell that produced this plot in scenario 1.) Does the distribution of diameters of seeds in the sample seem to accurately reflect the population?

In [16]:
# solution

# base layer
base_samp = alt.Chart(sample).properties(width = 400, height = 300)

# histogram of diameter measurements
hist_samp = base_samp.mark_bar(opacity = 0.8).encode(
    x = alt.X('diameter', 
              bin = alt.Bin(maxbins = 20),
              scale = alt.Scale(domain = (0, 6)),
              title = 'Diameter (mm)'),
    y = alt.Y('count()', title = 'Number of seeds in weighted sample')
)

# vertical line for population mean
mean_samp = base_samp.mark_rule(color='blue').encode(
    x = 'mean(diameter)'
)

# display
hist_samp + mean_samp | hist_pop + mean_pop

#### Answer

*While the population distribution looks more skewed to the right, the biased sample looks pretty symmetric and almost following a normal distribution. It seems like the weighting of the diameters has made it so the data is more normal.*

### Assessing bias

Here you'll mimic the simulation done in scenario 1 to assess the bias of the sample mean under this new sampling design.

#### Question 2c

Investigate the bias of the sample mean by:

* drawing 1000 samples with observations weighted by inclusion probability;
* storing the sample mean from each sample; 
* computing the average difference between the sample means and the population mean.

(*Hint*: copy the cell that performs this simulation in scenario 1, and be sure to adjust the sampling step to include `weights = ...` with the appropriate argument.)

In [17]:
# solution
np.random.seed(40221) # for reproducibility

# number of samples to simulate
nsim = 1000

# storage for the sample means
samp_means = np.zeros(nsim)

# repeatedly sample and store the sample mean
for i in range(0, nsim):
    samp_means[i] = np.mean(population.sample(n = 250, replace = False, weights = 'inclusion_prob').drop(columns = 'inclusion_prob'))
    
# bias
samp_means.mean() - population.diameter.mean()

0.7722765557571376

#### Question 2d

Does this sampling design seem to introduce bias? If so, does the sample mean tend to over-estimate or under-estimate the population mean?

#### Answer

*According to my calculation at the end of my previous cell, there seems to be quite a lot of bias in the sampling design. The sample mean tends to over estimate the population mean.*

---

## Scenario 3

In this scenario, you'll explore sampling from a population with group structure -- frequently bias can arise from inadvertent uneven sampling of groups within a population.

### Hypothetical population

Suppose you're interested in determining the average beak-to-tail length of red-tailed hawks to help differentiate them from other hawks by sight at a distance. Females and males differ slightly in length -- females are generally larger than males. The cell below generates length measurements for a hypothetical population of 3000 females and 2000 males.

In [18]:
# for reproducibility
np.random.seed(40721)

# simulate hypothetical population
population = pd.DataFrame(
    data = {'length': np.random.normal(loc = 57.5, scale = 3, size = 3000),'sex': np.repeat('female',3000)}).append(
    pd.DataFrame(
        data = {'length': np.random.normal(loc = 50.5, scale = 3, size = 2000),
                'sex': np.repeat('male',2000)}
    )
)

# preview
population.groupby('sex').head(2)

Unnamed: 0,length,sex
0,53.97523,female
1,60.516768,female
0,53.076663,male
1,49.933166,male


The cell below produces a histogram of the lengths in the population overall (bottom panel) and when distinguished by sex (top panel).

In [19]:
base = alt.Chart(population).properties(height = 200)

hist = base.mark_bar(opacity = 0.5, color = 'red').encode(
    x = alt.X('length', 
              bin = alt.Bin(maxbins = 40), 
              scale = alt.Scale(domain = (40, 70)),
              title = 'length (cm)'),
    y = alt.Y('count()', 
              stack = None,
              title = 'number of birds')
)

hist_bysex = hist.encode(color = 'sex').properties(height = 100)

hist_bysex & hist

The population mean -- average length of both female and male red-tailed hawks -- is shown below.

In [20]:
# population mean
population.mean()

length    54.737717
dtype: float64

First try drawing a random sample from the population:

In [21]:
# for reproducibility
np.random.seed(40821)

# randomly sample
sample = population.sample(n = 300, replace = False)
sample

Unnamed: 0,length,sex
1063,50.882531,male
173,50.264073,male
2535,59.807364,female
2519,55.485999,female
2166,59.809179,female
...,...,...
479,53.524782,male
444,54.557387,male
983,51.048157,male
660,56.971426,female


#### Question 3a

Do you expect that the sample will contain equal numbers of male and female hawks? Think about this for a moment (you don't have to provide a written answer), and then compute the proportions of individuals in the sample of each sex. 

(*Hint*: group by sex, use `.count()`, and divide by the sample size. Be sure to rename the output column appropriately, as the default behavior produces a column called `length`.) 

In [22]:
# solution
proportion = sample.groupby('sex').count()/300
proportion.rename(columns = {'length':'proportion'})

Unnamed: 0_level_0,proportion
sex,Unnamed: 1_level_1
female,0.596667
male,0.403333


*Since the hypothetical population contains more females than males, it is very likely that a random sample would also contain more females than males. Just like how it is in the hypothetical population, there are more females than males in the sample.*

The sample mean is shown below, and is fairly close to the population mean. This should be expected, since you already saw in scenario 1 that random sampling is an unbiased sampling design with respect to the mean.

In [23]:
# solution
sample.mean()

length    54.952103
dtype: float64

### Biased sampling

Let's now consider a biased sampling design. Usually, length measurements are collected from dead specimens collected opportunistically. Imagine that male mortality is higher, so there are better chances of finding dead males than dead females. Suppose in particular that specimens are five times as likely to be male; to represent this situation, we'll assign sampling weights of 5/6 to all male hawks and weights of 1/6 to all female hawks.

In [24]:
def weight_fn(sex, p = 5/6):
    if sex == 'male':
        out = p
    else:
        out = 1 - p
    return out

weight_df = pd.DataFrame(
    {'length': [50.5, 57.5],
     'weight': [5/6, 1/6],
     'sex': ['male', 'female']})

wt = alt.Chart(weight_df).mark_bar(opacity = 0.5).encode(
    x = alt.X('length', scale = alt.Scale(domain = (40, 70))),
    y = alt.Y('weight', scale = alt.Scale(domain = (0, 1))),
    color = 'sex'
).properties(height = 70)

hist_bysex & wt

#### Question 3b

Draw a weighted sample from the population using the weights defined by `weight_fn`, and compute the sample mean.

In [25]:
# for reproducibility
np.random.seed(40821)

# assign weights
population['weights'] = population.sex.aggregate(func = weight_fn)
# randomly sample
sample = population.sample(n = 300, replace = False, weights = 'weights').drop(columns = 'weights')
# compute mean
sample.mean()

length    51.887106
dtype: float64

#### Question 3c

Investigate the bias of the sample mean by:

* drawing 1000 samples with observations weighted by `weight_fn`;
* storing the sample mean from each sample; 
* computing the average difference between the sample means and the population mean.

In [26]:
# solution
np.random.seed(40221) # for reproducibility

# number of samples to simulate
nsim = 1000
# storage for the sample means
samp_means = np.zeros(nsim)
# repeatedly sample and store the sample mean
for i in range(nsim):
    samp_means[i] = population.sample(n = 300, replace = False, weights = 'weights').drop(columns = 'weights').mean()
# bias
samp_means.mean() - population.length.mean()

-2.5720649894415715

#### Question 3d

Reflect a moment on your simulation result in question 3c. If *female* mortality is higher and specimens for measurement are collected opportunistically, as described in this sampling design, do you expect that the average length in the sample will be an underestimate or an overestimate of the population mean? Explain why in 1-2 sentences.

#### Answer

*If female hawk mortality is higher and they are more likely to be surveyed in a sample, I would expect the current sample mean to be an underestimate of the overall population mean. This is because in our previous histograms and statistics calculated for the whole population of hawks, females tend to be larger than males.*

---

## Bias correction

*What can be done if a sampling design is biased? Is there any remedy?* 

You've seen some examples above of how bias can arise from a sampling mechanism in which units have unequal chances of being selected in the sample. Ideally, we'd work with random samples all the time, but that's not very realistic in practice. Fortunately, biased sampling is not a hopeless case -- **it is possible to apply bias corrections if you have good information about which individuals were more likely to be sampled**.

To illustrate how this would work, let's revisit scenario 2 -- sampling larger eucalyptus seeds more often than smaller ones. Imagine you realize the mistake and conduct a quick experiment with your sieve to determine the proportion of seeds of each size that pass through, and use this to estimate the inclusion probabilities. (To simplify this excercise, we'll just use sampling weights we defined to calculate the actual inclusion probabilities.)

The cell below generates the population and sample from scenario 2 again:

In [27]:
# simulate seed diameters
np.random.seed(40221) # for reproducibility
population = pd.DataFrame(
    data = {'diameter': np.random.gamma(shape = 2, scale = 1/2, size = 5000), 
            'seed': np.arange(5000)}
).set_index('seed')

# probability of inclusion as a function of seed diameter
def weight_fn(x, r = 2, c = 2):
    out = 1/(1 + np.e**(-r*(x - c)))
    return out

# assign inclusion probability to each seed
population['samp_weight'] = weight_fn(population.diameter)

# draw weighted sample
np.random.seed(40721)
sample = population.sample(n = 250, replace = False, weights = 'samp_weight')

The sample mean and population mean you calculated earlier are shown below:

In [28]:
# print sample and population means
pd.Series({'sample mean': sample.diameter.mean(), 'population mean': population.diameter.mean()})

sample mean        1.811721
population mean    1.018929
dtype: float64

We can obtain an unbiased estimate of the population mean by computing a *weighted average* of the diameter measurements instead of the sample average after weighting the measurements in inverse proportion to the sampling weights:

$$\text{weighted average} = \sum_{i = 1}^{250} \underbrace{\left(\frac{w_i^{-1}}{\sum_{j = 1}^{250} w_j^{-1}}\right)}_{\text{bias adjustment}} \times \text{diameter}_i$$

This might look a little complicated, but the idea is simple -- the weighted average corrects for bias by simply up-weighting observations with a lower sampling weight and down-weighting observations with a higher sampling weight.

The cell below performs this calcuation.

In [29]:
# compute bias adjustment
sample['bias_adjustment'] = (sample.samp_weight**(-1))/(sample.samp_weight**(-1)).sum()

# weight diameter measurements
sample['weighted_diameter'] = sample.diameter*sample.bias_adjustment

# sum to compute weighted average
sample.weighted_diameter.sum()

1.0139201948221341

Notice that the weighted average successfully corrected for the bias:

In [30]:
# print sample and population means
pd.Series({'sample mean': sample.diameter.mean(),
           'weighted average': sample.weighted_diameter.sum(),
           'population mean': population.diameter.mean()})

sample mean         1.811721
weighted average    1.013920
population mean     1.018929
dtype: float64

---

## Takeaways

These simulations illustrate through a few simple examples that random sampling -- a sampling design where each unit is equally likely to be selected -- produces unbiased sample means. That means that 'typical samples' will yield sample averages that are close to the population value. By contrast, deviations from random sampling tend to yield biased sample averages -- in other words, nonrandom sampling tends to distort the statistical properties of the population in ways that can produce misleading conclusions (if uncorrected).

Here are a few key points to reflect on:

* bias is not a property of an individual sample, but of a *sampling design*
    - unbiased sampling designs tend to produce faithful representations of populations
    - but there are no guarantees for individual samples


* if you hadn't known the population distributions, there would have been no computational method to detect bias
    - in practice, it's necessary to *reason* about whether the sampling design is sound


* the sample statistic (sample mean) was only reliable when the sampling design was sound
    - the quality of data collection is arguably more important for reaching reliable conclusions than the choice of statistic or method of analysis



---
## Submission Checklist
1. Save file to confirm all changes are on disk
2. Run *Kernel > Restart & Run All* to execute all code from top to bottom
3. Save file again to write any new output to disk
4. Select *File > Download* (should save as .ipynb)
5. Submit to Gradescope