# Sampling strategies and experimental design

## Sampling strategies

### Why not take a cencus?

* Conducting a census is very resource intensive
* (Nearly) impossible to collect data from all individuals, hench no gurantee of unbiased results
* Populations constantly change

### Sampling is natural

* We don't taste a whole pot of soup to check it.
* Make sure you shake the soup thouroughly before taking a sip

### Simple random sample

* Each case is equally likely to be selected.

### Stratified sample

* First, devide the population into homegeneous groups, called strata, and then randomly sample from within each stratum.

### Cluster sample

* Divide the population into clusters, randomly sample a few clusters, and then sample all observations within these clusters.
* The clusters, unlike strata in tratified sampling, are heterogeneous within themselves and each cluster is similar to the others, such that we can get away with sampling from just a few of the clusters.

### Multistage sample

* Divide the population into clusters, randomly sample a few clusters, and then randomly sample observations from within those clusters.
* Cluster and multistage sampling are often used for economical reasons.

## Sampling in R

### Setup

In [1]:
# Load packages
library(openintro)
library(dplyr)

# Load county data
data(county)

county_noDC <-
    county %>%
    filter(state != "District of Columbia") %>%
    droplevels()

Please visit openintro.org for free statistics materials

Attaching package: 'openintro'

The following objects are masked from 'package:datasets':

    cars, trees

"package 'dplyr' was built under R version 3.4.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



### Simple random sample

In [2]:
# Simple random sample of 150 counties
county_srs <-
    county_noDC %>%
    sample_n(size = 150)

# Glimpse county_srs
glimpse(county_srs)

Observations: 150
Variables: 10
$ name          <fctr> Concordia Parish, San Bernardino County, Tallahatchi...
$ state         <fctr> Louisiana, California, Mississippi, California, Main...
$ pop2000       <dbl> 20247, 1709434, 14903, 54501, 186742, 14792, 10037, 2...
$ pop2010       <dbl> 20822, 2035210, 15378, 55365, 197131, 16912, 10099, 2...
$ fed_spend     <dbl> 10.345548, 5.792442, 9.314475, 8.138264, 8.216658, 7....
$ poverty       <dbl> 30.8, 14.8, 32.5, 11.7, 8.5, 17.5, 18.7, 11.3, 20.6, ...
$ homeownership <dbl> 70.8, 65.1, 72.6, 70.2, 75.2, 77.0, 79.1, 83.1, 73.4,...
$ multiunit     <dbl> 5.5, 18.8, 4.7, 8.6, 22.4, 7.8, 4.1, 10.8, 10.8, 9.2,...
$ income        <dbl> 15911, 21867, 12687, 25483, 27137, 18735, 16835, 2840...
$ med_income    <dbl> 30062, 55845, 24668, 47462, 55008, 37095, 34732, 4870...


### SRS state distribution

In [3]:
# State distribution of SRS counties
county_srs %>%
    group_by(state) %>%
    count()

state,n
Alabama,1
Alaska,1
Arizona,1
Arkansas,2
California,3
Colorado,2
Delaware,1
Florida,1
Georgia,9
Hawaii,2


### Stratified sample

In [5]:
# Stratified sample of 150 counties, each state is a stratum
county_str <-
    county_noDC %>%
    group_by(state) %>%
    sample_n(size = 3)

glimpse(county_str)

Observations: 150
Variables: 10
$ name          <fctr> Dallas County, Wilcox County, Perry County, Prince o...
$ state         <fctr> Alabama, Alabama, Alabama, Alaska, Alaska, Alaska, A...
$ pop2000       <dbl> 46365, 13183, 11861, 6146, NA, 9196, 167517, 155032, ...
$ pop2010       <dbl> 43820, 11670, 10591, 5559, 968, 9492, 211033, 200186,...
$ fed_spend     <dbl> 13.948106, 13.639417, 12.354735, 10.248966, 0.000000,...
$ poverty       <dbl> 31.8, 38.5, 28.8, 14.0, 10.8, 24.6, 13.7, 16.1, 24.4,...
$ homeownership <dbl> 62.6, 76.8, 67.8, 69.0, 59.1, 56.2, 72.5, 71.5, 72.5,...
$ multiunit     <dbl> 16.0, 6.0, 11.4, 9.7, 27.2, 17.4, 11.0, 9.8, 7.0, 2.4...
$ income        <dbl> 16646, 12573, 13433, 24193, 35536, 20549, 25527, 2152...
$ med_income    <dbl> 26029, 23491, 25950, 45728, 73500, 53899, 43290, 3978...


### Simple random sample in R

Suppose you want to collect some data from a sample of eight states. A list of all states and the region they belong to (Northeast, Midwest, South, West) are given in the us_regions data frame.

INSTRUCTIONS

The dplyr package and us_regions data frame have been loaded for you.

* Use simple random sampling to select eight states from us_regions. Save this sample in a data frame called states_srs.
* Count the number of states from each region in your sample.

In [14]:
# Create the dataset
us_regions <- data.frame(state = state.name,
                   region = as.character(state.region))

# Simple random sample: states_srs
states_srs <- us_regions %>%
    sample_n(size = 8)

# Count states by region
states_srs %>%
    group_by(region) %>%
    count()

region,n
North Central,3
Northeast,1
South,2
West,2


### Stratified sample in R

In the last exercise, you took a simple random sample of eight states. However, as you may have noticed when you counted the number of states selected from each region, this strategy is unlikely to select an equal number of states from each region. The goal of stratified sampling is to select an equal number of states from each region.

INSTRUCTIONS

The dplyr package has been loaded for you and us_regions is still available in your workspace.

* Use stratified sampling to select a total of eight states, where each stratum is a region. Save this sample in a data frame called states_str.
* Count the number of states from each region in your sample to confirm that each region is represented equally in your sample.

In [15]:
# Stratified sample
states_str <- us_regions %>%
    group_by(region) %>%
    sample_n(size = 2)

# Count states by region
states_str %>%
    group_by(region) %>%
    count()

region,n
North Central,2
Northeast,2
South,2
West,2


## Principles of experimental design

* Control: compare treatment of interest to a control group
* Randomize: randomly assign subjects to treatments
* Replicate: collect a sufficiently large smaple within a study, or replicate the entire study
* Block: account for the potential effect of confounding variables
    - Group subjects into blocks based on these variables
    - Randomize within each block to treatment groups
    
### Design a study, with blocking

* Learning R: lecture or online
    - Previous experience in programming
    - To ensure each group is equally represented into two groups

![design_a_study_with_blocking](design_a_study_with_blocking.png)