# Sample Selection

The `samplics` class `SampleSelection` implements several popular random selection methods. 

`SampleSelection` is part of the sampling module of `samplics`. Hence, we can import it as follows

```python
from samplics.sampling import SampleSelection
```

The API for instantiating an object of this class is

```python
SampleSelection(
    method,
    stratification = False,
    with_replacement = True,
)
```

The main method of `SampleSelection` is `select()` which the following signature

```python
select(
    samp_unit,
    samp_size = None,
    stratum = None,
    mos = None,
    samp_rate = None,
    probs = None,
    shuffle = False,
    to_dataframe = False,
    sample_only = False,
    ) 
```

Let's use the `SampleSelection` class to illustrate the selection of several samples using different methods.

## Simple Random Sampling (SRS)

The SRS is the simplest selection method. With this approach, all the sampling units have the same probability of selection. 

Assume that we have a population of 100 units and we want to select 10 using SRS.

In [1]:
# We generate a population of 100 units

population = list(range(1, 101))

print(f"\nThe generated population is {population}\n")


The generated population is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]



Now we will use the class `SampleSelection` to select a sample using the SRS method

In [2]:
# we import the needed class
from samplics.sampling import SampleSelection

# We instantiate the object with the default parameters
srs_sampling = SampleSelection(method="srs")

# We select the sample and return a tuple of 3 arrays
srs_sample, srs_hits, srs_probs = srs_sampling.select(samp_unit=population, samp_size=10)

# we print the arrays
print(f"\nThe sample is:\n {srs_sample}")

print(f"\nThe number of hits are:\n {srs_hits}")

print(f"\nThe probabilities of selection are:\n {srs_probs}\n")


The sample is:
 [0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0]

The number of hits are:
 [0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0]

The probabilities of selection are:
 [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]



  from pandas import Int64Index as NumericIndex



The setting `to_dataframe=True` will convert the data to a pandas DataFrame instead of the tuple of three arrays. Note that the columns in the dataframe will be reduced to the minimum. 


In [3]:
# Use the parameter to_dataframe to return a Pandas DataFrame
srs_sample2 = srs_sampling.select(samp_unit=population, samp_size=10, to_dataframe=True)

# we print the first obs of the frame
print(f"\nThe sample is now a Pandas DataFrame")
srs_sample2.head(15)


The sample is now a Pandas DataFrame


Unnamed: 0,_samp_unit,_mos,_sample,_hits,_probs
0,1,1.0,0,0,0.1
1,2,1.0,1,1,0.1
2,3,1.0,0,0,0.1
3,4,1.0,0,0,0.1
4,5,1.0,0,0,0.1
5,6,1.0,0,0,0.1
6,7,1.0,0,0,0.1
7,8,1.0,0,0,0.1
8,9,1.0,0,0,0.1
9,10,1.0,0,0,0.1


The code above return the entire population with the variable `_sample` indicating the selected units. 

We can use `sample_only=True` to subset the returned data to only contain the sample. 

In [4]:
# Use the parameter to_dataframe to return a Pandas DataFrame
srs_sample3 = srs_sampling.select(
    samp_unit=population, samp_size=10, to_dataframe=True, sample_only=True
)

# we print the sample
print(f"\nThe sample is now a Pandas DataFrame")
srs_sample3


The sample is now a Pandas DataFrame


Unnamed: 0,_samp_unit,_mos,_sample,_hits,_probs
0,20,1.0,1,1,0.1
1,30,1.0,1,1,0.1
2,43,1.0,1,1,0.1
3,44,1.0,1,1,0.1
4,49,1.0,1,1,0.1
5,50,1.0,1,1,0.1
6,78,1.0,1,1,0.1
7,83,1.0,1,1,0.1


**Exercise 6.1** 

Select a systematic sample of size 10 from the population data. 

## Probability Proportional to Size (PPS)

The probability proportional to size (PPS) requires some measure of size (mos) associated with each sampling unit.

Let's generate the mos as follows

In [5]:
import random

mos = []
for _ in range(100):
    mos.append(round(100 * random.random(), 0))

mos[1:15]

[56.0,
 35.0,
 4.0,
 90.0,
 38.0,
 39.0,
 68.0,
 35.0,
 85.0,
 70.0,
 56.0,
 35.0,
 87.0,
 71.0]


The PPS selection method can be implemented with or without replacement. 

The available PPS algorithms for selecting sample with unequal probablities of selection are Brewer, Hanurav-Vijayan (hv), Murphy, and Rao-Sampford (rs) methods. 

```python 
Sample(method="pps-sys", with_replacement=True) 
Sample(method="pps-sys", with_replacement=False)   
Sample(method="pps-brewer", with_replacement=False)   
Sample(method="pps-hv", with_replacement=False) # Hanurav-Vijayan method    
Sample(method="pps-murphy", with_replacement=False)    
Sample(method="pps-rs", with_replacement=False) # Rao-Sampford method   
```


In [6]:
# We instantiate the object with the pps systematic method
pps_sampling = SampleSelection(method="pps-sys")

# We select the sample
pps_sample = pps_sampling.select(samp_unit=population, samp_size=10, mos=mos, to_dataframe=True)

pps_sample.head(15)

Unnamed: 0,_samp_unit,_mos,_sample,_hits,_probs
0,1,43.0,True,1,0.091665
1,2,56.0,False,0,0.119378
2,3,35.0,False,0,0.074611
3,4,4.0,False,0,0.008527
4,5,90.0,False,0,0.191857
5,6,38.0,False,0,0.081006
6,7,39.0,False,0,0.083138
7,8,68.0,False,0,0.144958
8,9,35.0,False,0,0.074611
9,10,85.0,False,0,0.181198


**Exercise 6.2** 

Select a sample of 2 units using the PPS Brewer method with replacement. 

## Stratified Design

In a stratified design, the population is divided into $H$ partitions or strata. 

Sample is selected independently from each stratum. 

Using `stratification = True` indicates a stratified design to `samplics`. 

We will then have to provide the stratum variable while specifying `select()`.

Let's create the following population with the stratification variable `region`. 

In [7]:
# import the numpy library
import numpy as np

# fix the random seed to ensure reproducibility
np.random.seed(seed=12345)

# create the stratification by randomly selecting values from the list ["North", "South", "West", "East"]
region = list(np.random.choice(["North", "South", "West", "East"], 100))

# sort the list and replace the original list by the sorted one
region.sort()

# print the sorted list
print(f"\nThis is generated stratification variable:\n {region}\n")


This is generated stratification variable:
 ['East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'East', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'North', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'South', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West', 'West']



Let's select a stratified SRS sample of size 5 by stratum as follows 

In [8]:
# instantiate the object for a stratified srs with replacement
str_sampling = SampleSelection(method="srs", stratification=True, with_replacement=True)

# we select the sample
str_sample = str_sampling.select(
    samp_unit=population, samp_size=5, stratum=region, to_dataframe=True, sample_only=True
)

# print the frame with the indicator for the sample
str_sample

Unnamed: 0,_samp_unit,_stratum,_mos,_sample,_hits,_probs
0,5,East,1.0,1,1,0.172414
1,16,East,1.0,1,1,0.172414
2,18,East,1.0,1,1,0.172414
3,23,East,1.0,1,1,0.172414
4,27,East,1.0,1,1,0.172414
5,32,North,1.0,1,1,0.277778
6,36,North,1.0,1,1,0.277778
7,38,North,1.0,1,1,0.277778
8,45,North,1.0,1,1,0.277778
9,46,North,1.0,1,1,0.277778



If we want the sample size to vary by stratum then we can use a Python dictionary


In [9]:
# variable sample size per stratum

str_samp_size = {"North": 4, "South": 2, "West": 4, "East": 3}

In [10]:
# instantiate the object for a stratified srs with replacement
str_sampling2 = SampleSelection(method="srs", stratification=True, with_replacement=False)

# we select the sample
str_sample2 = str_sampling2.select(
    samp_unit=population,
    samp_size=str_samp_size,
    stratum=region,
    to_dataframe=True,
    sample_only=True,
)

# print the sample
str_sample2

Unnamed: 0,_samp_unit,_stratum,_mos,_sample,_hits,_probs
0,11,East,1.0,1,1,0.103448
1,15,East,1.0,1,1,0.103448
2,26,East,1.0,1,1,0.103448
3,34,North,1.0,1,1,0.222222
4,39,North,1.0,1,1,0.222222
5,42,North,1.0,1,1,0.222222
6,44,North,1.0,1,1,0.222222
7,61,South,1.0,1,1,0.074074
8,72,South,1.0,1,1,0.074074
9,81,West,1.0,1,1,0.153846


**Exercise 6.2** 

Select a stratified sample where region is the stratification variable using systematic selection with replacement. Ensure that the sample sizes are 2 in the North, 5 in the South, 3 in the West, and 3 in the East.

## Two-stage Cluster Sampling 

Multi-stage cluster designs are common in large national surveys.

In multi-stage designs, the selection is conducted in independent layers or stages.

For example, with two-stage designs, the selection is conducted in two steps
- Stage 1: a sample of primary sampling units (PSUs) are selected
- Stage 2: from each selected PSU, a sample of secondary sampling units (SSUs) are selected.

Let's illustrate an example of two-stage sampling design.

Let's assume that the first stage sample consists of small geographic areas with a large number of households. Here the PSUs are the geographic areas or clusters of households. 

The secondary sampling units are the households. Hence, at the second stage, a sample of households will be selected from each PSU in the sample. 

### First Stage of Sampling

To illustrate the two-stage sampling, let's use the datasets included in `samplics`.

In [11]:
import numpy as np

# The function load_psu_frame() allows us to import the PSU sampling frame)
from samplics.datasets import load_psu_frame

#### PSU Frame

The PSU frame is a synthetic data mimicking a census frame of households clusters e.g. census blocks in the USA or enumeration areas in other countries
- The file has 100 clusters classified into 4 mutally exlusive regions (East, North, South and West). 
- Clusters represents a group of households and a status variable indicating whether the cluster is in scope or not.
- The number of households is the census count for the given cluster


In [12]:
psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]
psu_frame.head(15)

Unnamed: 0,cluster,region,number_households_census,cluster_status,comment
0,1,North,105,1,
1,2,North,85,1,
2,3,North,95,1,
3,4,North,75,1,
4,5,North,120,1,
5,6,North,90,1,
6,7,North,130,1,
7,8,North,55,1,
8,9,North,30,1,
9,10,North,600,1,due to a large building


#### PSU Probability of Selection

At the first stage, we use the proportional to size (pps) method to select a random sample of clusters. The measure of size is the number of households (number_households) as provided in the psu sampling frame. The sample is stratified by region. The probabilities, for stratified pps, is obtained as follow: \begin{equation} p_{hi} = \frac{n_h M_{hi}}{\sum_{i=1}^{N_h}{M_{hi}}} \end{equation} where $p_{hi}$ is the probability of selection for unit $i$ from stratum $h$, $M_{hi}$ is the measure of size (mos), $n_h$ and $N_h$ are the sample size and the total number of clusters in stratum $h$, respectively.

#### PSU Sample size

For a stratified sampling design, the sample size is provided using a Python dictionary. Python dictionaries allow us to pair the strata with the sample sizes. Let’s say that we want to select 3 clusters from stratum East, 2 from West, 2 from North and 3 from South. The snippet of code below demonstrates how to create the Python dictionary. Note that it is important to correctly spell out the keys of the dictionary which corresponds to the values of the variable stratum (in our case it’s region)

In [13]:
psu_sample_size = {"East": 3, "West": 2, "North": 2, "South": 3}

print(f"\nThe sample size per domain is: {psu_sample_size}\n")


The sample size per domain is: {'East': 3, 'West': 2, 'North': 2, 'South': 3}




To select the first stage sample of PSUs, we use the PPS method. Before selecting the sample, it is possible to use `inclusion_probs()` to calculate the probabilities of selection. 


In [14]:
stage1_design = SampleSelection(method="pps-sys", stratification=True, with_replacement=False)

psu_frame["psu_prob"] = stage1_design.inclusion_probs(
    samp_unit=psu_frame["cluster"],
    samp_size=psu_sample_size,
    stratum=psu_frame["region"],
    mos=psu_frame["number_households_census"],
)

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the PSU frame \n")
psu_frame.head(nb_obs)


First 15 observations of the PSU frame 



Unnamed: 0,cluster,region,number_households_census,cluster_status,comment,psu_prob
0,1,North,105,1,,0.151625
1,2,North,85,1,,0.122744
2,3,North,95,1,,0.137184
3,4,North,75,1,,0.108303
4,5,North,120,1,,0.173285
5,6,North,90,1,,0.129964
6,7,North,130,1,,0.187726
7,8,North,55,1,,0.079422
8,9,North,30,1,,0.043321
9,10,North,600,1,due to a large building,0.866426


#### PSU Selection
In this section, we select a sample of psus using pps methods. In the section above, we have calculated the probabilities of selection. That step is not necessary when using samplics. We can directly use the method `select()` to calculate the probability of selection and select the sample, in one run.

NB: np.random.seed() fixes the random seed to allow us to reproduce the random selection.

In [15]:
np.random.seed(23)

psu_sample = stage1_design.select(
    samp_unit=psu_frame["cluster"],
    samp_size=psu_sample_size,
    stratum=psu_frame["region"],
    mos=psu_frame["number_households_census"],
    to_dataframe=True,
    sample_only=True,
)

print("\nPSU sample without the non-sampled units\n")
psu_sample


PSU sample without the non-sampled units



Unnamed: 0,_samp_unit,_stratum,_mos,_sample,_hits,_probs
0,7,North,130,1,1,0.187726
1,10,North,600,1,1,0.866426
2,16,South,190,1,1,0.209174
3,24,South,75,1,1,0.082569
4,29,South,200,1,1,0.220183
5,34,East,305,1,1,0.210587
6,45,East,450,1,1,0.310702
7,52,East,700,1,1,0.483314
8,64,West,300,1,1,0.091673
9,86,West,280,1,1,0.085561



As shown below, all these sampling techniques can be specified when instantiating a Sample class; then call select() to draw samples.

For example, if we wanted to select the sample using the Rao-Sampford method, we could use the following snippet of code.


In [16]:
np.random.seed(12345)

stage1_design_rs = SampleSelection(method="pps-rs", stratification=True, with_replacement=False)

psu_frame["psu_sample"], psu_frame["psu_hits"], psu_frame["psu_probs"] = stage1_design_rs.select(
    samp_unit=psu_frame["cluster"],
    samp_size=psu_sample_size,
    stratum=psu_frame["region"],
    mos=psu_frame["number_households_census"],
)

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the PSU frame with the sampling information \n")
psu_frame.head(nb_obs)


First 15 observations of the PSU frame with the sampling information 



Unnamed: 0,cluster,region,number_households_census,cluster_status,comment,psu_prob,psu_sample,psu_hits,psu_probs
0,1,North,105,1,,0.151625,0,0,0.151625
1,2,North,85,1,,0.122744,0,0,0.122744
2,3,North,95,1,,0.137184,1,1,0.137184
3,4,North,75,1,,0.108303,0,0,0.108303
4,5,North,120,1,,0.173285,0,0,0.173285
5,6,North,90,1,,0.129964,0,0,0.129964
6,7,North,130,1,,0.187726,0,0,0.187726
7,8,North,55,1,,0.079422,0,0,0.079422
8,9,North,30,1,,0.043321,0,0,0.043321
9,10,North,600,1,due to a large building,0.866426,0,0,0.866426


### Second Stage of Sampling 

To select the second stage sample, we need the second stage frame which is the list of all the households in the 10 selected clusters (psus). 

The households are referred to as the secondary sampling units (SSUs).

#### SSU Sampling Frame

In this tutorial, we will simulate the second stage frame. For the simulation, assume that the psu frame was obtained from a previous census conducted several years before. We also assume that, the change in the number of households since the previous census follows a normal distribution with a mean equal to 5% higher than the census value and a variance of 0.15 times the number of households from the census. Under these assumptions, we generate the following second stage frame of households. Note that the frame is created only for the selected PSUs.

In [17]:
import pandas as pd

# Create a synthetic second stage frame
census_size = psu_frame.loc[psu_frame["psu_sample"] == 1, "number_households_census"].values
stratum_names = psu_frame.loc[psu_frame["psu_sample"] == 1, "region"].values
cluster = psu_frame.loc[psu_frame["psu_sample"] == 1, "cluster"].values

np.random.seed(15)

listing_size = np.zeros(census_size.size)
for k in range(census_size.size):
    listing_size[k] = np.random.normal(1.05 * census_size[k], 0.15 * census_size[k])

listing_size = listing_size.astype(int)
hh_id = rr_id = cl_id = []
for k, s in enumerate(listing_size):
    hh_k1 = np.char.array(np.repeat(stratum_names[k], s)).astype(str)
    hh_k2 = np.char.array(np.arange(1, s + 1)).astype(str)
    cl_k = np.repeat(cluster[k], s)
    hh_k = np.char.add(np.char.array(cl_k).astype(str), hh_k2)
    hh_id = np.append(hh_id, hh_k)
    rr_id = np.append(rr_id, hh_k1)
    cl_id = np.append(cl_id, cl_k)

ssu_frame = pd.DataFrame(cl_id.astype(int))
ssu_frame.rename(columns={0: "cluster"}, inplace=True)
ssu_frame["region"] = rr_id
ssu_frame["household"] = hh_id

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the SSU frame\n")
ssu_frame.head(nb_obs)


First 15 observations of the SSU frame



Unnamed: 0,cluster,region,household
0,3,North,31
1,3,North,32
2,3,North,33
3,3,North,34
4,3,North,35
5,3,North,36
6,3,North,37
7,3,North,38
8,3,North,39
9,3,North,310


In [18]:
psu_sample = psu_frame.loc[psu_frame["psu_sample"] == 1]
ssu_counts = ssu_frame.groupby("cluster").count()
ssu_counts.drop(columns="region", inplace=True)
ssu_counts.reset_index(inplace=True)
ssu_counts.rename(columns={"household": "number_households_listed"}, inplace=True)

pd.merge(
    psu_sample[["cluster", "region", "number_households_census"]],
    ssu_counts[["cluster", "number_households_listed"]],
    on=["cluster"],
)

Unnamed: 0,cluster,region,number_households_census,number_households_listed
0,3,North,95,95
1,24,South,75,82
2,52,East,700,718
3,89,West,360,350



According to the simulated second stage frame, we get the same number of households in cluster 7 as the census. However, for the clusters 10, 16, 29, and 64, we listed more households than during the census. And finally, we found less households in the remaining clusters than the census.

Now that we have a second stage frame, let’s use samplics to calculate the probabilities of selection and to select a sample. The second stage sample size is 150 households and the strategy is to select 15 households per cluster.



#### SSU (household) Selection

The second stage probabilities of selection are conditional on the first stage realization. For this stage, simple random selection (srs) and systematic selection(sys) are common methods used to select households. For this example, we use srs to select 15 households from each cluster. Conditionally to the first stage, the second stage selection is a stratified srs where the clusters are the strata. More generally, we have that \begin{equation} p_{hij} = \frac{m_{hj}}{M_{hj}^{'}} \end{equation} where $p_{hij}$ is the conditional probability of selection for unit $j$ from stratum $h$ and cluster $j$, $m_{hi}$ and $M_{hi}^{'}$ are the sample size and the number of secondary sampling units listed for stratum $h$ and cluster $j$, respectively.


In this scenario, sample size is the same in each stratum. Hence, the parameter *sample_size* does not need to be a Python dictionary; we will only provide 15 in the function call. 

In [19]:
stage2_design = SampleSelection(method="srs", stratification=True, with_replacement=False)

np.random.seed(11)
ssu_sample, ssu_hits, ssu_probs = stage2_design.select(
    samp_unit=ssu_frame["household"], samp_size=15, stratum=ssu_frame["cluster"]
)

ssu_frame["ssu_sample"] = ssu_sample
ssu_frame["ssu_hits"] = ssu_hits
ssu_frame["ssu_probs"] = ssu_probs

ssu_frame[ssu_frame["ssu_sample"] == 1].sample(15)

Unnamed: 0,cluster,region,household,ssu_sample,ssu_hits,ssu_probs
46,3,North,347,1,1,0.157895
609,52,East,52433,1,1,0.020891
573,52,East,52397,1,1,0.020891
1140,89,West,89246,1,1,0.042857
748,52,East,52572,1,1,0.020891
68,3,North,369,1,1,0.157895
227,52,East,5251,1,1,0.020891
99,24,South,245,1,1,0.182927
222,52,East,5246,1,1,0.020891
162,24,South,2468,1,1,0.182927


To use systematic selection, we just need to replace `method="srs"` by `method="sys"`.

Another common approach is to use a rate for selecting the sample. 

Instead of selecting 15 households from 130 in the first cluster, we may want to select with a rate of 15/130, and similarly for the other clusters.

In [20]:
number_strata = 4

rates = np.repeat(15, number_strata) / ssu_counts["number_households_listed"].values

ssu_rates = dict(zip(np.unique(ssu_frame["cluster"]), rates))

ssu_rates

{3: 0.15789473684210525,
 24: 0.18292682926829268,
 52: 0.020891364902506964,
 89: 0.04285714285714286}


A sample is selected using the rates as follows:


In [21]:
np.random.seed(22)

stage2_design2 = SampleSelection(method="sys", stratification=True, with_replacement=False)

ssu_frame["ssu_sample_r"], ssu_frame["ssu_hits_r"], _ = stage2_design2.select(
    ssu_frame["household"], stratum=ssu_frame["cluster"], samp_rate=ssu_rates
)

ssu_frame.head(25)

Unnamed: 0,cluster,region,household,ssu_sample,ssu_hits,ssu_probs,ssu_sample_r,ssu_hits_r
0,3,North,31,0,0,0.157895,0,0
1,3,North,32,1,1,0.157895,0,0
2,3,North,33,1,1,0.157895,0,0
3,3,North,34,0,0,0.157895,0,0
4,3,North,35,0,0,0.157895,0,0
5,3,North,36,0,0,0.157895,1,1
6,3,North,37,0,0,0.157895,0,0
7,3,North,38,0,0,0.157895,0,0
8,3,North,39,0,0,0.157895,0,0
9,3,North,310,0,0,0.157895,0,0



Let’s store the first and second stages samples.


In [22]:
psu_sample[["cluster", "region", "psu_prob"]].to_csv("psu_sample.csv")

ssu_sample = ssu_frame.loc[ssu_frame["ssu_sample"] == 1]
ssu_sample[["cluster", "household", "ssu_probs"]].to_csv("ssu_sample.csv")