# ___Probability Sampling___
-----------------

## ___Simple Random Samping (SRS)___
-----------------

In [1]:
# We start with a N population units (sampling frame), and randomly select n units from the frame.
# n is the sample size
# Every unit evaluated for sampling has a probability of n/N to get selected for the sample.

# ___$P(x)~=~(\frac{n}{N})$___

In [3]:
# This probability is same for every entity in the sampling frame.
# This also means that all possible samples of size n are equally likely.
# Statistical estimates made from a SRS are unbiased.

In [4]:
# Imagine a SRS of 134 people selected from a population of 10,000 people.
# Collecting data from this 134 people who were selected at random, can help us make representative inferences about the larger population.

In [5]:
# SRS can be done with/with out replacements.

# SRS with replacement
# Here, entities selected once can get selected again.
# Means that at every single sampling attempt the probability of selection is n/N

# SRS without replacement
# Here, entities selected once cannot be selected again.

# DOUBT
# Here, the probability of selection increases as we proceed with sampling (as the population size declines)
# Because we are not returning the previously sampled individuals back to the population, thus the population size declines.
# at i th sampling attempt, the probability of selection will be n/(N-i)

In [6]:
# SRS is rarely used in real life
# Because collecting a n representative entities from a large population can be prohibitively expensive.

# Imagine sampling 1,000 random individuals from the whole population of USA.
# Population of USA is 331.9 million

1_000 / (331.9 * 1000_000)

3.0129557095510694e-06

In [7]:
# Suppose we have to drive to the residencies of these 1,000 random people all over the USA,
# imagine how expensive that expedition will be.
# Thus, SRS is practically limited to smaller populations.

In [8]:
# So, less expensive alternatives to SRS are needed for larger populations.

## ___SRS and IID data___
-----------------

In [1]:
# IID observations are independent & identically distributed in the distributions where they come from.
# SRS will generate IID data, theoretically.
    # We select a sample
    # Take the measurements from the entities in the sample
# In theory, this should produce an IID sample.

# All randomly sampled entities will give measurements that are independent, such that there's no connection between the units that are 
# randomly sampled.

# These measurements from the entities in the sample are identically distributed.
# That they represent a larger population of entities i.e the sampling frame.
# The characteristics of the entities in the sample stisfactorily reflect the characteristics of entities in the larger population.

In [2]:
# SO WE HAVE A REPRESENTATIVE RANDOMLY SELECTED SAMPLE OF UNITS THAT ARE INDEPENDENT OF ONE ANOTHER.

### ___SRS example.___

In [3]:
# A customer service database contains 2,500 emails received from customers in 2018.
# And the director of the customer services division has received several complaints on poor and slow customer services.
# So, now he wants an estimate of how long does the customer service agents take to respond to emails from customers?

# Unfortunately, the database did not record the reply time in the received email records. 
# A record of received email only has the time record for arrival.
# To trace the time of reply, a staff needs to manually search for the given email and find the thread of that convesation
# and note down the time of reply, marked in the reply email.

# This is a very tediuos process, that will take a very long time for 2,500 emails.
# So, the director asks the analytics team to sample just 150 emails, process them and find the estimate!

In [4]:
# Sampling approaches

# Naive approach
# Just process the first 150 emails in the database.
    # Likely be biases
    # At the start, the customer service employees may have been inexperienced and took more time to respond
    # Eventually they got better at their job, now they are able to respond faster.
    # At the start, the company may have had very few customer service employees, so they might have took long to respond
    # At the start, the email traffic may have been very small, so the employees could have responded very quickly
# No randomness in selection here.
# The selected entities are not representative of the whole 2,500 email response times.
# This essentially is a nonprobability sampling.
    
# Alternative
# Number the emails from 1 to 2,500
# randomly pick a 150 using a random number generator.
# Here every email has a known probability of selection
# P(x) = n/N
#      = 150 / 2500
# This will produce a random representative sample (in theory)
# The estimated mean response time will be an unbiased estimate of the mean response time of the whole set of 2,500 emails.

## ___Complex Probability Sampling___
---------------

In [5]:
# Not every probability sample is an SRS
# SRS have their own drawbacks. They are very expensive for larger populations.
# This is where complex probability sampling helps.

In [6]:
# SRS is used when the population is relatively small or
# When the data needed is already available e.g in a database, or a cabinet full of files. (no need for data collection)
# So we just have to sample it.
# Here as well each entity in the population has equal likelihood of getting selected.

# In complex sampling we use specific features of probability sample design to minimize the sampling costs.
# The term complex sampling is used for anything more complicated than SRS.

### ___Key features of complex sampling___
--------------

## ___(1) Stratification___

In [7]:
# The whole population is partitioned into different strata, based on a criteria
# And each entity belongs to just one stratum.
# This ensures the sample has a representative mix of entities from each stratum
# And reduces the variances in survey estimates.

# Consider a stadium with 4 quadrants.
# We select 100 random audience from the stadium and by random chance all the 100 audience could have come from just one quadrant.
# In that instance, our sample is not really representative of all the people in the stadium.
# This is still a random, probability sample but not a representative one.

# With stratification involved, we may have allocated a quota of 25 to each quardant and randomly selected 25 people from each quadrant
# A more representative sample migh have resulted.

## ___(2) Cluster sampling___

In [8]:
# In cluster sampling we might select clusters of populations.
# e.g. counties in USA are sampled with a known probability of selection within a stratum.

# x number of counties from Western United States  --> Cluster A
# y number of counties from Mid-Western United States  --> Cluster B
# z number of counties from Southern United States  --> Cluster C
# etc..

# Now we have a sample of counties that are representative of all the counties in USA.
# We can visit these counties to select individuals or households.
# The effort becomes more focused and coordinated, thus reduces costs.

# Instead of visiting a select sample of households that are sampled completely randomly all over the USA,
# this method saves a lot of costs by categorzing the sites spatially, such that one could vistit county by county selecting people

# Now that we have a sample of counties -> which consists of clusters of counties (categorized based on their locality)
# We can randomly sample households within those counties with a defined probability of selection.
# e.g. get all the addresses in County x
# and pick 100 addresses randomly & visit those huseholds to collect the measurements.
# This part is very similar to SRS.

# This employs a hierarchical approach to sampling.
# Population -> Sample of counties -> For each county, a random sample of households.

### ___Determinants of an entity's probability of selection in complex sampling___

In [9]:
# Number of clusters sampled from each stratum (e.g. counties selected from Eastern USA)

# Total number of clusters in each stratum (e.g. all the counties in Southern USA)

# Number of units ultimately sampled from each cluster (e.g. number of households from County A)

# Total number of units in each cluster (e.g. all the households in County A)

### ___Finding the probability of selection in Complex Sampling___

In [1]:
# Select "a" number of clusters from a stratum (population of clusters) "A"
# Select "b" number of entities from a cluster "B"

# Hierarchy level 1 ---> a from A
# Hierarchy level 2 (lower) ---> b from B

# ___$P(b)~=~(\frac{a}{A}) \cdot (\frac{b}{B})$___

In [2]:
# A stratum -> Midwestern USA (A)
# Randomly seleced counties in Midwestern USA -> (a)

# For one selected county -> (B)  (B is one entity in a)
# Randomly select b number of households. -> (b)                                                   

# ___Probability of selecting a cluster `a` at random from a stratum `A`$~=~P(a)~=~(\frac{a}{A})$___

# ___Probability of selecting an entity `b` at random from a cluster `B`$~=~P(b)~=~(\frac{b}{B})$___

## __Q__
----------------
In the northeastern region of the United States (a stratum), suppose that 20 counties (clusters) are sampled at random from a list of 300 counties, and 100 housing units (elements) are sampled from a purchased list of housing units in each sampled county. In the southeastern region of the United States, suppose that 10 counties are sampled at random from a list of 200 counties, and 100 housing units are sampled from a list of housing units in each county. What are the probabilities of selection for housing units in each of the two strata?

1) 20/100 and 10/100
2) 20/300 and 10/200
3) 100/300 and 100/200
4) We cannot determine the probabilities of selection from the information provided.

# ___$P(h_{NE})~=~(\frac{100}{?}) \cdot (\frac{20}{300})$___
# ___$P(h_{SE})~=~(\frac{100}{?}) \cdot (\frac{10}{200})$___

In [3]:
# 4 is the answer, since we do not know how many housing units were in the selected counties, we cannot compute the probabilities.

## ___Complex Sampling Example___

In [4]:
# Let's use NHANES

# Let's divide the USA into different regions, based on geography and population density. (stratification)
# This division forms the stratification.
# By allocating entities to these strata, we ensure representiveness.

# Allocate a random sample of counties from each of the region (stratum)
# This division forms clusters.
# This reduces costs.

In [5]:
# Sample certain socio-demographic groups at high rates within counties
# A phenomenon known as oversampling.
# Cetrain demographics like mexicans, chinese or vietnamese may be overy represented by certain counties
# And certain counties may represent extremes of socioeconomic hierarchy, like extremely poor households or rich households.

In [6]:
# This might seem to introduce bias at first glance
# Because different types of people are going to be represented in different proportions in the samples
# And that's okay. We still have a probability design and this can be used to make representative inferences.

In [7]:
# The NHANES approach

# Stage 1   -> The whole USA is divided into regions from which county clusters are sampled randomly.
#              From each of these clusters, one county is randomly selected.
# This step improves representativeness

# Stage 2   -> From a given county multiple segments (areas) are randomly sampled.
# This step reduces sampling expenses.

# Stage 3   -> Select a random sample of households from the selected county area.

# Stage 4   -> Visit the household and randomly select one member for measurements.

In [None]:
# At all 4 of the stages we know what the probability of selections are.
# This ability to caculate the probability of selection for each entity at every step.
# Ultimately we can compute the probability of being included for every single individual.