# ___Probability Sampling___
-----------------

## ___Simple Random Samping (SRS)___
-----------------

In [1]:
# We start with a N population units (sampling frame), and randomly select n units from the frame.
# n is the sample size
# Every unit evaluated for sampling has a probability of n/N to get selected for the sample.

# ___$P(x)~=~(\frac{n}{N})$___

In [3]:
# This probability is same for every entity in the sampling frame.
# This also means that all possible samples of size n are equally likely.
# Statistical estimates made from a SRS are unbiased.

In [4]:
# Imagine a SRS of 134 people selected from a population of 10,000 people.
# Collecting data from this 134 people who were selected at random, can help us make representative inferences about the larger population.

In [5]:
# SRS can be done with/with out replacements.

# SRS with replacement
# Here, entities selected once can get selected again.
# Means that at every single sampling attempt the probability of selection is n/N

# SRS without replacement
# Here, entities selected once cannot be selected again.

# DOUBT
# Here, the probability of selection increases as we proceed with sampling (as the population size declines)
# Because we are not returning the previously sampled individuals back to the population, thus the population size declines.
# at i th sampling attempt, the probability of selection will be n/(N-i)

In [6]:
# SRS is rarely used in real life
# Because collecting a n representative entities from a large population can be prohibitively expensive.

# Imagine sampling 1,000 random individuals from the whole population of USA.
# Population of USA is 331.9 million

1_000 / (331.9 * 1000_000)

3.0129557095510694e-06

In [7]:
# Suppose we have to drive to the residencies of these 1,000 random people all over the USA,
# imagine how expensive that expedition will be.
# Thus, SRS is practically limited to smaller populations.

In [8]:
# So, less expensive alternatives to SRS are needed for larger populations.

## ___SRS and IID data___
-----------------

In [1]:
# IID observations are independent & identically distributed in the distributions where they come from.
# SRS will generate IID data, theoretically.
    # We select a sample
    # Take the measurements from the entities in the sample
# In theory, this should produce an IID sample.

# All randomly sampled entities will give measurements that are independent, such that there's no connection between the units that are 
# randomly sampled.

# These measurements from the entities in the sample are identically distributed.
# That they represent a larger population of entities i.e the sampling frame.
# The characteristics of the entities in the sample stisfactorily reflect the characteristics of entities in the larger population.

In [2]:
# SO WE HAVE A REPRESENTATIVE RANDOMLY SELECTED SAMPLE OF UNITS THAT ARE INDEPENDENT OF ONE ANOTHER.

### ___SRS example.___

In [3]:
# A customer service database contains 2,500 emails received from customers in 2018.
# And the director of the customer services division has received several complaints on poor and slow customer services.
# So, now he wants an estimate of how long does the customer service agents take to respond to emails from customers?

# Unfortunately, the database did not record the reply time in the received email records. 
# A record of received email only has the time record for arrival.
# To trace the time of reply, a staff needs to manually search for the given email and find the thread of that convesation
# and note down the time of reply, marked in the reply email.

# This is a very tediuos process, that will take a very long time for 2,500 emails.
# So, the director asks the analytics team to sample just 150 emails, process them and find the estimate!

In [None]:
# Sampling approaches

# Naive approach
# Just process the first 150 emails in the database.
    # Likely be biases
    # At the start, the customer service employees may have been inexperienced and took more time to respond
    # Eventually they got better at their job, now they are able to respond faster.
    # At the start, the company may have had very few customer service employees, so they might have took long to respond
    # At the start, the email traffic may have been very small, so the employees could have responded very quickly
# No randomness in selection here.
# The selected entities are not representative of the whole 2,500 email response times.
# This essentially is a nonprobability sampling.
    
# Alternative
# Number the emails from 1 to 2,500
# randomly pick a 150 using a random number generator.
# Here every email has a known probability of selection
# P(x) = n/N
#      = 150 / 2500
# This will produce a random representative sample (in theory)
# The estimated mean response time will be an unbiased estimate of the mean response time of the whole set of 2,500 emails.