# ___Inference for Non-Probability Samples___
---------------------

In [2]:
# What we have discussed hitherto is about how to make inferences about the population based on a single random probability
# sample.
# But what if our sample is a non-probability sample?

In [3]:
# Non-probability samples do not give us the luxury of leveraging the sampling theory.
# They do not let us leverage the sampling theory.
# That is we cannot base our accuracy of inferences on the sampling distributions.

In [4]:
# Sampling theory is a set of concepts that allow us to make inferences about populations based on samples.

In [6]:
# There are two approaches to make inferences from non-probability samples.
    # Quasi randomization.
    # Population modelling.

## ___Quasi randomization___
------------

In [7]:
# Involves combining data from a non-probability sample with data from a probability sample that collecetd the same types of measures.
# It is important that these two samples are independent and collected the same measures (focused on the same features)

In [9]:
# e.g. assume we are measuring blood pressure, age, sex, ethnicity on a non-probability sample of volunteers,
# our sample is biased and not representative of the larger population.

# But we can get data from probability samples that collected same variables and combine it with our non-probability sample.
# e.g. the NHANES data

In [13]:
# So we have our non-probability sample with 2,000 records
# we randomly select 2,000 records (a probability sample) from NHANES
# and stack them one top of another.

In [15]:
# Then we introduce a binary categorical label as a new column, to identify which sample a given record belongs to.
# let's say if the label is 0, then that record came from the non-probability sample
# if the label is 1, then that record came from the probability sample (NHANES).

In [16]:
# We may even be interested in a variable that was measured only in the non-probability sample.
# So that variable will be absent in the probability sample (NHANES).

# let's say NHANES didn't care to collect information about the participants' marital status, number of children, country of origin,
# but we did.

In [17]:
# Then we try to fit a logistic regression model to predict the probability of having 0 in the lables of non-probability sample.
# in other words, the probability of being in the non-probability sample with the given set of other features that record has.
# e.g. the probability of a 36 years old French woman with 145 mmHg blood pressure being in the non-probability sample.

In [18]:
# In fitting that model, we'll weight the non-probability cases by 1, so we do not give them any differential weights.
# We'll weight the probability cases by their survey weights. (i.e their probabilities of selection)

In [19]:
# So, in essence, we try to simulate what's happening in the population based on the probability sample
# and then we try to determine the probability of being in the non-probability sample, based on the model.

## ___The concept of quasi randomization___

In [20]:
# We can predict the probability of being in the non-probability sample, within whatever population represented by the probability sample.
# e.g being a volunteer, being a student of the professor who conducted the survey

In [21]:
# This is what we get from stacking the two datasets together. We get a slice of overall population.
# Then we predict the probability of an individual in the larger population appearing in the non-probability sample of volunteers.

In [22]:
# Once we have the predicted probabilities for being in the non-probability sample,
# we treat the inverse of those probabilities as survey weights in standard weighted survey analyses.

In [23]:
# This weight tells us how many people in the larger population will be represented by this type of individual.
# We'll then use this weight in analyzing data from the non-probability sample.

## ___How to calculate sampling variance?___

In [24]:
# In probability samples, one can use the probability distribution to calculate the sampling variance.
# That technique is not applicable to this type of quasi randomization approach.

## ___Population Modelling___
----------------

In [None]:
# Use predictive modelling to predict aggregate sample quantities (usually totals) on key variables of interest for population
# units not included in the non-probability sample.

# Suppose we hahve a dataset that contains information about everyone in the population, that also includes general auxiliary information
# on everyone in that population.

# In our non-prob sample, we only collect measurements for a specific small subset of the population.
# This collects some variables that are absent in the larger population data.
# So, when combined, records from the population will have missing values for these columns.
# 