# ___Non Probability Sampling___

## ___What defines a non-probability sample?___
------------

In [1]:
# Unlike probability samples, the probability of selection cannot be determined for the entities in a non-probability sample.
# There is no random selection. Researchers often have little/no control over the mechanism that brings entities into thse samples.

# The samples can be divided into groups (i.e strata, clusters)
# But the clusters are not randomly sampled at earlier stages. (at upper hierarchical levels, the sampling is often non-random)
# Data collection is extremely cheap compared to probability sampling.

In [2]:
# Examples

# Studying volunteers. -> happens very commonly in clinical trials, psychological studies, etc..

# These often invite volunteers by incentivizing them by stating that their study may/may not be able to help them in some regard.
# Like, if you suffer from chronic depression, the drug that we are developing might be able to cope with life better,
# If you are interested, plesase contact this at that ... blah blah...

# Here the researcher does not have any control over the people who offer to help, and often know nothing about their probabilities of 
# selection.

# Opt-in web surveys

# Like the surveys conducted by JetBrains or StackOverflow
# There is no known probability of selection.
# There is no known randmness in selection.


# Snowball sampling

# This is a type of non-probability sampling where the sample grows as people refer others to join in.
# Again, no probability of selection or no randomness.
# Here the communication happens in the word-of-mouh mode.

# Convenience sampling

# University faculty/professors use their students to conduct a study.
# Again, no probability of selection or no randomness.

# Quota sampling
# In quota sampling you have certain targets that need to be met.
# Researchers do whatever they need to hit those targets.
# Like find 2,000 adult male cancer patients ???
# No randomness, no probabilities of selection

In [3]:
# A common characteristic of all these types of non-probability sampling is that there is no knowing of probabilities of selection a priori.

## ___What's wrong with non-probability sampling?___
------------

In [1]:
# With no randomness and no known probabilities of selection,
# There's no way to make statistically sound and meaningful inferences abut the larger population based on the selected sample.

# Knowing the probabilities of selection allows us to estimate the features of sampling distribution.
# Sampling distribution is the distribution of features that would arise, if we had taken many samples using the same probability
# sampling design. 
# In non-random sampling the sample units are not selected at random. This poses a strong risk of sampling bias.
# e.g. consider a web opt-in survey, here the sample will only consist of people who happen to visit that site/who were informrd by those
# who frequent that site. Not really a representative sample at all.

# The sample units in the non-probability samples are generally not representative of the larger population.
# This makes it very difficult to garner meaningful insights of the larger population using the sample.

# Most of the "Big Data" craze, is founded on non-probability sampling.
# These types of data are gathered mostly by automated processes, where little/no consideration has gone into the design
# And these are often based on non-probability data sampling.
# e.g. twitter data, web scraping data

## ___So, how to remediate these shortcomings?___
---------------------

In [2]:
# Can we say anything meaningful about the larger population?
# Not everything is at lost when it comes to non-probability sampling.

# There are two possible remedies.
        # Pseudo-randomization
    # Here, with a little bit of work, we can treat the non-probability sample as a probability sample.
        # Calibration
    # Here, we weight the non-probability sample to look more like a probability sampling.

## ___Population Inference Approaches for Non-probability Samples___
----------------

### ___Pseudo Randomization___

In [5]:
# Here, the non-probability sample is combined with another probability sample.
# These two samples must collect similar measurements and the measuring units must be identical.
# e.g. height & weight of students (in kgs and cms).

# In this approach we must first find the common variables in a probability sample and a non-probability sample.
# Then we stack them together.

In [6]:
# We then estimate the probability of being included in the non-probability sample, as a function of auxiliary information available
# in both the sample.

# e.g. if the datasets of student height & weight also contain variables like gender, race, field of study. etc... these can be
# leveraged to study the probabilities of selection.
# Then we use this information to determine the probabilities of selection for the entities in the non-probability sample.

# Then we use the determined probabilities as "known" probabilities for the non-probability sample.

### ___Calibration___

In [7]:
# We compute the weight for all the entities in a non-probability sample that allows the weighted sample to mirror a known population.

# e.g. There are 67% males and 32% females in the larger population (all the students in a university)
# and there's 23% males and 75% females in our sample (final year students at the CS department)

# From our knowledge of the larger population, we know that males are disproportinately higher and
# females are disproportionately lower in our sample.

# So, we up-weight females and down-weight males in our non-probability sample.
# So, once the weighting is done, our sample may look more/less like the larger population

# A limitation here is that, if the weighting factor used in the calibration is not related to the variables that we are interested in,
# we may not be able to reduce some of the sampling bias.

## ___Twitter example___
-----------------

In [8]:
# Assume we have scraped 1,567,870 tweets using the Twitter API that indicate support to Donald Trump in an election.

# This is a non-probability sample, collected by filtering a 100 million tweets using computers.
# This involves designing a binary natural language classifier, that decides whether a given tweet supports Trump or not.

# Here, we do not know the probability of a tweet being selected in our sample.
# We just collected all the tweets that show support for Trump in one way or another.
# This essentially is a convenient sample.

# Twitter users are not a random sample of US population.
# And not all Twitter users are vocal about the political alignments there.
# And Twitter is polluted with massive number of Bots, did we have a criteria in place to detect tweets that could potentially
# have come from a bot?

# Given the lack of probabilities of selection and non-random sampling,
# we run the risk of a high sampling bias.

# People who tweet in support for Trump represent a very unique subset of the larger population. 
# These are the people who have very strong political opinions and are very vocal about it.

## ___Sampling distributions & Sampling variance___
--------------------

In [10]:
# How to estimate the features of the probability distribution based only on one probability sample?

# How do we make inferences of a population based on just one probability sample?

# Model based approaches to analyzing data!