# **Week 6: Sampling Methods**

```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```

In this workshop, we will study common sampling methods and sampling distributions, and examines the asymptotic distribution of the sample mean as the sample size increases.

## **Pre-Configurating the Notebook**

### **Switching to the R Kernel on Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Importing Required Packages**
**Run the following lines of code**:

In [None]:
#Do not modify

setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

#
invisible(source("R/preConfigurated.R"))

**Do not modify the following**

In [None]:
if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been loaded", {

  expect_true(all(c("ggplot2", "tidyr", "dplyr", "stringr", "magrittr", "knitr") %in% loadedNamespaces()))

})

[32mTest passed[39m 🥳


## **Finite-Population Non-Probability Sampling**

In this section, we will explore non-probability sampling through a series of practical discussion questions, rather than exercises or programming examples.


### **Non-Probability Sampling Schemes**

Non-probability sampling is a method of selecting units from a population through non-random, subjective procedures. This approach does not require a complete sampling frame — a list of all units in the population — making it relatively fast, convenient, and cost-effective. However, because selection is not random, the resulting sample may not accurately represent the population, and any inferences drawn are subject to potential bias.

Common forms of non-probability sampling include sequential sampling, convenience sampling, snowball sampling, quota sampling, and purposive sampling, each of which has distinct characteristics and limitations. In particular:

- **Sequential sampling**: Samples are selected periodically from a list of potential subjects, taken one after another in sequence. These schemes sometimes resemble random sampling but lack true randomness.
- **Convenience sampling**: Subjects are chosen because they are easiest to reach, or they self-select to participate (e.g., IMDB user rankings). However, results are often unreliable due to self-selection bias.
- **Snowball sampling**: Similar to convenience sampling, except participants are asked to refer others to the study. This can expand the sample but still suffers from self-selection bias.
- **Quota sampling**: Aims to ensure demographic balance by filling pre-set quotas (e.g., 50% male, 50% female). While designed to improve representativeness, it is not random and risks over- or under-representing subgroups.
- **Purposive sampling**: Participants are deliberately selected because they fit specific characteristics or criteria relevant to the study (e.g., interviewing only experts on climate change). This method can provide rich insights but is vulnerable to researcher bias and lacks generalisability.

#### **Exercise**

Can you think of an example where snowball sampling might be useful?

<details>
<summary>▶️ Click to show the solution</summary>

Studying the experiences of undocumented immigrants in a city can be challenging, as these individuals are often difficult to identify and may avoid official records out of fear of legal consequences.

</details>

### **Free Swimming Tickets for Everyone**

A city council wants to evaluate the health benefits of a new policy that provides free swimming tickets to residents. Participants are given tickets, and their heart rate is recorded before the program begins and after one year. All participants attend the sessions regularly, with no dropouts. After analysing the data, the council observes no substantial improvement in heart rate.

#### **Question 1**

What type of sampling method was used in this consultation?

<details>
<summary>▶️ Click to show the solution</summary>

Convenience sampling

</details>

#### **Question 2**

What are the potential problems with this sampling approach?

<details>
<summary>▶️ Click to show the solution</summary>

Several problems can arise from this approach:

- People who accept free swimming tickets may already be health-conscious, physically active, or motivated to improve their fitness.
- Those who are less active or indifferent may ignore the offer, so the sample does not represent the general population.

</details>

#### **Question 3**

What biases could affect conclusions drawn from this online consultation?

<details>
<summary>▶️ Click to show the solution</summary>

- Self-selection bias / voluntary response bias: People who chose to participate in the swimming program may differ systematically from the general population (e.g., already health-conscious or motivated). The observed lack of heart rate improvement may not reflect the effect on the entire population.
- Coverage bias: Residents who did not accept the free tickets or were unaware of the program were excluded from the study.

</details>

### **84% Want Europe to Stop Changing the Clock**

In 2018, the European Commission conducted a public consultation on whether to end the bi-annual clock change across Europe. The consultation ran online from 4 July to 16 August and received 4.6 million responses from all 28 Member States, the highest number ever received for any Commission consultation. According to preliminary results, 84% of respondents favored stopping the clock changes, and 76% described the experience of changing clocks twice a year as “negative” or “very negative.” (see the press release [Summertime Consultation: 84% want Europe to stop changing the clock](https://ec.europa.eu/commission/presscorner/detail/en/ip_18_5302) for more details).

![Which of the following alternatives would you favour?](http://ec.europa.eu/avservices/avs/files/video6/repository/prod/photo/store/store2/9/P037989-973692.jpg)


![What is your overall experience with the clock change?](http://ec.europa.eu/avservices/avs/files/video6/repository/prod/photo/store/store2/9/P037989-904263.jpg)



#### **Question 1**

What type of sampling method was used in this consultation?

<details>
<summary>▶️ Click to show the solution</summary>

Convenience sampling

</details>

#### **Question 2**

Why might the results not accurately represent the views of all European citizens?

<details>
<summary>▶️ Click to show the solution</summary>

- European citizens who are highly motivated or have strong opinions about the clock change (especially those annoyed by it) were more likely to respond.
- Many citizens who are indifferent or less engaged may not have participated.

</details>

#### **Question 3**

What biases could affect conclusions drawn from this online consultation?

<details>
<summary>▶️ Click to show the solution</summary>

- Self-selection bias/voluntary response bias: Strong opinions are overrepresented. The 84% figure likely overestimates support for ending clock changes compared to the general population.
- Coverage bias: Citizens without internet access or awareness of the consultation were excluded.

While the survey collected millions of responses, the large sample size does not guarantee representativeness. Any conclusions drawn should be treated cautiously.

</details>

#### **Question 4**

Can you think of an example of a survey that might suffer from the same type of bias as the online consultation?

<details>
<summary>▶️ Click to show the solution</summary>

An online survey titled "Who Do You Think Will Win the Next US Election?" published on the website of a political party. Respondents are likely to be party supporters, which can lead to an overrepresentation of strong opinions and a sample that does not accurately reflect the views of the general population.

</details>


## **Finite-Population Probability Sampling**

### **Simple Random Sampling**


Simple random sampling (SRS) is a common method where each unit in the population has an equal chance of being selected, and all possible samples of a given size are equally likely.


There are two main SRS schemes:
- SRS Without Replacement (SRSWOR): Each unit in the population can be selected only once. Once a unit is chosen, it is removed from the pool of possible selections.
- SRS With Replacement (SRSWR): Each unit can be selected multiple times, independently of previous selections. This means that after a unit is chosen, it is “put back” into the population, so it is still available for subsequent draws.

SRSWR is less efficient than SRSWOR as it may select the same observation multiple times. However, SRSWR is mathematically simpler to work with because each draw is independent. Details on this topic are beyond the scope of this unit.

#### **R Examples**

To perform simple random sampling in R, we can use the `sample(x, size, replace, prob)` function, where:

- `x`: a vector of values to sample from
- `n`: the number of items to select
- `replace`: whether or not to perform sampling with replacement
- `prob`: an optional vector of probability weights for elements in `x`. By default, all elements are equally likely.

If `replace = TRUE`, `n` cannot be greater than `length(x)`

### **Stratified Random- Sampling**

### **Cluster Sampling**

## **Sampling Distribution**

Suppose we have a population (for example, the heights of all people in a city). We are often interested in some characteristic of this population, such as the average height or the proportion of people taller than 180 cm.
In practice, we rarely observe the entire population, so we instead take a sample from it. From this sample, we calculate a statistic (a function of the sample data), such as:

- The sample mean: Estimate of the population mean
- The sample variance: Estimate of the population variance,
- The sample median: Estimate of the population median.

If we were to repeat this sampling process many times, we would obtain a collection of values for the statistic. The distribution of these values is called the sampling distribution of that statistic.

## **Workshop Questions**
<details>
<summary>▶️ Click to show the solution</summary>

- Self-selection bias/voluntary response bias: Strong opinions are overrepresented. The 84% figure likely overestimates support for ending clock changes compared to the general population.
- Coverage bias: Citizens without internet access or awareness of the consultation were excluded.

While the survey collected millions of responses, the large sample size does not guarantee representativeness. Any conclusions drawn should be treated cautiously.

</details>


### **Question 1**