Erika Duan 2020-10-18
- Introduction
- Bernoulli trial
- Binomial distribution
- Binomial confidence intervals
- Drawing control limits for binomial proportions (funnel plots)
- Resources
#-----load required packages-----
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
tidyverse,
ggrepel,
DescTools,
Hmisc)
Discrete probability distributions are a great place to start learning about mathematical models. Discrete probability distributions can be used to describe real-world processes such as detecting manufacturing defects. This is because discrete probability distributions model variables which are categorical or discrete counts.
A bernoulli trial is the simplest discrete distribution. It has two important assumptions:
- There are only two possible outcomes for each trial.
- The outcome of individual trials is independent of each other (the first outcome does not affect the probability of a second outcome also happening).
Keeping in mind that there are only two possible outcomes (i.e. x = 0 or
x = 1), the probability mass function for a Bernoulli distribution is
defined as
.
This can be simplied into the following conditions.
The expectation (or theoretical mean of discrete random variables) of a Bernoulli distribution can therefore be calculated.
The variance of a Bernoulli distribution can also be easily calculated.
#-----calculate the expectation and variance of a Bernoulli distribution-----
summarise_bernoulli_dist <- function(p) {
if(p == 0 | p >= 1) {
stop("Error: probability must be between 0 and 1.")
}
mean <- p
var <- p * (1 - p)
list(P = p,
Expectation = mean,
Variance = var)
}
summarise_bernoulli_dist(p = 0.23)
#> $P
#> [1] 0.23
#
#> $Expectation
#> [1] 0.23
#
#> $Variance
#> [1] 0.1771
The binomial distribution can be described as a finite number of Bernoulli trials which are all independent of each other. Each independent trial has two discrete outcomes.
Its probability mass function can be written as below.
This equation can be broken down into two components: the combinations formula (the total number of combinations of successes and failures can happen) and the bernoulli trial (i.e. the probability of one specific ordering of successes and failures).
The binomial coefficient, or combinations formula, is represented below.
It arises from the concept of finding all combinations of a particular permutation of object orderings.
#-----calculate binomial probability-----
# if the probability of an account being hacked is p = 0.001
# what is the probability 2 accounts are hacked out of 1000
dbinom(x = 2, size = 1000, prob = 0.001)
#> [1] 0.1840317
# what is the probability of 2 or less accounts being hacked out of 1000 accounts
pbinom(q = 2, size = 1000, prob = 0.001)
#> [1] 0.9197907
# what is the probability of more than 2 accounts being hacked out of 100 accounts
1 - pbinom(q = 2, size = 1000, prob = 0.001)
#> [1] 0.08020934
#-----create binomial probability distribution function-----
calc_binom_dist <- function(x, n, p) {
combinations <- choose(n, x)
combinations * (p^x) * ((1-p)^(n-x))
}
calc_binom_dist(x = 2, n = 1000, p = 0.001)
#> [1] 0.1840317
The expectation can be
derived
as . The intuitive explanation is that the mean of the
binomial distribution is equal to the sum of the mean of each
independent variable that constitutes the distribution.
The variance can be
derived
as .
To calculate bionomial proportion confidence
intervals,
the easiest concept is to estimate normal approximation intervals for
the binomial proportion. We assume that the distribution of the standard
error of is approximately normal. We can use this approach if our
sample sizes are relatively large (
and
) and the probability of each independent bernoulli
trial is not close to 0 or 1.
The normal approximation interval for a binomial proportion is represented below.
Note: Whilst the rationale for the normal approximation interval is the easiest to understand, its underlying assumption that the variance is independent of the mean renders it undesirable for use. The normal approximation interval is also called the Wald interval.
#-----estimate the unemployment rate using the normal approximation interval-----
# 600 people were randomly surveyed and 78 were unemployed
n <- 600
n_success <- 78
#------calculate 95% and 99% confidence interval for the true unemployment rate-----
# use the DescTools package
BinomCI(n_success, n,
conf.level = 0.95,
sides = "two.sided",
method = "wald")
#> est lwr.ci upr.ci
#> [1,] 0.13 0.1030906 0.1569094
BinomCI(n_success, n,
conf.level = 0.99,
sides = "two.sided",
method = "wald")
#> est lwr.ci upr.ci
#> [1,] 0.13 0.09463507 0.1653649
#-----create a function to manually calculate binomial normal approximation intervals-----
calc_binom_CI_wald <- function(n, k, CI = 0.95) {
if(!is.numeric(n)) {
stop("Error: n must be a number (number of total trials).")
}
if(!is.numeric(k)) {
stop("Error: k must be a number (number of successful trials).")
}
if(!is.numeric(CI) | CI > 1) {
stop("Error: the 95% confidence interval should be represented as 0.95 and etc.")
}
alpha <- 1 - CI
z <- qnorm(p = 1 - (alpha/2))
p_hat <- k/n
lower_bound <- p_hat - z * sqrt(p_hat * (1 - p_hat)/n)
upper_bound <- p_hat + z * sqrt(p_hat * (1 - p_hat)/n)
return(c("est" = p_hat, "lwr.ci" = lower_bound, "upr.ci" = upper_bound))
}
calc_binom_CI_wald(n = 600, k = 78, CI = 0.95)
## est lwr.ci upr.ci
## 0.1300000 0.1030906 0.1569094
#> est lwr.ci upr.ci
#> 0.1300000 0.1030906 0.1569094
The Wilson score interval is recommended instead of the normal approximation interval as the actual coverage probability (i.e. 0.95) is closer to the calculated confidence interval.
This approach starts by using the normal approximation of the binomial.
The lower and upper bounds of the Wilson score interval is derived by solving the rearranged quadratic equation below.
Note: A continuity correction also exists for the Wilson score interval. Continuity corrections are used when the probability mass function of a discrete distribution is very jagged (i.e. not very smooth).
#-----estimate the unemployment rate using the Wilson score interval-----
# 600 people were randomly surveyed and 78 were unemployed
n <- 600
n_success <- 78
p_hat <- n_success/n
#------calculate 95% and 99% confidence interval for the true unemployment rate-----
# use the DescTools package
BinomCI(n_success, n,
conf.level = 0.95,
sides = "two.sided",
method = "wilson")
#> est lwr.ci upr.ci
#> [1,] 0.13 0.1054271 0.1592806
BinomCI(n_success, n,
conf.level = 0.99,
sides = "two.sided",
method = "wilson")
#> est lwr.ci upr.ci
#> [1,] 0.13 0.09864373 0.1694498
The Clopper–Pearson interval is an exact interval as it is directly
based on the binomial distribution. The lower and upper bounds of the
confidence interval is represented through the following two binomial
distributions, where .
An exception is that the lower bound is zero when and the
upper bound is 1 when
.
Note: The formula for calculating the Clopper-Pearson interval can also be re-written based on the beta-binomial or F distribution.
Note: Although the Clopper-Person approach has been traditionally
favoured, its results are more conservative than results obtained using
the Wilson score interval. This is because the coverage probability of
the exact Clopper-Pearson interval is always at least .
#-----estimate the unemployment rate using the Clopper-Pearson interval-----
# 600 people were randomly surveyed and 78 were unemployed
n <- 600
n_success <- 78
p_hat <- n_success/n
#------calculate 95% and 99% confidence interval for the true unemployment rate-----
BinomCI(n_success, n,
conf.level = 0.95,
sides = "two.sided",
method = "clopper-pearson")
#> est lwr.ci upr.ci
#> [1,] 0.13 0.1041328 0.1595719
BinomCI(n_success, n,
conf.level = 0.99,
sides = "two.sided",
method = "clopper-pearson")
#> est lwr.ci upr.ci
#> [1,] 0.13 0.09686711 0.1691624
A summary of the estimated confidence intervals for the local unemployment rate, as calculated using different interval methods, can be viewed below.
conf.level | method | est | lwr.ci | upr.ci |
---|---|---|---|---|
0.95 | wald | 0.13 | 0.1030906 | 0.1569094 |
0.99 | wald | 0.13 | 0.0946351 | 0.1653649 |
0.95 | wilson | 0.13 | 0.1054271 | 0.1592806 |
0.99 | wilson | 0.13 | 0.0986437 | 0.1694498 |
0.95 | clopper-pearson | 0.13 | 0.1041328 | 0.1595719 |
0.99 | clopper-pearson | 0.13 | 0.0968671 | 0.1691624 |
Probability distributions can be used to approximate the occurrence of natural (especially social) phenomenon, where the results obtained arise from:
- The true distribution (which exists in theory but is unknown).
- Natural and unavoidable variability (i.e. stochastic error) observed.
Stochastic errors should not be confused with systematic errors. The former can be modelled using statistic analysis whereas the latter should be physically eliminated from the study through improved study design.
The exercise below aims to replicate the concept of visualising bionomial proportion confidence intervals using funnel plots, as inspired by Figure 9.2 from The Art of Statistics - Learning from Data by David Spiegelhalter. I have made up a series of random unemployment rates for different LGAs.
#-----construct a scenario to create a the funnel plot------
# imagine if the national unemployment rate is 10%
# we already know p = 0.1
# we want to know if the unemployment rate of LGAs differs from p
lgas <- c("Albury", "Bega Valley", "Broken Hill",
"Byron", "Central Coast", "Griffith",
"Inner West", "Leeton", "Murray River",
"Newcastle", "Orange", "Parramatta")
lga_pop <- c(52171, 33941, 18114,
33399, 336611, 26356,
191194, 11407, 11872,
160707, 41210, 234444)
lga_unemploy_count <- c(5313, 3294, 1962,
3433, 33751, 2797,
18812, 1008, 1225,
15894, 6357, 23094)
#-----create dataframe for all LGAs-----
unemployment_data <- tibble(lgas,
lga_pop,
lga_unemploy_count)
unemployment_data <- unemployment_data %>%
mutate(lga_unemploy_rate = lga_unemploy_count / lga_pop)
#----calculate 95% confidence intervals using the Wilson score interval-----
values_total <- c(seq(1, 400000, 10))
values_10_percent <- values_total * 0.1
CIs <- map2(values_10_percent, values_total, ~ BinomCI(.x, .y, conf.level = 0.95,
sides = "two.sided",
method = "wilson"))
# create CIs matrix
CIs <- matrix(unlist(CIs), ncol = 3, byrow = T) %>%
as_tibble(.name_repair = "unique")
colnames(CIs) <- c("est", "lwr.ci", "upr.ci")
CIs <- CIs %>%
mutate(lga_pop = values_total)
unemployment_data <- bind_rows(unemployment_data, CIs)
#-----visualise the funnel plot-----
# x-axis displays LGA population size
# y-axis displays local unemployment rate
# control limits are 95% confidence intervals for p
# subset the LGAs which fall outside of the 95% confidence intervals
subset_lgas <- c("Orange", "Leeton", "Broken Hill", "Griffith")
ggplot(unemployment_data, aes(x = lga_pop, y = lga_unemploy_rate, label = lgas)) +
geom_point(colour = if_else(unemployment_data$lgas %in% subset_lgas, "#b2182b", "grey40")) +
geom_text_repel(colour = if_else(unemployment_data$lgas %in% subset_lgas,"black","grey40"),
segment.color = "grey70", point.padding = 0.1,
size = 3.5, direction = "both", force = 6, nudge_x = 10000, nudge_y = 0.006) +
geom_hline(yintercept = 0.1, size = 0.5, colour = "#2166ac", linetype = "dashed") +
geom_line(aes(x = lga_pop, y = lwr.ci), size = 0.5, colour = "#67a9cf", linetype = "dotted") +
geom_line(aes(x = lga_pop, y = upr.ci), size = 0.5, colour = "#67a9cf", linetype = "dotted") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(limits = c(0.04, 0.16), labels = scales::percent_format(accuracy = 1)) +
labs(x = "LGA resident population",
y = "LGA unemployment rate (%)",
title = "Which LGA unemployment rates significantly differ from the national rate?",
caption = "Labour Force, Australia, Detailed - released 24 September 2020") +
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linetype = "dashed"),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5, colour = "grey35"))
Note: The final step in generating a funnel plot would be to
calculate overdispersion limits (when we have more variation between
LGAs than expected from statistical theory alone). It would also be
interesting to apply the package
nullabor
to visually confirm that the unemployment rate in Orange is not
differently distributed due to chance alone.
- Youtube video series explaining discrete probability distributions.
- Mathmatical proof for the derivation of the binomial distribution’s mean and variance.
- Explanation of how the standard error of the mean is defined for a binomial distribution using the normal approximation.
- Explanations here, here and here of the Wilson score interval for calculating the confidence interval of a proportion as modelled by a binomial distribution.
- Explanation of how to calculate confidence intervals for proportions in R.
- Blog posts here and here for drawing funnel plots in SAS.
- An excellent introduction to drawing funnel plots from the NHS R community blog.
- An excellent research article providing guidelines for the standardisation of constructing funnel plots.