🏷️sec_statistics
Undoubtedly, to be a top deep learning practitioner, the ability to train the state-of-the-art and high accurate models is crucial. However, it is often unclear when improvements are significant, or only the result of random fluctuations in the training process. To be able to discuss uncertainty in estimated values, we must learn some statistics.
The earliest reference of statistics can be traced back to an Arab scholar Al-Kindi in the
More specifically, statistics can be divided to descriptive statistics and statistical inference. The former focus on summarizing and illustrating the features of a collection of observed data, which is referred to as a sample. The sample is drawn from a population, denotes the total set of similar individuals, items, or events of our experiment interests. Contrary to descriptive statistics, statistical inference further deduces the characteristics of a population from the given samples, based on the assumptions that the sample distribution can replicate the population distribution at some degree.
You may wonder: “What is the essential difference between machine learning and statistics?” Fundamentally speaking, statistics focuses on the inference problem. This type of problems includes modeling the relationship between the variables, such as causal inference, and testing the statistically significance of model parameters, such as A/B testing. In contrast, machine learning emphasizes on making accurate predictions, without explicitly programming and understanding each parameter's functionality.
In this section, we will introduce three types of statistics inference methods: evaluating and comparing estimators, conducting hypothesis tests, and constructing confidence intervals. These methods can help us infer the characteristics of a given population, i.e., the true parameter
In statistics, an estimator is a function of given samples used to estimate the true parameter
We have seen simple examples of estimators before in section :numref:sec_maximum_likelihood
. If you have a number of samples from a Bernoulli random variable, then the maximum likelihood estimate for the probability the random variable is one can be obtained by counting the number of ones observed and dividing by the total number of samples. Similarly, an exercise asked you to show that the maximum likelihood estimate of the mean of a Gaussian given a number of samples is given by the average value of all the samples. These estimators will almost never give the true value of the parameter, but ideally for a large number of samples the estimate will be close.
As an example, we show below the true density of a Gaussian random variable with mean zero and variance one, along with a collection samples from that Gaussian. We constructed the
from d2l import mxnet as d2l
from mxnet import np, npx
import random
npx.set_np()
# Sample datapoints and create y coordinate
epsilon = 0.1
random.seed(8675309)
xs = np.random.normal(loc=0, scale=1, size=(300,))
ys = [np.sum(np.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2))
/ np.sqrt(2*np.pi*epsilon**2)) / len(xs) for i in range(len(xs))]
# Compute true density
xd = np.arange(np.min(xs), np.max(xs), 0.01)
yd = np.exp(-xd**2/2) / np.sqrt(2 * np.pi)
# Plot the results
d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=np.mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(np.mean(xs)):.2f}')
d2l.plt.show()
#@tab pytorch
from d2l import torch as d2l
import torch
torch.pi = torch.acos(torch.zeros(1)) * 2 #define pi in torch
# Sample datapoints and create y coordinate
epsilon = 0.1
torch.manual_seed(8675309)
xs = torch.randn(size=(300,))
ys = torch.tensor(
[torch.sum(torch.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2))\
/ torch.sqrt(2*torch.pi*epsilon**2)) / len(xs)\
for i in range(len(xs))])
# Compute true density
xd = torch.arange(torch.min(xs), torch.max(xs), 0.01)
yd = torch.exp(-xd**2/2) / torch.sqrt(2 * torch.pi)
# Plot the results
d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=torch.mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(torch.mean(xs).item()):.2f}')
d2l.plt.show()
#@tab tensorflow
from d2l import tensorflow as d2l
import tensorflow as tf
tf.pi = tf.acos(tf.zeros(1)) * 2 # define pi in TensorFlow
# Sample datapoints and create y coordinate
epsilon = 0.1
xs = tf.random.normal((300,))
ys = tf.constant(
[(tf.reduce_sum(tf.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2)) \
/ tf.sqrt(2*tf.pi*epsilon**2)) / tf.cast(
tf.size(xs), dtype=tf.float32)).numpy() \
for i in range(tf.size(xs))])
# Compute true density
xd = tf.range(tf.reduce_min(xs), tf.reduce_max(xs), 0.01)
yd = tf.exp(-xd**2/2) / tf.sqrt(2 * tf.pi)
# Plot the results
d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=tf.reduce_mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(tf.reduce_mean(xs).numpy()):.2f}')
d2l.plt.show()
There can be many ways to compute an estimator of a parameter
Perhaps the simplest metric used to evaluate estimators is the mean squared error (MSE) (or
eq_mse_est
This allows us to quantify the average squared deviation from the true value. MSE is always non-negative. If you have read :numref:sec_linear_regression
, you will recognize it as the most commonly used regression loss function. As a measure to evaluate an estimator, the closer its value to zero, the closer the estimator is close to the true parameter
The MSE provides a natural metric, but we can easily imagine multiple different phenomena that might make it large. Two fundamentally important are fluctuation in the estimator due to randomness in the dataset, and systematic error in the estimator due to the estimation procedure.
First, let us measure the systematic error. For an estimator
eq_bias
Note that when
It is worth being aware, however, that biased estimators are frequently used in practice. There are cases where unbiased estimators do not exist without further assumptions, or are intractable to compute. This may seem like a significant flaw in an estimator, however the majority of estimators encountered in practice are at least asymptotically unbiased in the sense that the bias tends to zero as the number of available samples tends to infinity:
Second, let us measure the randomness in the estimator. Recall from :numref:sec_random_variables
, the standard deviation (or standard error) is defined as the squared root of the variance. We may measure the degree of fluctuation of an estimator by measuring the standard deviation or variance of that estimator.
eq_var_est
It is important to compare :eqref:eq_var_est
to :eqref:eq_mse_est
. In this equation we do not compare to the true population value
It is intuitively clear that these two main components contribute to the mean squared error. What is somewhat shocking is that we can show that this is actually a decomposition of the mean squared error into these two contributions plus a third one. That is to say that we can write the mean squared error as the sum of the square of the bias, the variance and the irreducible error.
We refer the above formula as bias-variance trade-off. The mean squared error can be divided into three sources of error: the error from high bias, the error from high variance and the irreducible error. The bias error is commonly seen in a simple model (such as a linear regression model), which cannot extract high dimensional relations between the features and the outputs. If a model suffers from high bias error, we often say it is underfitting or lack of flexibilty as introduced in (:numref:sec_model_selection
). The high variance usually results from a too complex model, which overfits the training data. As a result, an overfitting model is sensitive to small fluctuations in the data. If a model suffers from high variance, we often say it is overfitting and lack of generalization as introduced in (:numref:sec_model_selection
). The irreducible error is the result from noise in the
Since the standard deviation of an estimator has been implementing by simply calling a.std()
for a tensor a
, we will skip it but implement the statistical bias and the mean squared error.
# Statistical bias
def stat_bias(true_theta, est_theta):
return(np.mean(est_theta) - true_theta)
# Mean squared error
def mse(data, true_theta):
return(np.mean(np.square(data - true_theta)))
#@tab pytorch
# Statistical bias
def stat_bias(true_theta, est_theta):
return(torch.mean(est_theta) - true_theta)
# Mean squared error
def mse(data, true_theta):
return(torch.mean(torch.square(data - true_theta)))
#@tab tensorflow
# Statistical bias
def stat_bias(true_theta, est_theta):
return(tf.reduce_mean(est_theta) - true_theta)
# Mean squared error
def mse(data, true_theta):
return(tf.reduce_mean(tf.square(data - true_theta)))
To illustrate the equation of the bias-variance trade-off, let us simulate of normal distribution
theta_true = 1
sigma = 4
sample_len = 10000
samples = np.random.normal(theta_true, sigma, sample_len)
theta_est = np.mean(samples)
theta_est
#@tab pytorch
theta_true = 1
sigma = 4
sample_len = 10000
samples = torch.normal(theta_true, sigma, size=(sample_len, 1))
theta_est = torch.mean(samples)
theta_est
#@tab tensorflow
theta_true = 1
sigma = 4
sample_len = 10000
samples = tf.random.normal((sample_len, 1), theta_true, sigma)
theta_est = tf.reduce_mean(samples)
theta_est
Let us validate the trade-off equation by calculating the summation of the squared bias and the variance of our estimator. First, calculate the MSE of our estimator.
#@tab all
mse(samples, theta_true)
Next, we calculate
bias = stat_bias(theta_true, theta_est)
np.square(samples.std()) + np.square(bias)
#@tab pytorch
bias = stat_bias(theta_true, theta_est)
torch.square(samples.std(unbiased=False)) + torch.square(bias)
#@tab tensorflow
bias = stat_bias(theta_true, theta_est)
tf.square(tf.math.reduce_std(samples)) + tf.square(bias)
The most commonly encountered topic in statistical inference is hypothesis testing. While hypothesis testing was popularized in the early 20th century, the first use can be traced back to John Arbuthnot in the 1700s. John tracked 80-year birth records in London and concluded that more men were born than women each year. Following that, the modern significance testing is the intelligence heritage by Karl Pearson who invented
A hypothesis test is a way of evaluating some evidence against the default statement about a population. We refer the default statement as the null hypothesis
Imagine you are a chemist. After spending thousands of hours in the lab, you develop a new medicine which can dramatically improve one's ability to understand math. To show its magic power, you need to test it. Naturally, you may need some volunteers to take the medicine and see whether it can help them learn math better. How do you get started?
First, you will need carefully random selected two groups of volunteers, so that there is no difference between their math understanding ability measured by some metrics. The two groups are commonly referred to as the test group and the control group. The test group (or treatment group) is a group of individuals who will experience the medicine, while the control group represents the group of users who are set aside as a benchmark, i.e., identical environment setups except taking this medicine. In this way, the influence of all the variables are minimized, except the impact of the independent variable in the treatment.
Second, after a period of taking the medicine, you will need to measure the two groups' math understanding by the same metrics, such as letting the volunteers do the same tests after learning a new math formula. Then, you can collect their performance and compare the results. In this case, our null hypothesis will be that there is no difference between the two groups, and our alternate will be that there is.
This is still not fully formal. There are many details you have to think of carefully. For example, what is the suitable metrics to test their math understanding ability? How many volunteers for your test so you can be confident to claim the effectiveness of your medicine? How long should you run the test? How do you decide if there is a difference between the two groups? Do you care about the average performance only, or also the range of variation of the scores? And so on.
In this way, hypothesis testing provides a framework for experimental design and reasoning about certainty in observed results. If we can now show that the null hypothesis is very unlikely to be true, we may reject it with confidence.
To complete the story of how to work with hypothesis testing, we need to now introduce some additional terminology and make some of our concepts above formal.
The statistical significance measures the probability of erroneously rejecting the null hypothesis,
It is also referred to as the type I error or false positive. The
:numref:fig_statistical_significance
shows the observations' values and probability of a given normal distribution in a two-sample hypothesis test. If the observation data example is located outsides the
🏷️
fig_statistical_significance
The statistical power (or sensitivity) measures the probability of reject the null hypothesis,
Recall that a type I error is error caused by rejecting the null hypothesis when it is true, whereas a type II error is resulted from failing to reject the null hypothesis when it is false. A type II error is usually denoted as
Intuitively, statistical power can be interpreted as how likely our test will detect a real discrepancy of some minimum magnitude at a desired statistical significance level.
One of the most common uses of statistical power is in determining the number of samples needed. The probability you reject the null hypothesis when it is false depends on the degree to which it is false (known as the effect size) and the number of samples you have. As you might expect, small effect sizes will require a very large number of samples to be detectable with high probability. While beyond the scope of this brief appendix to derive in detail, as an example, want to be able to reject a null hypothesis that our sample came from a mean zero variance one Gaussian, and we believe that our sample's mean is actually close to one, we can do so with acceptable error rates with a sample size of only
We can imagine the power as a water filter. In this analogy, a high power hypothesis test is like a high quality water filtration system that will reduce harmful substances in the water as much as possible. On the other hand, a smaller discrepancy is like a low quality water filter, where some relative small substances may easily escape from the gaps. Similarly, if the statistical power is not of enough high power, then the test may not catch the smaller discrepancy.
A test statistic
Often,
The
If the
Normally there are two kinds of significance test: the one-sided test and the two-sided test. The one-sided test (or one-tailed test) is applicable when the null hypothesis and the alternative hypothesis only have one direction. For example, the null hypothesis may state that the true parameter
After getting familiar with the above concepts, let us go through the general steps of hypothesis testing.
- State the question and establish a null hypotheses
$H_0$ . - Set the statistical significance level
$\alpha$ and a statistical power ($1 - \beta$ ). - Obtain samples through experiments. The number of samples needed will depend on the statistical power, and the expected effect size.
- Calculate the test statistic and the
$p$ -value. - Make the decision to keep or reject the null hypothesis based on the
$p$ -value and the statistical significance level$\alpha$ .
To conduct a hypothesis test, we start by defining a null hypothesis and a level of risk that we are willing to take. Then we calculate the test statistic of the sample, taking an extreme value of the test statistic as evidence against the null hypothesis. If the test statistic falls within the reject region, we may reject the null hypothesis in favor of the alternative.
Hypothesis testing is applicable in a variety of scenarios such as the clinical trails and A/B testing.
When estimating the value of a parameter Neyman.1937
, who first introduced the concept of confidence interval in 1937.
To be useful, a confidence interval should be as small as possible for a given degree of certainty. Let us see how to derive it.
Mathematically, a confidence interval for the true parameter
eq_confidence
Here
Note that :eqref:eq_confidence
is about variable
It is very tempting to interpret a
This may seem pedantic, but it can have real implications for the interpretation of the results. In particular, we may satisfy :eqref:eq_confidence
by constructing intervals that we are almost certain do not contain the true value, as long as we only do so rarely enough. We close this section by providing three tempting but false statements. An in-depth discussion of these points can be found in :cite:Morey.Hoekstra.Rouder.ea.2016
.
- Fallacy 1. Narrow confidence intervals mean we can estimate the parameter precisely.
- Fallacy 2. The values inside the confidence interval are more likely to be the true value than those outside the interval.
-
Fallacy 3. The probability that a particular observed
$95%$ confidence interval contains the true value is$95%$ .
Sufficed to say, confidence intervals are subtle objects. However, if you keep the interpretation clear, they can be powerful tools.
Let us discuss the most classical example, the confidence interval for the mean of a Gaussian of unknown mean and variance. Suppose we collect
If we now consider the random variable
we obtain a random variable following a well-known distribution called the Student's t-distribution on
This distribution is very well studied, and it is known, for instance, that as
Thus, we may conclude that for large
Rearranging this by multiplying both sides by
Thus we know that we have found our eq_gauss_confidence
It is safe to say that :eqref:eq_gauss_confidence
is one of the most used formula in statistics. Let us close our discussion of statistics by implementing it. For simplicity, we assume we are in the asymptotic regime. Small values of t_star
obtained either programmatically or from a
# Number of samples
N = 1000
# Sample dataset
samples = np.random.normal(loc=0, scale=1, size=(N,))
# Lookup Students's t-distribution c.d.f.
t_star = 1.96
# Construct interval
mu_hat = np.mean(samples)
sigma_hat = samples.std(ddof=1)
(mu_hat - t_star*sigma_hat/np.sqrt(N), mu_hat + t_star*sigma_hat/np.sqrt(N))
#@tab pytorch
# PyTorch uses Bessel's correction by default, which means the use of ddof=1
# instead of default ddof=0 in numpy. We can use unbiased=False to imitate
# ddof=0.
# Number of samples
N = 1000
# Sample dataset
samples = torch.normal(0, 1, size=(N,))
# Lookup Students's t-distribution c.d.f.
t_star = 1.96
# Construct interval
mu_hat = torch.mean(samples)
sigma_hat = samples.std(unbiased=True)
(mu_hat - t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch.float32)),\
mu_hat + t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch.float32)))
#@tab tensorflow
# Number of samples
N = 1000
# Sample dataset
samples = tf.random.normal((N,), 0, 1)
# Lookup Students's t-distribution c.d.f.
t_star = 1.96
# Construct interval
mu_hat = tf.reduce_mean(samples)
sigma_hat = tf.math.reduce_std(samples)
(mu_hat - t_star*sigma_hat/tf.sqrt(tf.constant(N, dtype=tf.float32)), \
mu_hat + t_star*sigma_hat/tf.sqrt(tf.constant(N, dtype=tf.float32)))
- Statistics focuses on inference problems, whereas deep learning emphasizes on making accurate predictions without explicitly programming and understanding.
- There are three common statistics inference methods: evaluating and comparing estimators, conducting hypothesis tests, and constructing confidence intervals.
- There are three most common estimators: statistical bias, standard deviation, and mean square error.
- A confidence interval is an estimated range of a true population parameter that we can construct by given the samples.
- Hypothesis testing is a way of evaluating some evidence against the default statement about a population.
- Let
$X_1, X_2, \ldots, X_n \overset{\text{iid}}{\sim} \mathrm{Unif}(0, \theta)$ , where "iid" stands for independent and identically distributed. Consider the following estimators of$\theta$ :$$\hat{\theta} = \max {X_1, X_2, \ldots, X_n };$$ $$\tilde{\theta} = 2 \bar{X_n} = \frac{2}{n} \sum_{i=1}^n X_i.$$ - Find the statistical bias, standard deviation, and mean square error of
$\hat{\theta}.$ - Find the statistical bias, standard deviation, and mean square error of
$\tilde{\theta}.$ - Which estimator is better?
- Find the statistical bias, standard deviation, and mean square error of
- For our chemist example in introduction, can you derive the 5 steps to conduct a two-sided hypothesis testing? Given the statistical significance level
$\alpha = 0.05$ and the statistical power$1 - \beta = 0.8$ . - Run the confidence interval code with
$N=2$ and$\alpha = 0.5$ for$100$ independently generated dataset, and plot the resulting intervals (in this caset_star = 1.0
). You will see several very short intervals which are very far from containing the true mean$0$ . Does this contradict the interpretation of the confidence interval? Do you feel comfortable using short intervals to indicate high precision estimates?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: