# Confidence Sets

## This is a rough work in progress!

## Types of Parameters

Many different kinds of things are called "parameters." Here are several categories.

### Population parameters

Any property of a population may be called a _parameter_.
Examples include the population mean, percentiles, number of modes, etc.

If the population has more than one "value" per item, e.g., if the population is a group of people each of whom has a height and a weight, then the population correlation between height and weight is a parameter.

Similarly, consider a group of individuals and the values of some quantity (the "response") for each of those individuals without and with some intervention (notionally, a "treatment").
The difference between the average response without the intervention and the average response with the intervention is a parameter (the _average treatment effect_).

If we are sampling at random from a population, the probability distribution of the sample
depends on the values in the population, and thus, in general, on the 
values of population parameters.
(It will also depend on the sampling design.)

### Functional parameters of probability distributions

Suppose $X \sim P$, where $P$ is a probability distribution on some space $\mathcal{X}$ of possible outcomes.
We assume that $P \in \mathcal{P}$, some known set of possible distributions.

A _functional parameter_ $\theta(P)$ is a function of $P$.
For instance the (population) mean is a functional parameter:

\begin{equation}
\theta(P) = \mathbb{E}X \equiv \int_\mathcal{X} x dP(x).
\end{equation}

So are other moments of the probability distribution:

\begin{equation}
\theta(P) = \mathbb{E}X^n \equiv \int_\mathcal{X} x^n dP(x), \;\; n=1, 2, \ldots .
\end{equation}

Other properties of $P$, such as percentiles of a univariate
distribution, are also functional parameters.
For instance, if $X$ is a real-valued random variable,
then the $\alpha$ percentile of $P$,
\begin{equation}
\theta(P) = P_\alpha \equiv \inf \left \{x: \int_{-\infty}^x dP(x) \ge \alpha \right \},
\end{equation}
is a functional parameter.

For multivariate distributions, correlations among the components of $X$
are functional parameters.

In general, there can be distinct distributions $P$ and $Q$ such that 
$P \ne Q$ but $\theta(P) = \theta(Q)$. 
For instance there are infinitely many normal distributions with the
same mean (but different variances).

### Parameters as indices of sets of distributions

Another use of the term "parameter" is as an abstract
index that points to a particular distribution in
a family of distributions.
For instance, we might have a multiset of distributions
$\mathcal{P} = \{P_\eta\}_{\eta \in \Theta}$.
In that case, $\eta$ is an index parameter.
For index parameters, if for all parameters $\eta$, $\nu \in \Theta$ such that
$\eta \ne \nu$, $P_\eta \ne P_\nu$, the parameter is said to be _identifiable_.
That is, $X$ contains enough information to
identify the value of the parameter with arbitrarily high accuracy, given enough
observations.
Otherwise, the parameter is _non-identifiable_ or _unidentifiable_: the data
do not contain enough information to distinguish among different values
of the parameter, no matter how many observations are made.


#### Special case: location-scale families

Many indexed families of distributions are related through the
value of their parameter in a particular way. 
For instance, suppose that the outcome space $\mathcal{X}$ is a real vector space,
so it makes sense to add elements of $\mathcal{X}$ and to multiply them by scalars.

If $X \sim P$, then for any $\theta \in \mathcal{X}$ and $a \in \Re \backslash \{0\}$,
we could define $P_{\theta,a}$ to be the distribution of $aX+\theta$.
Then $\{P_{\theta, a} \}_{\theta \in \mathcal{X}, a \in \Re \backslash \{0\}}$ is a 
_location-scale family_
with parameter $(\theta, a)$
As $\theta$ varies, the probability distribution "shifts" its location.
As $a$ varies, the probability distribution $P$ is "stretched" or re-scaled.
The family of univariate normal distributions is a location-scale family over the two-dimensional parameter $\theta = (\mu, \sigma)$ with $\mu \in \Re$, $\sigma \in \Re \backslash 0$.

#### Notation for index parameters
To keep the notation for index parameters
parallel with the notation for functional parameters, we will define $\theta(P) \equiv 
\{ \eta: P = P_\eta\}$.
If $\theta$ is identifiable, $\theta(P)$ is a singleton set; otherwise,
it may contain more than one value. 

### Parametric families of distributions

A _parametric family of distributions_ is an indexed collection of probability distributions
that depends on the index parameter (which might be multidimensional) in a 
fixed functional way.
(We can think of things like the mean and standard deviation of a normal distribution as either a multidimensional parameter or as a collection of parameters.)

Most distributions that have names are parametric families, e.g., Bernoulli (the parameter $p$), Binomial (the two-dimensional parameter $(n, p)$), Geometric ($p$), Hypergeometric (the three-dimensional parameter $(N, G, n)$), Negative Binomial $(p, k)$, Normal $(\mu, \sigma)$, Student's $T$ $(\mu, \sigma, \nu)$, continuous uniform (the endpoints of the interval of support, the two-dimensional parameter $(a, b)$), and so on. 

### Nuisance parameters

When the probability distribution of the data depends on a multi-dimensional parameter
but only some components of that parameter are of interest, the other components are called _nuisance parameters_. 
For instance, in estimating the mean of a normal distribution, the
variance of the distribution is a nuisance parameter: we don't care what it is,
but it affects the probability distribution of the data.

Similarly, in estimating a population mean from a stratified sample, the means
within the different strata are nuisance parameters.


### Abstract parameters

For most of the theory in this chapter, $\theta$ will be an abstract parameter:
the development applies to functional parameters, index parameters, parameters of
parametric families, etc.

## Confidence sets

What can we learn about the value of $\theta(P)$ from observations?
[The chapter on testing](./tests.ipynb) discusses testing hypotheses, including
hypotheses about parameters.
Here we explore a different approach to quantifying what a sample tells us about
$\theta$: confidence sets.
The treatment will be abstract but informal. 
(For instance, we shall ignore measurability issues.)

In an abuse of notation, we will let $\theta$ denote both the value of a parameter, and
the mapping from a distribution to the value of the parameter for that distribution,
as if $\theta$ were a functional parameter even if it is an index parameter (or some other
kind of parameter).
Thus, $\theta: \mathcal{P} \rightarrow \Theta$, $P \mapsto \theta(P)$.
If $P = P_\eta$, then $\theta(P) = \eta$.
The set $\Theta$ will denote the possible values of $\theta$. 
Lowercase Greek letters such as $\eta$ will denote
generic elements of $\Theta$.

We shall observe $X \sim P$, where $X$ takes values in the outcome space $\mathcal{X}$.
We do not know $P$, but we know that $P \in \mathcal{P}$, a known set of distributions.
Let $\mathcal{I}(\cdot)$ be a set-valued function that assigns a subset of $\Theta$ to each possible observation $x \in \mathcal{X}$.
For instance, we might observe $X \sim N(\theta, 1)$, and $\mathcal{I}(x)$ might be $[x-c, x+c]$.

Fix $\alpha \in (0, 1)$.
Suppose that for all $\eta \in \Theta$, if $\theta(P) = \eta$ then
\begin{equation}
P \{\mathcal{I}(X) \ni \eta \} \ge 1-\alpha.
\end{equation}
Then $\mathcal{I}(\cdot)$ is a _confidence set procedure_ for $\theta(P)$.
It maps outcomes to sets in such a way that the chance is at least
$1-\alpha$ that the resulting set will contain the parameter $\theta(P)$.

If we observe $X=x$, $\mathcal{I}(x)$ is _a $1-\alpha$ confidence set for $\theta$_.
The _confidence level_ of the set is $1-\alpha$.

When $\mathcal{I}(x) \ni \theta$, we say that the confidence set _covers_ $\theta$.
The _coverage probability_ of the confidence set $\mathcal{I}$
is $\Pr_P \{\mathcal{I}(X) \ni \theta \}$.

Before the data $X$ are observed, the chance that $\mathcal{I}(X)$ will contain $\theta$
is $1-\alpha$. 
After the data $X=x$ are observed, the set $\mathcal{I}(x)$ either
does or does not contain $\theta$: there is nothing random anymore.

A _confidence interval_ is a special case of a confidence set, when the set is an interval of real numbers.

### Example: confidence interval for a Normal mean

Suppose $X \sim N(\theta, 1)$: $\theta$ is an index parameter for the family of unit variance normal distributions and also a functional parameter, since $\mathbb{E} X = \theta$.

Define $\mathcal{I}(x) \equiv [x - z_{1-\alpha/2}, x + z_{1-\alpha/2}]$, where
$z_{1-\alpha/2}$ is the $1-\alpha/2$ percentile of the standard normal distribution.
Then 
\begin{equation}
   \Pr \{ \mathcal{I}(X) \ni \theta \} = 1-\alpha
\end{equation}
whatever be $\theta \in \Re$.
Thus $[x - z_{1-\alpha/2}, x + z_{1-\alpha/2}]$ is a confidence interval for $\theta$.

Why is the coverage probability of $[X - z_{1-\alpha/2}, X + z_{1-\alpha/2}]$ equal to $1-\alpha$?
The chance is $1-\alpha$ that $|X-\theta| \le z_{1-\alpha/2}$. Whenever that event occurs,
the interval $[X - z_{1-\alpha/2}, X + z_{1-\alpha/2}]$ contains $\theta$.

## Duality between hypothesis tests and confidence sets

One of the most versatile ways of constructing confidence sets is to _invert_
hypothesis tests.

Suppose we have a family of significance level $\alpha$ hypothesis tests for all possible values of a parameter $\theta \in \Theta$. 
That is, for each $\eta \in \Theta$, we have a test $T_\eta : \mathcal{X} \rightarrow \{0, 1\}$ such that
if $\theta(P) = \eta$,
\begin{equation}
P \{ T_\eta(X) = 0 \} \le \alpha.
\end{equation}

Consider the set 
\begin{equation}
\mathcal{I}(X) \equiv \{ \eta \in \Theta : T_\eta(X) = 1 \};
\end{equation}
that is, $\mathcal{I}$ is the set of possible parameters $\eta \in \Theta$ for which the corresponding test $T_\eta$ does not reject the hypothesis that $\theta(P) = \eta$.

**Claim:** $\mathcal{I}(X)$ is a $1-\alpha$ confidence procedure. That is,
whatever the true value of $\theta(P)$ happens to be,
\begin{equation}
P \{ \mathcal{I}(X) \ni \theta(P) \} \ge 1-\alpha.
\end{equation}

**Proof:** The set $\mathcal{I}(X) \ni \theta(P)$ whenever $T_{\theta(P)}(X) = 1$.
But $P \{ T_{\theta(P)}(X) = 0 \} \le \alpha$, so 
$P \{ T_{\theta(P)}(X) = 1 \} \ge 1-\alpha$.

### Example: confidence interval for Binomial $p$ (known $n$)


### Example: confidence interval for Hypergeometric $G$ (known $N$, $n$)

### Example: confidence interval for Gaussian $\mu$ (known $\sigma$)

## Confidence intervals from permutation tests

### The two-sample problem

### Alternative hypotheses: constant shift

### Alternative hypotheses: constant multiple