# STAT 345: Nonparametric Statistics

## Lesson 03.2: Confidence Intervals for Quantiles

**Reading: Conover Section 3.2**

*Prof. John T. Whelan*

Tuesday 4 February 2025

These lecture slides are in a computational notebook.  You have access to them through http://vmware.rit.edu/

Flat HTML and slideshow versions are also in MyCourses.

The notebook can run Python commands (other notebooks can use R or Julia; "Ju-Pyt-R").  Think: computational data analysis, not "coding".

Standard commands to activate inline interface and import libraries:

In [1]:
%matplotlib inline

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

- Recall, for some $p^*\in[0,1]$, the $p^*$ quantile $x_{p^*}$ of a random variable $\color{royalblue}{X}$ is defined by
$$P({\color{royalblue}{X}}{\mathbin{<}}x_{p^*}) \le p^* \qquad\hbox{and}\qquad P({\color{royalblue}{X}}{\mathbin{>}}x_{p^*}) = 1-P({\color{royalblue}{X}}{\mathbin{\le}}x_{p^*})\le 1-p^*$$

- If $\color{royalblue}{X}$ is a continuous rv, $P({\color{royalblue}{X}}{\mathbin{\le}}x_{p^*})=P({\color{royalblue}{X}}{\mathbin{<}}x_{p^*})$ & the defn simplifies to $P({\color{royalblue}{X}}{\mathbin{\le}}x_{p^*})=p^*$

- Today, we'll consider how to construct a confidence interval for a quantile $x_{p^*}$, given the data $\{x_i\}=x_1,\ldots,x_n$.

- First, consider point estimator for a quantile.  An obvious estimator for the population median $x_{0.5}$ is the sample median, e.g., if $n=11$, this would be the 6th highest value in the sample, which has 5 values below and 5 above.

- Sample median is an example of an **order statistic**.  The $k$th order statistic of a sample $\{{\color{royalblue}{X_i}}\}$ is written ${\color{royalblue}{X^{(k)}}}$, and it’s simply the $k$th value in the sorted list of the sample.

- E.g., if $x_1=1.3$, $x_2=-2.1$, $x_3=3.4$, and $x_4=0.7$,<br>we have $x^{(1)}=-2.1$,
$x^{(2)}=0.7$, $x^{(3)}=1.3$, and $x^{(4)}=3.4$

In [3]:
x_i = np.array([1.3,-2.1,3.4,0.7])
np.sort(x_i)

array([-2.1,  0.7,  1.3,  3.4])

- For a random sample $\{{\color{royalblue}{X_i}}\}$, each order statistic ${\color{royalblue}{X^{(k)}}}$ is a random variable which depends on the whole sample.

- Endpoints of confidence interval on quantile $x_{p^*}$ are order statistics; define a $1-\alpha$ CI by
$$
  P({\color{royalblue}{X^{(r)}}} {\mathbin{\le}}x_{p^*} {\mathbin{\le}}{\color{royalblue}{X^{(s)}}}) = 1-\alpha$$

- Choose the integers $r$ and $s$ by considering
the inequalities ${\color{royalblue}{X^{(r)}}} {\mathbin{\le}}x_{p^*}$ and
$x_{p^*} {\mathbin{\le}}{\color{royalblue}{X^{(s)}}}$.

- Assuming for
simplicity that we’re dealing with a continuous distribution, so that
$P({\color{royalblue}{X}}{\mathbin{\le}}x_{p^*})=P({\color{royalblue}{X}}{\mathbin{<}}x_{p^*})=p^*$
for each point in the sample, the statement
${\color{royalblue}{X^{(r)}}} \le x_{p^*}$ means that at least $r$ of the
values in the sample are below $x_{p^*}$.

- OTOH,
$x_{p^*} \le {\color{royalblue}{X^{(s)}}}$ means that fewer than $s$ of the
points in the sample are below $x_{p^*}$.

- So if ${\color{royalblue}{Y}}\sim\operatorname{Bin}(n,p^*)$ is the #
of points in the sample below $x_{p^*}$, we can write
$$P(r {\mathbin{\le}}{\color{royalblue}{Y}} {\mathbin{<}}s) = \sum_{i=r}^{s-1} \binom{n}{i} (p^*)^i(1-p^*)^{n-i}
  = 1-\alpha$$

$$
P({\color{royalblue}{X^{(r)}}} {\mathbin{\le}}x_{p^*} {\mathbin{\le}}{\color{royalblue}{X^{(s)}}}) = 1-\alpha
\ \hbox{where}\ P(r {\mathbin{\le}}{\color{royalblue}{Y}} {\mathbin{<}}s) = \sum_{i=r}^{s-1} \binom{n}{i} (p^*)^i(1-p^*)^{n-i}
  = 1-\alpha
$$
- Find the `r` and `s` which satisfy this and pick the corresponding order statistics as the ends of the confidence interval.

- As usual, if $np^*$ and $n(1-p^*)$ are large enough, we can us the
normal approximation, along with $E({\color{royalblue}{Y}})=np^*$ and
$\operatorname{Var}({\color{royalblue}{Y}})=np^*(1-p^*)$, and the continuity
correction $$P(r {\mathbin{\le}}{\color{royalblue}{Y}} {\mathbin{<}}s)
  = P\left(r-\frac{1}{2} {\mathbin{\le}}{\color{royalblue}{Y}} {\mathbin{\le}}s-\frac{1}{2}\right)
  = 1-\alpha$$ to write
$$\begin{aligned}
    r-\frac{1}{2} &\approx np^* - z_{1-\alpha/2} \sqrt{np^*(1-p^*)}\\
    s-\frac{1}{2} &\approx np^* + z_{1-\alpha/2} \sqrt{np^*(1-p^*)}\\
  \end{aligned}$$

To illustrate, let's take a random sample, as in the last lesson:

In [4]:
x_i=np.array([0.5103, 0.9597, 0.0861, 0.4118, 0.2941, 0.2506, 0.3237, 0.4470, 0.4915, 0.6421,
              0.5123, 0.8789, 0.3373, 1.6668, 0.1830, 0.8486, 0.5105, 0.6678, 0.2892, 0.3326,
              1.2161, 3.6242, 0.4207, 0.8942, 1.6524, 1.8217, 0.2444, 0.1984, 0.3115, 1.6670,
              0.2557, 0.5141, 3.0989, 0.6351, 0.8932, 0.4223, 0.8816, 1.3748, 0.1684, 1.0407])
n = len(x_i); n

40

To get the order statistics $\{x^{(i)}\}$, we just have to sort the array $\{x_i\}$:

In [5]:
xordered_i = np.sort(x_i); xordered_i

array([0.0861, 0.1684, 0.183 , 0.1984, 0.2444, 0.2506, 0.2557, 0.2892,
       0.2941, 0.3115, 0.3237, 0.3326, 0.3373, 0.4118, 0.4207, 0.4223,
       0.447 , 0.4915, 0.5103, 0.5105, 0.5123, 0.5141, 0.6351, 0.6421,
       0.6678, 0.8486, 0.8789, 0.8816, 0.8932, 0.8942, 0.9597, 1.0407,
       1.2161, 1.3748, 1.6524, 1.6668, 1.667 , 1.8217, 3.0989, 3.6242])

Now let's construct a 90% confidence interval on the 60th percentile $x_{0.6}$ of the distribution.  The indices for the order statistics are $r$ and $s$, where 90% of the area under the null distribution $\operatorname{Bin}(n,0.6)$ lies between $r$ and $s-1$, inclusive, i.e.,  $P(r {\mathbin{\le}}{\color{royalblue}{Y}} {\mathbin{<}}s)=P(r {\mathbin{\le}}{\color{royalblue}{Y}} {\mathbin{\le}}(s-1))=0.9$

In [6]:
pstar = 0.6; alpha = 1. - 0.90; mydist = stats.binom(n,pstar)
r,sm1 = mydist.interval(1.-alpha); r = int(r); s = int(sm1) + 1
r,s

(19, 30)

So $r=19$ and $s=30$.  Remembering that Python indexes from 0 rather than 1, we can extract the order statistics $x^{(19)}$ and $x^{(30)}$

In [7]:
print('The (at least) %d%% CI on x_%.1f is from x^(%d)=%g to x^(%d)=%g.' %
     (100*(1-alpha),pstar,r,xordered_i[r-1],s,xordered_i[s-1]))

The (at least) 90% CI on x_0.6 is from x^(19)=0.5103 to x^(30)=0.8942.


We can check the actual coverage of the confidence interval, by looking at the probability between $r$ and $s-1$:

In [8]:
print('The CI on x_%.1f from x^(%d)=%g to x^(%d)=%g has coverage %g%%.'
      % (pstar,r,xordered_i[r-1],s,xordered_i[s-1],100*(mydist.cdf(s-1) - mydist.cdf(r-1))))

The CI on x_0.6 from x^(19)=0.5103 to x^(30)=0.8942 has coverage 92.561%.


It is actually 92.6%, but we can verify this is the smallest coverage above 90%.  If we tried to reduce $s$ to 29 or increase $r$ to 20, we'd dip below 90%:

In [9]:
print('The CI on x_%.1f from x^(%d)=%g to x^(%d)=%g has coverage %g%%.'
      % (pstar,r,xordered_i[r-1],s-1,xordered_i[s-2],100*(mydist.cdf(s-2) - mydist.cdf(r-1))))
print('The CI on x_%.1f from x^(%d)=%g to x^(%d)=%g has coverage %g%%.'
      % (pstar,r+1,xordered_i[r],s,xordered_i[s-1],100*(mydist.cdf(s-1) - mydist.cdf(r))))

The CI on x_0.6 from x^(19)=0.5103 to x^(29)=0.8932 has coverage 88.9883%.
The CI on x_0.6 from x^(20)=0.5105 to x^(30)=0.8942 has coverage 89.0426%.


We can also check what the normal approximation would give us:

In [10]:
mu = mydist.mean(); sigma = mydist.std(); mu,sigma

(24.0, 3.0983866769659336)

In [11]:
zcrit = stats.norm.isf(0.5*alpha); zcrit

1.6448536269514729

In [12]:
rn = 0.5 + mu - zcrit * sigma; sn = 0.5 + mu + zcrit * sigma; rn,sn

(19.403607436694465, 29.596392563305535)

If we round $r$ down and $s$ up to get a conservative interval, we see that again we get $x^{(19)}$ to $x^{(30)}$:

In [13]:
(int(np.floor(rn)),int(np.ceil(sn)),r,s)

(19, 30, 19, 30)

Quantile intervals require care to avoid off-by-one errors.

One place where there's a check is for a CI on the median $x_{0.5}$.

In [14]:
pstar = 0.5; alpha = 1. - 0.90; mydist = stats.binom(n,pstar)
r,sm1 = mydist.interval(1.-alpha); r = int(r); s = int(sm1) + 1
print('The (at least) %d%% CI on x_%.1f with n=%d is from x^(%d)=%g to x^(%d)=%g.' %
     (100*(1-alpha),pstar,n,r,xordered_i[r-1],s,xordered_i[s-1]))
print('The exact coverage is %g%%.' % (100*(mydist.cdf(s-1) - mydist.cdf(r-1))));

The (at least) 90% CI on x_0.5 with n=40 is from x^(15)=0.4207 to x^(26)=0.8486.
The exact coverage is 91.931%.


This should be symmetric, and we see there are $15-1=14$ values below $x^{(15)}$ and $40-26=14$ values above $x^{(26)}$ so it is indeed.

### Question: Can we ever have $\gamma$ so large that we prove $H_1$ is true?

For a test with critical region $C$, the power is $\gamma=P(\color{royalblue}{\mathbf{X}}\in C|H_1)$.
- The power is a property of the test, not of the observation.
- Could get $\gamma=1$ if the test *always* rejected $H_0$, but that wouldn't make us conclude $H_1$ is true.

What can we conclude about $H_1$ from observing $\mathbf{x}\in C$ or more generally from some data $D$?
Bayes's theorem tells us
$$
P(H_1|D) = \frac{P(H_1,D)}{P(D)} = \frac{P(D|H_1)P(H_1)}{P(D)}
$$
but to use this, need to define $P(H_1)$ (prior prob) & $P(D)$ (overall prob of observing what we did).

Even if $P(D|H_1)=1$, that doesn't tell us about $P(H_1|D)$, which still depends upon $P(H_1)$ and $P(D)$.

On the other hand, what if $P(D|H_1)=0$ (and but $P(D)>0$ so $D$ is not a set of measure zero)?

$$
\hbox{Then}\qquad
P(H_1|D)=\frac{P(D|H_1)P(H_1)}{P(D)}=0
$$

I.e., a hypothesis which tells us the observed data are impossible is definitely false.

Don't really need probability to conclude that, only logic.