# Parameter estimation

## From parameter estimation to hypothesis testing

Suppose that we have sample $X$, that follow the distribution with density $p(\cdot)$, parametrized by (unknown) parameters $\theta$ that we'd like to estimate from sample:
$$
p(X|\theta) \to \max_\theta
$$

Now let's turn it inside out. In the given setting, let's have a guess $\theta_0$ of the parameters. Now, we can try to estimate the likelihood of observing our sample given these parameters. If the likelihood is high enough, we may assume that our guess is correct. This is the essence of hypothesis testing.

# Hypothesis testing

## Hypotheses

A hypothesis is any statement about sample:
$$
H_0: X \sim p(\cdot)
$$
1. Hypothesis $H_0$ is called null hypothesis.
2. This is a simple hypothsis. There exist composite hypothesis, when the distribution is defined up to a family:
$$
H_0: X \sim p(\cdot) \in \mathcal{F} = \{p_\theta(\cdot), \theta \in \Theta\}
$$
- Example: $H_0: X \sim \mathcal{N}(\cdot, \cdot)$
3. We always have alternative hypothesis $H_1$ (one-sided or two-sided), which is the negation of null hypothesis.

## Hypothesis testing

Now, how do we define whether our guess is correct? We may go for the likelihood, but it does not give us an answer, just a number. We would like to have an instrument that will simply tell us, whether the guess is correct or not. This instrument is called hypothesis testing. We will introduce a decision (rejection) rule in the following form:
$$
\varphi(X) = \mathbb{I}\rm{nd}\{T(X) > c\}
$$
where $T$ is a test statistic (replacement for likelihood) and $c$ is the statistic's critical value.

Since the decision is a random variable itself, it will never be certain. There may be two types of errors:
1. Reject null hypothesis when it's true (Type I)
2. Accept null hypothesis when it's false (Type II)

We'd like to limit the probability of Type I error with a (small) $\alpha$ called confidence level.
$$
P(H_1|H_0) \leqslant \alpha
$$

The probability of not making Type II error is denoted $\beta$ and is called test power.
$$
\beta = P(H_1|H_1) = 1 - P(H_0|H_1)
$$

## Hypothesis testing

Decision (rejection) rule:
$$
\varphi(X) = \mathbb{I}\rm{nd}\{T(X) > c\}
$$
where $T$ is a test statistic (replacement for likelihood) and $c$ is the statistic's critical value.

<img src="confidence_levels.png" width=800px>

## Hypothesis testing

Decision (rejection) rule:
$$
\varphi(X) = \mathbb{I}\rm{nd}\{T(X) > c\}
$$
where $T$ is a test statistic (replacement for likelihood) and $c$ is the statistic's critical value.

<img src="error_types.png" width=800px>

## Hypothesis testing

Decision (rejection) rule:
$$
\varphi(X) = \mathbb{I}\rm{nd}\{T(X) > c\}
$$
where $T$ is a test statistic (replacement for likelihood) and $c$ is the statistic's critical value.

Now, let's assume that we have performed two tests for two different null hypotheses, and both tests were successful. How do we determine, which hypothesis better describes the data?

<img src="p-values.png" width=800px>

## Goodness-of-fit tests

Let's focus on the following null hypothesis:
$$
H_0: X \sim p(\cdot)
$$

It's called a goodness-of-fit hypothesis. There are many goodness-of-fit tests:
- Kolmogorov–Smirnov test
- Chi-squared test
- etc

For a simple null hypothesis, these test will be non-parametric. Some of the tests also have a parametric variant for composite hypothesis.

A test is fully defined by its statistic.

## Non-parametric $\chi^2$-test

Suppose that we have a sample $X = (x_1, x_2, \ldots, x_n) \in \mathbb{R}$ of real numbers. Let's split the real line $\mathbb{R}$ into $r$ intervals $\Delta_i$ and let's count how many samples reside in each interval:
$$
\nu_i = \sum_k \mathbb{I}\rm{nd}\{x_k \in \Delta_i\}
$$

Let's make a goodness-of-fit test. We suppose that our sample is drawn from a distribution with density $p(\cdot)$:
$$
H_0: X \sim p(\cdot)
$$

Now we are able to estimate the probabilities of a sample appearing in each interval under the null hypothesis:
$$
p_i = \int_{\Delta_i} p(x) \rm{d} x
$$

Let's define $\chi^2$-test statistic as
$$
T(X) = \sum_{i=1}^r \frac{\left(\nu_i - np_i\right)^2}{np_i}
$$


## Non-parametric $\chi^2$-test

Let's define $\chi^2$-test statistic as
$$
T(X) = \sum_{i=1}^r \frac{\left(\nu_i - np_i\right)^2}{np_i}
$$

The claim is
$$
T(X) \to^d_{n\to\infty} \chi^2(r-1)
$$

## Example: dice

Let's assume that we rolled the dice $n=72$ times and obtained the following results $\nu_i$:

| 1 | 2 | 3 | 4 | 5 | 6 |
| --- | --- | --- | --- | --- | --- |
| 9 | 20 | 14 | 8 | 11 | 10 |

Let's test whether the dice is symmetric with confidence level $\alpha = 0.01$. What is the null hypothesis?

$$
H_0: p_i = p = \frac16, \forall i
$$

## Example: dice

- $n = 72$
- $\alpha = 0.01$
- $r = 6$

$$
\begin{aligned}
T & = \sum_{i=1}^6 \frac{\left(\nu_i - n p_i\right)^2}{np_i} = \\
& = \frac{(9-12)^2 + (20-12)^2 + (14-12)^2 + (8-12)^2 + (11-12)^2 + (10-12)^2}{12} = \frac{98}{12} = 8.16
\end{aligned}
$$

- $T(X) \sim \chi^2(r-1)$
- $c_{0.01}^{\chi^2(5)} = 15.086$ (believe me for now)

In which part of the hypothesis "space" are we?

$$
T(X) < c_{\alpha} \Longrightarrow \text{we do not reject } H_0
$$

## Code

In [1]:
import numpy as np
import scipy.stats as sts

In [13]:
# data
X = np.array([9, 20, 14, 8, 11, 10])
n = X.sum()
r = len(X)

# H0
p = 1/6

# confidence level
alpha = 0.01

# critical value
chisq = sts.chi2(r-1)
c = chisq.isf(alpha)

# test statistic
T = np.sum((X - n * p) ** 2 / (n * p))

# p-value
pvalue = chisq.sf(T)

print("Rejected" if T > c else "Not rejected")
print("Rejected" if pvalue < alpha else "Not rejected")

Not rejected
Not rejected


In [15]:
result = sts.chisquare(X)
print("Rejected" if result.pvalue < alpha else "Not rejected")

Not rejected


## Parametric $\chi^2$-test

Same formulation: Suppose that we have a sample $X = (x_1, x_2, \ldots, x_n) \in \mathbb{R}$ of real numbers. Let's split the real line $\mathbb{R}$ into $r$ intervals $\Delta_i$ and let's count how many samples reside in each interval:
$$
\nu_i = \sum_k \mathbb{I}\rm{nd}\{x_k \in \Delta_i\}
$$

Let's make a goodness-of-fit test. We suppose that our sample is drawn from a distribution with density $p(\cdot)$, which is defined up to a family:
$$
H_0: X \sim p(\cdot) \in \mathcal{F} = \{p(\cdot, \theta), \theta \in \Theta \in \mathbb{R}^s\}
$$

The test statistic is defined as
$$
T(X) = \sum_{i=1}^r \frac{\left(\nu_i - np_i(\theta)\right)^2}{np_i(\theta)}
$$

The claim is:
$$
T(X) \to^d_{n\to\infty} \chi^2(r-s-1)
$$

## Parametric $\chi^2$-test

The test statistic is defined as
$$
T(X) = \sum_{i=1}^r \frac{\left(\nu_i - np_i(\theta)\right)^2}{np_i(\theta)}
$$

The claim is:
$$
T(X) \to^d_{n\to\infty} \chi^2(r-s-1)
$$

But where do we get $\theta$ from?

We can take MLE. This will result (prove!) in:

$$
\sum_{i=1}^r \frac{\nu_i}{p_i(\theta)} \frac{\partial p_i(\theta)}{\partial \theta_k}, \; k \in [1, s]
$$

## Example

Let's assume that we have the following data $\nu_i$:

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 5 | 8 | 6 | 12 | 14 | 18 | 11 | 6 | 18 | 7 |

Let's test two hypotheses:
1. $X \sim DU\{0,\ldots,9\}$
2. $X \sim \mathcal{N}(\cdot, \cdot)$

## Example: case 1

Let's apply the non-parametric $\chi^2$-test.

- $n = 100$
- $\alpha = 0.01$
- $r = 10$
- $p_i = p = 0.1$

$$
\begin{aligned}
T & = \sum_{i=1}^6 \frac{\left(\nu_i - n p_i\right)^2}{np_i} = \\
& = \frac{(5-10)^2 + (8-10)^2 + \ldots + (7-10)^2}{10} = \frac{164}{10} = 16.4
\end{aligned}
$$

- $T(X) \sim \chi^2(r-1)$
- $p\rm{-value} = 0.059$ (not rejected)

## Example: case 2

Let's apply the parametric $\chi^2$-test. Our unknown parameters are $\theta = (\mu, \sigma^2)$. We will the optimal parameters through:
$$
\sum_{i=1}^r \frac{\nu_i}{p_i(\theta)} \frac{\partial p_i(\theta)}{\partial \theta_k}, \; k \in [1, s]
$$

Here,
$$
p_i(\theta) = \int_{\Delta_i} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{(x - \mu)^2}{2 \sigma^2} \right) \rm{d} x = \int_{\Delta_i} f(x) \rm{d} x
$$

After a bit of lengthy calculation, we will obtain
$$
\begin{cases}
\hat{\mu} & = \frac1n \sum_{i=1}^r i \nu_i, \\
\hat{\sigma}^2 & = \frac1n \sum_{i=1}^r \nu_i (i - \hat{\mu})^2.
\end{cases}
$$

## Example: case 2

Calculation yields $\hat{\mu} = 4.77, \hat{\sigma}^2 = 6.28$. Using this values and intervals $(-\infty, 0.5], [0.5, 1.5], \ldots, [7.5, 8.5], [8.5, +\infty)$, we will obtain:

| $p_0$ | $p_1$ | $p_2$ | $p_3$ | $p_4$ | $p_5$ | $p_6$ | $p_7$ | $p_8$ | $p_9$ |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0.058 | 0.062 | 0.098 | 0.133 | 0.154 | 0.154 | 0.131 | 0.096 | 0.059 | 0.055 |

Then,
$$
T(X) = \sum_{i=1}^r \frac{\left(\nu_i - np_i(\theta)\right)^2}{np_i(\theta)} = 13.13
$$

- $T(X) \sim \chi^2(r-s-1)$
- $p\rm{-value} = 0.069$ (not rejected)

## Example: comparison

- Non-parametric $\chi^2$-test gave us $p\rm{-value} = p_1 = 0.059$ and $H_0: X \sim DU\{0,\ldots,9\}$ was not rejected for confidence level $\alpha = 0.01$
- Parametric $\chi^2$-test gave us $p\rm{-value} = p_2 = 0.069$ and $H_0: X \sim \mathcal{N}(\cdot, \cdot)$ was not rejected for confidence level $\alpha = 0.01$

Which hypothesis does the data support more?

$p_2 > p_1$, therefore the data is closer to second hypothesis $H_0: X \sim \mathcal{N}(\cdot, \cdot)$.