# https://github.com/wiso/StatisticsTutorialATLASItalia17

### Other resources

* PhD school lectures https://github.com/wiso/StatisticsLectures
* [Kyle Cranmer](http://orcid.org/0000-0002-5769-7094) [lectures](https://indico.cern.ch/event/117033/other-view?view=standard) and [proceedings](https://cds.cern.ch/record/2004587/files/arXiv:1503.07622.pdf) at 2011 ESHEP (see page 3 for many books)
* Kyle Cranmer [lectures](https://indico.cern.ch/event/243641/) for summer students in 2013
* Glen Cowan [Statistical Data Analysis for Particle Physics](http://www.pp.rhul.ac.uk/~cowan/stat_aachen.html) and [other](http://www.pp.rhul.ac.uk/~cowan/) lectures
* [Luca Lista](http://people.na.infn.it/~lista/Statistics/) with RooStats examples
* [Asymptotic formulae for likelihood-based tests of new physics](https://arxiv.org/pdf/1007.1727v3.pdf)

<strong>Statistic</strong>: a function of the data (the mean, the number of observed events, ...)

<strong>p-value</strong>: the probability to obtaining a result equal to or "more extreme" than what was actually observed
What "more extreme" means? It depends on question we want to answer

<strong>Likelihood</strong>: $\mathcal{L}(\theta) = P(\text{data}|\theta)$

## Hypothesis testing

A statistical hypothesis is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables

A statistical hypothesis test is a method of statistical inference

The goal of the hypothesis testing is to determine if the null ($H_0$) hypothesis can be
rejected. A statistical test can either reject (prove false) or fail to reject (fail to
prove false) a null hypothesis, but never prove it true (i.e., failing to reject a null
hypothesis does not prove it true).

## Normal significance

Usually p-values are translated in a more-friendly normal-significance (number of sigmas) $z$. With the one-tail definition this is the value (quantile) corresponding to a certain p-value for a standard gaussian:

$$ \int_z^\infty N[x| 0, 1] dx = \text{p-value}$$

or $ \int_z^\infty + \int_{-\infty}^{-z} N[x|0,1] dx = \text{p-value}$ for the two-tail definition.

Taking into account the definition of the cumulative density function of a normal distribution, $\Phi(z) = \int_{-\infty}^{z} N[x| 0, 1] dx$, this can be written as:

$$ z = \Phi^{-1}(1 - \text{p-value})$$

or taking into account the definition of the survival function $SF(z) = 1 - CDF(z)$:

$$ z = \text{SF}^{-1}(\text{p-value})$$


## Error type

   * Type I error: false positive (excessive credulity). Rate = $\alpha = P(\text{reject } H_0|H_0 \text{ is true})$
   * Type II error: false negative (excessive skepticism). Rate = $\beta = P(\text{don't reject } H_0|H_0 \text{ is false})$
   
   * Power of the test: $1-\beta$
   * Remember that "true"/"false" refers to $H_1$, the alternative hypothesis

## Hypothesis testing in practice

   * Define the null hypothesis you want to try to reject
   * Define the observables (number of events in the signal region, ...)
   * Fix the rate of type I error $\alpha$ of the test statistics (5%, $5\sigma$, ...)
   * Define the test statistic (trying to maximize the power $1-\beta$)
   * Find the rejection region in the observable space, which is the region where $H_0$ is rejected (p-value $<\alpha$)
   * Do the experiment
   * If the outcome is outside the acceptance region reject the null-hypothesis
   
In complex example one don't compute the acceptance region, but just compute the observed p-value

## Neyman–Pearson lemma
Having two simple hypotheses (no additional parameters) $H_0: \theta = \theta_0$ and $H_1: \theta = \theta_1$, the likelihood-ratio test:
$$
\Lambda(x) = \frac{L(\theta_0|x)}{L(\theta_1|x)}
$$

which rejects $H_0$ in favour of $H_1$ when $\Lambda \leq k_\alpha$ (rejection region) with $\alpha=P(\Lambda(X)\leq k_\alpha|H_0)$ is the most powerful test with size $\alpha$.

## Profile likelihood ratio

Suppose you have composite hypothesis as 

$$H_0: \theta\in\Theta_0 \quad H_1: \theta\in\Theta_0^C$$

where $\Theta_0 = \{s=0, \ldots\}$, $\theta\in\Theta_0^C = \{s\neq 0, \ldots\}$, 

$$\lambda(x) = \frac{\sup_{\theta\in\Theta_0}{L(\theta|x)}}{\sup_{\theta\in\Theta}{L(\theta|x)}}$$

with $\Theta_0 \subset \Theta$. For example it can be $H_0: s=0$, $H_1: s\neq 0$

$$\lambda(x) = \frac{L(s=0, \hat{\hat\theta}(0)|x)}{L(\hat{s}, \hat\theta|x)}$$

where $\hat{\hat\theta}(0)$ is the value of $\theta$ which optimize the likelihood for $s=0$ (conditioned likelihood), while $\hat{s}$ and $\hat{\theta}$ are the values that optimize the likelihood without any constrains (unconditioned likelihood).

<small>It varies between 0 and 1, low values mean that the observed result is less likely to occur under the null hypothesis as compared to the alternative.

The profile likelihood ratio is nearly an optimal test-statistics</small>

<small>As shown it is important to have an analytically expression $f_q$ of the distribution of the test-statistics $q$ to compute the p-value: $\text{p-value} = \int_{q^\text{obs}}^{\infty} f_q(q) dq$. Otherwise toys must be run.</small>

## Wilks's theorem

The quantity $t=-2\log(\lambda)$ is aymptotically (large sample) distributed as a $\chi^2$ distribution with $n=\text{dim}(\Theta)-\text{dim}(\Theta_0)$ degrees of freedom when $H_0$ is true.

## Test statistics $t_\mu$ and $\tilde{t}_\mu$

$$ t_\mu = -2\log\lambda (\mu) = -2\log \frac{L(\mu, \hat{\hat{\theta}}(\mu))}{L(\hat{\mu}, \hat{\theta})}$$

High value means incompatiblity with data. If we want to test a specific $\mu$ we can compute the p-value $= \int_{t_{\mu, obs}}^\infty f(t_\mu|\mu) dt_\mu$. Values can be excluded because they are too low, or too high.



Usually we can assume that $\mu\geq 0$, so we need a new test statistic:

$$ \tilde{t}_\mu = -2\log \tilde\lambda(\mu)$$

$$ \tilde\lambda(\mu) = \begin{cases} 
      \hfill \frac{L(0, \hat{\hat{\theta}}(\mu=0))}{L(\hat{\mu}, \hat{\theta})}    \hfill & \hat{\mu} < 0 \\
      \hfill \frac{L(\mu, \hat{\hat{\theta}}(\mu))}{L(\hat{\mu}, \hat{\theta})} \hfill & \hat{\mu} \geq 0 \\
  \end{cases}$$
  
Also in this case, values can be excluded because they are too low, or too high.

## $q_0$ statistics for discovery of positive signal

For discovery we want to exclude the hypothesis $s=0$ (background-only), assuming $\mu\geq 0$. Defining $q_0 = \tilde{t}_0$:

$$ q_0 = 
\begin{cases} 
-2\log\lambda(0)\qquad &\hat \mu\geq 0 \\
0\qquad &\hat \mu < 0
\end{cases}
$$

If $\hat \mu<0$ it means that we are observing less events than the one predicted by the background-only model. Since we are truncating the the definition of test statistics we are not considering downward fluctuation as discrepancies with the model. High value of $\hat\mu$ means high value of $q_0$ and large discrepancy with the background-only model.

## $q_\mu$ statistic for exclusion
Suppose we want to put un upper limit, so we define as null hypothesis to exclude the hypotesis signal+background with $\mu$ as signal multiplier.

$$
q_\mu=\begin{cases}
-2\log\lambda(\mu)\qquad & \hat\mu \leq \mu\\
0 \qquad & \hat\mu > \mu
\end{cases}
$$

we set $q_\mu=0$ when observing a value of $\mu$ greater than the one we are observing since we don't want it to enter in the rejection region when doing an upper limit; we don't want that upper fluctuation count as bad agreement with data.