# Fundamentals of Statistics
In practice, researchers have to find methods to choose among distributions and to estimate distribution parameters from real data. The subject of sampling brings us to the theory of statistics. Whereas probability assumes the distributions are known, statistics attempts to make inferences from actual data.

We sample from the distribution of a population, say the return on a stock market index, to make inferences about the population. Issues of interest are the choices of the best distribution and of the best parameters. In addition, risk measurement deals with large numbers of random variables. As a result, we also need to characterize the relationships between risk factors.

In this notebook, we will talk about two important problems in statistical inference: **estimation** and **tests of hypotheses**. With estimation, we wish to estimate the value of an known parameter from sample data. With tests of hypotheses, we wish to verify a conjecture about the data.

- [Parameter Estimation](#parameter_estimation)

## <a name="parameter_estimation">Parameter Estimation</a>
### Parameters
The first step in risk measurement is to define the risk factors. These can be movements in stock prices, interest rates, exchange rates, or commodity prices. The next step is to measure their distribution. This usually involves choosing a particular distribution function and then estimating parameters. For instance, define $X$ as the random variable of interest. We observe a sequence of $T$ realized values for $x, x_{1}, x_{3},..., x_{T}$.

As an example, we could assume that the observed values for $x$ are drawn from a normal distribution

$x \sim  \Phi(\mu, \sigma)$

with mean $\mu$ and standard deviation $\sigma$. Generally, we also need to assume that the random variables are independent and identically distributed ($i.i.d.$). Estimation is still possible if this is not the case but requires additional steps, fitting a model to $r$ until the residuals are $i.i.d$. Even in simple cases, the $i.i.d.$ assumption requires a basic transformation of the data. For example, $r$ should be the rate of change in the stock index, not its level $P$. We know that the level tomorrow cannot be far from the level today. What is random is whether the level will go up or down. So, the random variable should be the rate of change in the level.

Armed with our $i.i.d.$ sample of $T$ observations, we can start estimating the parameters of interest, such as the sample mean, the variance, and other moments. Say that the random variable $X$ has a normal distribution with **parameters** $\mu$ and $\sigma^{2}$. These are unknown values that must be estimated. This approach can also be used to check whether the parametric distribution is appropriate. For instance, the normal distribution implies a value of three for the kurtosis coefficient. We can estimate an estimate for the kurtosis for the sample at hand and test whether it is equal to three. If not, the assumption that the distribution is normal must be rejected and the risk manager must search for another distribution that fits the data better.

### Parameter estimators
The expected return, or mean, $\mu = E(X)$ can be estimated by the sample mean,

$m = \hat{\mu} = \frac{1}{T}\Sigma^{T}_{i=1}x_{i}$

The sample mean $m$ is an estimator, which is a function of the data. The particular value of this estimator for this sample is a **point estimate**.

Note that we assign the same weight of $1/T$ to all observations because they all have the same probability due to the $i.i.d.$ property. Other estimators are possible, however. For instance, the pattern of weight could be different, as long as they sum to $1$. A good estimator should several properties.

- It should be **unbiased**, meaning that its expectation is equal to the parameter of interest; for example, $E[m] = \mu$. Otherwise, the estimator is biased.
- It should be **efficient**, which implies that it has the smallest standard deviation of all possible estimators; for example, $V[m - \mu]$ is lowest.

The sample mean, for example, satisfies all of these conditions. An estimator that is unbiased and efficient among all linear combinations of the data is said to be **best linear unbiased estimator (BLUE)**.

A weaker condition is for an estimator to be **consistent**. This means that it converges to the true parameter as the sample size $T$ increases, or asymptotically. 

The variance, $\sigma^{2} = E[(X - \mu)^{2}]$ can be estimated by the sample variance

$s^{2} = \hat{\sigma} = \frac{1}{T - 1}\Sigma^{T}_{i=1}(x_{i} - \hat{\mu})^{2}$

Note that we divide by $T - 1$ instead of $T$. This is because we estimate the variance around an unknown parameter, the mean. So, we have fewer degrees of freedom than otherwise. As a result, we need to adjust $s^{2}$ to ensure that its expectation equals the true value, or that it is unbiased. In most situations, however, $T$ is large so this adjustment is minor.

It's essential to note that these estimated values depend on the partitcular sample and, hence, have some inherent variability. The sample mean itself is distributed as 

$m = \hat{\mu} \sim N(\mu, \sigma^{2}/T)$

If the population distribution is normal, this exactly describes the distribution of the sample mean. Otherwise, the central limit theorem states that this distribution is only valid asymptotically (i.e., for large samples).

$se(m) = \sigma\sqrt{\frac{1}{T}}$

For the distribution of the sample variance $\hat{\sigma}^{2}$, one can show that, when $X$ is normal, the following ratio is distributed as a chi-square with $T - 1$ degrees of freedom:

$\frac{(T - 1)\hat{\sigma}^{2}}{\sigma^{2}} \sim \chi^{2}(T - 1)$ 

If the sample size $T$ is large enough, the chi-square distribution converges to a normal distribution

$\hat{\sigma}^{2} \sim N(\sigma^{2}, \sigma^{4}\frac{2}{T - 1})$

Using the same approximation, the sample standard deviation has a normal distribution with a standard error of 

$se(\hat{\sigma}) = \sigma \sqrt{\frac{1}{2T}}$

Note that the precision of these estimators, or standard error ($se$), is proportional to $1/\sqrt{T}$. This is a typical result, which is due to the fact the observations are independent of each other.

We can use this information for **hypothesis testing**. For instance, we would like to detect a constant trend in $X$. Here, the **null hypothesis** is that $\mu = 0$. To answer the question, we use the distributional assumption and compute a standard normal variable as the ratio of the estimated mean to its standard error, or 

$z = \frac{m - 0}{\sigma/\sqrt{T}}$

Because this is now a standard normal variable, we would not expect to observe values far away from $0$. We need to decide on a **significance level** for the test. This is also $1$ minus the confidence level. Call this $c$. Typically, we would set $c = 95\%$, which translates into a two-tailed interval for $z_{c}$ of $[-1.96, +1.96]$. The significance level here is $5\%$.

Roughly, this means that, if the absolute value of $z$ is greater than $2$, we would reject the hypothesis that $m$ came from a distribution with a mean of $0$. We can have some confidence that the true $\mu$ is indeed different from $0$.

In fact, we do not know the true $\sigma$ and use the estimated $s$ instead. The distribution then becomes a Student's t with T degrees of freedom:

$t = \frac{m - 0}{s/\sqrt{T}}$

for which the cutoff values can be found from Student's tables. The quantile values for the interval are then $t_{c}$. For large values of $T$, however, this distribution is close to the normal.

These test statistics can be transformed into **confidence intervals**. These are random intervals that contain the true parameter with a fixed level of confidence 

$c = P[m - z_{c} \times se(m) \leq \mu \leq m + z_{c} \times se(m)]$

Say for instance that we want to determine a $95\%$ confidence interval that contains $\mu$. If $T$ is large, we can use the normal distribution, and the multiplier $z_{c}$ is $1.96$. The confidence interval for the mean is then 

$m \pm z_{c}se(m) = [m - 1.96 \times se(m), m + 1.96 \times se(m)]$

In this case, the confidence interval is symmetric because the distribution is normal. More generally, this interval could be asymmetric. For instance, the distribution of the sample variance is chi-square, which is asymmetric.

### Choose significance levels for tests
Hypothesis testing requires the choice of a significance level, which needs careful consideration. 

| Decision | Model Correct | Model Incorrect |
|----------|---------------|-----------------|
| Accept   | Okay          | Type 2 error    |
| Reject   | Type 1 error  | Okay            |

For a given test, increasing the significance level will decrease the probability of a type $1$ error but increase the probability of a type $2$ error.

### Precision of estimates
When the sample size increases, the standard error of $\hat{\mu}$ shrinks at a rate proportional to $1/\sqrt{T}$. The precision of the estimate increases as the number of observations increases.

This result will prove useful to assess the precision of estimates generated from **numerical simulations**, which are widely used in risk management. Numerical simulations create independent random variables over a fixed number of replications $T$. If $T$ is too small, the final estimates will be imprecisely measured. If $T$ is very large, the estimates will be accurate. The precision of the estimates increases at a rate proportional to $1/\sqrt{T}$.

### Hypothesis testing for distributions
The analysis so far has focused on hypothesis testing for specific parameters. Another application is to test the hypothesis that the sample comes from a specific distribution such as the normal distribution. Such a hypothesis can be tested using a variety of tools. A widely used test focuses on the moments. Define $\hat{\gamma}$ and $\hat{\delta}$ as the estimated skewness and kurtosis. With a normal distribution, the true values are $\gamma = 0$ and $\delta = 3$. 

The **Jarque-Bera ($JB$)** statistics measures the deviations from the expected values 

$JB = T[\frac{\hat{\gamma}^{2}}{6} + \frac{(\hat{\delta} - 3)^{2}}{24}]$

which under the null hypothesis has a chi-square distribution with $2$ degrees of freedom. The cutoff point at the $95\%$ level of confidence is $5.99$. Hence if the observed value of the $JB$ statistic is above $5.99$, we would have to reject the hypothesis that the observations come from a normal distribution.