# Fundamentals of Statistics
In practice, researchers have to find methods to choose among distributions and to estimate distribution parameters from real data. The subject of sampling brings us to the theory of statistics. Whereas probability assumes the distributions are known, statistics attempts to make inferences from actual data.

We sample from the distribution of a population, say the return on a stock market index, to make inferences about the population. Issues of interest are the choices of the best distribution and of the best parameters. In addition, risk measurement deals with large numbers of random variables. As a result, we also need to characterize the relationships between risk factors.

In this notebook, we will talk about two important problems in statistical inference: **estimation** and **tests of hypotheses**. With estimation, we wish to estimate the value of an known parameter from sample data. With tests of hypotheses, we wish to verify a conjecture about the data.

- [Parameter Estimation](#parameter_estimation)
- [Regression Analysis](#regression_analysis)

## <a name="parameter_estimation">Parameter Estimation</a>
### Parameters
The first step in risk measurement is to define the risk factors. These can be movements in stock prices, interest rates, exchange rates, or commodity prices. The next step is to measure their distribution. This usually involves choosing a particular distribution function and then estimating parameters. For instance, define $X$ as the random variable of interest. We observe a sequence of $T$ realized values for $x, x_{1}, x_{3},..., x_{T}$.

As an example, we could assume that the observed values for $x$ are drawn from a normal distribution

$x \sim  \Phi(\mu, \sigma)$

with mean $\mu$ and standard deviation $\sigma$. Generally, we also need to assume that the random variables are independent and identically distributed ($i.i.d.$). Estimation is still possible if this is not the case but requires additional steps, fitting a model to $r$ until the residuals are $i.i.d$. Even in simple cases, the $i.i.d.$ assumption requires a basic transformation of the data. For example, $r$ should be the rate of change in the stock index, not its level $P$. We know that the level tomorrow cannot be far from the level today. What is random is whether the level will go up or down. So, the random variable should be the rate of change in the level.

Armed with our $i.i.d.$ sample of $T$ observations, we can start estimating the parameters of interest, such as the sample mean, the variance, and other moments. Say that the random variable $X$ has a normal distribution with **parameters** $\mu$ and $\sigma^{2}$. These are unknown values that must be estimated. This approach can also be used to check whether the parametric distribution is appropriate. For instance, the normal distribution implies a value of three for the kurtosis coefficient. We can estimate an estimate for the kurtosis for the sample at hand and test whether it is equal to three. If not, the assumption that the distribution is normal must be rejected and the risk manager must search for another distribution that fits the data better.

### Parameter estimators
The expected return, or mean, $\mu = E(X)$ can be estimated by the sample mean,

$m = \hat{\mu} = \frac{1}{T}\Sigma^{T}_{i=1}x_{i}$

The sample mean $m$ is an estimator, which is a function of the data. The particular value of this estimator for this sample is a **point estimate**.

Note that we assign the same weight of $1/T$ to all observations because they all have the same probability due to the $i.i.d.$ property. Other estimators are possible, however. For instance, the pattern of weight could be different, as long as they sum to $1$. A good estimator should several properties.

- It should be **unbiased**, meaning that its expectation is equal to the parameter of interest; for example, $E[m] = \mu$. Otherwise, the estimator is biased.
- It should be **efficient**, which implies that it has the smallest standard deviation of all possible estimators; for example, $V[m - \mu]$ is lowest.

The sample mean, for example, satisfies all of these conditions. An estimator that is unbiased and efficient among all linear combinations of the data is said to be **best linear unbiased estimator (BLUE)**.

A weaker condition is for an estimator to be **consistent**. This means that it converges to the true parameter as the sample size $T$ increases, or asymptotically. 

The variance, $\sigma^{2} = E[(X - \mu)^{2}]$ can be estimated by the sample variance

$s^{2} = \hat{\sigma} = \frac{1}{T - 1}\Sigma^{T}_{i=1}(x_{i} - \hat{\mu})^{2}$

Note that we divide by $T - 1$ instead of $T$. This is because we estimate the variance around an unknown parameter, the mean. So, we have fewer degrees of freedom than otherwise. As a result, we need to adjust $s^{2}$ to ensure that its expectation equals the true value, or that it is unbiased. In most situations, however, $T$ is large so this adjustment is minor.

It's essential to note that these estimated values depend on the partitcular sample and, hence, have some inherent variability. The sample mean itself is distributed as 

$m = \hat{\mu} \sim N(\mu, \sigma^{2}/T)$

If the population distribution is normal, this exactly describes the distribution of the sample mean. Otherwise, the central limit theorem states that this distribution is only valid asymptotically (i.e., for large samples).

$se(m) = \sigma\sqrt{\frac{1}{T}}$

For the distribution of the sample variance $\hat{\sigma}^{2}$, one can show that, when $X$ is normal, the following ratio is distributed as a chi-square with $T - 1$ degrees of freedom:

$\frac{(T - 1)\hat{\sigma}^{2}}{\sigma^{2}} \sim \chi^{2}(T - 1)$ 

If the sample size $T$ is large enough, the chi-square distribution converges to a normal distribution

$\hat{\sigma}^{2} \sim N(\sigma^{2}, \sigma^{4}\frac{2}{T - 1})$

Using the same approximation, the sample standard deviation has a normal distribution with a standard error of 

$se(\hat{\sigma}) = \sigma \sqrt{\frac{1}{2T}}$

Note that the precision of these estimators, or standard error ($se$), is proportional to $1/\sqrt{T}$. This is a typical result, which is due to the fact the observations are independent of each other.

We can use this information for **hypothesis testing**. For instance, we would like to detect a constant trend in $X$. Here, the **null hypothesis** is that $\mu = 0$. To answer the question, we use the distributional assumption and compute a standard normal variable as the ratio of the estimated mean to its standard error, or 

$z = \frac{m - 0}{\sigma/\sqrt{T}}$

Because this is now a standard normal variable, we would not expect to observe values far away from $0$. We need to decide on a **significance level** for the test. This is also $1$ minus the confidence level. Call this $c$. Typically, we would set $c = 95\%$, which translates into a two-tailed interval for $z_{c}$ of $[-1.96, +1.96]$. The significance level here is $5\%$.

Roughly, this means that, if the absolute value of $z$ is greater than $2$, we would reject the hypothesis that $m$ came from a distribution with a mean of $0$. We can have some confidence that the true $\mu$ is indeed different from $0$.

In fact, we do not know the true $\sigma$ and use the estimated $s$ instead. The distribution then becomes a Student's t with T degrees of freedom:

$t = \frac{m - 0}{s/\sqrt{T}}$

for which the cutoff values can be found from Student's tables. The quantile values for the interval are then $t_{c}$. For large values of $T$, however, this distribution is close to the normal.

These test statistics can be transformed into **confidence intervals**. These are random intervals that contain the true parameter with a fixed level of confidence 

$c = P[m - z_{c} \times se(m) \leq \mu \leq m + z_{c} \times se(m)]$

Say for instance that we want to determine a $95\%$ confidence interval that contains $\mu$. If $T$ is large, we can use the normal distribution, and the multiplier $z_{c}$ is $1.96$. The confidence interval for the mean is then 

$m \pm z_{c}se(m) = [m - 1.96 \times se(m), m + 1.96 \times se(m)]$

In this case, the confidence interval is symmetric because the distribution is normal. More generally, this interval could be asymmetric. For instance, the distribution of the sample variance is chi-square, which is asymmetric.

### Choose significance levels for tests
Hypothesis testing requires the choice of a significance level, which needs careful consideration. 

| Decision | Model Correct  | Model Incorrect |
|----------|----------------|-----------------|
| Accept   | Okay           | Type $2$ error  |
| Reject   | Type $1$ error | Okay            |

For a given test, increasing the significance level will decrease the probability of a type $1$ error but increase the probability of a type $2$ error.

### Precision of estimates
When the sample size increases, the standard error of $\hat{\mu}$ shrinks at a rate proportional to $1/\sqrt{T}$. The precision of the estimate increases as the number of observations increases.

This result will prove useful to assess the precision of estimates generated from **numerical simulations**, which are widely used in risk management. Numerical simulations create independent random variables over a fixed number of replications $T$. If $T$ is too small, the final estimates will be imprecisely measured. If $T$ is very large, the estimates will be accurate. The precision of the estimates increases at a rate proportional to $1/\sqrt{T}$.

### Hypothesis testing for distributions
The analysis so far has focused on hypothesis testing for specific parameters. Another application is to test the hypothesis that the sample comes from a specific distribution such as the normal distribution. Such a hypothesis can be tested using a variety of tools. A widely used test focuses on the moments. Define $\hat{\gamma}$ and $\hat{\delta}$ as the estimated skewness and kurtosis. With a normal distribution, the true values are $\gamma = 0$ and $\delta = 3$. 

The **Jarque-Bera ($JB$)** statistics measures the deviations from the expected values 

$JB = T[\frac{\hat{\gamma}^{2}}{6} + \frac{(\hat{\delta} - 3)^{2}}{24}]$

which under the null hypothesis has a chi-square distribution with $2$ degrees of freedom. The cutoff point at the $95\%$ level of confidence is $5.99$. Hence if the observed value of the $JB$ statistic is above $5.99$, we would have to reject the hypothesis that the observations come from a normal distribution.

## <a name="regression_analysis">Regression Analysis</a>
Regression analysis has particular importance for risk management, because it can be used to model relationships between financial variables. It can also be used for the mapping process. For example, positions in individual stocks can be replaced by exposures on a smaller set of stock indices. This considerably reduces the dimensions of the risk space, using exposures estimated via regressions.
### Bivariate regression: ordinary least squares (OLS) estimation
In a linear regression, the **dependent variable** $y$ is projected on a set of $N$ predetermined **independent variables**, $x$. In the simplest bivariate case, with $2$ variables, we write

$y_{t} = \alpha + \beta x_{t} + \epsilon_{t}, t = 1,..., T$

where $\alpha$ is called the **intercept**, or constant, $\beta$ is called the **slope**, and $\epsilon$ is called the **error term**. Here, $y$ is called the **regressand**, and $x$ the **regressor** (as in predictor).

This regression could represent a time series or a cross section of variables. It is **linear** in the coefficients but not necessarily in the variables. The variables themselves could have been transformed. For example, when size or total market is used, it is commonly transformed by taking logs, in which case $x = ln(X)$.

Next comes the question of how to estimate these parameters, which are unobservable usually. The estimated regression is 

$y_{t} = \hat{\alpha} + \hat{\beta}x_{t} + \hat{\epsilon_{t}}, t = 1,..., T$

where the estimated error $\hat{\epsilon}$ is called the **residual**, or deviation between the observed and fitted values. 

A particular convenient regression model is based on **ordinary least squares (OLS)**, using these assumptions:
- The errors are independent of the variable $x$.
- The errors have constant variance.
- The errors are uncorrelated across observations $t$.
- The errors are normally distributed.

Independence between $2$ random variables $u$ and $v$ is a strong assumption; it implies that the covariance $Cov(u, v)$ and the correlation $\rho(u, v)$ are $0$. Otherwise, the assumption of normality leads to exact statistical results. More generally, the conditions of the central limit theorem hold as $T$ grows large, which leads to similar results.

$OLS$ provides estimators of the coefficients by minimizing the sum of squared errors, $\Sigma\hat{\epsilon}^{2}$. Thus, the estimators provide the least sum of squared errors.

Under the model assumptions, the Gauss-Markov theorem states that $OLS$ provides the best linear unbiased estimators ($BLUEs$). The estimator $\hat{\beta}$ is unbiased, or has expectation that is the true value $\beta$. Best means that the $OLS$ estimator is efficient, or has the lowest variance compared to others. When the errors are normal, $OLS$ is a maximum likelihood estimator. However, it also has good properties for broader distributions.

The $OLS$ $\beta$ is 

$\hat{\beta} = \frac{[1/(T - 1)]\Sigma{t}(x_{t} - \bar{x})(y_{t} - \bar{y})}{[1/(T - 1)]\Sigma_{t}(x_{t} - \bar{x})^{2}}$

where $\bar{x}$ and $\bar{y}$ are the means of $x_{t}$ and $y_{t}$. $\alpha$ is estimated by

$\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}$

When the regression includes an intercept, it can be shown that the estimated residuals must average to $0$ by construction 

$(1/T)\Sigma^{T}_{t=1}\hat{\epsilon} = 0$

Note that the numerator in the $\hat{\beta}$ equation is also the sample covariance between $2$ series $x_{i}$ and $x_{j}$, which can be written as 

$\hat{\sigma_{ij}} = \frac{1}{T - 1}\Sigma^{T}_{t=1}(x_{t,i} - \hat{\mu_{i}})(x_{t,j} - \hat{\mu_{j}})$

To interpret $\beta$, we can take the covariance between $y$ and $x$, which is

$Cov(y, x) = Cov(\alpha + \beta x + \epsilon, x) = \beta Cov(x, x) = \beta V(x)$

because $\epsilon$ is uncorrelated with $x$. This shows that the population $\beta$ is also

$\beta(y, x) = \frac{Cov(y, x)}{V(x)} = \frac{\rho(y, x)\sigma(y)\sigma(x)}{\sigma^{2}(x)} = \rho(y, x)\frac{\sigma(y)}{\sigma(x)}$

Once the parameters have been estimated, we can construct a forecast for $y$ conditional on observations on $x$:

$\hat{y_{t}} = E[y_{t}|x_{t}] = \hat{\alpha} + \hat{x_{t}}$

This is very convenient representation for risk measurement. It shows that the variable $y$ can be mapped on a variable $x$ when the regression provides a good fit, meaning that the error terms are relatively small.

### Bivariate regression: quality of fit
The **regression fit** can be assessed by examining the size of the residuals, obtained by subtracting the fitted values $\hat{y_{t}}$ from $y_{t}$,

$\hat{\epsilon_{t}} = y_{t} - \hat{y_{t}} = y_{t} - \hat{\alpha} - \hat{\beta}x_{t}$

and taking the estimated variance as 

$V(\hat{\epsilon}) = \frac{1}{T - 2}\Sigma^{T}_{t=1}\hat{\epsilon_{t}}^{2}$

We divide by $T - 2$ because the estimator uses $2$ unknown quantities, $\hat{\alpha}$ and $\hat{\beta}$. Without this adjustment, it would be too small, or biased downward. Also note that, because the regression includes an intercept, the average value of $\hat{\epsilon}$ has to be exactly $0$.

The quality of the fit can be assessed using a unitless measure called the **regression R-squared**, also called the **coefficient of determination**. This is defined as

$R^{2} = 1 - \frac{SSE}{SSY} = 1 - \frac{\Sigma_{t}{\hat{\epsilon_{t}}^{2}}}{\Sigma_{t}(y_{t} - \bar{y})^{2}}$

where $SSE$ is the sum of squared errors (residuals, to be precise), and $SSY$ is the sum of squared deviations of $y$ around its mean. If the regression includes a constant, we always have $0 \leq R^{2} \leq 1$. In this case, $R$-squared is also the square of the usual correlation coefficient, 

$R^{2} = \rho(y, x)^{2}$

The $R^{2}$ measures the degree to which the size of the errors is smaller than that of the original dependent variable $y$. To interpret $R^{2}$, consider $2$ extreme cases. On one hand, if the fit is excellent, all the errors will be $0$ and the numerator in the $R^{2}$ equation will be $0$, which gives $R^{2} = 1$. On the other hand, if the fit is poor, $SSE$ will be as large as $SSY$ and the ratio will be one, giving $R^{2} = 0$.

Alternatively, we can interpret the $R$-squared by decomposing the variance of $y_{t} = \alpha + \beta x_{t} + \epsilon_{t}$. Because $\epsilon$ and $x$ are uncorrelated, this yields

$V(y) = \beta^{2}V(x) + V(\epsilon)$

Dividing by $V(y)$,

$1 = \frac{\beta^{2}V(x)}{V(y)} + \frac{V(\epsilon)}{V(y)}$

Because the $R$-squared is also $R^{2} = 1 - V(\epsilon)/V(y)$, it is equal to $\beta^{2}V(x)/V(y)$, which is the contribution in the variation of $y$ due to $\beta$ and $x$.

Assuming the quality of fit also involves checking whether the $OLS$ assumptions are satisfied. For instance, the errors should have a distribution with constant variance: $V(\epsilon+{t}|x_{t}) = \sigma^{2}$. The absence of subscript in $\sigma^{2}$ means that it is constant. This can be checked by plotting the squared residual against $x_{t}$.

### Bivariate regression: hypothesis testing
Finally, we can derive the distribution of the estimated coefficients, which is normal and centered around the true values. For the slope coefficient, $\hat{\beta} \sim N(\beta, V(\hat{\beta}))$, with variance given by

$V(\hat{\beta}) = V(\hat{\epsilon})\frac{1}{\Sigma_{t}(x_{t} - \bar{x})^{2}}$

This can be used to test whether the slope coefficient is significantly different from $0$. The associated test statistic

$t = \hat{\beta}/\sigma{\hat{\beta}}$

has a Student's t distribution.

The usual practice is to check whether the absolute value of the statistic is above $2$. If so, we would reject the hypothesis that there is no relationship between $y$ and $x$. This corresponds to a two-tailed significance level of $5\%$. The $t$ equation can also be used to construct confidence intervals for the population coefficients.

### Autoregression
A particular useful application is a regression of a variable on a lagged value of itself, called **autoregression**:

$y_{t} = \alpha + \beta_{k}y_{t-k} + \epsilon_{t}, t = 1,..., T$

If the $\beta$ coefficient is significant, previous movements in the variable can be used to predict future movements. Here, the coefficient $\beta_{k}$ is known as the $k$th-order **autocorrelation coefficient**.

Consider, for instance, a first-order autoregression, where the daily change in the stock market is regressed on the previous day's value. A positive coefficient $\hat{\beta_{1}}$ indicates a trend, or momentum. A negative coefficient indicates mean reversion. 

Autocorrelation changes normal patterns in risk across horizons. When there is no autocorrelation, risk increases with the square root of time. With positive autocorrelation, shocks have a longer-lasting effect and risk increases faster than the square root of time.

### Multivariate regression
More generally, the regression with $N$ independent variables can be written in

$y_{t} = \alpha + \Sigma^{N}_{i=1}\beta_{i}x_{t,i} + \epsilon_{t}$

Stacking all the $y$ variables, we have

$\begin{bmatrix}y_{1} \\ \vdots \\ y_{T} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & x_{13} & \dots &x_{1N} \\ \vdots &&&& \\ x_{T1} & x_{T2} & x_{T3} & \dots & x_{TN} \end{bmatrix}$ $\begin{bmatrix} \beta_{1} \\ \vdots \\ \beta_{N} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \vdots \\ \epsilon_{T} \end{bmatrix}$

This can include the case of a constant when the first column of $X$ is a vector of ones, in which case $\beta_{1}$ is the usual $\alpha$. In matrix notation,

$y = X\beta + \epsilon$

The estimated coefficients can be written succinctly as 

$\hat{\beta} = (X'X)^{-1}X'y$

which requires another assumption for $OLS$ estimation:
- The matrix $X$ must have full rank, meaning that a variable cannot be a linear combination of others.

If $X$ were not full rank, the matrix ($X'X$) would not be invertible. This assumption rules out perfect correlations between variables. Sometimes, however, $2$ variables can be highly correlated, which is described as **multicollinearity**. When this is the case, the regression output is unstable. Small changes in the data can produce large changes in the estimates. Indeed, coefficients will have very high standard errors.

The covariance matrix of coefficients is 

$V(\hat{\beta}) = V(\hat{\epsilon}(X'X)^{-1})$

using 

$V(\hat{\epsilon}) = \frac{1}{T - N}\Sigma^{T}_{t=1}\hat{\epsilon_{t}}^{2}$

where the deonominator is adjusted for the number of estimated coefficients $N$.

We can extend the $t$-statistic to a multivariate environment. Say we want to test whether the last $m$ coefficients are jointly $0$. Define $\hat{\beta_{m}}$ as these grouped coefficients and $V_{m}(\hat{\beta})$ as their covariance matrix. We set up a statistic

$F = \frac{\hat{\beta_{m}'}V_{m}(\hat{\beta})^{-1}\hat{\beta_{m}}/m}{SSE/(T - N)}$

which has an $F$-distribution with $m$ and $T - N$ degrees of freedom. As before, we would reject the hypothesis if the value of $F$ is too large compared to critical values from tables.

Finally, we can also report a regression $R$-squared. We should note, however, that the $R$-squared mechanically increases as variables are added to the regression. With more variables, the variance of the residual term must be small, because this is in-sample fitting with more variables. Sometimes an **adjusted R-squared** is used:

$\bar{R^{2}} = 1 - \frac{SSE/(T - N)}{SSY/(T - 1)} = 1 - (1 - R^{2})\frac{T - 1}{T - N}$

This more properly penalizes for the number of independent variables.

### Pitfalls with regressions
We now briefly mention potential problems of interpretation.

The original $OLS$ setup assumes that the $X$ variables are predetermined (i.e., exogenous or fixed), as in a controlled experiment. In practice, regressions are performed on actual, existing data that do not satisfy these strict conditions.

If the $X$ variables are stochastic, however, most of the $OLS$ results are still valid as long as the $X$ variables are distributed independently of the errors and their distribution does not involve $\beta$ and $\sigma^{2}$.