Statistical Methods
==========
***

Linear Regression Models and Least Squares
-----------

A linear model can be expressed in the matrix form:

$\mathbf{y}=X\beta + \varepsilon$

where $\mathbf{y}$ are the dependent variables, X are the independent variables, $\beta$ is the list of parameters in the fit, and $\varepsilon$ is an error term. When you collect a bunch of data points, you will add another index to your matrices/tensors, but let's assume now that we have just 1 independent and 1 dependent variable for simplicity. Then one measurement is $y_i=X_i\beta$, and the collection of measurements is expressed now as

$\mathbf{y}=X\beta$

The least squares minimization approach seeks to minimize the quantity

$\chi^2 = (\mathbf{y} - X\beta)^2 = (\mathbf{y} - X\beta)^T(\mathbf{y} - X\beta) $

which, in the case of a linear regression, has an analytic solution. The minimum is found by setting the derivative wrt $\beta$ equal to 0, e.g.

$\frac{\Delta \chi^2}{\Delta \beta} = 0 
  = - X^T \mathbf{y} - \mathbf{y}^T X + 2 X^T X \beta = 2 ( - X^T \mathbf{y} + X^T X \beta)
\,\,\rightarrow\,\,\,   X^T \mathbf{y} = X^T X \beta
$

Then simply

$\beta = (X^T X)^{-1} X^T \mathbf{y}$

assuming $(X^T X)$ is invertible (see multicollinearity). In practice, however, once the number of measurements is sufficiently large that makes matrix inversion computationally expensive, then numerical methods are typically used to solve the least-squares problem, and are generally more efficient.

Extensions to the least squares method include expanding it for measurements with different errors, e.g.

$ \chi^2 = \sum_i\frac{(y - X\beta)^2}{\sigma^2}$

Linear vs Nonlinear least squares
-------

A polynomial $f(x) = A + Bx + Cx^2 + ...$ is still said to be a *linear* model in the sense that the function is expressed as a linear combination of the coefficients $A$, $B$, $C$ of the fit. A *nonlinear model* in this sense would be an expression like $f(x) = \frac{B\ln x}{1+Ax}$, which is clearly not a linear combination of coefficients.


Bayes Theorem and Bayesian Statistics
-------------

Simply stated:

$P(Y|X)P(X) = P(X|Y)P(Y)$

$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$

More slowly, let's say we have a test $t$ to check for a trait, for which one can be positive $+$ or negative $-$.
Let's write down two ways to check for $P(+,t_+)$ (the probability that you are positive *and* you had a positive test) -- you can get this quantity in two equivalent ways:

$P(+,t_+) = P(+|t_+)P(t_+) = P(t_+|+) P(+)$

or in other words, the fraction of positive test results times the probability that an individual is truly positive based on a positive test result, is equal to the fraction of positive people times the conditional probability that you test positive if you are positive (the **sensitivity** of the test).

A typical question is: *if someone tests positive, what is the probability that they are truly positive?*. This is as simple as rearranging the terms:

$P(+|t_+) = \frac{P(t_+|+) P(+) }{ P(t_+)}$,

or in other words, the probability of a true positive, conditional on a positive test result, is equal to the sensitivity of the test times the prevalence of the trait, divided by the total probability of a positive test result $P(t_+)$. This last quantity can be calculated by:

$P(t_+) = P(t_+|+)P(+) + P(t_+|-)P(-)$

Terms: **Sensitivity** (true positive rate), **specificity** (true negative rate, or $1-$FPR), and **prevalence** (of the positive condition in the population).

Bayesian Inference
--------

Something something flat prior, something something marginalization.


Naive Bayes Classifiers, and relation to the Likelihood Ratio
---------

A naive Bayes classifier starts with the conditional probability $P(H_1|\mathbf{x})$ (that is, the probability of Hypothesis 1 given an observation $\mathbf{x}$), and assumes that the variables are all uncorrelated. Then, the joint probability can be written as a product of one-dimensional probabilities, e.g.

$P(H_1|\mathbf{x}) = P(H_1|x_0)\times P(H_1|x_1)\times ... = \prod_i P(H_1| x_i)$

If you have a multi-class problem, then you can simply assign the class with the highest probability as the selected label.

If you are testing two hypotheses $H_0$ and $H_1$ against one another, then you can construct a **likelihood ratio** discriminant, e.g.

$d = \large{\frac{P(H_1|\mathbf{x})}{P(H_0|\mathbf{x})}}$, or
$d = \large{\frac{P(H_1|\mathbf{x})}{P(H_0|\mathbf{x}) + P(H_1|\mathbf{x})}}$ if you want your result to be between 0 and 1.

***
Kernels
==============

Very basically, (positive definite) kernels relate two vectors to one another. A very simple example is the concept of distance kernel, e.g. 

$K(x,y) = |x-y|$.

The Gaussian kernel (used e.g. in kernel density estimation of probability density functions) -- also used in the Gaussian Process where it is often referred to as RBF (radial basis function), is defined as

$K(\mathbf x, \mathbf y) = e^{|\mathbf y - \mathbf x|^2/2\sigma^2}$

and where in the KDE case, a data point $x_0$ is replaced with a gaussian kernel $K(x,x_0)$, and then these kernels are added for each data point in the data set.

Kernels as Feature Map generators
----------

The key idea is that a kernel allows you to generate a "feature space" (create more features) by calculating the relationship between two given datapoints to create a larger set of features.

An example is the **polynomial kernel**, given by $K(x,y) = (x^T y + c)^d$ of some dimension $d$, where $x$ and $y$ are two samples (events) in a dataset. Taking $d=2$ and assuming that there are two features in the dataset, then the polynomial kernel is defined as

$K(x,y) = \left(\sum_i x_i y_i + c \right)^2 = x_0^2 y_0^2 + x_1^2 y_1^2 + 2x_0y_0 x_1 y_1 + cx_0y_0 + cx_1y_1 + c^2 $

Note that this kernel satisfies the formulation $K(x,y) = \phi(x)\cdot \phi(y)$ where $\phi(x)$ is a mapping of x to another feature space with more features. By inspection of the above kernel, we can deduce that

$\phi(x) = \langle x_0^2,~~ x_1^2,~~ x_0,~~ x_1,~~ x_0x_1,~~ c \rangle$ (with some factors omitted).

RBF (Gaussian) Kernel
--------

The RBF kernel technically has an infinite feature space, so in order to avoid this ..... ???

Lagrangian Dual
========================

Lagrangian Multipliers
===========

***
Bibliography
=======
 - https://en.wikipedia.org/wiki/Ordinary_least_squares
 - Glen Cowan: Statistical Data Analysis