# Regression analysis
regression analysis is a set of statistical processes for estimating the relationships between a dependent variable, often called the **outcome variable** and one or more independent variables, often called **predictors**, **covariates**, or **features**. 

## 1) Linear Regression
Although the terms **least squares** and **linear model** are closely linked, they are not synonymous. Linear regression models are often fitted using the least squares approach but other approaches could be used, i.e.  minimizing a penalized version of the least squares cost function as in ridge regression (Tikhonov regularization
 and lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models.

Given a data set ${\displaystyle \{y_{i},\,x_{i1},\ldots ,x_{in}\}_{i=1}^{m}}$,  regression model assumes that the relationship between the dependent variable $y$ and the $\textbf{x}$ is linear.

${\displaystyle y_{i}=\beta _{0}+\beta _{1}x_{i1}+\cdots +\beta _{p}x_{ip}+\varepsilon _{i}=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}+\varepsilon _{i},\qquad i=1,\ldots ,m,}$

Writing it in the matrix form:

${\displaystyle {\begin{pmatrix}1&x_{11}&\cdots &x_{1n}\\1&x_{21}&\cdots &x_{2n}\\\vdots &\vdots &\ddots &\vdots \\1&x_{m1}&\cdots &x_{mn}\end{pmatrix}}}\mathbf {\begin{pmatrix}\beta_{1}\\ \beta_{2}\\\vdots \\ \beta_{n}\end{pmatrix}}=\mathbf {\begin{pmatrix}y_{1}\\y_{2}\\\vdots \\ y_{m}\end{pmatrix}} $






$X_{m\times n}{\vec {\beta_{n\times 1} }}=Y_{m\times 1}$

${\displaystyle L(D,{\vec {\beta }})=||X{\vec {\beta }}-Y||^{2}=(X{\vec {\beta }}-Y)^{T}(X{\vec {\beta }}-Y)=Y^{T}Y-Y^{T}X{\vec {\beta }}-{\vec {\beta }}^{T}X^{T}Y+{\vec {\beta }}^{T}X^{T}X{\vec {\beta }}}$


${\displaystyle {\frac {\partial L(D,{\vec {\beta }})}{\partial {\vec {\beta }}}}={\frac {\partial \left(Y^{T}Y-Y^{T}X{\vec {\beta }}-{\vec {\beta }}^{T}X^{T}Y+{\vec {\beta }}^{T}X^{T}X{\vec {\beta }}\right)}{\partial {\vec {\beta }}}}=-2X^{T}Y+2X^{T}X{\vec {\beta }}}$


setting the gradient of the loss to zero and solving for ${\displaystyle {\vec {\beta }}}$ we get: 

${\displaystyle -2X^{T}Y+2X^{T}X{\vec {\beta }}=0\Rightarrow X^{T}Y=X^{T}X{\vec {\beta }}\Rightarrow {\vec {\hat {\beta }}}=(X^{T}X)^{-1}X^{T}Y}{\displaystyle -2X^{T}Y+2X^{T}X{\vec {\beta }}=0}$


$\Rightarrow X^{T}Y=X^{T}X{\vec {\beta }}\Rightarrow {\vec {\hat {\beta }}}=(X^{T}X)^{-1}X^{T}Y$

## 2) Logistic regression
logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose. This can be extended to model several classes of events with **Multinomial logistic regression**.
Using linear regression the outputs may not be between zero and one. Basically we have $\textbf{X} \in \mathbb{R}$ and we are inerested to map it to a probability value  $p(\textbf{X}) \in [0,1]$, so we use sigmoid function to limit the outcome between 0 and 1:


$p(x)=\frac{1}{1+e^{-\beta x}}$. 




### 2-1) Odds
Odds provide a measure of the likelihood of a particular outcome. The odds of rolling a 6 is 1:5, The odds of rolling either a 5 or 6 is 2:4.


$odds=\frac{P(event)}{1-P(event)}$

### 2-2) Logit
The logit function or the log-odds is the logarithm of the odds. If $p$ is a probability, 

${\displaystyle \operatorname {logit} (p)=\log \left({\frac {p}{1-p}}\right)=\log(p)-\log(1-p)=-\log \left({\frac {1}{p}}-1\right)\,.}$


### 2-3) Parameter Estimation
In regression we used least square to estimate the parameter, here we use $MLE$ to estimate the parameter. For every sample in our training sete, since all samples are conditionally indipendent of each other we have:

$p(x_i\cap x_j)=p(x_i, x_j)=p(x_i)p(x_j)$


We have to labels (classes): $0$ and $1$ $ {\displaystyle Y\in \{0,1\}}$

We are looking for parameter $\theta $ such that it maximized the likelihood:


${\displaystyle {\begin{aligned}L(\theta \mid y;x)&=\Pr(Y\mid X;\theta )\\&=\prod _{i}\Pr(y_{i}\mid x_{i};\theta )\\&=\prod _{i}h_{\theta }(x_{i})^{y_{i}}(1-h_{\theta }(x_{i}))^{(1-y_{i})}\end{aligned}}}$

Where:

$\Pr(Y=1\mid X;\theta )={\displaystyle h_{\theta }(X)={\frac {1}{1+e^{-\theta ^{T}X}}}}$


$\Pr(Y=0\mid X;\theta )=1-\Pr(Y=1\mid X;\theta )={\displaystyle 1-h_{\theta }(X)=1-{\frac {1}{1+e^{-\theta ^{T}X}}}}$
 

By taking $log$ function from both side we can turn multipication into summation:

$l(\theta)=\sum_{i} y_{i}\theta x_{i} - log(1+e^{\theta x_{i}})$

which is a A **transcendental equation**. A **transcendental function** is an analytic function that does not satisfy a polynomial equation, in contrast to an algebraic function, i.e: $x=e^{-x}, x=\cos x, 2^{x}=x^{2}$
This equation canbe solved numerically by algorithm such as 
- Gradient descent
- Newton–Raphson method
- Quasi-Newton methods
- Davidon–Fletcher–Powell formula
- Broyden–Fletcher–Goldfarb–Shanno algorithm

Refs: [1](https://www.youtube.com/watch?v=YMJtsYIp4kg)

### 2-4) Example:
Refs [1](http://faculty.cas.usf.edu/mbrannick/regression/Logistic.html)

### 2-5) Multinomial logistic regression


## 3) Nonlinear regression