<h4>Explain Linear Regression</h4>

$$y_i = \beta x^T +\epsilon$$

<img src="https://github.com/BadrulAlom/Data-Science-Notes/raw/master/_img/aml/aml010.png" height="100" width="300">

* Y is called the the regressand, endogenous variable, response variable, measured variable, criterion variable, or dependent variable
* X is called the regressors, exogenous variables, explanatory variables, covariates, input variables, predictor variables, or independent variables 
* $\beta$ is the parameter vector. <b>You will often see it being denoted by W, a weight vector.</b> Elements of this are interpreted as the partial derivatives of the dependent variable with respect to the various independent variables.
* $\epsilon$ is called the error term, disturbance term, or noise. This variable captures all other factors which influence the dependent variable yi other than the regressors xi. The relationship between the error term and the regressors, for example whether they are correlated, is a crucial step in formulating a linear regression model, as it will determine the method to use for estimation.
* Linear regression has a number of assumptions (https://en.wikipedia.org/wiki/Linear_regression#Extensions) some of which are addressed by the many extensions to it
<br>


More about the error term:
<br>We often assume e has a Gaussian (normal) distribution. We denote this by e ~ <i>N</i>($\mu,\sigma)$
<br> $p(y|x,\theta) = N(y | \mu(x), \sigma^2(x))$

1. The error term, e represents the overall mean error - therefore it's a constant that you add on to make and adjustment to the prediction
2. The output, y, will follow a normal distribution equavelent to one generated from using the mean of x, and std. dev. of x

<h4 style="background-color:#616161;color:white">Measuring the loss</h4>

Given N rows of data, the loss is

$$Loss(\beta) = \sum^N_{i=1} (y^i - \beta^Tx^i)^2$$

or

$$E(w) = \sum^N_{i=1} (y^i - w^Tx^i)^2$$

<h4 style="background-color:#616161;color:white">How to derive the parameter vector</h4>

A large number of procedures have been developed for parameter estimation and inference in linear regression. 

<b>Least-squares estimation and related techniques:</b>
- Ordinary Least Squares (OLS). Minimizes the sum of squared residuals which leads to a closed form (i.e. mathematical) way to dervise the parameter vector:

$${\hat { {\beta }}}=({X} ^{\top }{X} )^{-1} {X} ^{\top }{y} =\left(\sum {x} _{i}{x} _{i}^{\top }\right)^{-1}\left(\sum {x} _{i}y_{i}\right)$$

- Generalised Least Squares (GLS) is an extension of OLS that allows estimation when heteroscedasticity (no constant variance), or correlations are present among the error terms of the model; as long as the form of heteroscedasticity and correlation is known independently of the data.

- Total least squares (TLS) is an approach to least squares estimation of the linear regression model that treats the covariates and response variable in a more geometrically symmetric manner than OLS. It is one approach to handling the "errors in variables" problem, and is also sometimes used even when the covariates are assumed to be error-free.

<b>Maximum-likelihood estimation and related techniques</b>
- Maximum likelihood estimation can be performed when the distribution of the error terms is known to belong to a certain parametric family ƒθ of probability distributions. 
- Ridge regression and other forms of penalized estimation such as Lasso regression deliberately introduce bias into the estimation of $\beta$ in order to reduce the variability of the estimate. The resulting estimators generally have lower mean squared error than the OLS estimates, particularly when multicollinearity is present or when overfitting is a problem. They are generally used when the goal is to predict the value of the response variable y for values of the predictors x that have not yet been observed. These methods are not as commonly used when the goal is inference, since it is difficult to account for the bias.

<h4 style="background-color:#616161;color:white">Regularization</h4>

Needed to keep your model in check and avoid overfitting There are a number of ways you can do this:

* Normalize each data point by dividing by variance: $x^n_i \rightarrow \frac{x^n_i}{\sigma_i}$
* Add an extra term to penalize rapid changes in the error  $$E'(w) + \lambda R(w) $$where R(w) is the function for determining the error and \lambda controlling the penalty strength
* Add an L2 penalty term that's some fuction of the weight vector
$$E(w) = \sum^{N}_{n=1}(y^n-w^TX^n)^2 + \lambda w^Tw$$

In this case the optimal w is given by

$$w=\left(\sum_nX^n(X^n)^T+\lambda I \right)^{-1} \sum^{N}_{n=1}y^nX^n$$

------------------------

'Independent variables' do not have to be independent

<b>What is multi-collinearity?

'Independent' variables that are highly correlated

<b>What is hetroesdascity?<b>

Is when variance of the error is increasing/decreasing over time

In [None]:
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/

Non-Parametrics

e.g. Qunitic function = y(x) = $\alpha_0+\alpha_1x+\alpha_2x+\alpha_3x+\alpha_4x$

Higher degree poynomials are useful as they can provide a better fit to data if a simple linear model does not suffice. Each extra degree allows an additional level of curvature - notice for quadratic the curve changes once, and for quintic it changes 4 times

<b>What is Polynomial Least Square

Watch this video: https://www.youtube.com/watch?v=Z5iq95Vg2ZY&index=4&list=PLpT5xJ7AmkRW5Q0HwfVRWgUjFdiis-alc

<b>What is Ridge Regression?</b>

Ridge regression is a technique that deals with over-fitting due to having too many parameters. If you think of this as resulting in lots of co-efficients that are highly tuned to the dataset it was trained on

<b>What is Robust Linear Regression?

It is very common to model the noise (i.e. error term) in regression models using a Gaussian distribution with zero mean and constant variance, i ∼ N (0, σ 2 ), where i = yi −wT xi . In this case, maximizing likelihood is equivalent to minimizing the sum of squared residuals, as we have seen. However, if we have outliers in our data, this can result in a poor fit. 

This is because squared error penalizes deviations quadratically, so points far from the line have more affect on the fit than points near to the line. One way to achieve robustness to outliers is to replace the Gaussian distribution for the response variable with a distribution that has heavy tails. Such a distribution will assign higher likelihood to outliers, without having to perturb the straight line to “explain” them.

<b>What is the logistic function?

$$f(x) =\frac{1}{1+e^{-\theta}}$$

This is the sigmoid function. It has a tigh S shaped curve which is better for approximating to binary (0 or 1) type outputs

<b> How is this used in logistic regression?

Remember that for normal linear regression the output y  = $ h_\theta (X) + c$

This can be thought of as '<i>some</i> function of X with parameter theta' ; similar to saying f(X). Remember function can mean a long polynomial formula with higher exponents $X^n$ and a vector for $\theta$ , not just 1 value). The c above can be ignored for our purposes - it's simply the intercept of y.

In linear regression $ h_\theta (X)$ is  $\theta^TX$  (where T = transpose)

In logistic regression $ h_\theta (X)$ is also $\theta^TX$ but then wrapped up in a sigmoid function


$$= \frac{1}{1+e^{-(\theta^Tx_0)}}$$ 

This can also be thought of as prob(y|x and $\theta$) (e.g. 0.7 = 70% chance). 

In logistic regression Y = 1 when probability >=0.5

<img src="img/LogisticFunction.png" >

<b>Logistic Regression Loss function

The accuracy of a Logistic Regression model is measured by the following loss function:

$$J(\theta) = -\frac1 m \sum_{i=1}^m\Big[ y^i log(h_ \theta x^i)  + (1 - y^i)log (1-h_\theta (x^i)\Big]$$

This is what we would like to minimize

<b>Gradient Descent and Newton-Raphson algorithm <b>

For logistic regression, the theta, $\theta$, vector cannot be calculated easily. It requires using gradient descent to find the minimum point of loss. Mathematically this means finding a $\theta$ vector where the derviative of the loss (the change in loss given the change in $\theta$) approximates to 0.

This is done in iterative steps, and the Newton-Raphson method does the iteration efficiently by calculating a second derivative on top of the first derivative:

New $\theta$ = old $\theta-\frac {DiffrentialOfLoss} {DiffrentialOfDifferentialOfLoss}$ 

= $\theta_{t+1} = \theta_{t}  - \frac {f(J(\theta_{t}))} {f'(J(\theta_{t}))}$

The above version of the formula works for when your $\theta$ is not a vector.
The generalized version looks like this:

= New $\theta$ = old $\theta-\frac {DiffrentialOfLoss} {DiffrentialOfDifferentialOfLoss}$ 

= New $\theta$ = old $\theta$ - DiffrentialOfLoss * inverse of DiffrentialOfDifferentialOfLoss

= $\theta_{t+1} = \theta_t  - $ GradientVector *  Hessian($\theta)^{-1}$

= $\theta_{t+1} = \theta_t  - \nabla f'(J(\theta_t))   * H(\theta)^{-1} $

Loss functions

It’s very common to train models using the squared loss. However, if the task
performance is measured by some other loss, it often makes more sense to
train the model using the correct loss. One often sees users train a model
using one kind of loss, but evaluate it using a very different loss – seems a bad
idea in general.