# Linear and Logistic Regression, AWS Machine Learning: Data Scientist

## Linear Regression via Least Squares Minimization

### Types of Machine Learning

* **Supervised**. Learns a function that approximates the relationship between input and output data
* **Unsupervised**. Finds structure in data without using explicit labels.
    * **Classification**. Predicts discrete labels/categories. Example, Logistic Regression
    * **Regression**. Predicts continuous values. Example, Ordinary Least Squares

### Introduction to Linear Models

* **What is it?**. In linear modeling, the relationship between each individual input variables and the output is a straight line. Slopes of such lines become the coefficients of the linear equation.
* **An example of a linear notation**. $y_i = a_0 + a_1x_1 + a_2x_2 + ...$.
* **Why use linear models?**.     
    * Interpretable
    * Low complexity
    * Scalable
    * Good baseline

### Definition of 1-Dimension Ordinary Least Squares (OLS)

* $i$ - an observation. Assume $i = 1, 2, ..., N$.
* $x$ - independent variable, feature
* $y$ - dependent variable, output 

The linear relationship looks like this

\begin{equation*}
y = ax + b
\end{equation*}

where $a, b$ are constants.  

The OLS problem is to solve for $a$ and $b$ in order to achieve $y_i = ax_i + b + \epsilon$, where $\epsilon$ is the noise (there is always a noise/error: no matter how well we find $a$ and $b$, the indipendent variable never fully explains the dependent variable). We usually skip $\epsilon$ and instead write

\begin{equation*}
\hat{y}_i = ax_i + b
\end{equation*}

where  $\hat{y}_i$ are our estimates of $y_i$.

We solve for $a$ and $b$ by defining a loss function. A loss function is used whenever we are trying to optimize a given function. Let's define a loss function for linear regression.

\begin{equation*}
L = \sum_{i=1}^{N}(y_i - \hat{y_i})^2
\end{equation*}

### Solution of Ordinary Least Squares (OLS)

Find the $L$ minimum by equating the first derivatives of $L$ to zero, and solving for the values of $a$ and $b$ for which $L$ is minimum.

\begin{equation*}
L = \sum_{i=1}^{N}(y_i - \hat{y_i})^2 = \sum_{i=1}^{N}(y_i - ax_i - b)^2
\end{equation*}

Set $\frac{dL}{da} = 0$ and $\frac{dL}{db} = 0$

Let's do it

\begin{align*}
&\frac{dL}{db} = \sum_{i=1}^{N} 2(y_i - ax_i - b)(-1) = \sum_{i=1}^{N} (y_i - ax_i - b) = 0 \\
&\sum_{i=1}^{N} y_i- a\sum_{i=1}^{N}x_i - bN = 0 \\
& \boxed{b = \frac{1}{N} \sum_{i=1}^{N} y_i - \frac{a}{N} \sum_{i=1}^{N}x_i}
\end{align*}


\begin{align*}
& \frac{dL}{da} = \sum_{i=1}^{N} 2(y_i - ax_i - b)(-x_i) = \sum_{i=1}^{N} (y_i - ax_i - b)(x_i) = 0\\
& \sum_{i=1}^{N} x_iy_i - a\sum_{i=1}^{N}x_i^2 - b\sum_{i=1}^{N}x_i = 0 \\
& \sum_{i=1}^{N} x_iy_i = a\sum_{i=1}^{N}x_i^2 + \big( \frac{1}{N} \sum_{i=1}^{N} y_i - \frac{a}{N} \sum_{i=1}^{N}x_i  \big) \sum_{i=1}^{N}x_i \\
& \boxed{a = \frac{\sum_{i} x_iy_i - \frac{1}{N}(\sum_{i}x_i)(\sum_{i}y_i)}{\sum_{i}x_i^2 - \frac{1}{N}(\sum_{i}x_i)^2}} 
\end{align*}

So the exact solutions to the minimization problem are

\begin{align*}
a &= \frac{\sum_{i} x_iy_i - \frac{1}{N}(\sum_{i}x_i)(\sum_{i}y_i)}{\sum_{i}x_i^2 - \frac{1}{N}(\sum_{i}x_i)^2} \qquad \big[= \frac{Cov(X,Y)}{Var(X)} \big] \\
b &= \frac{1}{N}\sum_{i}y_i - \frac{a}{N}\sum_{i}x_i \qquad[ = E[Y] - aE[X]]
\end{align*}

### Interpretation

* Interpretation of $a$ and $b$
    * $a$ – slope of the line (size of relationship between $X$ and $Y$)
    * $b$ – intercept (the value of $\hat{y}$ when $x=0$). Sometimes, we normalize the independent variable, $x$, by subtracting the mean, $\bar{x}$ from all $x$. Then $b$ becomes the value of $\hat{y}$ when $x=\bar{x}$.
* Interpretation of $L$, Loss function
    * How well the model is capturing the variation in the output variable. That quantity is usually represented by $R^2$.
    * $\boxed{R^2 = 1 - \frac{L}{Var(y)} = 1 - \frac{MSE}{Var(y)} = 1 - \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2}}$
    * In general, $R^2 \in [0,1]$. The higher the $R^2$, the better the model is able to fit the observed data. Hypothetically, $R^2$ can be less than $0$, which will mean that the model is performing even worse than the simple average. In that case, we should through away that model and take the average. So in practice $R^2$ should be between $0$ and $1$.
    * **Caution**. One shouldn't use $R^2$ blindly to judge the performance of the model.




### Multivariate Ordinary Least Squares (OLS)

This happens when we have more than 1 dependent variable. 

* $i = 1, 2, ..., N$ observations
* $\vec{x_i} = [x_{i1}, x_{i2}, ..., x_{iM}]$. $M$ features, independent variables
* $y_i$ - dependent variable/output 

The linear relationship looks like this  

\begin{equation*}
\hat{y_i} = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_M x_{iM} = \vec{x_i} \beta
\end{equation*}

where $\beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ ... \\ \beta_M\end{pmatrix}$. Note, that we added $1$ to the feature vector $\vec{x_i}$.  

In matrix form, we can rewrite this model as 

\begin{align*}
& \begin{pmatrix} y_1 \\ y_2 \\ ... \\ y_N\end{pmatrix} = \begin{pmatrix} 1 \quad x_{1,1} \quad x_{1,2} \quad ... \quad x_{1,M} \\ 1 \quad  x_{2,1} \quad x_{2,2} \quad ... \quad x_{2,M} \\ ... \\ 1 \quad x_{N,1} \quad x_{N,2} \quad ... \quad x_{N,M} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ ... \\ \beta_M\end{pmatrix} \\
& N X 1 \qquad\qquad\quad N X (M+1) \qquad\qquad (M+1) X 1
\end{align*}

\begin{equation*}
\boxed{Y = X \beta}
\end{equation*}

The loss function is 

\begin{align*}
L &= \sum_{i=1}^{N} (y_i - \hat{y_i})^2 \\
  &= \sum_{i=1}^{N} (y_i - (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_M x_{iM}))^2 \\
  &= \sum_{i=1}^{N} (y_i - \vec{x_i} \beta)^2 \\
  &= \Vert Y - X \beta\Vert_2^2 \\
  &= (Y -X \beta)^T(Y - X\beta)
\end{align*}

where $\Vert . \Vert_2$ is the $L2$ norm of the vector.

### Optimization

Recall $L = (Y -X \beta)^T(Y - X\beta)$.

\begin{align*}
\frac{dL}{d\beta} &= 0 \\
\frac{d(Y -X \beta)^T(Y - X\beta)}{\beta} &= 0 \\
\frac{d(Y^TY - Y^TX\beta - \beta^TX^TY + \beta^TX^TX\beta}{d\beta} &= 0 \\
-(X^TY) - (X^TY) + 2 (X^TX)\beta = 0 \\ 
X^TY = (X^TX)\beta \\
\boxed{\beta = (X^TX)^{-1}X^TY}
\end{align*}

assuming that $X^TX$ is invertible, which happens whenever $X$ is a full rank matrix. In our case that means that $X$ should not have any columns that are linearly dependent.

### OLS Pros and Cons

**Pros**
* Efficient computation
* Unique minimum
* Stable under perturbation of data
* Easy to interpret

**Cons**
* Influenced by outliers
* $X^TX$ has to be invertible

## Linear Regression: A Probabilistic Approach

* The definition of a probabilistic approach
* Maximum Likelihood Estimate (MLE)
* MLE for the case of continuous probability with probability density function
* Gaussian MLE
* OLS as MLE
* Reformulate OLS model
* OLS likelihood function
* Minimizing MSE = Maximizing MLE
* Improvements in MLE

