# Week 2: Linear Regression - Theory

# 1. What is linear regression?

Linear Regression is a highly interpretable, standard method for modeling the <b>past relationship</b> between <b>independent input variables ($x$)</b> and <b>dependent output variables ($y$)</b> to help predict future values of the output variables ($y$).

# 2. When is linear regression used?

Linear Regresison is commonly used for <b>predictive analysis</b>, e.g. linear regression can be used to predict house prices based on the square footage or proximity to a shopping centre.

# 3. What are the ingredients for linear regression?

Linear Regression has <b>four</b> ingredients.  These are:

1. A <b>training dataset</b> of <b>indepdendent</b> $x$ values mapped to corresponding <b>dependent</b> $y$ values.


2. A <b>hypothesis</b>: a function that takes an input ($x$) and performs a function to provide an output ($y$).


3. A <b>cost function</b> to score the accuracy of the hypothesis.


4. A <b>gradient descent function</b> to minimise the cost function and thereby optimise the hypothesis.

The <b>goal</b> of linear regression is to use (4) to minimise (3) in order to identify (2) and thereby have a model that as best as possible accurately predicts $y$ values given input values for $x$.

These are explained below, and later implemented using python.

## 3.1. Training Data

For any ML problem we need training data, i.e. some data with which to interrogate and try to model to produce an outcome.  In <b>supervised</b> ML techniques, such as linear regression, we have a <b>labelled</b> training dataset.  
    
This means we have a set of input variabls ($x$) plus their corresponding output variables ($y$).  

We use this to "<b>train</b>" our hypothesis so that it can predict $y$ values for new $x$ values, e.g. house price based on square footage.

## 3.2. The hypothesis

The hypothesis for linear regression is usually described as $y = mx + b$:

<img src="../Images/linRegModel.png" width=100%/>

Mathematically, linear regression is usually presented as the below, which is configured for a single input feature, $x_1$:

\begin{align}
h_\theta(x) = \theta_0 + \theta_1x
\end{align}

### 3.2.1. What is this formula doing?

* $\theta$ represents each <b>parameter</b> (also known as "<b>weights</b>" and "<b>model coefficients</b>").  These values are "<b>learned</b>" during the model fitting / training step.


* $\theta_0$ is the y axis intercept.  It is sometimes represented simply as "b", as in the diagram above.


* $\theta_1$ is the parameter for $x_1$ (the <b>first feature</b>) and so on (if we have more than one feature).  It is sometimes represented as "m" as in the diagram above, and is the <b>slope</b> of the line.

### 3.2.2. Some examples in action

#### <center> h(x) = 1.5 + 0x
<img src="../Images/linreggraph1.png" width=40%/>

#### <center> h(x) = 1 + 1.5x
<img src="../Images/linreggraph2.png" width=40%/>

## 3.3. Cost Function (aka Loss Function)

### 3.3.1. What is the cost function?

It measures how <b>good</b> or <b>bad</b> our hypothesis is at predicting $y$ values for a given $x$ value.  It does so by scoring how closely fitted the hypothesis line is to the actual data points.  

A commonly used cost function is the <b>mean squared error function</b> ("MSE"):

\begin{align}
J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2
\end{align}

Whereby:

* The <b>smaller</b> the MSE, the <b>closer</b> the fit is to the data.


* The <b>larger</b> the MSE, the <b>looser</b> the fit is to the data.

In doing so, the cost function helps us achieve our primary goal, i.e. to find the values of $\theta$ so that $h_\theta(x)$ outputs $y$ values that best describe the corresponding input values $x$. 

### 3.3.2. What is the goal of our cost function?

From a mathematical point of view, for each $i^{th}$ point in my data set, I want the difference between my predicted $y$ value and actual $y$ value for a given input $x$ (i.e. as described by: $h_\theta(x^{(i)}) - y^{(i)}$), to be very small (ideally as small as possible).  In doing so, this means the corresponding hypothesis parameters will be set to optimise the hypothesis such that it more accurately predicts $y$ values for given $x$ values.

### 3.3.3. What is this formula doing?

1. The formula iterates through each $(x, y)$ point (i.e. each training example) in our data set and sums ($\sum_{i=1}^{m}$) the square distances (i.e. everything in parentheses) between each point's <b>actual</b> $y$ value (i.e. $y^{(i)}$) and the candidate line's <b>predicted</b> $y$ value (i.e. $h_\theta(x^{(i)}$).  


2. The $\frac{1}{2m}$ calculates the <b>mean</b> result from (1).


3. Note that the $2$ is added to ease calculations in further steps, and that the <b>squaring</b> is done so negative values do not cancel positive values.


4. Finally, we can expand the cost function as follows:

\begin{align}
h_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}
\end{align}

5.  Which means we can write the MSE as:

\begin{align}
J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(\theta_0 + \theta_1x^{(i)} - y^{(i)})^2
\end{align}
Graphically this can be represented like so:

<img src="../Images/LossSideBySide.png" width=40%/>

The above demonstrates <b>high loss</b> for the left model, i.e. the greater average distances between actual $y$ values (yellow nodes) and predicted $y$ values (corresponding points on the blue line).  Conversely, the right model demonstrates <b> low loss</b>, i.e. lesser average distances between actual $y$ values and predicted $y$ values.

### 3.3.4. A simple example of the cost function in action

Let's begin with the below dataset:

<img src="../Images/linregsimple1.png" width=40%/>

To keep things simple, and because we can see this function passes through (0, 0), we are only looking at a single value of $\theta$.  Therefore our objective is to min $J(\theta)$.

#### <center> J(1.0) = 0

<img src="../Images/linregsimple2.png" width=40%/>

#### <center> J(0.5) = 0.58

<img src="../Images/linregsimple3.png" width=40%/>

#### <center> J(other values)

<img src="../Images/linregsimple4.png" width=40%/>

### 3.3.5. Plotting the cost function values for each theta value

<img src="../Images/linregsimple5.png" width=40%/>

The pink line visualises the cost function values for each value of $\theta$.  We can see that $J(\theta)$ is at a minimum when $\theta = 1$.  Therefore the winning hypothesis is $h_\theta(x) = 1x$, i.e. the orange line.

Note also that where we have more than one $\theta$ value, we can plot the cost function values against corresponding theta values in multidimensional space, e.g.

<img src="../Images/Image[16].png"/>

## 3.4. Gradient Descent

### 3.4.1. What is gradient descent?

An algorithm that automatically minimizes the cost function thereby identifying the best values for $\theta$ that in turn optimise our initial hypotheis to increase its accuracy at predicting $y$ for a given input $x$.

By analogy with the 3D diagram above, we can think of gradient descent as finding the optimal path from the top of a mountain down to the lowest point in the surrounding valleys.

The gradient descent function is usually presented per the below (explained later):

\begin{align}
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)
\end{align}

### 3.4.2. Minimums and Maximums

Returning to the mountain valley analogy, we can describe the various peaks and troughs as follows:

- The <b>local minimum</b> is a point where the function evaluates to a <b>greater</b> value at every other point in a <i>neighhbourhood</i> around the local minimum.


- The <b>global minimum</b> is a point where the function evaluates to a <b>greater</b> value at every other point across the entire function.


- The <b>local maximum</b> is a point where the function evaluates to a <b>lower</b> value at every other point across the entire function.


- The <b>global maximum</b> is a point where the function evaluates to a <b>lower</b> value at every other point across the entire function.


- Visualised by example, these are as follows:

<p align = "center">
<img src="../Images\maxmin.png" width=40%>
</p>

### 3.4.3. Types of Gradient Descent

As with most things ML, there's more than one way to skin a cat.

1. <b>Batch</b>: is the <i>total</i> number of examples used to calculate the gradient in a single iteration of a gradient descent calculation.


2. <b>Stochastic</b>: adjective to describe something that was <i>randomly</i> determined.

#### Batch Gradient Descent

What we've actually covered is <b>Batch Gradient Descent</b>.  Batch Gradient Descent uses the <b>entire</b> set of training examples to calculate gradient descent in each iteration.   

#### Stochastic Gradient Descent

Stochastic gradient descent (aka <b>"SGD"</b>) uses a <b>single</b> training example to calculate gradient descent in each iteration. The particular training example for each iteration is chosen <b>randomly</b>, hence the term "Stochastic Gradient Descent".  

#### Mini-Batch Stochastic Gradient descent

Mini-Batch stochastic gradient descent uses a <b>subset</b> of the entire set of training examples to calculate gradient descent in each iteration.  The batch size of a mini-batch is usually between <b>10</b> and <b>1,000</b>.

#### When to use which type of Gradient Descent?

Batch gradient descent (in the sense of iterating on the entire set of training examples) works well for smaller more manageable datasets.  However, at huge scale (e.g. Google scale), data sets often contain billions or even hundreds of billions of examples.  Further, each example may contain a huge number of features.  Consequently, a batch can be enormous and too costly to compute.  In such scenarios, SGD or MBSGD are often used.

For instance, a large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.

### 3.4.4. How does gradient descent work?

#### The common sense idea

Essentially gradient descent automates this process:

1. Start with some values of the coefficients/parameters, eg. $\theta_0 = 0$, $\theta_1 = 0$.


2. Keep changing $\theta_0$ and $\theta_1$ in whatever direction (e.g. up or down) such that the cost function reduces to 0.  

This process is often expressed as Min $J(\theta_0, \theta_1)$, i.e. minimising the cost function.

#### The gradient descent formula and how it works

The formula for gradient descent is usually expressed as:

\begin{align}
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)
\end{align}

$\theta_j$ means the $\theta$ value corresponding to the $j^{th}$ feature, e.g.


* $\theta_0$ = theta value corresponding to $x_0$


* $\theta_1$ = theta value corresponding to $x_1$

$:=$ means assignment.

$\alpha$ (alpha) is the <b>learning rate</b>.  It controls how big a step you take.  If $\alpha$ is large you take big steps up or down between $\theta$ values and vice versa.  Note:

<p align = "center">
<img src="../Images\graddescentlargesmall.png" width=50%>
</p>

$\frac{\partial}{\partial \theta_j}$ means we take the <b>derivative</b> of $J(\theta_0, \theta_1)$, i.e. the cost function.  Expanded (i.e. substituting $\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2
$ for $J(\theta_0, \theta_1)$, it looks like this:

\begin{align}
\frac{\partial}{\partial \theta_j} . J(\theta_0, \theta_1) = 
\frac{\partial}{\partial \theta_j} . \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2
\end{align}

Taking the derivative of the cost function as the above describes generates a new equation, which is:

\begin{align} 
\text{repeat until convergence \{} \\
\theta_j & := \theta_j - \alpha \dfrac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ 
\text{\}}
\end{align}

This new function new equation allows us to calculate the <b>gradient of a straight line tangent</b> adjacent to the point on the graph that represents the cost function for the relevant value of $\theta$:

<img src="../Images/linregderivative.png" width=40%/>

If the gradient is <b>positive</b> then the values of $\theta_0$ and $\theta_1$ are <b>reduced</b> to slide down the slope backwards to min $J(\theta_0, \theta_1)$.

<img src="../Images/gradDescentNegativeSlope.png" width=40%/>

If the gradient is <b>negative</b> then the values of $\theta_0$ and $\theta_1$ are <b>increased</b> to slide down the slope forwards to min $J(\theta_0, \theta_1)$.

<img src="../Images/gradDescentPositiveSlope.png" width=40%/>

Concretely, gradient descent works as follows: for every $\theta_j$ of the J function (i.e. $\theta_0, \theta_1,...\theta_n$) compute the value of $\theta_j$ by subtracting from itself the derivative of the function at point $\theta_j$ multiplied by a number $\alpha$ (i.e. what is described above).

Rinse and repeat the above until <b>convergence</b>, which means the optimal result or what is very nearly the optimal result (i.e. parameters adjusted so that the cost function output is now 0 or very close to 0).

#### Executing this formula in practice (and correctly!)

The correct way to implement gradient descent requires <b>simultaneous</b> updates of $\theta_j$.  That is:

1. To store each value in a temporary variable before assigning those temporary values to the actual $\theta_j$:

\begin{align}
temp_0 := \theta_0 - \alpha . \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1,... \theta_n)
\end{align}

\begin{align}
temp_1 := \theta_1 - \alpha . \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1,... \theta_n)
\end{align}

\begin{align}
...
\end{align}

\begin{align}
temp_n := \theta_n - \alpha . \frac{\partial}{\partial \theta_0} J(\theta_n, \theta_1,... \theta_n)
\end{align}

2. And <b>only after each $\theta$ value is updated</b>, assign those temporary values to the actual $\theta_j$:

\begin{align}
\theta_0 := temp_0
\end{align}

\begin{align}
\theta_1 := temp_1
\end{align}

\begin{align}
...
\end{align}

\begin{align}
\theta_n := temp_n
\end{align}

This prevents you updating the value of $\theta_0$ then using the newly updated $\theta_0$ value when updating $\theta_1$ and so on!

### Useful Resources

- https://hackernoon.com/gradient-descent-aynk-7cbe95a778da
- https://www.internalpointers.com/post/gradient-descent-function 
- https://www.internalpointers.com/post/gradient-descent-action
- https://medium.com/ai-society/hello-gradient-descent-ef74434bdfa5
- https://hackernoon.com/life-is-gradient-descent-880c60ac1be8
- https://storage.googleapis.com/supplemental_media/udacityu/315142919/GradientDescent.pdf

## 3.5. Putting it all together

Therefore, the resutling linear regression algorithm is the following:

\begin{align} 
\text{repeat until convergence \{} \\
\theta_0 & := \theta_0 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \\\\ 
\theta_1 & := \theta_1 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \\\\
\text{\}}
\end{align}


# 4. Feature Scaling

## 4.1. The Challenge

Gradient descent will descend upon convergence:

1. <b>Quickly</b> where the range of θ values is <b>small</b>; and


2. <b>Slowly</b> where the range of θ values is <b>large</b>.

For instance:

<p align = "center">
<img src="../Images\feature_scaling_mean_normalization0.png" width=40%>
</p>

Because the two ranges (i.e. square footage size and number of rooms) are widly different, the resulting topology is skewed and narrow.  This makes gradient descent less efficient at finding the global minimum for θ<sub>0</sub> and θ<sub>1</sub>.

## 4.2. The Desired Solution

### 4.2.1. What are we trying to achieve?

To speed up gradient descent, we want to modify the range of our input variables so they are roughly the same, ideally: 

   (a) $-1 \leq x_{(i)} \leq 1$; or 
   
   (b) $−0.5 \leq x_{(i)} \leq 0.5$.

   
### 4.2.2. How can we achieve this?

Two techniques help with this: 

1. <b>Feature Scaling</b>; and 


2. <b>Mean Normalization</b>.

### 4.2.3. How do we implement this?

We implement both together, using this formula:

\begin{align}
x_i := \frac{x_i - \mu_i}{s_i}
\end{align}

Where:

1. $x_i$ is the <b>original value</b> of $x_i$ that is updated (i.e. as denoted by $:=$).


2. $\mu$ is the <b>mean</b> of all values in the $x$ variable range.


3. $s_i$ is the <b>range</b>, i.e. difference between the <b>highest</b> $x$ value and the <b>lowest</b> $x$ value, e.g. $s_i = x_{max} - x_{min}$


4. $x_i - \mu_i$ implements <b>Mean Normalisation</b>, i.e. <b>subtracting</b> the average value from an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero.


5. Dividing by $s_i$ implements <b>Feature Scaling</b>, i.e. <b>dividing</b> the input values by the range (i.e. the maximum value minus the minimum value) of the input variables, resulting in a new range of just 1.

### 4.2.4. How does this help?

Doing this for the above example results in this:

<p align = "center">
<img src="../Images\feature_scaling_mean_normalization0b.png" width=60%>
</p>

This has the following effects:

1. Transforms both sets of features for $\theta_i$ to a range of $0 \leq \theta_i \leq 1$.


2. It speeds up gradient descent by making it require <b>fewer</b> iterations to get to a good solution.


3. It does <b>not</b> speed up solving for $\theta$ using the normal equation.  The magnitude of feature values are <b>insignificant</b> in terms of computational cost.


4. Does <b>not</b> prevent the matrix $X^T X$ (used in the normal equation) from being non-invertible (singular / degenerate).


5. Is <b>not</b> necessary to prevent gradient descent from getting stuck in local optima.  The cost function $J(\theta)$ for linear regression has no local optima. 

# 5. Normal Equation

## 5.1. What is the "Normal Equation"?

In Gradient Descent for Linear Regression, we iteratively minimize cost function $J(\theta)$ to converge to the global minimum.

In contrast, the <b>Normal Equation</b> would give us a method to solve for $\theta$ analytically.  In other words, rather than needing to run an iterative algorithm, we can instead solve for the optimal value for $\theta$ all at one go, <b>i.e. in one step you get to the optimal value</b>.

## 5.2. How does the normal equation work?

<p align = "center">
<img src="../Images\normalequation3.png" width=70%>
</p>

To minimize a function, you take the derivative function and set it to zero.  So in the above:

#### Where $\theta$ is a <u>Real</u> number

The bit in blue handwriting minimizes the quadratic function (shown in the "U" shaped graph) where $\theta$ is a real number.

#### Where $\theta$ is <u>not</u> a Real number

1. In the above, the bit circled yellow describes where $\theta$ is <u>not</u> a Real Number.


2. As such, to minimize the cost function $J(\theta)$ (i.e. bit in green and bit in pink), we need to:

  (a) take the derivative of $J(\theta)$, with respect to every parameter of $\theta_j$;

  (b) set all of these values to $0$; and

  (c) solve the values of $\theta_1$, $\theta_2$ etc up to $\theta_n$ then this would give you the values of $\theta$ to minimise $J(\theta)$.

## 5.3. Example of Normal Equation

Here we are trying to predict house prices based on four features: (1) Size, (2) No. of Bedrooms, (3) No. of floors and (4) Age.

<p align = "center">
<img src="../Images\normalequation4.png" width=70%>
</p>

In the above, the part circled <b>red</b>, is the formula that provides the value of $\theta$ that minimizes your cost function.

<p align = "center">
<img src="../Images\normalequation7.png" width=70%>
</p>

## 5.4. When to use Normal Equation vs. Gradient Descent?

With the normal equation, computing the inversion has complexity $O(n^{3})$.

So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Other distinctions in the below comparison table:

<p align = "center">
<img src="../Images\normalequation6.png" width=50%>
</p>

# 6. Polynomial Regression

## 6.1. Math Refresh: What is a "Polynomial"

A polynomial is simply a function that has <b>one or more</b> terms.  Specifically, a polynomial can have:

1. <b>Constants</b>, e.g. 3, -20 or 1/2;


2. <b>Variables</b>, e.g. x and y; and


3. <b>Exponents</b>, e.g. the 2 in x<sup>2</sup> (but only 0, 1, 2, 3+),

that can be combined using <b>addition</b>, <b>subtraction</b> and <b>division</b> <u>EXCEPT</u> <b>division by a variable</b>

### 6.1.1. Things that <u>are</u> Polynomials

1. 3x


2. x − 2


3. −6y2 − ( 79)x


4. 3xyz + 3xy2z − 0.1xz − 200y + 0.5


5. 512v5 + 99w5


6. 5


7. <sup>x</sup>/<sub>2</sub> is allowed because you can divide by a constant


8. <sup>3x</sup>/<sub>8</sub>, also for same reason as (7)


9. √2 is allowed because it is a constant

### 6.1.2. Things that are <u>not</u> Polynomials

1. 3xy - 2 is not, because the exponent is "-2"

    (exponents can only be 0,1,2,... and negative exponents flip the base, e.g. x<sup>-2</sup> == <sup>1</sup>/<sub>x<sup>2</sup></sub>)
  

2. <sup>2</sup>/<sub>(x+2)</sub> is not, because dividing by a variable is not allowed



3. <sup>1</sup>/<sub>x</sub> is not either


4. √x is not, because the exponent is "½" (see fractional exponents)

## 6.2. What is Polynomial Regression?

Polynomial Regression is an attempt to create a <b>polynomial function</b> that approximates a set of data points.

## 6.3. When to use Polynomial Regression?

Polynomial Regression should be used when the data does not appear to have a straight line hypothesis, e.g.

<p align = "center">
<img src="../Images\polynomialregression1.png" width=40%>
</p>

For instance:

<p align = "center">
<img src="../Images\polynomialregression2.png" width=40%>
</p>

## 6.4. How to use Polynomial Regression?

To map our old linear hypothesis and cost functions to a polynomial set of variables:

1. x<sub>1</sub> = x


2. x<sub>2</sub> = x<sup>2</sup>


3. x<sub>1</sub> = x<sup>3</sup>

## 6.5. How does Polynomial Regression impact Feature Scaling?

Polynomial Regression makes Feature Scaling more important because it <b>exaggerates</b> features, e.g.

1. x<sub>1</sub> = x, then possible range for x = 1 - 1000


2. x<sub>2</sub> = x<sup>2</sup>, then possible range for x = 1 - 1,000,000


3. x<sub>1</sub> = x<sup>3</sup>, then possible range for x = 1 - 1,000,000,000

## 6.6. Additional Resources

- https://www.mathsisfun.com/definitions/polynomial.html

- http://polynomialregression.drque.net/math.html


# 7. Useful Resources

- https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
- https://developers.google.com/machine-learning/crash-course/descending-into-ml/linear-regression
- https://developers.google.com/machine-learning/crash-course/descending-into-ml/training-and-loss
- https://medium.com/simple-ai/linear-regression-intro-to-machine-learning-6-6e320dbdaf06 
- http://onlinestatbook.com/2/regression/intro.html 
- https://medium.com/@lachlanmiller_52885/machine-learning-week-1-cost-function-gradient-descent-and-univariate-linear-regression-8f5fe69815fd
- https://hackernoon.com/machine-learning-bit-by-bit-multivariate-gradient-descent-e198fdd0df85
- https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer