# Module 23 topic review

## Derivatives

$f'(x) = \dfrac{\Delta y}{\Delta x} =  \dfrac{f(x + \Delta x) - f(x)}{\Delta x}$

- A derivative is the instantaneous rate of change of a function (i.e. the slope of a curve)
- if the rate of change is calculated in terms of time, as the time-frame approaches zero the derivative approaches a *limit*

### Derivatives of linear functions

Derivatives of linear functions are simply the distance between two vector points 
$$\large f'(x) = \dfrac{\Delta y}{\Delta x} = \dfrac{y_{2}-y_{1}}{x_{2}-x_{1}}$$  
<img src="images/linear_derivative.png">

### Rules for calculating derivatives

Derivatives of non-linear functions are a little trickier because the difference between any two vector points along the curve is in fact shorter than the distance actually covered by the curve.  

To deal with this issue we calculate the derivative's *limit* which is the value the derivative approaches as the linear distance between the two vector points approaches zero. 

So the derivative formula for non-linear functions is then defined as:  
$$ f'(x) = \displaystyle {\lim_{ \Delta x \to 0}} \frac{f(x + \Delta x) - f(x)}{\Delta x} $$  

<img src = "images/non-linear-derivatives.png">

### Rules for calculating derivatives

**The power rule**
-  if a variable,  𝑥 , is raised to a exponent  𝑟 , then the derivative of that function is the exponent  𝑟  multiplied by the variable, with the variable raised to the original exponent minus one.
- given that: $ f(x) = x^r $
- therefore: $ f'(x) = r*x^{r-1} $
- example . . .
    - given the function: $f(x) = x^2 $
    - therefore: $f'(x) = 2*x^{2-1} = 2*x^1 = 2*x $  

**The constant factor rule**
- If the variable is multiplied by a constant, then take the derivative of the variable first and then multiply the constant by the derivative
- the general case is defined as: $\frac{\Delta f}{\Delta x}(a*f(x)) = a * \frac{\Delta f}{\Delta x}*f(x) $
- example . . . 
    - given the function: $f(x) = 2x^2 $
    - therefore: $f'(x) = 2*\frac{\Delta f}{\Delta x} x^{2} = 2*2*x^{2-1} = 4x^1 = 4x $

**The addition rule**
- To take a derivative of a function that has multiple terms, simply take the derivative of each of the terms individually.
- example:
    - given the function: $ f(x) = 4x^3 - x^2 + 3x $
    - therefore: $ f'(x) = 12x^2 - 2x + 3  $

A visualization of derivatives calculated at various values for x on a non-linear function
<img src = "images/tangent-lines.png">

### Minima and maxima

The Minima or maxima of a non-linear function can be simply understood as the peak of a curve.  

Given a non-linear function $f(x)$ the value of $x$ at the functions minima (or maxima for the inverse) will be the same value for x where $f'(x) = 0$

An example:  
    - given the function: $f(x) = 2x^2-8x$  
    - the derivative is: $f'(x) = 4 x - 8 $  
    - so the optimum x-value is where $f'(x) = 4x - 8 = 0 $  
    - which simplifies to $ x= 2$  

visualized . . .   
<img src = "images/minima.png">

## Gradient Descent

### Overview
Gradient descent is an optimization algorithm used to find the values of a function's coefficients that minimize the related cost function. 

1: begin with a regression line with a best-guess for $m$ and $b$  
2: calculate the residual sum of squares $(RSS)$  
3: adjust $m$ (x-axis), and/or $b$ (y-axis)  
4: re-calculate $RSS$  
5: repeat process . . .  
6: select the values for $m$ and $b$ where $RSS$ is the least.  
    - The change in outpute based in the change in input(of m or b) is the **cost function**

The amount that the parameter is changed is referred to as the *step size*.  
This sign and slope of a curve can help determine what the appropriate step size is.  
- if the slope approaches $0$ on the y-axis (i.e. is negative), the next step should move towards $\infty$ on the x-axis (i.e. away from zero). (and vice-versa)
- As the slope moves away from $0$(i.e. absolute value increases), the larger the step size should be.

The general procedure to find the ideal $m$ is as follows: 
1.  Randomly choose a value of $m$, and 
2.  Select some step size $(\eta)$, then
3.  Update $m$ with the formula $ m = (\eta) * slope_{m = i} + m_i$.

```python
def gradient_descent(x_values, y_values, steps, current_b, learning_rate, m):
    cost_curve = []
    for i in range(steps):
        cost_slope = slope_at(x_values,y_values,m,current_b)
        current_rss = residual_sum_squares(x_values,y_values,m,current_b)
        cost_curve.append({'b':current_b,'rss':current_rss,'slope':cost_slope})
        current_b = updated_b(current_b,learning_rate,cost_slope)
    return cost_curve
```

### Gradient descent in three-dimensions

For non-linear multi-variable functions (e.g. $f(x, y) = y*x^2 $) the related cost-function will be three dimensional. 

In this case the formula for residuals sum of squares is defined as:  
$J(m, b) = \sum_{i=1}^{n}(y_i - (mx_i + b))^2$  

This means adjusting $m$ *and* $b$ in tandem, which requires taking the *partial derivative* of either parameter while treating the other as a constant.  
<img src = "images/multi-variable-function.png" width = 400>

To calculate the gradient of the cost function for a multi-variable function we take the partial derivative of each parameter with respect to the other as a constant. . .  
$$ \nabla J(m, b) = \frac{\delta J}{\delta m}, \frac{\delta J}{\delta b}$$  

For example:
- given the formula $f(x, y) = y*x^2 $
- Each partial derivative is solved for as . . .  
$$ \frac{dJ}{dm}J(m,b) = -2*\sum_{i=1}^n x(y_i - \hat{y}_i)  = -2*\sum_{i=1}^n x_i*\epsilon_i$$
$$ \frac{dJ}{db}J(m,b) = -2*\sum_{i=1}^n(y_i - \hat{y}_i) = -2*\sum_{i=1}^n \epsilon_i$$

Practically speaking, the code would be something along the lines of . . .  

`current_m` =  `old_m` $ - \frac{dJ}{dm}J(m,b)$

`current_b` =  `old_b` $ - \frac{dJ}{db}J(m,b) $

### Coding gradient descent

In [10]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.random.seed(11)

x1 = np.random.rand(100,1).reshape(100)
x2 = np.random.rand(100,1).reshape(100)
y_randterm = np.random.normal(0,0.2,100)
y = 2+ 3* x1+ -4*x2 + y_randterm

data = np.array([y, x1, x2])
data = np.transpose(data)

In [11]:
def step_gradient_multi(b_current, m_current ,points):
    b_gradient = 0
    m_gradient = np.zeros(len(m_current))
    learning_rate = .1 # arbitrarialy selected
    N = float(len(points)) # weights the gradient proprtional to the size of the data set
    for i in range(0, len(points)):
        y = points[i][0]
        x = points[i][1:(len(m_current)+1)] 
        b_gradient += -(1/N)  * (y -  (sum(m_current * x) + b_current))
        m_gradient += -(1/N) * x * (y -  (sum(m_current * x) + b_current))
    new_b = b_current - (learning_rate * b_gradient)
    new_m = m_current - (learning_rate * m_gradient)
    return (new_b, new_m)

In [12]:
b = 0
m = [0,0]
iterations = []
for i in range(500):
    iteration = step_gradient_multi(b, m, data)
    b= iteration[0]
    m = []
    for j in range(len(iteration)):
        m.append(iteration[1][j])
    iterations.append(iteration)

In [13]:
iterations[-5:]

[(1.9431512015830243, array([ 2.99522584, -3.90814716])),
 (1.9434733566282547, array([ 2.99539515, -3.90888235])),
 (1.9437935889183264, array([ 2.99556229, -3.90961205])),
 (1.9441119102745354, array([ 2.99572729, -3.9103363 ])),
 (1.944428332442866, array([ 2.99589015, -3.91105514]))]