# On the Notion of the Gradient

In [1]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import numpy as np

Gradient descent: Find the minimum of a cost function.

<table style = "width:100%">
    <tr>
        <th style = "width:10%">Regression</th>
        <th style = "width:35%">Cost Function Equation</th>
        <th>Gradient of Cost Function</th>
    </tr>
    <tr>
        <td>simple - no intercept</td>
        <td>$\mathcal L = \sum (y - \beta x)^2$</td>
        <td>$\frac{d\mathcal L}{d\beta} = -2\sum x(y - \beta x)$</td>
    </tr>
    <tr>
        <td>simple - with intercept</td>
        <td>$\mathcal L = \sum (y - (\beta_1 x + \beta_0))^2$</td>
        <td>$\nabla\mathcal L = -2\sum(y - (\beta_1 x + \beta_0))\hat{\beta_0}-2\sum x(y - (\beta_1 x + \beta_0))\hat{\beta_1}$</td>
    </tr>
    <tr>
        <td>multiple</td>
        <td>$\mathcal L = \sum (y - (\beta_n x_n + ... + \beta_1 x_1 + \beta_0))^2$</td>
        <td>$\nabla\mathcal L = -2\sum (y - (\beta_n x_n + ... + \beta_1 x_1 + \beta_0))\hat{\beta_0}- ... -2\sum x_n(y - (\beta_n x_n + ... + \beta_1 x_1 + \beta_0))\hat{\beta_n}$</td>
    </tr>
    <tr>
        <td>multiple - with ridge regularization</td>
        <td>$\mathcal L = \sum (y - (\beta_2 x_2 + \beta_1 x_1))^2 + \lambda(\beta_1^2 + \beta_2^2)$</td>
        <td>$\nabla\mathcal L = -2[\sum (y - (\beta_2 x_2 + \beta_1 x_1)) + \lambda\beta_1]\hat{\beta_1} - 2[\sum (y - (\beta_2 x_2 + \beta_1 x_1)) + \lambda\beta_2]\hat{\beta_2}$
        </td>
    </tr>
</table>

## Where Gradient Equals $0$

Simple LR - no intercept: If $\frac{d\mathcal L}{d\beta} = -2\sum x(y - \beta x) = 0$, then $\sum xy = \beta\sum x^2$, so $\beta = \frac{\sum xy}{\sum x^2}$.

In [2]:
X, y = make_regression(n_features=1, random_state=42)

In [3]:
LinearRegression().fit(X, y).coef_

array([41.74110031])

In [4]:
X = X[:, 0]

In [5]:
np.sum(X*y) / np.sum(X**2)

41.7411003148779

## Gradient Points in Direction of Steepest Increase

![plot](images/contourPlotSSE.png)

## In Code

$\nabla\mathcal L = -2\sum(y - (\beta_1 x + \beta_0))\hat{\beta_0}-2\sum x(y - (\beta_1 x + \beta_0))\hat{\beta_1}$

In [6]:
X, y = make_regression(n_features=1, bias=2, random_state=42)

In [7]:
lr = LinearRegression().fit(X, y)
print(lr.coef_, lr.intercept_)

[41.74110031] 2.0000000000000004


In [8]:
X = X[:, 0]

In [9]:
def grad(beta_0, beta_1, x=X, y=y):
    beta_0_component = -2 * np.sum(y - (beta_1 * x + beta_0))
    beta_1_component = -2 * np.sum(x * (y - (beta_1 * x + beta_0)))
    return beta_0_component, beta_1_component

In [10]:
grad(lr.intercept_, lr.coef_)

(-8.304468224196171e-14, 1.240140003855395e-12)

**Close to optimal for $\beta_0$, off the mark for $\beta_1$**

In [11]:
grad(3, 60)

(-179.2246287496094, 3000.3712222450754)

**Close to optimal for $\beta_1$, off the mark for $\beta_0$**

In [12]:
grad(10, 42)

(1594.6228338691285, -123.31656003412012)