# Ch. 4: Training Models

***
***

What did we do last few chapters? 
 - Optimized a regression system
 - Improved image classification
 
Time to look under the hood for more efficient debugging.

What will we look at upcoming?
 - Linear Regression models
  - Training Method 1: "closed-form" equation computing best fit model parameters to minimize cost function over training set
  - Training Method 2: "Gradient Descent", tweaking parameters to converge on model parameters that minimize cost function.
 - Polynomial Regression models
  - Detect overfitting
  - Regularization techniques to avoid overfitting
  - Regression for classification
   - Logistic regression
   - Softmax regression
***

## 4.1 - Linear Regression

### 4.1.1 - Mathematical Definition of Linear Regression Model

Linear model predicts via a weighted sum of input features plus a bias term:

$$\hat{y} = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n}$$

- $\hat{y}$: predicted value
- $n$: number of features (or dimensions)
- $x_{i}$: ith feature
- $\theta_{j}$: jth model parameter including bias term as $\theta_{0}$

In vectorized notation:

$$\boxed{\hat{y} = h_{\boldsymbol{\theta}} \textbf{(x)} = \boldsymbol{\theta} \cdot \textbf{x}}$$
- $\boldsymbol{\theta}$: **Parameter Vector** containing bias term $\theta_{0}$ and feature weights $\theta_1$ to $\theta_n$
- $\textbf{x}$: **Feature Vector** containing $x_0$ to $x_n$ where $x_0 = 1$
- $\boldsymbol{\theta} \cdot \textbf{x}$: **Dot Product** of _parameter_ and _feature_ vectors (element-wise multiplaction then summing)
- $h_0$: **Hypothesis Function** using model parameters $\boldsymbol{\theta}$

Vectors are often represented as **column vectors**. If we are indeed only dealing with column vectors of a single column for a single instance (and we include our bias term), we can say our prediction in matrix multiplication form becomes:

$$\boxed{\hat{y} = \boldsymbol{\theta}^{T}\textbf{x}}$$

### 4.1.2 - Training Linear Regression: Cost Function

Ch. 2, we saw common use is the _Root Mean Square Error_ (RMSE). We want to find a value of $\boldsymbol{\theta}$ that minimizes the RMSE. Can also use the _Mean Square Error_ (MSE) as if the square root of a function is minimized, so is the function.

**MSE of Linear Regression of hypothesis** $h_{\theta}$ **on a training set** $\textbf{X}$:

$$\boxed{MSE(\textbf{X}, h_{\theta}) = \frac{1}{m} \sum_{i=1}^{m} (\boldsymbol{\theta}^{T}\textbf{x}^{(i)} - y^{(i)})^2}$$

### 4.1.3 - Training Linear Regression: Closed Form Solution of Minimization of Cost Function

There is an equation that directly gives the best value for $\boldsymbol{\theta}$ called the **normal equation**:

$$\boxed{\boldsymbol{\hat{\theta}} = (\textbf{X}^{T}\textbf{X})^{-1} \textbf{X}^T \textbf{y}}$$

- $\hat{\boldsymbol{\theta}}$: parameter vector that minimizes cost function
- $\textbf{y}$: vector of target values

#### 4.1.3.1 - Testing and Plotting Closed Form Solution

Below, we can generate a random dataset with a linear dependence and compute our parameter vector that minimizes the cost function. Numpy has a linear algebra module that can compute the inverse and dot product of two matricies:

In [19]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from ipywidgets import *

X = 2 * np.random.rand(100,1)

def get_rand_lin_dist(alpha, beta):
    return alpha + beta * X + np.random.randn(100, 1)

def get_theta_best(X, Y):
    X_tot = np.c_[np.ones((100,1)), X]     # Add x0 = 1 to every instance
    return np.linalg.inv(X_tot.T.dot(X_tot)).dot(X_tot.T).dot(Y)

def plt_rand_lin_dist(alpha, beta):
    Y = get_rand_lin_dist(alpha, beta)
    
    X_lims = np.linspace(0, 2, num=100)
    X_lims_tot = np.c_[np.ones((100,1)), X_lims]
    
    theta_closed_form_pred = get_theta_best(X, Y)
    
    y_pred = X_lims_tot.dot(theta_closed_form_pred)
    
    plt.scatter(X, Y)
    plt.plot(X_lims, y_pred, 'r-')
    plt.grid(True)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title('Random Linear Distribution')
    plt.xlim(0, 2)
    plt.ylim(0,20)
    plt.show()
    
    print("Closed Form Best Theta Predictions: ", theta_closed_form_pred)
    
interactive_plt_rand_lin_dist = interactive(plt_rand_lin_dist, 
                                            alpha=(0,20), beta=(0,10))
interactive_plt_rand_lin_dist

interactive(children=(IntSlider(value=10, description='alpha', max=20), IntSlider(value=5, description='beta',…

Interestingly enough, the closed form solution to $\hat{\boldsymbol{\theta}}$ has been thrown off by the random noise.

#### 4.1.3.2 Linear Regression via Scikit-Learn

