# Supervised Learning

The three branches of machine learning regularly recognized today
- Supervised Machine Learning
- Unsupervised Machine Learning
- Reinforcement Machine Learning

In supervised machine learning our algorithms learn from labeled data. After studying labelled data, these techiques are able to determine which label should be given to new data based on observing patterns and associating those patterns to new unlabeled data. 

Supervised learning can be divided into two categories: 
- Classification
- Regression

Classification models predict a category that an item belongs to. 
Regression models predict a numeric value (e.g. home price).

Unsupervised learning creates models for data when there are noe pre-existing labels to train on. 

Reinforcement Learning is used to train algorithms that learn based on taking certain actions and receiving rewards for those actions (e.g. self-driving cars, game playing agents). 

Deep learning can be applied to all three categories. Historically, we've cared less and less about how we make predictions, and more about the accuracy of our predictions. It has three major barriers to entry:
1. You must have enough data to train on
2. Computing power
3. You won't always understand why certain decisions are being made given the complexity and flexibility of these algorithms. 

### Linear Regression

$y = w_1x+w_2$

where $w_1$ is slope and $w_2$ is intercept. 

#### Absolute trick

$y = (w_1+p\alpha)x+(w_2+\alpha)$  for coordinate (p,q) and $\alpha$ = learning rate

#### Square trick

Comprehends distance from target. 

$y = (w_1+p(q-q')\alpha)x+(w_2+(q-q')\alpha)$  for coordinate (p,q) and $\alpha$ = learning rate

Let's say that we have a line whose equation is y = -0.6x + 4. For the point (x,y) = (-5, 3), apply the absolute trick to get the new equation for the line, using a learning rate of alpha = 0.1.

In [12]:
w1 = -0.6 - (-5)*0.01
w2 = 4 - 0.01
print("y={:.3f}x+{:.3f}".format(w1,w2))

w3 = -0.6 + -5*(3-4)*0.01
w4 = 4 + (3-4)*0.01
print("y={:.3f}x+{:.3f}".format(w3,w4))

y=-0.550x+3.990
y=-0.550x+3.990


#### Gradient Descent

$w_i -> w_i-\alpha \frac{\delta}{\delta w_i}Error$

#### Mean Absolute Error

Always want Error to be positive so it doesn't cancel out with other negative coefficients.

$Error = \frac{1}{m}\sum^m_{i=1}|y-\hat{y}|$

#### Mean Squared Error

$Error = \frac{1}{2m}\sum^m_{i=1}(y-\hat{y})^2$

Compute the mean absolute error and squared error for the following line and points:
line: y = 1.2x + 2
points: (2, -2), (5, 6), (-4, -4), (-7, 1), (8, 14)

In [29]:
x = [2,5,-4,-7,8]
y = [-2,6,-4,1,14]

def mean_absolute_error(x, y):
    y_h = [0]*len(x)
    error = 0
    for i in range(len(x)):
        y_h[i] = 1.2*x[i]+2
        error += abs(y[i] - y_h[i])
    return error*(1/(len(x)))
        
print(mean_absolute_error(x, y))

def mean_squared_error(x, y):
    y_h = [0]*len(x)
    error = 0
    for i in range(len(x)):
        y_h[i] = 1.2*x[i]+2
        error += (y[i] - y_h[i])**2
    return error*(1/(2*len(x)))

print(mean_squared_error(x, y))

3.88
10.692000000000002


### Minimizing Error Functions

Development of the derivative of the error function
Notice that we've defined the squared error to be

$Error = \frac{1}{2} (y - \hat{y})^2$

Also, we've defined the prediction to be

$\hat{y} = w_1 x + w_2$ 

So to calculate the derivative of the Error with respect to $w_1$ we simply use the chain rule:

$\frac{\partial}{\partial w_1} Error = \frac{\partial Error}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w_i}$

The first factor of the right hand side is the derivative of the Error with respect to the prediction $\hat{y}$ 
which is $-(y-\hat{y})$. 

The second factor is the derivative of the prediction with respect to $w_1$ which is simply $x$.

Therefore, the derivative is:

$\frac{\partial}{\partial w_1} Error = -(y-\hat{y})x$   
$\frac{\partial}{\partial w_2} Error = -(y-\hat{y})$

Note for absolute error:   
$\frac{\partial}{\partial w_1} Error = \pm x$   
$\frac{\partial}{\partial w_2} Error = \pm 1$

#### Gradient Step

$w_i \rightarrow w_i - \alpha \frac{\partial}{\partial w_i}Error$

#### Mean vs Total Squared (or Absolute) Error   
A potential confusion is the following: How do we know if we should use the mean or the total squared (or absolute) error?

The total squared error is the sum of errors at each point, given by the following equation:

$M = \sum_{i=1}^m \frac{1}{2} (y - \hat{y})^2$

whereas the mean squared error is the average of these errors, given by the equation, where $m$ is the number of points:

$T = \sum_{i=1}^m \frac{1}{2m}(y - \hat{y})^2$

The good news is, it doesn't really matter. As we can see, the total squared error is just a multiple of the mean squared error, since

$M = mT$

Therefore, since derivatives are linear functions, the gradient of $T$ is also $m$ times the gradient of $M$.

However, the gradient descent step consists of subtracting the gradient of the error times the learning rate $\alpha$. Therefore, choosing between the mean squared error and the total squared error really just amounts to picking a different learning rate.

In real life, we'll have algorithms that will help us determine a good learning rate to work with. Therefore, if we use the mean error or the total error, the algorithm will just end up picking a different learning rate.

### Batch vs Stochastic Gradient Descent
At this point, it seems that we've seen two ways of doing linear regression.

By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.
More specifically, the squared (or absolute) trick, when applied to a point, gives us some values to add to the weights of the model. We can add these values, update our weights, and then apply the squared (or absolute) trick on the next point. Or we can calculate these values for all the points, add them, and then update the weights with the sum of these values.

The latter is called batch gradient descent. The former is called stochastic gradient descent. 

The question is, which one is used in practice?

Actually, in most cases, neither. Think about this: If your data is huge, both are a bit slow, computationally. The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent.

### Linear Regression

`from sklearn.linear_model import LinearRegression  `   
`model = LinearRegression()   `   
`model.fit(x_values, y_values)   `   
`print(model.predict([ [127], [248] ])) `    
