# Linear Regression

In Regression algorithms, the model will find a relationship between a feature (`x` axis) and a continuous value (`y` axis).

<br><img src="assets/images/linear-regression.png" alt="Linear Regression" style="width: 720px"><br>

The idea for fitting a line through the data is to arbitrarily set a line and compute the distance to the data points. In each iteration the line is moved, trying to minimize the distance from each point.

## Absolute Trick

The Absolute Trick consists of adding a small number to the slope of the line, which is equal to the horizontal coordinate of a point (`p` in the example above), times a small constant, called **Learning Rate** (`α`). The y-intercept is also changed, by adding 1 times the learning rate.

<br><img src="assets/images/absolute-trick.png" alt="Absolute Trick" style="width: 720px"><br>

## Square Trick

The Square Trick is a slight variation of the Absolute Trick. The difference is that both terms added are also multiplied by the vertical distance of the point to the line.

<br><img src="assets/images/square-trick.png" alt="Square Trick" style="width: 720px"><br>

# Gradient Descent

Gradient Descent is a method for moving the line in order to find the one where the error - defined by an error function - is minimum.

<br><img src="assets/images/gradient-descent.png" alt="Gradient Descent" style="width: 720px"><br>

## Mean Absolute Error (MAE)

The Mean Absolute Error is one of the most common error functions, where the error is average of the absolute value of the vertical distance of the points.

<br><img src="assets/images/mean-absolute-error.png" alt="Mean Absolute Error" style="width: 720px"><br>

## Mean Squared Error (MSE)

The Mean Squared Error function is similar to the MAE, but instead of the aboliste value, we use the squared difference. Also, there’s an extra 1/2 factor multiplying the whole mean.

<br><img src="assets/images/mean-squared-error.png" alt="Mean Squared Error" style="width: 720px"><br>


There is more than one way to do Linear Regression.

The first one consists of applying Gradient Descent to all points, _one-by-one_ and is called **Stochastic Gradient Descent**. In this, we calculate the error for the first point and update the weights of the model. Then we repeat the process for the remaining data points.

The second, **Batch Gradient Descent**, applies the trick to all points _at the same time_, sum the values calculated and apply to the model.

In practice, neither are used, because both are computationally slow. What is used is the **Mini-Batch Gradient Descent**, which splits the dataset into small batches and uses them to update the weights.

## N-Dimensions

* In a 2D problem (1 input, 1 output), the prediction is a _line_.
* In a 3D problem (2 inputs, 1 output), the prediction is a _plane_.

In a n-dimensional problem (n-1 inputs, 1 output), the prediction is a _hyperplane_ in n-1 dimensions. Since it's hard to picture this, we deal with the linear equation:

$$
\hat{y} = w_1 x_1 + w_2 x_2 + ... + w_{n-1} x_{n-1} + w_n
$$

<br><img src="assets/images/n-dimensions.png" alt="N-Dimensions" style="width: 720px">

## Linear Regression Warnings

Linear regression comes with a set of implicit assumptions and is not the best model for every situation. Here are a couple of issues that you should watch out for.

### Linear Regression Works Best When the Data is Linear

Linear regression produces a straight line model from the training data. If the relationship in the training data is not really linear, you'll need to either make adjustments (transform your training data), add features (we'll come to this next), or use another kind of model.

![Quadratic Linear Regression](assets/images/quadratic-linear-regression.png)

### Linear Regression is Sensitive to Outliers

Linear regression tries to find a 'best fit' line among the training data. If your dataset has some outlying extreme values that don't fit a general pattern, they can have a surprisingly large effect.

In this first plot, the model fits the data pretty well.

![Linear Regression Without Outliers](assets/images/linear-regression-no-outliers.png)

However, adding a few points that are outliers and don't fit the pattern really changes the way the model predicts.

![Linear Regression With Outliers](assets/images/linear-regression-outliers.png)

In most circumstances, you'll want a model that fits most of the data most of the time, so watch out for outliers!

## Polynomial Regression

When we have data where a line don't fit well, we use higher degree polynomials and apply the same techniques!

<br><img src="assets/images/polynomial-regression.png" alt="Polynomial Regression" style="width: 720px">

## Regularization

Regularization is a technique to improve classification and regression models, and make sure they don't overfit.

<br><img src="assets/images/regularization.png" alt="Regularization" style="width: 720px"><br>

The idea behind this is to take the complexity of the model in account when measuring the error.

### L1 Regularization

L1 Regularization simply adds the absolute values of the coefficients to compute the error.

<br>
<img src="assets/images/l1-simple.png" alt="L1 Simple" style="float: left; width: 480px">
<img src="assets/images/l1-complex.png" alt="L1 Complex" style="float: right; width: 480px">
<div style="clear: both"></div>
<br>

### L2 Regularization

L2 Regularization uses the squared coefficients to calculate the error.

<br>
<img src="assets/images/l2-simple.png" alt="L2 Simple" style="float: left; width: 480px">
<img src="assets/images/l2-complex.png" alt="L2 Complex" style="float: right; width: 480px">
<div style="clear: both"></div>
<br>

Simple models have the advantage of not overfitting, while more complex ones may produce less error. Because of this, it's important to know when each should be used. Some systems can accept complexity, while others don't mind a little error:

<br><img src="assets/images/simple-complex.png" alt="Simple vs Complex" style="width: 720px"><br>

### Lambda

In order to avoid punishing too much or not punishing enough a model, the use a **Lambda** parameter. Small values favor complex models, while large values reward simple ones.

<br>
<img src="assets/images/small-lambda.png" alt="Small Lambda" style="float: left; width: 480px">
<img src="assets/images/large-lambda.png" alt="Large Lambda" style="float: right; width: 480px">
<div style="clear: both"></div>
<br>

<br><img src="assets/images/l1-l2.png" alt="L1 vs L2" style="width: 720px">

## Quizes

01. [Mini-Batch Gradient Descent](quizes/mini-batch-gradient-descent/mini-batch-gradient-descent.ipynb)
02. [Linear Regression in Scikit-Learn](quizes/linear-regression-scikit-learn/linear-regression-scikit-learn.ipynb)
03. [Multiple Linear Regression](quizes/multiple-linear-regression/multiple-linear-regression.ipynb)
04. [Polynomial Regression](quizes/polynomial-regression/polynomial-regression.ipynb)
05. [Regularization](quizes/regularization/regularization.ipynb)
06. [Feature Scaling](quizes/feature-scaling/feature-scaling.ipynb)