# Getting Started with Linear Models

After such an unbelievably overkill introduction to ML involving TensorFlow, it's time to learn the basics the right way. I'll admit it, the hype train got me. BUT... Now this part of the learning path is going to be incredibly easy.

As you know, the basis of linear regression follows this equation, with `y_hat` being the predicted value:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mrow data-mjx-texclass="ORD">
    <mover>
      <mi>y</mi>
      <mo stretchy="false">^</mo>
    </mover>
  </mrow>
  <mo stretchy="false">(</mo>
  <mi>w</mi>
  <mo>,</mo>
  <mi>x</mi>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <msub>
    <mi>w</mi>
    <mn>0</mn>
  </msub>
  <mo>+</mo>
  <msub>
    <mi>w</mi>
    <mn>1</mn>
  </msub>
  <msub>
    <mi>x</mi>
    <mn>1</mn>
  </msub>
  <mo>+</mo>
  <mo>.</mo>
  <mo>.</mo>
  <mo>.</mo>
  <mo>+</mo>
  <msub>
    <mi>w</mi>
    <mi>p</mi>
  </msub>
  <msub>
    <mi>x</mi>
    <mi>p</mi>
  </msub>
</math>

# Ordinary Least Squares

`LinearRegression` fits a linear model with coefficients <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>w</mi>
  <mo>=</mo>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>w</mi>
    <mn>1</mn>
  </msub>
  <mo>,</mo>
  <mo>.</mo>
  <mo>.</mo>
  <mo>.</mo>
  <mo>,</mo>
  <msub>
    <mi>w</mi>
    <mi>p</mi>
  </msub>
  <mo stretchy="false">)</mo>
</math> to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically, it solves a problem of the form:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <munder>
    <mo data-mjx-texclass="OP" movablelimits="true">min</mo>
    <mrow data-mjx-texclass="ORD">
      <mi>w</mi>
    </mrow>
  </munder>
  <mrow data-mjx-texclass="ORD">
    <mo stretchy="false">|</mo>
  </mrow>
  <mrow data-mjx-texclass="ORD">
    <mo stretchy="false">|</mo>
  </mrow>
  <mi>X</mi>
  <mi>w</mi>
  <mo>&#x2212;</mo>
  <mi>y</mi>
  <mrow data-mjx-texclass="ORD">
    <mo stretchy="false">|</mo>
  </mrow>
  <msubsup>
    <mo stretchy="false">|</mo>
    <mn>2</mn>
    <mn>2</mn>
  </msubsup>
</math>

![Alt image](images/sphx_glr_plot_ols_ridge_001.png)

`LinearRegression` takes in its `fit` method arguments `X`, `y`, `sample_weight` and stores the coefficients <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>w</mi>
</math> of the linear model in its `coef_` and `intercept_` attributes.

(Hey, doesn't this all make a lot more sense after reading the textbook???)

In [1]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [2]:
reg.coef_

array([0.5, 0.5])

In [3]:
reg.intercept_

np.float64(4.440892098500626e-16)

The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and some columns of the design matrix *X* have an approximately linear dependence, the design matrix becomes close to singular as a result, the least squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of **multicollinearity** can arise, for example, when data are collected without an experimental design.

## Example: OLS and Ridge Regression

1. Ordinary Least Squares: We illustrate how to use the OLS model, `LinearRegression`, on a single feature of the diabetes dataset. We train on a subset of the data, evaluate on a test set, and visualize the predictions.
1. OLS and Ridge Regression Variance: We then show how OLS can have high variance when the data is sparse or noisy, by fitting on a very small synthetics sample repeatedly. Ridge regression, `Ridge_`, reduces this variance by penalizing (shrinking) the coefficients, leading to more stable predictions.

Before we continue, let's refer to the textbook on **Ridge Regression**.

# Ridge Regression

(Refer to page 248)

Recall from earlier notes (found in the other linear regression repository), that the Residual Sum of Squares (RSS) follows this formula:

![Alt image](images/rss_formula.png)

**Ridge regression** is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates are the values that minimize

![Alt image](images/ridge_regression_formula.png)

Where λ ≥ 0 is a **tuning parameter**, to be determined separately. The above equation trades off two different criteria. As with least squares, ridge regression seeks estimates that fit the data well by making the RSS small. However, the second term, called a **shrinkage penalty**, is small when the beta value is close to zero, and so it has the effect of shrinking the estimates of `Bj` towards zero. The tuning parameter serves to control the relative impact of these two terms on the regression coefficient estimates. 

In other words, **small coefficients get gentle treatment**: If a feature has a small but potentially useful relationship, ridge doesn't aggressively eliminate it.

**Ridge regression is more useful than Ordinary Least Squares when dealing with multicollinearity**.

The tuning parameter serves to control the relative impact of these two terms on the regression coefficient estimates.

**We'll resume this later after we get through Chapter 6: Linear Model Selection and Regularization**.