# [Linear regression from scratch](http://gluon.mxnet.io/chapter02_supervised-learning/linear-regression-scratch.html)

Powerful ML libraries can eliminate repetitive work, but if you rely too much on abstractions, you might never learn how neural networks really work under the hood.  
So for this first example, let’s get our hands dirty and build everything from scratch, relying only on a`utograd` and `NDArray`.  
First, we’ll import the same dependencies as in the [autograd chapter](http://gluon.mxnet.io/chapter01_crashcourse/autograd.html).  
We’ll also import the powerful `gluon` package but in this chapter, we’ll only be using it for data loading.

In [1]:
# The kernel for this notebook is running Python 3, but we'll see:
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd, gluon
mx.random.seed(1)

## Set the context

We’ll also want to specify the contexts where computation should happen.  
This tutorial is so simple that you could probably run it on a calculator watch.  
But, to develop good habits we’re going to specify two contexts: one for data and one for our models.

In [2]:
data_ctx = mx.cpu()
model_ctx = mx.cpu()

## Linear regression

To get our feet wet, we'll start off by looking at the problem of regression.  
This is the task of predicting a *real valued target* $y$ given a data point $x$.  
In linear regression, the simplest and still perhaps the most useful approach, we assume that prediction can be expressed as a *linear* combination of the input features (thus giving the name *linear* regression):  
$$\hat{y} = w_1 \cdot x_1 + ... + w_d \cdot x_d + b$$  
Given a collection of data points $X$, and corresponding target values $\boldsymbol{y}$ we'll try to find the *weight* vector $\boldsymbol{w}$ and bias term $b$ (also called an *offset* or *intercept*) that approximately associate data points $\boldsymbol{x}_i$ with their corresponding labels ``y_i``.  
Using slightly more advanced math notation, we can express the predictions $\boldsymbol{\hat{y}}$ corresponding to a collection of datapoints $X$ via the matrix-vector product:  
$$\boldsymbol{\hat{y}} = X \boldsymbol{w} + b$$  
Before we can get going, we will need two more things:  
* Some way to measure the quality of the current model  
* Some way to manipulate the model to improve its quality

### Square loss

In order to say whether we’ve done a good job, we need some way to measure the quality of a model.  
Generally, we will define a loss function that says how far are our predictions from the correct answers.  
For the classical case of linear regression, we usually focus on the squared error.  
Specifically, our loss will be the sum, over all examples, of the squared error $(y_i−\hat{y})^2)$ on each:  
$$\ell(y, \hat{y}) = \sum_{i=1}^n (\hat{y}_i-y_i)^2.$$  
For one-dimensional data, we can easily visualize the relationship between our single feature and the target variable. It’s also easy to visualize a linear predictor and it’s error on each example.  
Note that squared loss heavily penalizes outliers.  
For the visualized predictor below, the lone outlier would contribute most of the loss.

![linear_regression](img/linear_regression.png)

### Manipulating the model

For us to minimize the error, we need some mechanism to alter the model.  
We do this by choosing values of the *parameters* $\boldsymbol{w}$ and $b$.  
This is the only job of the learning algorithm.  
Take training data ($X$, $y$) and the functional form of the model $\hat{y} = X\boldsymbol{w} + b$.  
Learning then consists of choosing the best possible $\boldsymbol{w}$ and $b$ based on the available evidence.

Matters of provenance aside, you might wonder - if Legendre and Gauss worked on linear regression, does that mean there were the original deep learning researchers?  
And if linear regression doesn't wholly belong to deep learning, then why are we presenting a linear model as the first example in a tutorial series on neural networks?  
Well it turns out that we can express linear regression as the simplest possible (useful) neural network.  
A neural network is just a collection of nodes (aka neurons) connected by directed edges.  
In most networks, we arrange the nodes into layers with each feeding its output into the layer above.  
To calculate the value of any node, we first perform a weighted sum of the inputs (according to weights ``w``) and then apply an *activation function*.  
For linear regression, we only have two layers, one corresponding to the input (depicted in orange) and a one-node layer (depicted in green) correspnding to the ouput.  
For the output node the activation function is just the identity function.  
![](img/onelayer.png)  
While you certainly don't have to view linear regression through the lens of deep learning, you can (and we will!).  
To ground the concepts that we just discussed in code, let's actually code up a neural network for linear regression from scratch.