Regression problems pop up whenever we want to predict a numerical value.

In [1]:
%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l

Linear regression flows from a few simple assumptions.
First, we assume that the relationship between features $\mathbf{x}$ and target $y$ is approximately linear, i.e., that the conditional mean $E[Y|X = \mathbf{x}]$ can be expressed as a weighted sum of the features $\mathbf{x}$.
This setup allows that the target value may still deviate from its expected value on account of observation noise.
Next, we can impose the assumption that any such noise is well behaved, following a Gaussian distribution.

At the heart of every solution is a model that describes how features can be transformed into an estimate of the target.
The assumption of linearity means that the expected value of the target can be expressed as a weighted sum of the features.

(3.1.1) $y = w_1 \cdot x_1 + w_2 \cdot x_2 + b$

The weights determine the influence of each feature on our prediction.
The bias determines the value of the estimate when all features are zero.
We need the bias because it allows us to express all linear functinos of our features (rather than restricting us to lines that pass through the origin).

Strictly speaking (3.1.1) is an affine transformation of input features, which is characterized by a linear transformation of features via a weighted sum, combined with a translation via the added bias.
Given a dataset, our goal is to choose the weights $\mathbf{w}$ and the bias $b$ that, on average, make our model's predictions fit the true prices observed in the data as closely as possible.

Given a training dataset $\mathbf{X}$ (where $\mathbf{X}$ contains one row for every example, and one column for every feature) and corresponding (known) labels $\mathbf{y}$, the goal of linear regression is to find the weight vector $\mathbf{w}$ and the bias term $b$ such that, given features of a new data example sampled from the same distribution as $\mathbf{X}$, the new example's label will (in expectation) be predicted with the smallest error.

Before we can go about searching for the best parameters (or model parameters) $\mathbf{w}$ and $b$, we will need two more things: (i) a measure of the quality of some given model; and (ii) a procedure for updating the model to improve its quality.

Naturally, fitting our model to the data requires that we agree on some measure of fitness (or, equivalently, of unfitness).
Loss functions quantify the distance between the real and predicted values of the target.
The loss will usually be a nonnegative number where smaller values are better and perfect predictions incur a loss of 0.
For regression problems, the most common loss is the squared error.

Unlike most of the models that we will cover, linear regression presents us with a surprisingly easy optimization problem. In particular, we can find the optimal parameters (as assessed on the training data) analytically by applying a simple formula as follows. First, we can subsume the bias $b$ into the parameter $\mathbf{w}$ by appending a column to the design matrix consisting of 1s. Then our prediction problem is to minimize $||\mathbf{y} - \mathbf{Xw}||^2$. As long as the design matrix $\mathbf{X}$ has full rank (no feature is linearly dependent on the others), then there will be just one critical point on the loss surface and it corresponds to the minimum of the loss over the entire domain. Taking the derivative of the loss with respect to $\mathbf{w}$ and setting it equal to zero yields: ...

While simple problems like linear regression may admit analytic solutions, you should not get used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the requirement of an analytic solution is so restrictive that it would exclude almost all exciting aspects of deep learning.

Fortunately, even in cases where we cannot solve the models analytically, we can still often train models effectively in practice. Moreover, for many tasks, those hard-to-optimize models turn out to be so much better that figuring out how to train them ends up being well worth the trouble.

The key technique for optimizingt nearly every deep learning model consists of iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function. This algorithm is called gradient descent.

The most naive application of gradient descent consists of taking the derivative of the loss function, which is an average of the losses computed on every single example in the dataset. In practice, this can be extremely slow: we must pass over the entire dataset before making a single update, even if the update steps might be very powerful. Even worse, if there is a lot of redundancy in the training data, the benefit of a full update is limited.

The other extreme is to consider only a single example at a time and to take update steps based on one observation at a time. The resulting algorithm, stochastic gradient descent (SGD) can be an effective strategy, even for large datasets. Unfortunately, SGD has drawbacks, both computational and statistical. One problem arises from the fact that processors are a lot faster multiplying and adding numbers than they are at moving data from main memory to processor cache. It is up to an order of magnitude more efficient to perform a matrix-vector multiplication that a corresponding number of vector-vector operations. This means that it can take a lot longer to process one sample at a time compared to a full batch. A second problem is that some of the layers, such as batch normalization, only work well when we have access to more than one observation at a time.

The solution is to pick an intermediate strategy: rather than taking a full batch or only a single sample at a time, we take a minibatch of observations. The specific choice of the size of the said minibatch depends on many factors. A number between 32 and 256, preferably a multiple of a large power of 2, is a good start. This leads us to minibatch stochastic gradient descent.

Linear regression happens to be a learning problem with a global minimum (whenever $\mathbf{X}$ is full rank, or equivalently, whenever $\mathbf{X}^T\mathbf{X}$ is invertible). However, the loss surfaces for deep networks contain many saddle points and minima. Fortunately, we typically do not care about finding an exact set of parameters but merely any set of parameters that leads to accurate predictions (and thus low loss). In practice, deep learning practitioners seldom struggle to find parameters that minimize loss on training sets. The more formidable task is to find parameters that lead to accurate predicitons on previously unseen data, a challenge called generalization.