# 12 Least Squares Data Fitting

<center><img src="figs/02_regression.png" alt="Drawing" width=600px/></center>


#### Unit 1: Vectors, Book ILA Ch. 1-5

#### Unit 2: Matrices, Book ILA Ch. 6-11 + Book IMC Ch. 2

#### Unit 3: Least Squares, Book ILA Ch. 12-14
- 11 Least Squares
- _**12 Least Squares Data Fitting**_
- 13 Least Squares Classification

# Outline: 12 Least Squares Data Fitting

- [Least Square Model Fitting](#sec)
- [Validation](#sec)
- [Feature Engineering](#sec)

# True relationship: $f$



$\color{#EF5645}{\text{Definition}}$: When we believe that a scalar $y$ and an $n$-vector $x$ are related by model:
$$y ≈ f (x),$$
we use the following vocabulary:
- $x$ is called the independent variable 
- $y$ is called the outcome or response variable
- $f : \mathbb{R}^n \rightarrow \mathbb{R}$ represents the "true" relationship between x and y.

Generally, we do not know $f$, we just assume it exists. Our goal is to learn $f$, or a reasonable approximation of it, using data.

# Data

$\color{#EF5645}{\text{Definition}}$: The data:
$$x^{(1)}, . . . , x^{(N)}, y^{(1)}, . . . , y^{(N)}$$
are called observations, examples, samples, or measurements.
- $x^{(i)}, y^{(i)}$ is ith data pair
- $x^{(i)}_j$ is the jth component of ith data point $x^{(i)}$.


# Model: $\hat{f}$

$\color{#EF5645}{\text{Definition}}$: Choosing a set of basis functions: $f_i: \mathbb{R}^n \rightarrow \mathbb{R}$, for $i=1...p$, we model a guess or approximation of $f$ as:
$$\hat{f}(x) = \theta_1 f_1(x) + ... + \theta_p f_p(x),$$
where:
- $\theta_i$ are model parameters that we will learn from the data,
- $\hat{y}^{(i)} = \hat{f}(x^{(i)})$ is (the model’s) prediction of $y^{(i)}$.


$\color{#EF5645}{\text{Remark}}$: If our model is good, then $\hat{y}^{(i)} ≈ y^{(i)},$ i.e., model is consistent with observed data.


# Residuals

$\color{#EF5645}{\text{Definition}}$: Given:
- observations $x^{(1)}, ..., x^{(N)},..., y^{(1)}, y^{(N)}$,
- a model $\hat{f}$ generating $\hat{y}^{(i)} = \hat{f}(x^{(i)})$ predictions of $y^{(i)}$, for $i=1, ..., p$,

we define the prediction error, or residual: 
$$r_i = y^{(i)} - \hat{y}^{(i)}.$$

# Least Square Data Fitting

$\color{#EF5645}{\text{Definition}}$: The Least Square Data Fitting problem is the problem of choosing model's parameters $\theta_1, ..., \theta_n$ that minimize the RMS prediction error on the dataset:
$$\left(\frac{r_1^2 + ... + r_N^2}{N}\right)^{1/2}.$$

# LS Data Fitting and LS

The Least Square (LS) Data Fitting problem can be formulated as a Least Squares (LS) Problem.

$\color{#EF5645}{\text{Notations}}$: We can express $y^{(i)}, \hat{y}^{(i)}$, and $r_i$ as $N$-vectors:
- $y = (y^{(1)}, . . . , y^{(N)})$ is vector of outcomes,
- $\hat{y} = (\hat y^{(1)}, . . . , \hat  y^{(N)})$ is vector of predictions,
- $r = (r_1, . . . , r_N)$ is vector of residuals.

$\color{#6D7D33}{\text{Proposition}}$: Define the $N \times p$ matrix $A$ with elements $A_{ij} = f_j(x^{(i)})$, such that $\hat y =A \theta$. The least square data fitting problem amounts to choose $\theta$ that minimizes:
$$||A\theta - y||^2,$$
which shows that it can be written as a Least Square Problem.

# Solving the LS Data Fitting Problem

$\color{#6D7D33}{\text{Proposition}}$: Consider a LS Data Fitting problem formulated as minimizing $||A \theta - y||^2$. Assuming that the columns of $A$ are independent, the solution is:
$$\hat \theta = (A^TA)^{-1}A^Ty.$$