# 12 Least Squares Data Fitting

<center><img src="figs/02_regression.png" alt="Drawing" width=600px/></center>


<center><img src="figs/12_regression.png" alt="Drawing" width=600px/></center>

#### Unit 1: Vectors, Textbook Ch. 1-5

#### Unit 2: Matrices, Textbook Ch. 6-11

#### Unit 3: Least Squares, Textbook Ch. 12-14
- 11 Least Squares
- _**12 Least Squares Data Fitting**_

##### Outline: 12 Least Squares Data Fitting

- [Least Square Model Fitting](#sec)
- [Application to Regression and Classification](#sec)

### True relationship: $f$



$\color{#EF5645}{\text{Definition}}$: When we believe that a scalar $y$ and an $n$-vector $x$ are related by model:
$$y ≈ f (x),$$
we use the following vocabulary:
- $x$ is called the independent variable 
- $y$ is called the outcome or response variable
- $f : \mathbb{R}^n \rightarrow \mathbb{R}$ represents the "true" relationship between x and y.

Generally, we do not know $f$, we just assume it exists. Our goal is to learn $f$, or a reasonable approximation of it, using data.

### Model: $\hat{f}$

$\color{#EF5645}{\text{Definition}}$: Choosing a set of basis functions: $f_j: \mathbb{R}^n \rightarrow \mathbb{R}$, for $j=1...p$, we model a guess or approximation of $f$ as:
$$\hat{f}(x) = \theta_1 f_1(x) + ... + \theta_p f_p(x),$$
where:
- $\theta_j$ are parameters that we learn from data: $x^{(1)}, ..., x^{(N)},..., y^{(1)},... y^{(N)}$,
- $\hat{y}^{(i)} = \hat{f}(x^{(i)})$ is (the model’s) prediction of $y^{(i)}$, for $i=1, ..., N$.

$\color{#047C91}{\text{Exercise}}$: What are the basis functions in linear regression?

### Prediction Error (Residual)

$\color{#EF5645}{\text{Remark}}$: Our predictions are $\hat y_i = \hat f(x_i)$, for $i=1, ..., N$. If our model is good, then $\hat{y}^{(i)} ≈ y^{(i)}$ for $i=1, ..., N$.


$\color{#EF5645}{\text{Definition}}$: We define the _prediction error_, or _residual_ for each $i=1, ..., N$: 
$$r_i = y^{(i)} - \hat{y}^{(i)}.$$

$\color{#047C91}{\text{Exercise}}$: What is the average prediction error, averaged over the dataset?

### Least Square Data Fitting


$\color{#EF5645}{\text{Definition}}$: The Least Square Data Fitting problem is the problem of choosing model's parameters $\theta_1, ..., \theta_n$ that minimize the RMS prediction error on the dataset:
$$\big(\frac{r_1^2 + ... + r_N^2}{N}\big)^{1/2}.$$

$\color{#6D7D33}{\text{Proposition}}$: Define the $N \times p$ matrix $A$ with elements $A_{ij} = f_j(x^{(i)})$, such that $\hat y =A \theta$, where $y = (y^{(1)}, . . . , y^{(N)})$ is vector of outcomes. The least square data fitting problem amounts to choose $\theta$ that minimizes:
$$||A\theta - y||^2,$$
which shows that it can be written as a Least Square Problem. Assuming that the columns of $A$ are independent, the solution is:
$$\hat \theta = (A^TA)^{-1}A^Ty.$$

### Outline: 12 Least Squares Data Fitting

- [Least Square Model Fitting](#sec)
- [Application to Regression and Classification](#sec)

### Classification and Regression

<center>  </center>
<center><img src="figs/04_ai.png" alt="default" width=1500px/></center>

### Regression of House Prices

- $x$: house's area in 1000 sq feet,
- $y$: house's price in in k\$
(we do not consider the number of beds for simplicity).

Consider the model: $\hat f (x) = \theta_1 f_1(x) + \theta_2 f_2(x)$ with $f_1(x) = 1$ and $f_2(x) = x$, i.e.:
$$\hat f (x) = \theta_1 + \theta_2 x.$$

$\color{#047C91}{\text{Example}}$: Write down the LS problem associated to this learning. What are $A$, $y$? Explain how you can find $\hat \theta_1$ and $\hat \theta_2$ with Python. At home: compute $\hat \theta_1$ and $\hat \theta_2$ manually (hard).

### From Regression to (Binary) Classification

- Model $\hat f$ outputs a number.
- Binary classification wants a category: +1 or -1 only.

$\rightarrow$ use $sign(\hat f)$ in place of $\hat f$ to classify.

### Classification of MNIST

- $x$ image of size $28 \times 28$ from the MNIST dataset,
- $y$: whether the image shows a $0$ digit or another digit.

Consider the model: $\hat f (x) = sign(\theta_0 f_0(x) + \theta_1 f_1(x) + ... + \theta_{784} f_{784}(x))$ with $f_0(x) = 1$, and $f_p(x) = x_p$ for $p = 1, ..., 784$, i.e.:
$$\hat f (x) = sign(\theta_0 + \theta_1 x_1 + ... + \theta_{784} x_{784}).$$

$\color{#047C91}{\text{Example}}$: What are $x, y, A$? Explain how you can find the $\hat \theta$s with Python.


<center><img src="figs/04_mnist.png" alt="default" width=250px/></center>


<center><img src="figs/12_classification.png" alt="default"/></center>

#### Unit 1: Vectors, Textbook Ch. 1-5

#### Unit 2: Matrices, Textbook Ch. 6-11

#### Unit 3: Least Squares, Textbook Ch. 12-14
- 11 Least Squares
- _**12 Least Squares Data Fitting**_

Resources: Textbook Ch. 13-14.