# Abstract

# Introduction
In this first project of the course we are looking at linear regression and resampling methods. The goal is to implement three methods for linear regression; Ordinary Least Squares, Ridge, and Lasso regression, and study their performance and behavior. Two types of resampling methods, namely the bootstrap and k-fold cross validation is used to better evaluate the method and to determine the optimal value for the relevent parameters.

We study two types of data. First we create synthetic data using the Franke function. This data is then used to explore and verify the models. After this we move on to real data, and in this project we will look at map data from [UCSG's EarthExplorer](https://earthexplorer.usgs.gov/), more specifically elevation. We explore how the models perform on this data

We will look at the performance of the models and study their biance variance trade off.

# Data
We create synthetic data using the Franke function, as well as real digital terrain data to explore our regression models.
## The Franke function 
The Franke function is a two dimensional weighted sum of four exponentials. It has two Gaussian peaks of different heights, and a smaller dip and is often used as a test function in interpolation problems.

The function is defined as

$$
\begin{align*}
f(x,y) &= \frac{3}{4}\exp{\left(-\frac{(9x-2)^2}{4} - \frac{(9y-2)^2}{4}\right)}+\frac{3}{4}\exp{\left(-\frac{(9x+1)^2}{49}- \frac{(9y+1)}{10}\right)} \\
&+\frac{1}{2}\exp{\left(-\frac{(9x-7)^2}{4} - \frac{(9y-3)^2}{4}\right)} -\frac{1}{5}\exp{\left(-(9x-4)^2 - (9y-7)^2\right) }.
\end{align*}
$$

and will be defined for $x,y\in [0,1]$. See figure (?) for a plot.

## Real Data
We will use topological map data as real data for trying out our regression methods. I used EarthExplorer[1] to find a suitable map of elevation and chose an area over the Teton mountain range in Wyoming, USA. The map section had the enitityId SRTM1N43W111V3. I downloaded it as a GeoTIFF file with resolution of 1 arc second. A plot of this map is shown in figure (?)

# Methods
I explore three different methods for linear regression, as well as two methods for resampling. 
## Regression Methods

### OLS
Ordinary Least Squares Regression (OLS) fits a linear model with coefficients $\beta_i$ to minimize the residual sum of squares between the output value (aka dependent or target variable) in the dataset, and the output as predicted by the linear approximation. With $\boldsymbol{X}$ as a matrix of the input variables, and $\boldsymbol{y}$ as the output or target, we approximate the target as

$$
\boldsymbol{\hat{y}}= \boldsymbol{X}\boldsymbol{\beta},
$$

So the goal of OLS is to find the optimal $\boldsymbol{\hat{\beta}}$ that minimizes the difference between the values $\boldsymbol{\hat{y_i}}$ and $\boldsymbol{y_i}$. 

Defining the loss function to quantify  this difference, or spread, as:

$$
L(\boldsymbol{\beta})=\frac{1}{n}\sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{\hat{y}}\right)^T\left(\boldsymbol{y}-\boldsymbol{\hat{y}}\right)\right\},
$$

We want to minimize this function, and by taking the derivative of $L$ with respect to the individual $\boldsymbol{\beta_j}$ and solving for $\boldsymbol{\beta}$ we find the solution 

$$
\boldsymbol{\hat{\beta}} =\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}.
$$

that can be used to calculate $\boldsymbol{\hat{\beta}}$.

From an new input $\boldsymbol{X_{a}}$ we can use the found $\boldsymbol{\hat{\beta}}$ to calculate an estimate or prediction for the target $\boldsymbol{y_a}$, $\boldsymbol{\hat{y_a}}$.



### Ridge regression
Ridge regression is a modification of OLS which puts a restriction on the size of the individual coefficients $\boldsymbol{\beta}$. This is particularly useful in models with many, partly correlated, input values. The coefficients are then likely to become poorly determined, and they tend to have high variance.

To combat this behavior ridge regression adds a penalty term to the loss function from the OLS model penalizing large beta values. The penalty is equivalent to the square of the magnitude of the coefficients. More succintly the ridge model adds L2 regularization to the OLS model.

starting with the expression from the above section, 

$$
L(\boldsymbol{\beta})=\sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2=\sum_{i=1}^{n}(y_i-\sum_{j=1}^{p}x_{ij}\beta_j^2)^2,
$$

a penalty term is added
$$
L(\boldsymbol{\beta})=\sum_{i=1}^{n}(y_i-\sum_{j=1}^{p}x_{ij}\beta_j)^2+\sum_{j=1}^{p}\beta_j^2.
$$

subject to

$$
\sum_{i=1}^{p} \beta_i^2 \leq t,
$$

where $t$ is a positive number.

In matrix notation

$$
L(\boldsymbol{X},\boldsymbol{\beta})=\frac{1}{n}\left\{(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})\right\}+\lambda\boldsymbol{\beta}^T\boldsymbol{\beta},
$$

From which we get an expression for the coefficients, $\boldsymbol{\beta}^{\texttt{ridge}}$

$$
\boldsymbol{\beta}^{\texttt{ridge}} = \left(\boldsymbol{X}^T\boldsymbol{X}+\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}
$$


where $\boldsymbol{I}$ is the $\texttt{p×p}$ identity matrix.

We can see from this that for $\boldsymbol{\lambda}=0$ this model reduces to OLS. The bigger the value of $\boldsymbol{\lambda}$, the stricter the restriction on the size of the $\boldsymbol{\beta}$ values.


### Lasso regression
Like ridge regression, lasso (least absolute shrinkage and selection operator) regression adds a penalty to the loss function. We say lasso performs L1 regularization by adding a penalty equivalent to the absolute value of the magnitude of the coefficients.

The loss function becomes 

$$
L(\boldsymbol{\beta})=\sum_{i=1}^{n}(y_i-\sum_{j=1}^{p}x_{ij}\beta_j)^2+\sum_{j=1}^{p}|\beta_j|.
$$

## Resampling 
### The bootstrap
### k-Fold Cross Validation

## Error measures
The $r^2$ score, also known as the coefficient of determination, is a common measure of how well a model is able to predict outcomes. It is defined as one minus the residual sum of squares, 

$$
\text{RSS} = \sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2
$$ 

divided by the total sum of squares,

$$
\text{TSS} = \sum_{i=1}^{n}\left(y_i-{y}_{mean}\right)^2
$$

giving;

$$
r^2 = 1 - \frac{RSS}{TSS}
$$

Here values closer to 1 are better, with $r^2=1.0$ being the optimal model. It is worth noting that $r^2$ can take negative values.

The MSE is the mean of the square of the errors, or residual sum of squares:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y-\hat{y})^2
$$

and naturally values closer to zero are better.

## Preprocessing
Before a data analysis can begin is important to preprosess the data we are working on. This includes removing inconsistent and corrupted values, confirming completeness of the dataset and possibly removing highly corrolated features as well as outliers. It may also include selecting certain features from the dataset to focus on to reduce dimensionality.

In this project our data is either synthetic and as such well defined, or we use map data where the issues mentioned are not relevant concerns. What is most relevant in this assignment is scaling.
### Scaling
Many models are sensitive to the effective value range of the features or input data. There are several ways to employ scaling. One popular option is to adjust the data so each predictor has mean value equal to zero and a variance of one. Another option is to scale all the data points so each feature vector has the same euclidian lentgh. Yet another option is to scale the values to all lie between a given minimum and maximum value, typically zero and one. 
### Dividing the data set
A crucial step when trying to use regression to create a model based on a dataset is to divide up the dataset in at least two sets. This being a training set to train the model, and a test set to test it. In addition, if the size of the data set allows, one may also add a validation set for validating and fine tuning the model before tesing it on the test set. 

I will be dividing the data into training and test sets, and use cross validation in place of a separate validation test. I will be using a 75%/25% split between training and test data.

## Packages and Tools
While I have written my own code for the OLS and ridge regression models, as well as the bootstrap and kFold CV, I have used functionality from the library scikit-learn[3] for the lasso regression as well as for scaling and splitting the data set. This python library is based on numpy and and scipy, and contains a wide array of machine learning algorithms, including regression methods.

Other packages I've used is numpy[4] for array handling, matplotlib.pyplot[5] for plots and visualizations, and python's random module[6] for generating (pseudo) random numbers.

# Bibliography
[1] USGSs (the United States Geological Survey) EarthExplorer tool - https://earthexplorer.usgs.gov/

[2] Franke, R. (1979). A critical comparison of some methods for interpolation of scattered data

[3] SciKit-learn: https://scikit-learn.org/stable/index.html

[4] Numpy: https://numpy.org/

[5] MatPlotLib.PyPlot: https://matplotlib.org/api/pyplot_api.html

[6] Python.random: https://docs.python.org/3/library/random.html