The point of this notebook is to help me refresh my knowledge of linear algebra and provide concrete examples of how to code linear algebra solutions both directly and with tools like scipy.

I'll also be trying to relate these examples to solutions using numerical optimisation techniques as well.

To star, let us consider the case where we have data $(x_i, y_i)$ with i $\exists$ [1,N], inclusive with N>2. In this case, $x_i$ are the independent variables and $y_i$ are the dependent variables.

--------------------------------------------------------------

In general, we might be interested in finding the relationship between these two variables and using that relationship to make predictions about future data points yet to be taken.

To do this, we are interested in regression analysis -- that is, determining the relationship between these two variables by finding the trend, or regressor, that best represents the average behaviour of the data and minimizes the differences between any given data points an the trend. 

We'll be using an example where we think that the trend is roughly linear, so our regressor with be a line, i.e. $f(x) = \hat{y} = m x + b$.

As I said, we'll be using linear algebra to carry out the regression analysis first! 

So, let's start out by defining our matrices.

First, we have our dependent variables 
                                 $\mathbf{Y}=\left[ \begin{matrix} y_1 \\
                                                 y_2, \\
                                                 \vdots \\
                                                 y_N 
                                   \end{matrix} \right]$, 
                                   
which is a column vector that contains our observed data $y_i$.

Next, we have our independent variables 
                                  $\mathbf{X} = \left[ \begin{matrix} 1 & x_1 \\
                                                  1 & x_2 \\
                                                  \vdots & \vdots \\
                                                  1 & x_N 
                                   \end{matrix} \right]$
                                                  
We also consider the general case where our data have associated variances $\sigma^2_i$, this gives us the co-variance matrix
$\mathbf{C} = \left[ \begin{matrix} \sigma^2_{y1} & 0 & \cdots & 0 \\
                                    0 & \sigma^2_{y2} & \cdots & 0 \\
                                    \vdots & \vdots & \ddots & \vdots \\
                                    0 & 0 & 0 & \sigma^2_{yN}
                                    \end{matrix} \right]$
                                    
                                    
In matrix form, we can write our model as: ${\bf Y}={\bf X}\beta$, where $\beta$ contains the parameters that make our regressor best fit the data, i.e., $\beta = \left[ \begin{matrix} m\\ b\end{matrix}\right]$.

So, conceptually, what we want to do is identify the values of the parameters m and b  (i.e. $\hat{\beta}$) which best predict the relationship between our data $(x,y)$. 

To do this, we would minimize the square differences in the residuals between the data ${\bf Y}$ and our model $mx+b$ (i.e. ${\bf X}\beta$). This motivates us to state the residual function $e_i=y_i-(mx_i+b)$ (i.e. $e={\bf Y}-{\bf X}\beta$).

Then our best fitting regressor is given by minimising the squared differences, given by $\sum^N_i e_i^2 = \sum^N_i \left( y_i - (mx_i + b) \right)^2$