In [14]:
import pandas as pd
import numpy as np

## 2.3 Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors

Notation Assumptions:
>If X is a vector, its components are accessed by subscripts of $X_j$.

> All vectors are assumed to be column vectors, the $ith$ row of $\textbf{X}$ is $x_{i}^T$, the vector transpose of $x_i$.

The linear model has been a mainstay for about 30 years or more and remains of the most important tools amongst Machine Learning.

Given a vector of inputs denoted as:
$$
X^T = (X_1, X_2, ..., X_{p})
$$

Where $X^T$ denotes vector or matrix transpose (X being a column vector).

For example, if we had X as a column vector
$$
X = 
\begin{bmatrix}
1 \\
2 \\
3
\end{bmatrix}
$$

Then transposing it turns it into:

$$
X^T = 
\begin{bmatrix}
1 & 2 & 3
\end{bmatrix}
$$

We want to predict the output $Y$ via the model:

$$
\hat{Y} = \hat{\beta}_0 + \sum_{j=1}^{p} X_j \cdot \hat{\beta}_j
$$

Where, $p$ is the number of vectors. $\hat{\beta}_j$ are the coefficients that best predict $Y$ and $\hat{\beta}_0$ is the bias. 

However, a trick can be done if we include the bias in the vector of coefficients by adding a constant variable 1 in $X$, $\hat{\beta}$, and then write the linear model in vector form as an **inner product**.

$$
\hat{Y} = X^T \cdot \hat{\beta}
$$

We can decompose this through the inner product such that:
$$
\hat{Y} = X^T \cdot \hat{\beta}
\\
\hat{Y} = (X_1, X_2, ..., X_{p}) \cdot \hat{\beta}
\\
\hat{Y} = (X_1, X_2, ..., X_{p}) \cdot
\begin{bmatrix}
\hat{\beta}_1 & \dots & K_1\\
\vdots & \ddots & \vdots \\
\hat{\beta}_p & \dots & K_p
\end{bmatrix}
\\
\hat{Y} = (X_1, X_2, ..., X_{p}) \cdot
\begin{bmatrix}
\hat{\beta}_1 \\
\vdots \\
\hat{\beta}_p
\end{bmatrix} \dots 
(X_1, X_2, ..., X_{p}) \cdot
\begin{bmatrix}
K_1 \\
\vdots \\
K_p
\end{bmatrix} 
$$

Interestingly, this is basically similar to linear algebra in solving for $Ax = b$.

### Fitting Least Squares

The Method of least squares calls for minimizing the residual sum of squares denoted by:

$$
RSS(\beta) = \sum_{i=1}^{N} (y_i - x^T_i \beta)^2
$$

Since $RSS(\beta)$ is a quadratic function of parameters, then a **minimum** always exists but **may not** be unique.

We can also write:

$$
RSS(\beta) = (\textbf{y} - \textbf{X}\beta)^T (\textbf{y} - \textbf{X}\beta)
$$

Where $\textbf{X}$ is an $N \times p$ matrix with each row an input vector and $y$ is an N-vector of the outputs in the training sets.

For example:

Given that: $\textbf{X} - [2 \times 3]$

$$
\textbf{X} = 
\begin{bmatrix}
1 & 2 & 3\\
4 & 5 & 6 
\end{bmatrix}
$$

Given that: $\textbf{y} - [2 \times 1]$

$$
\textbf{y} =
\begin{bmatrix}
24 \\
6
\end{bmatrix}
$$



In [16]:
#### Suppose we have feature a, b, c with 2 samples in our training data.
X = np.array(
    [
        [1, 2, 3],
        [4, 5, 6]
    ]
);
df_x = {
    'a': [1, 4],
    'b': [2, 5],
    'c': [3, 6]
}

X_df = pd.DataFrame(df_x);
X_df

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6


In [30]:
#### Our Labels.
y = np.array([[6], [15]])
df_y = {
    'y': [6, 15]
}

Y_df = pd.DataFrame(df_y);
Y_df

Unnamed: 0,y
0,6
1,15


Suppose that our coefficients that we choose are:

$$
\beta = 
\begin{bmatrix}
1 \\
1 \\
1
\end{bmatrix}
$$

Theoretically, we will have an $RSS(\beta) = 0$ as specified by:

$$
RSS(\beta) = (\textbf{y} - \textbf{X}\beta)^T (\textbf{y} - \textbf{X}\beta)
$$

In [31]:
beta = np.array([[1], [1], [1]]);

print("X Shape:", X.shape)
print("y Shape:", y.shape)
print("beta Shape:", beta.shape)


def calculateRSS(X, y, beta):
    
    print(np.matmul(X, beta))
    m1 = (y - np.matmul(X, beta)).T;
    m2 = (y - np.matmul(X, beta));
    
    print(m1)
    print(m2)
    
calculateRSS(X, y, beta)

X Shape: (2, 3)
y Shape: (2, 1)
beta Shape: (3, 1)
[[ 6]
 [15]]
[[0 0]]
[[0]
 [0]]
