## Normal Equation

In [3]:
from sklearn.linear_model import LinearRegression

In [4]:
height = [[160], [166], [172], [174], [180]]
weight = [56.3, 60.6, 65.1, 68.5, 75]
x_test = [[176]]

In [5]:
estimator = LinearRegression()
estimator.fit(height, weight)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [7]:
print(f'the estimator coefficient is: {estimator.coef_}')
print(f'the estimator intercept is: {estimator.intercept_}')
y_pred = estimator.predict(x_test)
print(f'the predicted value for x_test is: {y_pred}')

the estimator coefficient is: [0.92942177]
the estimator intercept is: -93.27346938775514
the predicted value for x_test is: [70.3047619]


### Math in Machine Learning: A Review

1. scalar = a quantity has magnitude only (ex. `a = 10`)
2. vector = an ordered list of elements (ex. series, list), has magnitude and direction
3. matrix = 2-D array (ex. DataFrame)
4. tensor = data with multiple dimensions

scalar -> vector -> matrix -> tensor 

![illustration](../assets/tensor.png)

- Tensor

Numpy -> `ndarray` -> n-D array

##### Partial Derivative 
$z$ is a function of $x$ and $y$, written as $z = z(x, y)$, find the minimum of $z = (x - 2)^2 + (y - 3)^2$

$\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$


Partial Z/Partial X: 

= $\frac{d}{dx}(x - 2)^2$

= $2(x - 2) * 1 = 0$

= $x = 2$

Partial Z/Partial Y:

=  $\frac{d}{dy}(y - 3)^2 = 0$

= $2(y - 3) * 1 = 0$

= $y = 3$

- Conclusion: 
$ \frac{\partial Z}{\partial x} = \frac{d}{dx}(x - 2)^2 = 2(x - 2) = 0 \Rightarrow x = 2; \quad \frac{\partial Z}{\partial y} = \frac{d}{dy}(y - 3)^2 = 2(y - 3) = 0 \Rightarrow y = 3 $


### Matrix Review
1. Each vector can be seen as a series in pandas

$$
\begin{pmatrix} 
1 \\ 2 \\ 3 
\end{pmatrix}
+
\begin{pmatrix} 
3 \\ 4 \\ 5 
\end{pmatrix}
=
\begin{pmatrix} 
4 \\ 6 \\ 8 
\end{pmatrix}
$$

2. Transpose
$$
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6
\end{pmatrix}^\mathrm{T}
=
\begin{pmatrix}
1 & 4 \\
2 & 5 \\
3 & 6
\end{pmatrix}
$$

- By the way, transpose in python can be written as:

In [4]:
def transpose(matrix:list[list[int]]):
    row = len(matrix)
    col = len(matrix[0])
    results = [[0] * row for _ in range(col)]
    for i in range(row):
        for j in range(col):
            results[j][i] = matrix[i][j]
    return results

if __name__ == '__main__':
    A = [
    [1, 2, 3],
    [4, 5, 6]
    ]
    print(transpose(A))


[[1, 4], [2, 5], [3, 6]]


### Norm 
- L1 Norm: the sum of absolute values of all components in a vector
$$
\|v\|_1 = |v_1| + |v_2| + \dots + |v_n|
$$

Example: 
$v = (3, -4, 1)$

$$
  \|v\|_1 = |3| + |-4| + |1| = 8
$$

- L2 Norm: the square root of the sum of the squares of its components

$$
\|v\|_2 = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2}
$$

Example: 
$v = (3, -4, 1)$:
$$
\|v\|_2 = \sqrt{3^2 + (-4)^2 + 1^2} = \sqrt{26} \approx 5.099
$$

Also, 
$$
\|x\|_2 = \sqrt{x^T x} 
$$
or
$$
\|x\|_2^2 = x^T x
$$
Must be $transpose * non-transpose$

- Lp Norm
$$
\|x\|_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}
$$


##### Symmetric Matrix 

A matrix is **symmetric** if it equals its transpose.

$$
A =
\begin{pmatrix}
1 & 2 & 3 \\
2 & 5 & 6 \\
3 & 6 & 9
\end{pmatrix},
\quad
A^T =
\begin{pmatrix}
1 & 2 & 3 \\
2 & 5 & 6 \\
3 & 6 & 9
\end{pmatrix}
$$


##### Unit (Identity) Matrix

A **unit matrix** (or **identity matrix**) has 1s on the diagonal and 0s elsewhere.

$$
I =
\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{pmatrix}
$$
$AI = A$

- Matrix multiplication is associative, which means: for any matrices A, B, and C: 
$(AB)C = A(BC)$
- Inverse Matrix: 
$$
AB = BA = I,
$$
then $B$ is called the **inverse** of $A$, denoted as $A^{-1}$.

- $(A + B)^{T} = A^{T} + B^{T}$
- $(AB)^{T} = B^{T} A^{T}$




#### Normal Equation Method


### Univariate Linear Regression — Loss Function Derivation

**Model (hypothesis):**
$$
\hat{y}_i = w x_i + b
$$

**Loss function (Mean Squared Error with 1/2 factor):**
$$
J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (w x_i + b - y_i)^2
$$

### 1) Compute gradients
$$
\frac{\partial J}{\partial w}
= \frac{1}{m} \sum_{i=1}^{m} (w x_i + b - y_i)x_i,
\qquad
\frac{\partial J}{\partial b}
= \frac{1}{m} \sum_{i=1}^{m} (w x_i + b - y_i)
$$

### 2) Set gradients to zero (Normal equations)
$$
\sum_{i=1}^{m} x_i (w x_i + b - y_i) = 0, 
\qquad
\sum_{i=1}^{m} (w x_i + b - y_i) = 0
$$

This leads to the linear system:
$$
\begin{cases}
w \sum x_i^2 + \sum b x_i = \sum x_i y_i, \\[4pt]
w \sum x_i + \sum b  = \sum y_i
\end{cases}
$$

Let
$$
(
\bar{x} = \frac{1}{m}\sum x_i,\ 
\bar{y} = \frac{1}{m}\sum y_i,\
S_{xx} = \sum (x_i - \bar{x})^2,\
S_{xy} = \sum (x_i - \bar{x})(y_i - \bar{y})
).
$$
Then the optimal parameters are:
$$
w^* = \frac{S_{xy}}{S_{xx}}
= \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2},
\qquad
b^* = \bar{y} - w^* \bar{x}
$$

### (Optional) Summation form
$$
w^* = \frac{m \sum x_i y_i - \sum x_i \sum y_i}{m \sum x_i^2 - (\sum x_i)^2},
\qquad
b^* = \frac{\sum y_i - w^* \sum x_i}{m}
$$


### Steps to Derive Parameters in Univariate Linear Regression

1. **Take partial derivatives** of the loss function with respect to $w$ and $b$.
2. **Obtain** a system of two linear equations:
   $$
   \begin{cases}
   \frac{\partial J}{\partial w} = 0, \\[4pt]
   \frac{\partial J}{\partial b} = 0
   \end{cases}
   $$
3. **Solve** the system to find the optimal values of $w$ and $b$.


### For Multiple Linear Regression
It is actually the same, when you have n features, you will get n + 1 linear equations
#### Proof

In multivariate linear regression:
$$
\hat{y}^{(i)} = w_1 x_1^{(i)} + w_2 x_2^{(i)} + \dots + w_n x_n^{(i)} + b
$$

The loss function is:
$$
J(w_1, w_2, \dots, w_n, b)
= \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2
$$

#### Step 1. Take partial derivatives
For each parameter $w_j$, take a partial derivative while treating all other $w_k$ ($k \neq j$) as constants:
$$
\frac{\partial J}{\partial w_j}
= \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) x_j^{(i)}
$$

Finally, take the derivative with respect to the bias $b$:
$$
\frac{\partial J}{\partial b}
= \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})
$$

#### Step 2. Set all gradients to zero
Setting all $(n+1)$ derivatives to zero gives $(n+1)$ linear equations:
$$
\begin{cases}
\frac{\partial J}{\partial w_1} = 0 \\[4pt]
\frac{\partial J}{\partial w_2} = 0 \\[2pt]
\vdots \\[2pt]
\frac{\partial J}{\partial w_n} = 0 \\[4pt]
\frac{\partial J}{\partial b} = 0
\end{cases}
$$

#### Step 3. Solve for all parameters
The system of equations yields the optimal parameters:
$$
w_1^*, w_2^*, \dots, w_n^*, b^*
$$

This is why **multivariate linear regression** leads to an **(n + 1)-variable linear system**.


### However, 
Instead of solving **(n + 1)** separate linear equations manually,  
we can express the entire regression problem in **matrix form**,  
which allows us to compute all parameters at once.

---

#### Model:
$$
\hat{y} = Xw + b
$$
where  
- $X$: feature matrix of shape $(m \times n)$  
- $w$: weight vector $(n \times 1)$  
- $b$: bias (scalar or included in $X$)  
- $y$: target vector $(m \times 1)$  

---

#### Loss function:
$$
J(w) = \frac{1}{2m}(Xw + b - y)^T(Xw + b - y)
$$

---

#### Gradient with respect to $w$:
Using matrix calculus,
$$
\frac{\partial J}{\partial w} = \frac{1}{m} X^T (Xw + b - y)
$$

---

#### Set the gradient to zero:
$$
X^T X w = X^T y
$$

---

#### Solve for the optimal weights:
$$
w^* = (X^T X)^{-1} X^T y
$$



In [7]:
import pandas as pd 
data = {
    'x1': [1, 2, 3, 4, 5],
    'x2': [2, 4, 6, 8, 10],
    'y':  [3, 5, 7, 9, 11]
}
df = pd.DataFrame(data)
print(df)

   x1  x2   y
0   1   2   3
1   2   4   5
2   3   6   7
3   4   8   9
4   5  10  11


$$
X =
\begin{pmatrix}
1 & 1 & 2 \\
1 & 2 & 4 \\
1 & 3 & 6 \\
1 & 4 & 8 \\
1 & 5 & 10
\end{pmatrix},
\quad
y =
\begin{pmatrix}
3 \\
5 \\
7 \\
9 \\
11
\end{pmatrix}
$$

- The first column of 1s represents the intercept b

#### According to the equation: 


$$
w^* = (X^T X)^{-1} X^T y
$$

#### So we get:

$$
w^* =
\begin{pmatrix}
1 \\
1 \\
1
\end{pmatrix}
$$

$$
\hat{y} = 1 + 1x_1 + 1x_2
$$

### An example:

In [8]:
import pandas as pd

data = {
    "hours_studied": [2, 4, 6, 8, 10],
    "attendance_rate": [60, 70, 75, 80, 90],
    "score": [65, 70, 75, 85, 95]
}

df = pd.DataFrame(data)
print(df)

   hours_studied  attendance_rate  score
0              2               60     65
1              4               70     70
2              6               75     75
3              8               80     85
4             10               90     95


### `.values` to convert dataframe into numpy array

In [21]:
import numpy as np
x_train = df[["hours_studied", "attendance_rate"]].values
y_train = df['score'].values.reshape(-1, 1)
print(y_train)
print(x_train)
print(type(x_train))

[[65]
 [70]
 [75]
 [85]
 [95]]
[[ 2 60]
 [ 4 70]
 [ 6 75]
 [ 8 80]
 [10 90]]
<class 'numpy.ndarray'>


### Add intercept in front of the x_train

In [25]:
x_train = np.hstack((np.ones((x_train.shape[0], 1)), x_train))

$$
w^* = (X^T X)^{-1} X^T y
$$

In [26]:
w = np.linalg.inv(x_train.T @ x_train) @ x_train.T @ y_train
print(w)

[[5.5500000e+01]
 [3.7500000e+00]
 [5.5067062e-14]]


### Time Complexity of Normal Equation
$$
w^* = (X^T X)^{-1} X^T y
$$

$(X^T X): O(m * n)$

$(X^T X)^{-1}: O(n^3)$

$(X^T Y): O(mn)$

- Multiplication itself $O(n^2)$

$(X^T X)^{-1} X^T y: O(n^2)$

#### So, the time complexity is $O(n^3)$