# 机器学习之Normal Equation求解

接触到正规方程组（Normal Equation）是线性回归中对模型参数的求解。线性回归可以参考专门的手册，这篇手册着重讨论Normal Equation的求解。吴恩达老师在机器学习公开课视频中提到他会经常使用他所讲述的方法进行一些推导，而且步骤相当简单，所以有必要仔细学习他讲述的方法。

## 线性回归模型
对于特征个数为n的模型，$\boldsymbol{x} = \left[ x_0, x_1, \cdots, x_n \right]^\text{T} \in \mathbb{R}^{(n+1) \times 1}$ 表示输入，$\boldsymbol{\theta} = \left[ \theta_0, \theta_1, \cdots, \theta_n \right]^\text{T} \in \mathbb{R}^{(n+1) \times 1}$ 表示模型参数，则模型的输出为：

$$h_\boldsymbol{\theta}(\boldsymbol{x}) = \boldsymbol{\theta}^{\text{T}}\boldsymbol{x} $$

显然，$h_\boldsymbol{\theta}(\boldsymbol{x})$ 是一个实数。同时，记输入 $\boldsymbol{x}$ 的真实输出为 $y$。则对于输入 $\boldsymbol{x}$，模型的平方误差为：

$$J(\boldsymbol{\theta}) = \left( h_\boldsymbol{\theta}(\boldsymbol{x}) - y \right)^2$$

令样本数量为m，当 $ 1 \leq i \leq m$ 时，$\boldsymbol{x}_i = \left[ x_{i0}, x_{i1}, \cdots, x_{in} \right]^\text{T} \in \mathbb{R}^{(n+1) \times 1}$ 表示第i个样本的输入，$y_i$ 表示第i个样本的真实输出，记均方误差为：

$$J(\boldsymbol{\theta}) = \frac{1}{2} \sum_{i=1}^m \left( h_\boldsymbol{\theta}(\boldsymbol{x}_i) -y_i \right)^2$$

我们希望模型参数 $\boldsymbol{\theta}$ 能使均方误差最小，最常用的求解方法就是**梯度下降（Gradient Descent）**和**正规方程组（Normal Equation）**。

通俗的说，Normal Equation就是求导，而导数等于零的点就是极值点（通过泰勒展开可以推导）。接下来的所有工作无非就是对均方误差函数的求导。

## 均方误差的矩阵形式

先将均方误差写作矩阵的形式。

令 $\boldsymbol{w} = \left[ w_1, w_2, \cdots, w_m \right]^{\text{T}} $，其中 $ w_i = h_\boldsymbol{\theta}(\boldsymbol{x}_i) - y_i $。有

$$ \begin{align*}
\sum_{i=1}^m \left( h_\boldsymbol{\theta}(\boldsymbol{x}_i) - y_i \right)^2 &= \sum_{i=1}^m w_i^2 \\
&= \boldsymbol{w}^{\text{T}}\boldsymbol{w} \\
&= \left[ \begin{matrix} h_\boldsymbol{\theta}(\boldsymbol{x}_1) - y_1 & \cdots & h_\boldsymbol{\theta}(\boldsymbol{x}_m) - y_m \end{matrix} \right] \left[ \begin{matrix} h_\boldsymbol{\theta}(x_1) - y_1 \\ \vdots \\ h_\boldsymbol{\theta}(\boldsymbol{x}_m) -y_m \end{matrix} \right]
\end{align*} $$

又有

$$ \begin{align*}
\left[ \begin{matrix} h_\boldsymbol{\theta}(\boldsymbol{x}_1) - y_1 \\ \vdots \\ h_\boldsymbol{\theta}(\boldsymbol{x}_m) - y_m \end{matrix} \right] &= \left[ \begin{matrix} \boldsymbol{\theta}^{\text{T}}\boldsymbol{x}_1 \\ \vdots \\ \boldsymbol{\theta}^{\text{T}}\boldsymbol{x}_m \end{matrix} \right] - \left[ \begin{matrix} y_1 \\ \vdots \\ y_m \end{matrix} \right] \\
&= \left[ \begin{matrix} \sum_{i=0}^{n}\theta_ix_{1i} \\ \vdots \\ \sum_{i=0}^{n}\theta_ix_{mi} \end{matrix} \right] - \left[ \begin{matrix} y_1 \\ \vdots \\ y_m \end{matrix} \right] \\\\
&= \left[ \begin{matrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{mn} \end{matrix} \right] \left[ \begin{matrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{matrix} \right] - \boldsymbol{y} \\\\
&= \boldsymbol{X\theta} - \boldsymbol{y}
\end{align*}$$

所以

$$J(\boldsymbol{\theta}) = \frac{1}{2} \left(\boldsymbol{X\theta} - \boldsymbol{y}\right)^{\text{T}} \left(\boldsymbol{X\theta} - \boldsymbol{y}\right) $$







## 求导

$$
\begin{align*}
\frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} &= \frac{\partial}{\partial \boldsymbol{\theta}} \frac{1}{2} \left(\boldsymbol{X\theta} - \boldsymbol{y}\right)^{\text{T}} \left(\boldsymbol{X\theta} - \boldsymbol{y}\right) \\
&=  \frac{1}{2} \frac{\partial}{\partial \boldsymbol{\theta}} \left( \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{X\theta} -  \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{y} - \boldsymbol{y}^{\text{T}}\boldsymbol{X\theta} + \boldsymbol{y}^{\text{T}}\boldsymbol{y} \right)  \\
&= \frac{1}{2} \left[\frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{X\theta} -  \frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{y} - \frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{y}^{\text{T}}\boldsymbol{X\theta} \right] \tag{1}
\end{align*} $$ 

其中，$\boldsymbol{\theta}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta}$ 、$\boldsymbol{\theta}^T\boldsymbol{X}^T \boldsymbol{y}$、$\boldsymbol{y}^T\boldsymbol{X}\boldsymbol{\theta}$ 都是标量（1×1的矩阵），所以可以分解成三个【标量对向量求导】的问题。


### 一般方法

分别对上述三个标量求导：

$$ \begin{align*}
\frac{\partial (\boldsymbol{\theta}^T\boldsymbol{X}^T \boldsymbol{y}) }{\partial \boldsymbol{\theta}}
\overset{\boldsymbol{X}^T \boldsymbol{y} = \boldsymbol{W}}{\Longrightarrow} \frac{\partial (\boldsymbol{\theta}^T\boldsymbol{W}) }{\partial \boldsymbol{\theta}}
&= \frac{\partial}{\partial \boldsymbol{\theta}} \left[ \begin{matrix} \theta_1 & \theta_2 & \cdots & \theta_n \end{matrix} \right] \left[ \begin{matrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{matrix} \right] \\
&= \frac{\partial}{\partial \boldsymbol{\theta}} \sum_{i=1}^{n}\theta_i w_i 
= \left[ \begin{matrix} \frac{\partial}{\partial \theta_1} \sum_{i=1}^{n}\theta_i w_i \\ \frac{\partial}{\partial \theta_2} \sum_{i=1}^{n}\theta_i w_i \\ \vdots \\ \frac{\partial}{\partial \theta_n} \sum_{i=1}^{n}\theta_i w_i \end{matrix} \right] 
= \left[ \begin{matrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{matrix} \right] = \boldsymbol{W} = \boldsymbol{X}^T \boldsymbol{y} \\
\\ \frac{\partial (\boldsymbol{y}^T\boldsymbol{X}\boldsymbol{\theta}) }{\partial \boldsymbol{\theta}}
\overset{\boldsymbol{y}^T\boldsymbol{X} = \boldsymbol{W}}{\Longrightarrow} \frac{\partial (\boldsymbol{W}\boldsymbol{\theta}) }{\partial \boldsymbol{\theta}}
&= \frac{\partial}{\partial \boldsymbol{\theta}}  \left[ \begin{matrix} w_1 & w_2 & \cdots & w_n \end{matrix} \right] \left[ \begin{matrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{matrix} \right] \\
&= \frac{\partial}{\partial \boldsymbol{\theta}} \sum_{i=1}^{n}w_i \theta_i
= \left[ \begin{matrix} \frac{\partial}{\partial \theta_1} \sum_{i=1}^{n}w_i \theta_i \\ \frac{\partial}{\partial \theta_2} \sum_{i=1}^{n}w_i \theta_i \\ \vdots \\ \frac{\partial}{\partial \theta_n} \sum_{i=1}^{n}w_i \theta_i \end{matrix} \right]
= \left[ \begin{matrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{matrix} \right] = \boldsymbol{W}^T = (\boldsymbol{y}^T\boldsymbol{X})^T =\boldsymbol{X}^T \boldsymbol{y} \\
\\ \frac{\partial (\boldsymbol{\theta}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta}) }{\partial \boldsymbol{\theta}}
\overset{\boldsymbol{X}^T\boldsymbol{X} = \boldsymbol{W}}{\Longrightarrow} \frac{\partial (\boldsymbol{\theta}^T\boldsymbol{W}\boldsymbol{\theta}) }{\partial \boldsymbol{\theta}}
&= \frac{\partial}{\partial \boldsymbol{\theta}} \left[ \begin{matrix} \theta_1 & \theta_2 & \cdots & \theta_n \end{matrix} \right] \left[ \begin{matrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \cdots & w_{nn} \end{matrix} \right] \left[ \begin{matrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{matrix} \right] \\
&= \frac{\partial}{\partial \boldsymbol{\theta}} \sum_{i=1}^{n} \sum_{j=1}^{n}\theta_i w_{ij} \theta_j
= \left[ \begin{matrix} \frac{\partial}{\partial \theta_1} \sum_{i=1}^{n} \sum_{j=1}^{n}\theta_i w_{ij} \theta_j \\ \frac{\partial}{\partial \theta_2} \sum_{i=1}^{n} \sum_{j=1}^{n}\theta_i w_{ij} \theta_j \\ \vdots \\ \frac{\partial}{\partial \theta_n} \sum_{i=1}^{n} \sum_{j=1}^{n}\theta_i w_{ij} \theta_j \end{matrix} \right] 
= \left[ \begin{matrix} 2\sum_{i=1}^{n}w_{1i}\theta_i \\ 2\sum_{i=1}^{n}w_{2i}\theta_i \\ \vdots \\ 2\sum_{i=1}^{n}w_{ni}\theta_i \end{matrix} \right] \\
&= 2 \left[ \begin{matrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \cdots & w_{nn} \end{matrix} \right] \left[ \begin{matrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{matrix} \right]
= 2\boldsymbol{W}\boldsymbol{\theta} = 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta}
 \end{align*} $$
 
其中，$\frac{\partial}{\partial \theta_k} \sum \limits_{i=1}^{n} \sum \limits_{j=1}^{n}\theta_i w_{ij} \theta_j$ 的求导需要详细叙述一下，注意到$w_{ik} = w_{ki}$，求导过程如下：

$$ \begin{align*}
\frac{\partial}{\partial \theta_k} \sum_{i=1}^{n} \sum_{j=1}^{n}\theta_i w_{ij} \theta_j
&= \frac{\partial}{\partial \theta_k} \left( \sum_{i=k}^{k} \sum_{j=k}^{k}\theta_i w_{ij} \theta_j + \sum_{i=k}^{k} \sum_{j=1,j\neq k}^{n}\theta_i w_{ij} \theta_j + \sum_{i=1,i\neq k}^{n} \sum_{j=k}^{k}\theta_i w_{ij} \theta_j + \sum_{i=1,i\neq k}^{n} \sum_{j=1,j\neq k}^{n}\theta_i w_{ij} \theta_j \right) \\
&= \frac{\partial}{\partial \theta_k} \left( w_{kk} \theta_k^2 + \sum_{j=1,j\neq k}^{n}\theta_k w_{kj} \theta_j + \sum_{i=1,i\neq k}^{n} \theta_i w_{ik} \theta_k + \sum_{i=1,i\neq k}^{n} \sum_{j=1,j\neq k}^{n}\theta_i w_{ij} \theta_j \right) \\
&= 2w_{kk}\theta_k +  \sum_{j=1,j\neq k}^{n} w_{kj} \theta_j + \sum_{i=1,i\neq k}^{n} \theta_i w_{ik} + 0 \\
&=  \sum_{i=1}^{n} w_{ki} \theta_i + \sum_{i=1}^{n} w_{ik}\theta_i \\
&= 2\sum_{i=1}^{n} w_{ki}\theta_i
\end{align*} $$

所以式(1)等于

$$
\begin{align*}
\frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} 
&= \frac{1}{2} \left[\frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{X\theta} -  \frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{y} - \frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{y}^{\text{T}}\boldsymbol{X\theta} \right]  \\
&=\frac{1}{2} \left[ 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta} - \boldsymbol{X}^T \boldsymbol{y} - \boldsymbol{X}^T \boldsymbol{y} \right] \\
&= \boldsymbol{X}^{\text{T}}\left( \boldsymbol{X\theta} -  \boldsymbol{y}\right)
\end{align*}
$$

### 利用迹求导

吴恩达老师在求导过程中引入了迹的求导，这里的推导与吴恩达老师使用的公式稍微有些不同，但最终效果应该是一样的。

n×n矩阵 $\boldsymbol{A} = \left[ a_{ij} \right]$ 的迹定义为：

$$ \text{tr}\ A = \text{tr}(A) = a_{11} + \cdots + a_{nn} = \sum_{i=1}^{n} a_{ii}$$

迹只对方阵有意义，非方阵矩阵无迹的定义。

关于迹的部分等式：

1. $ a \in \mathbb{R}^{1 \times 1}$
$$ \begin{align*} a = \text{tr}\ a  \tag{2} \end{align*} $$
2. $ \boldsymbol{A} \in \mathbb{R}^{n \times n}$
$$ \begin{align*} \text{tr}\ \boldsymbol{A}^{\text{T}} = \text{tr}\ \boldsymbol{A}  \tag{3} \end{align*} $$
3. $ \boldsymbol{A} \in \mathbb{R}^{m \times n}$、$ \boldsymbol{B} \in \mathbb{R}^{n \times m}$
$$ \begin{align*} \text{tr}\ \boldsymbol{AB} = \text{tr}\ \boldsymbol{BA}  \tag{4} \end{align*} $$
4. $ \boldsymbol{A} \in \mathbb{R}^{m \times n}$、$ \boldsymbol{B} \in \mathbb{R}^{n \times r}$、$ \boldsymbol{C} \in \mathbb{R}^{r \times m}$

$$ \begin{align*} \text{tr}\ \boldsymbol{ABC} = \text{tr}\ \boldsymbol{CAB} = \text{tr}\ \boldsymbol{BCA}  \tag{5} \end{align*} $$

关于迹的求导：

1. $ \boldsymbol{A} \in \mathbb{R}^{n \times n}$
$$ \begin{align*} \frac{\partial \text{tr}\ \boldsymbol{A}}{\partial \boldsymbol{A}} = \boldsymbol{I}  \tag{6} \end{align*} $$
2. $ \boldsymbol{A} \in \mathbb{R}^{m \times n}$、$ \boldsymbol{B} \in \mathbb{R}^{n \times m}$
$$ \begin{align*} \frac{\partial \text{tr}\ \boldsymbol{AB}}{\partial \boldsymbol{A}} = \frac{\partial \text{tr}\ \boldsymbol{BA}}{\partial \boldsymbol{A}} = \boldsymbol{B}^{\text{T}}  \tag{7} \end{align*} $$
3. $ \boldsymbol{A} \in \mathbb{R}^{m \times n}$、$ \boldsymbol{B} \in \mathbb{R}^{m \times n}$
$$ \begin{align*} \frac{\partial \text{tr}\ \boldsymbol{A}^{\text{T}}\boldsymbol{B}}{\partial \boldsymbol{A}} = \frac{\partial \text{tr}\ \boldsymbol{B}\boldsymbol{A}^{\text{T}}}{\partial \boldsymbol{A}} = \boldsymbol{B}  \tag{8} \end{align*} $$
4. $ \boldsymbol{A} \in \mathbb{R}^{m \times n}$、$ \boldsymbol{B} \in \mathbb{R}^{m \times m}$
$$ \begin{align*} \frac{\partial \text{tr}\ \boldsymbol{A}^{\text{T}}\boldsymbol{BA}}{\partial \boldsymbol{A}} =  \left( \boldsymbol{B} + \boldsymbol{B}^{\text{T}} \right) \boldsymbol{A}  \tag{9} \end{align*} $$
5. $ \boldsymbol{A} \in \mathbb{R}^{m \times n}$、$ \boldsymbol{B} \in \mathbb{R}^{n \times n}$
$$ \begin{align*} \frac{\partial \text{tr}\ \boldsymbol{AB}\boldsymbol{A}^{\text{T}}}{\partial \boldsymbol{A}} = \boldsymbol{A} \left( \boldsymbol{B} + \boldsymbol{B}^{\text{T}} \right)  \tag{10} \end{align*} $$

根据式(2)可以将式(1)表示成对迹求导，然后依据式(9)、式(8)、式(7)可以方便的推出求导的结果：

$$
\begin{align*}
\frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} 
&= \frac{1}{2} \left[\frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{X\theta} -  \frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{y} - \frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{y}^{\text{T}}\boldsymbol{X\theta} \right]  \\
&= \frac{1}{2} \left[\frac{\partial}{\partial \boldsymbol{\theta}} \text{tr} \ \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{X\theta} -  \frac{\partial}{\partial \boldsymbol{\theta}} \text{tr} \ \boldsymbol{\theta}^{\text{T}}\boldsymbol{X}^{\text{T}}\boldsymbol{y} - \frac{\partial}{\partial \boldsymbol{\theta}} \text{tr} \ \boldsymbol{y}^{\text{T}}\boldsymbol{X\theta} \right] \\
&= \frac{1}{2} \left[\frac{\partial}{\partial \boldsymbol{\theta}} \text{tr} \ \boldsymbol{\theta}^{\text{T}}\boldsymbol{W}\boldsymbol{\theta} -  \frac{\partial}{\partial \boldsymbol{\theta}} \text{tr} \ \boldsymbol{\theta}^{\text{T}}\boldsymbol{Z} - \frac{\partial}{\partial \boldsymbol{\theta}} \text{tr} \ \boldsymbol{Z}^{\text{T}}\boldsymbol{\theta} \right] \\
&= \frac{1}{2} \left[ \left( \boldsymbol{W} + \boldsymbol{W}^{\text{T}} \right) \boldsymbol{\theta} -  \boldsymbol{Z} - \left( \boldsymbol{Z}^{\text{T}} \right)^{\text{T}} \right] \\
&=  \frac{1}{2} \left[ \left( \boldsymbol{X}^{\text{T}}\boldsymbol{X} + (\boldsymbol{X}^{\text{T}}\boldsymbol{X})^{\text{T}} \right)\boldsymbol{\theta} - \boldsymbol{X}^{\text{T}}\boldsymbol{y} - \boldsymbol{X}^{\text{T}}\boldsymbol{y} \right] \\
&= \boldsymbol{X}^{\text{T}}\left( \boldsymbol{X\theta} -  \boldsymbol{y}\right)
\end{align*}
$$ 

令导数等于零，就可以得到Normal Equation了：

$$ \boldsymbol{X}^{\text{T}}\boldsymbol{X} \boldsymbol{\theta} = \boldsymbol{X}^{\text{T}} \boldsymbol{y}$$

进而可以得到 $\boldsymbol{\theta}$ 的解：

$$ \boldsymbol{\theta} = \left( \boldsymbol{X}^{\text{T}}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{\text{T}} \boldsymbol{y}$$


## 参考资料

- 吴恩达机器学习视频
- 矩阵分析与应用 张贤达
