- references
    - https://atmos.washington.edu/~dennis/MatrixCalculus.pdf
    - https://math.stackexchange.com/questions/312077/differentiate-fx-xtax

## cases

$$
\begin{split}
&\frac{\partial}{\partial x}(v^Tx)=v\\
&\frac{\partial}{\partial x}(A\cdot x)=A\\
&\frac{\partial}{\partial x}(x^TAx)=2Ax
\end{split}
$$

## $\mathbf y=\mathbf {Ax}$

$$
\mathbf{y} = \psi(\mathbf{x}),
$$


$$
\begin{equation}
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = 
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}
\end{equation}
$$

- $\mathbf{y} = \psi(\mathbf{x}),$ 比如 $\mathbf y=\mathbf {Ax}$
- $\frac{\partial \mathbf y}{\partial \mathbf x}$ 向量（多元输出，multi-variables）对向量（多元输入，multi-inputs）的导数，此时的输出是 jacobian matrix


$$
\begin{split}
&\mathbf y=\mathbf {Ax}\\
&\frac{\partial \mathbf y}{\partial \mathbf x}=\mathbf A
\end{split}
$$

- 我们来进行简单的推导
    - $y_i=A_{[i]}x=\sum_k a_{ik}x_k$
        - $y_1=\sum_ka_{1k}x_k$
- 一个特例，当 $A$ 为一个行向量时（$\mathbf w^T$），退化为一个多元输入，单输出（标量 scalar 输出）的内积运算，此时的导数为与输入等shape的向量；

    $$
    y=\mathbf w^T\mathbf x
    $$
  
    - $y=\mathbf w^T\mathbf x=\sum_iw_ix_i$
 

$$
\frac{\partial y}{\partial \mathbf x}=\begin{bmatrix}w_1,w_2,\cdots,w_n\end{bmatrix}=\mathbf w^T
$$

## $\alpha=\mathbf y^T\mathbf A\mathbf x$

$$
\begin{split}
&\frac{\partial \alpha}{\partial \mathbf x}=\mathbf y^T\mathbf A\\
&\frac{\partial \alpha}{\partial \mathbf y}=\mathbf x^T\mathbf A^T
\end{split}
$$

来看证明：
- 对于第一个导数
    - $\mathbf w^T=\mathbf y^T\mathbf A$
    - $\alpha=\mathbf w^T\mathbf x$
    - $\frac{\partial \alpha}{\partial \mathbf x}=\mathbf w^T=\mathbf y^T\mathbf A$
- 对于第二个导数
    - $\alpha=\alpha^T=\mathbf x^T\mathbf A^T\mathbf y$
    - $\frac{\partial \alpha}{\partial \mathbf y}=\mathbf x^T\mathbf A^T$

## $\alpha =\mathbf x^T\mathbf A\mathbf x$

$$
\frac{\partial \alpha}{\partial \mathbf x}=\mathbf x^T(\mathbf A+\mathbf A^T)
$$

证明，基于矩阵矢量乘法的定义/计算：

$$
\begin{split}
&\alpha=\sum_ix_i\sum_ja_{ij}x_j=\sum_i\sum_jx_ia_{ij}x_j\\
&\frac{\partial \alpha}{\partial x_k}=\sum_ix_ia_{ik}+\sum_jx_ka_{kj}\\
&\frac{\partial \alpha}{\partial \mathbf x}=\mathbf x^T\mathbf A^T+\mathbf x^T\mathbf A=\mathbf x^T(\mathbf A+\mathbf A^T)
\end{split}
$$

## $\mathbf y=\mathbf A\mathbf x$

$$
\frac{\partial \mathbf y}{\partial \mathbf A}
$$

- 是一个三维的tensor
    - $\frac{\partial y_i}{\partial \mathbf A}$ 各是一个矩阵
    - $y_1=w_{11}x_1+w_{12}x_2$ => $\begin{bmatrix}x_1 & x_2\\0 & 0\end{bmatrix}$
    - $y_2=w_{21}x_1+w_{22}x_2$ => $\begin{bmatrix}0 & 0\\x_1 & x_2\end{bmatrix}$

In [10]:
import torch

# 定义矩阵 A 和向量 x
A = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0],
                  [5.0, 6.0]], requires_grad=True)

x = torch.tensor([[0.5], [1.0]], requires_grad=True)

# 计算 y = A * x
y = torch.matmul(A, x)

# 计算 y 对 A 的雅可比矩阵
y.backward(torch.ones_like(y))

# 获取雅可比矩阵
jacobian = A.grad
jacobian

tensor([[0.5000, 1.0000],
        [0.5000, 1.0000],
        [0.5000, 1.0000]])

In [12]:
import torch

# 定义矩阵 A 和向量 x
A = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0],
                  [5.0, 6.0]], requires_grad=True)

x = torch.tensor([[0.5], [1.0]], requires_grad=True)

# 计算 y = A * x
y = torch.matmul(A, x)

# 初始化一个与 A 形状相同的零张量来存储雅可比矩阵
jacobian = torch.zeros((y.size(0), A.size(0), A.size(1)))

# 逐元素计算雅可比矩阵
for i in range(y.size(0)):
    # 清除梯度
    A.grad = None
    
    # 对 y 中的第 i 个元素进行反向传播
    y[i].backward(retain_graph=True)
    
    # 将计算得到的梯度存储在雅可比矩阵中
    jacobian[i] = A.grad
jacobian

tensor([[[0.5000, 1.0000],
         [0.0000, 0.0000],
         [0.0000, 0.0000]],

        [[0.0000, 0.0000],
         [0.5000, 1.0000],
         [0.0000, 0.0000]],

        [[0.0000, 0.0000],
         [0.0000, 0.0000],
         [0.5000, 1.0000]]])

## loss backward

In [1]:
import torch

# 设定输入 x 和权重 W，b 为偏置
x = torch.tensor([[1.0, 2.0]], requires_grad=True)  # 1x2 行向量
W = torch.tensor([[0.5, -0.5], [1.5, -1.0]], requires_grad=True)  # 2x2 矩阵
b = torch.tensor([[0.1, -0.1]], requires_grad=True)  # 1x2 行向量

# 前向传播计算 z = xW + b
z = x @ W + b  # 矩阵乘法加上偏置

# 定义一个简单的标量损失函数，假设为 z 的和
L = z.sum()

# 进行反向传播计算梯度
L.backward()

# 打印梯度
print("dL/dx:", x.grad)
print("dL/dW:", W.grad)
print("dL/db:", b.grad)

# 手动验证梯度计算
dL_dz = torch.ones_like(z)  # 因为 L = z.sum(), dL/dz = 1
dz_dW = x.t()  # d(xW+b)/dW = x^T
manual_dL_dW = dz_dW @ dL_dz  # outer product
print("Manual dL/dW:", manual_dL_dW)

dL/dx: tensor([[0.0000, 0.5000]])
dL/dW: tensor([[1., 1.],
        [2., 2.]])
dL/db: tensor([[1., 1.]])
Manual dL/dW: tensor([[1., 1.],
        [2., 2.]], grad_fn=<MmBackward0>)
