# Deep Learning Math

## Notation

**Scalar:** Just a single value.

**Vector:** Array of numbers, which are arranged in order.

**Matrix:** 2-D array of numbers, which is why each element is indexed by two subscripts.

From left to right: Column Vector, Row Vector, Matrix
$$
\begin{align}
    x &= \begin{bmatrix}
       x_{1} \\
       x_{2} \\
       \vdots \\
       x_{m}
     \end{bmatrix}, \ \ \ \ x^T = \begin{bmatrix}
       x_{1} & x_{2} & \cdots & x_{n}
     \end{bmatrix}, \ \ \ \ X = \begin{bmatrix}
       x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\
       x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\
       \vdots & \vdots & \ddots & \vdots \\
       x_{m,1} & x_{m,2} & \cdots & x_{m,n}
     \end{bmatrix}
\end{align}
$$

One special matrix is the identity matrix. The identity matrix does not change any vector when it is multplied by that matrix. It is defined as:
$$
\begin{align}
    A^{-1}A=I_n &= \begin{bmatrix}
       1 & 0 & 0 \\
       0 & 1 & 0 \\
       0 & 0  & 1
     \end{bmatrix}
\end{align}
$$
**Tensor:** More generalized form of a matrix with arbitrary many axes.

## Operations

**Transpose:** Mirror image of the matrix across a diagonal line from top left to bottom right.
$$
\begin{align}
    (A^T)_{i,j} = A_{j,i}
\end{align}
$$
**Addition:** The addition of two matrices is defined as the elementwise addition of their elements. In order to sum two matrices, they both have to be of the same size. In Computer Science the concept of broadcasting is often applied to match the sizes automatically in case only one of the dimensions is correct and the other one is 1. Scalar can also be applied via broadcasting.

$$
\begin{align}
    A + B := \begin{bmatrix}
       a_{1,1} + b_{1,1} & \cdots & a_{1,n} + b_{1,n} \\
       \vdots  & \ddots & \vdots \\
       a_{m,1} + b_{m,1} & \cdots & a_{m,n} + b_{m,n}
     \end{bmatrix}
\end{align}
$$

**Multiplication:** There are different kinds of multiplication on matrices. Matrix multiplication by default is *not* defined as the elementwise product. The elementwise so called *Hadamard product* is sometimes used in Computer Science, especially in combination with broadcasting to apply certain operation to all columns or all rows even if the dimensions don't match. The default matrix multiplication is defined as:
$$
\begin{align}
    {\textit C_{i,j}} = \sum \limits _k {\textit A_{i,k} B_{k,j}}
\end{align}
$$
**Example:**
$$
\begin{align}
    For \ A &= \begin{bmatrix}
       1 & 2 & 3 \\
       3 & 2 & 1
     \end{bmatrix} \in \mathbb{R}^{2x3}, \
     B = \begin{bmatrix}
       0 & 2 \\
       1 & -1 \\
       0 & 1
     \end{bmatrix} \in \mathbb{R}^{3x2}
\end{align}
$$

$$
\begin{align}
    AB &= \begin{bmatrix}
       1 & 2 & 3 \\
       3 & 2 & 1
     \end{bmatrix}
     \begin{bmatrix}
       0 & 2 \\
       1 & -1 \\
       0 & 1
     \end{bmatrix} = \begin{bmatrix}
       2 & 3 \\
       2 & 5
     \end{bmatrix} \in \mathbb{R}^{2,2}
\end{align}
$$

$$
\begin{align}
    BA &= \begin{bmatrix}
       0 & 2 \\
       1 & -1 \\
       0 & 1
     \end{bmatrix}
     \begin{bmatrix}
       1 & 2 & 3 \\
       3 & 2 & 1
     \end{bmatrix} = \begin{bmatrix}
       6 & 4 & 2 \\
       -2 & 0 & 2 \\
       3 & 2 & 1
     \end{bmatrix} \in \mathbb{R}^{3,3}
\end{align}
$$

This example shows that matrix multiplication is not commutative. Furthermore the so called dot product from this example is only defined if the neighboring dimensions match.



## Norms

In order to measure the size of a vector, in machine learning, usually the norm is used for that. Norms are functions mapping vectors to non-negative values. Intuitively the norm of a vector x measures the distance from the origin to point x. The most important norm used is the L<sup>2</sup> norm, which is also known as the Euclidian norm for measuring the distance between the origin and point x and is defined for p = 2. More generally a L<sup>p</sup> norm is defined as:
$$
\begin{align}
    \| \mathbf{x} \|_p = \bigg( \sum \limits _i | x_i |^p \bigg)^{\frac{1}{p}}
\end{align}    
$$

To measure the size of a whole matrix the *Frobenius norm* is the most common to do this. The *Frobenius norm* can be simplified with the Trace() function, which sums up all diagonal entries of a matrix:
$$
\begin{align}
    Tr(A) = \sum \limits _i A_{i,i} \ \ , \ \ \|A\|_F = \sqrt{Tr(AA^T)} \ \ , \ \ \|A\|_F = \sqrt{ \sum \limits{i,j} A_{i,j}^{2} }
\end{align} 
$$

## Rank

The rank of matrix A is denoted by the number of linearly independent columns of a matrix ***A*** &isin; ℝ<em><sup>m,n</sup></em>. The number of linearly independent rows and columns is always the same. The rank of matrix ***A*** is denoted as rk(***A***).

## Transformation

A transformation matrix is a linear mapping of vector space ***B*** into vector space ***C*** Typical transformations in 2D are stretching, sequeezing, rotation, shearing, reflection and orthogonal projection. Transformation matrices are also applicable to more than 2 dimensions. The three rotation matrices to rotate about the e<sub>1</sub>-axis, e<sub>2</sub>-axis and e<sub>3</sub>-axis are following:

$$
\begin{align}
    R_x(\theta) &= \begin{bmatrix}
       1 & 0 & 0 \\
       0 & cos \ \theta & -sin \ \theta \\
       0 & sin \ \theta & cos \ \theta
     \end{bmatrix}
\end{align}
$$

$$
\begin{align}
    R_y(\theta) &= \begin{bmatrix}
       cos \ \theta & 0 & sin \ \theta \\
       0 & 1 & 0 \\
       -sin \ \theta & 0 & cos \ \theta
     \end{bmatrix}
\end{align}
$$

$$
\begin{align}
    R_z(\theta) &= \begin{bmatrix}
       cos \ \theta & -sin \ \theta & 0 \\
       sin \ \theta & cos \ \theta & 0 \\
       0 & 0 & 1
     \end{bmatrix}
\end{align}
$$

## Sources

- [Goodfellow., I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.](https://www.deeplearningbook.org/contents/linear_algebra.html)
- [Deisenroth., M., Faisal, A., Ong, C. S. (2020). Mathematics for Machine Learning. Cambridge.](https://mml-book.github.io/book/mml-book.pdf)
- [Wikipedia. (2022). Transformation matrix.](https://en.wikipedia.org/wiki/Transformation_matrix)