---
title: Mathematical Preliminaries
math: 
    '\abs': '\left\lvert #1 \right\rvert' 
    '\norm': '\left\lvert #1 \right\rvert' 
    '\Set': '\left\{ #1 \right\}'
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\RM': '\boldsymbol{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
---

**DIVE into Deep Learning**
___

In [None]:
import numpy

The following is a lecture series that introduces the basic theory of deep learning.

::::{card}
:header: [open in new tab](https://www.youtube.com/embed/videoseries?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)
:::{iframe} https://www.youtube.com/embed/videoseries?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
:width: 100%
:::
::::

**What to know about vector calculus?**

::::{card}
:header: [open in new tab](https://www.cs.cityu.edu.hk/~ccha23/dl/Notation.mp4)
:::{iframe} https://www.cs.cityu.edu.hk/~ccha23/dl/Notation.mp4
:width: 100%
:::
::::

[Vectors](https://en.wikipedia.org/wiki/Euclidean_vector) are represented in lowercase boldface font as in

$$
\begin{align}
\M{x} &:= \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}= [x_i]\in \mathbb{R}^n  
&\text{such as}\quad \M{x}&:= \begin{bmatrix} 1 \\ 2 \\ \vdots \\ 9 \end{bmatrix} \text{, and}  \tag{column vector}\\
\M{x}^\intercal &=\begin{bmatrix}x_1 & x_2 & \cdots & x_n \end{bmatrix} & \text{such as}\quad \M{x}^\intercal &= \begin{bmatrix} 1 & 2 & \cdots & 9 \end{bmatrix}.\tag{row vector}\\
\end{align}
$$

- The above example defines a Euclidean vector, which is a 1-D array of ($n$) real numbers (from $\mathbb{R}$) organized into a column or a row.
- A column vector can be transposed ($(\cdot)^\intercal$) into a row vector.

In [None]:
import numpy as np

seq = np.arange(1, 10)  # 1D array
x = seq.reshape(-1, 1)  # column vector
x_transposed = x.transpose()  # row vector

print("Column vector:", x, "Row vector:", x_transposed, sep="\n")

[Matrices](https://en.wikipedia.org/wiki/Matrix_(mathematics)) in uppercase boldface font:

$$
\begin{align}\M{W} 
&:=\begin{bmatrix}w_{11} & w_{12} & \cdots & w_{1n}\\
w_{21} &  \ddots &  & \vdots\\
\vdots &  & \ddots & \vdots \\
w_{m1} & \cdots  & \cdots & w_{mn}\\
\end{bmatrix} 
=[w_{ij}] \in \mathbb{R}^{mn}
&\text{such as} \quad
 \M{W} 
&:= 
\begin{bmatrix}1 & 2 & 3 \\
4 & 5 & 6  \\
7 & 8 & 9  \end{bmatrix}, \text{and}
\\
\M{W}^\intercal &=
\begin{bmatrix}w_{11} & w_{21} & \cdots & w_{m1}\\
w_{21} &  \ddots &  & \vdots\\
\vdots &  & \ddots & \vdots \\
w_{1n} & \cdots  & \cdots & w_{mn}\\
\end{bmatrix}
&\text{such as} \quad
 \M{W} 
&:= 
\begin{bmatrix}1 & 4 & 7 \\
2 & 5 & 8  \\
3 & 5 & 9  \end{bmatrix}.
\end{align}
$$
- The above defines a Euclidean matrix, which is a 2-D array of real numbers organized into a table with rows and columns.
- Transposing a matrix turns its rows (columns) into columns (rows).

In [None]:
W = np.arange(1, 10).reshape(3, -1)  # 3-by-3 matrix

print("W:", W, "W^T:", W.transpose(), sep="\n")

- [Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication):

$$
\begin{align}
\M{W}\M{x} = \begin{bmatrix}w_{11} & w_{12} & \cdots \\
w_{21} & \ddots &  \\
\vdots &  &  \end{bmatrix}
\begin{bmatrix}x_1 \\ x_2 \\ \vdots \end{bmatrix}
=
\begin{bmatrix}w_{11}x_1 + w_{12}x_2 + \cdots \\ w_{21}x_1+\cdots \\ \vdots \end{bmatrix}
\end{align}
$$

In [None]:
W = np.arange(1, 10).reshape(3, -1)
x = np.arange(1, 4).reshape(-1, 1)
Wx = W @ x

print("W:", W, "x:", x, "Wx:", Wx, sep="\n")

**What to know about Probability Theory?**

::::{card}
:header: [open in new tab](https://www.cs.cityu.edu.hk/~ccha23/dl/Distribution.mp4)
:::{iframe} https://www.cs.cityu.edu.hk/~ccha23/dl/Distribution.mp4
:width: 100%
:::
::::

[Joint distribution](https://en.wikipedia.org/wiki/Joint_probability_distribution#Mixed_case):  

$$
p_{\RM{x}\R{y}}(\M{x},y)
= \underbrace{p_{\R{y}|\RM{x}}(y|\M{x})}_{
\underbrace{\Pr}_{
\text{probability measure}\kern-3em}\Set{\R{y}=y|\RM{x}=\M{x}}} \cdot \underbrace{p_{\RM{x}}(\M{x})
}_{(\underbrace{\partial_{x_1}}_{\text{partial derivative w.r.t. $x_1$}\kern-5em} \partial_{x_2}\cdots)\Pr\Set{\RM{x} \leq \M{x}}\kern-4em}\kern1em \text{where}$$    
- $p_{\R{y}|\RM{x}}(y|\M{x})$ is the *probability mass function [(pmf)](https://en.wikipedia.org/wiki/Probability_mass_function)* of $\R{y}=y\in \mc{Y}$ [conditioned](https://en.wikipedia.org/wiki/Conditional_probability_distribution) on $\RM{x}=\M{x}\in \mc{X}$, and
- $p_{\RM{x}}(\M{x})$ is the *(multivariate) probability density function [(pdf)](https://en.wikipedia.org/wiki/Probability_density_function#Densities_associated_with_multiple_variables)* of $\RM{x}=\M{x}\in \mc{X}$.

::::{card}
:header: [open in new tab](https://www.cs.cityu.edu.hk/~ccha23/dl/Expectation.mp4)
:::{iframe} https://www.cs.cityu.edu.hk/~ccha23/dl/Expectation.mp4
:width: 100%
:::
::::

For any function $g$ of $(\RM{x},y)$, the expectations are:  
  
$$
\begin{align}
E[g(\RM{x},\R{y})|\RM{x}]&=\sum_{y\in \mc{Y}} g(\RM{x},y)\cdot p_{\R{y}|\RM{x}}(y|\RM{x})\tag{conditional exp.}
\\
E[g(\RM{x},\R{y})] &=\int_{\mc{X}} \underbrace{\sum_{y\in \mc{Y}} g(\RM{x},y)\cdot \underbrace{p_{\RM{x},\R{y}}(\M{x},y)}_{p_{\R{y}|\RM{x}}(y|\M{x}) p_{\R{x}}(\M{x})}\kern-1.7em}_{E[g(\RM{x},\R{y})|\RM{x}]}\kern1.4em\,d \M{x} \tag{exp.}\\
&= E[E[g(\RM{x},\R{y})|\RM{x}]] \tag{iterated exp.}
\end{align}
$$