# Basics

## Training set

We have a set consisting of $m$ training examples:

$\qquad X = \begin{pmatrix}
  x_1^{(1)} & x_2^{(1)} & ... & x_n^{(1)} \\
  x_1^{(2)} & x_2^{(2)} & ... & x_n^{(2)} \\
  ... \\
  x_1^{(m)} & x_2^{(m)} & ... & x_n^{(m)} \\
\end{pmatrix}_{m \times n}$

and training labels (also known as target values):

$\qquad Y = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ ... \\ y^{(m)} \end{pmatrix}_{m \times 1}$

Each training example $x^{(i)}$ in $X$ is comprised of $n$ features:

$\qquad x^{(i)} = \begin{pmatrix} x_1^{(i)} & x_2^{(i)} & ... & x_n^{(i)} \end{pmatrix}_{1 \times n}$

and has a corresponding training label $y^{(i)}$ in $Y$.

## Linear Regression

Linear regression function is:

$\qquad z(x) = b + w_1 x_1 + w_2 x_2 + ... + w_n x_n$

where $w$ is a vector of coefficients (weights):

$\qquad w = \begin{pmatrix} w_1 & w_2 & ... & w_n \end{pmatrix}_{1 \times n}$

and $b$ is the bias.

Linear regression can be expressed in vector form as:

$\qquad z(x) = x \cdot w^T + b$

## Linear Prediction Model

Linear prediction function is:

$\qquad \hat{y}(x) = z(x) = x \cdot w^T + b$

Performance of a linear prediction model with weights $w$ and bias $b$ for a given example $x$ and label $y$ is measured using squared error loss function:

$\qquad l(x, y) = (\hat{y}(x) - y)^2$

$\qquad l(x, y) = ((x \cdot w^T + b) - y)^2$

Average loss across the entire training set $X$, $Y$ &mdash; also known as the Mean Squared Error (MSE) &mdash; is given by the cost function:

$\qquad L = \dfrac{1}{m} \displaystyle\sum_{i=1}^{m}{l(x^{(i)}, y^{(i)})}$

$\qquad L = \dfrac{1}{m} \displaystyle\sum_{i=1}^{m}{(\hat{y}(x^{(i)}) - y^{(i)})^2}$

$\qquad L = \dfrac{1}{m} \displaystyle\sum_{i=1}^{m}{((x^{(i)} \cdot w^T + b) - y^{(i)})^2}$

The cost function can be expressed in matrix form as:

$\qquad L = \dfrac{1}{m} \|(X \cdot w^T + \textbf{1}_{m \times 1} \cdot b) - Y\|^2$

where $\textbf{1}_{m \times 1}$ is a "matrix of ones" and $\|\ \|$ denotes Euclidean ($L^2$) norm.

The goal is to find parameters $w$ and $b$ that minimize the cost function $L$ with respect to $X$ and $Y$.

## Linear Classification Model (Perceptron)

Linear classification function is:

$\qquad \hat{y}(x) = a(z)$

$\qquad z(x) = x \cdot w^T + b$

where $a$ is an activation function and $z$ is the linear regression function.

### Multiclass Perceptron

Given $k$ classes ($k > 2$), multiclass classification function becomes:

$\qquad \hat{y}(x) = a(Z)$

$Z$ is a vector of $k$ linear regression functions &mdash; one for each class &mdash; and is expressed as:

$\qquad Z(x) = \begin{pmatrix} z_1(x) & z_2(x) & ... & z_k(x) \end{pmatrix}_{1 \times k}$

$\qquad z_j(x) = x \cdot w_j^T + b_j$

where $w_j$ and $b_j$ are the weights and bias for a given function $z_j$.

Vectors $w_j$ and values $b_j$ can be combined into weights matrix $W$ and biases vector $b$:

$\qquad W = \begin{pmatrix} w_1^T & w_2^T & ... & w_k^T \end{pmatrix} = \begin{pmatrix}
  w_{1,1} & w_{1,2} & ... & w_{1,k} \\
  w_{2,1} & w_{2,2} & ... & w_{2,k} \\
  ... \\
  w_{n,1} & w_{n,2} & ... & w_{k,k} \\
\end{pmatrix}_{n \times k}$

$\qquad b = \begin{pmatrix} b_1 & b_2 & ... & b_k \end{pmatrix}_{1 \times k}$

Vector $Z$ can then be expressed in matrix form as:

$\qquad Z(x) = x \cdot W + b$

Perceptron can be broken down into so called $input$ layer and an $output$ layer:

<img src="perceptron.png" width="50%">

## Binary Classification Model

Training labels $y$ can take on one of two values:

$\qquad y = \begin{cases}
  0 & negative\ class \\
  1 & positive\ class
\end{cases}$

Binary classification model uses sigmoid (also known as logistic) activation function:

$\qquad \hat{y}(x) = \sigma(z) = \dfrac{1}{1 + e^{-z}}$

$\qquad z(x) = x \cdot w^T + b$

where the value of $\hat{y}$ represents probability (or confidence) of the prediction belonging to the $positive\ class$.

Conversely, the value of $(1 - \hat{y}(x))$ is the probability of the prediction belonging to the $negative\ class$.

Performance of a binary classification model with weights $w$ and bias $b$ for a given example $x$ and label $y$ is measured using binary cross-entropy loss function:

$\qquad l(x, y) = -(y \times \ln \hat{y} + (1 - y) \times \ln (1 - \hat{y}))$

Binary cross-entropy function is a special case of the categorical cross-entropy function described below.

Cost function for the training set $X$, $Y$ is the average of all losses:

$\qquad L =  \dfrac{1}{m} \displaystyle\sum_{i=1}^{m}{l(x^{(i)}, y^{(i)})}$

$\qquad L = -\dfrac{1}{m} \displaystyle\sum_{i=1}^{m}{(y \times \ln \hat{y} + (1 - y) \times \ln (1 - \hat{y}))}$

Cost function can be expressed in matrix form as:

$\qquad Z = X \cdot w^T + \textbf{1}_{m \times 1} \cdot b$

$\qquad \hat{Y} = \sigma(Z) = \dfrac{1}{1 + e^{\circ(-Z)}}$

$\qquad L = -\dfrac{1}{m} \times |\ Y \circ \ln \hat{Y} + (1 - Y) \circ \ln (1 - \hat{Y})\ |$

where $\textbf{1}_{m \times 1}$ is a "matrix of ones", $\circ$ indicates Hadamard (element-wise) exponential and multiplication, and $|\ \ |$ denotes $L^1$ norm.

## Multiclass Classification Model

Given $k$ classes, training labels $y$ can take on any of the values:

$\qquad y \in \{ 1, 2, ..., k \}$

Training labels are transformed into vectors of size $k$ using "one-hot encoding" technique:

$\qquad \dot{y} = \begin{pmatrix} [y = 1] & [y = 2] & ... & [y = k] \end{pmatrix}_{1 \times k}$

where $[\ ]$ denotes Iverson bracket.

Matrix $\dot{Y}$ is defined as a set all "one-hot encoded" labels:

$\qquad \dot{Y} = \begin{pmatrix} \dot{y}^{(1)} \\ \dot{y}^{(2)} \\ ... \\ \dot{y}^{(m)} \end{pmatrix} =
\begin{pmatrix}
  y_1^{(1)} & y_2^{(1)} & ... & y_k^{(1)} \\
  y_1^{(2)} & y_2^{(2)} & ... & y_k^{(2)} \\
  ... \\
  y_1^{(m)} & y_2^{(m)} & ... & y_k^{(m)} \\
\end{pmatrix}_{m \times k}$

Multiclass classification model uses $softmax$ activation function:

$\qquad \hat{y}(x) = softmax(Z) = \begin{bmatrix} p_1 & p_2 & ... & p_k \end{bmatrix}$

$\qquad Z(x) = x \cdot W + b$

$\qquad p_i = \dfrac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}}$

where $p_i$ represents probability of the prediction belonging to class $i$.

Function $softmax$ has a property where: $\sum_{i=1}^k{p_i} = 1$

Performance of a multiclass classification model with weights $W$ and biases $b$ for a given example $x$ and "one-hot encoded" label $\dot{y}$ is measured using categorical cross-entropy loss function:

$\qquad l(x, y) = -\displaystyle\sum_{j=1}^k{(\dot{y}_j \times \ln \hat{y}_j)}$

Cost function for the training set $X$, $Y$ is the average of all losses:

$\qquad L =  \dfrac{1}{m} \displaystyle\sum_{i=1}^{m}{l(x^{(i)}, y^{(i)})}$

$\qquad L = -\dfrac{1}{m} \displaystyle\sum_{i=1}^{m}{\displaystyle\sum_{j=1}^k{(\dot{y}_j^{(i)} \times \ln \hat{y}_j^{(i)})}}$

Cost function can be expressed in matrix form as:

$\qquad Z = X \cdot W + \textbf{1}_{m \times 1} \cdot b$

$\qquad \hat{Y} = softmax(Z) = e^{\circ Z} \oslash (e^{\circ Z} \cdot \textbf{1}_{k \times k})$

$\qquad L = - \dfrac{1}{m} \times |\ \dot{Y} \circ \ln \hat{Y} \ |$

where $\textbf{1}_{m \times 1}$ and $\textbf{1}_{k \times k}$ are "matrices of ones", $\circ$ indicates Hadamard (element-wise) exponential and multiplication, $\oslash$ is Hadamard (element-wise) division and $|\ \ |$ denotes $L^1$ norm.

## Activation Functions

Sigmoid (Logistic) activation function:

$\qquad \sigma(x) = \dfrac{1}{1 + e^{-x}}$

Hyperbolic tangent activation function:

$\qquad tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}$

ReLU activation function:

$\qquad relu(x) = \begin{cases}
  0 & for\ x < 0 \\
  x & for\ x \geqslant 0
\end{cases}$

$\qquad relu(x) = max(0, x)$

## Multilayer Perceptron (MLP)

Multilayer Perceptron contains one or more $hidden$ layers between the $input$ layer and the $output$ layer &mdash; each with a set of linear functions and activation functions:

<img src="mlp.png" width="35%">

NB: MLP may also have a multiclass output layer.