<a href="https://colab.research.google.com/github/dmtzt/machine-learning-specialization/blob/main/2_multiclass_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiclass Classification
The target $y$ can take on more than two possible values.

## Activation functions
- Linear activation function
- Sigmoid
- ReLU: **Re**ctified **L**inear **U**nit

### Choosing an activation function
#### Output layer
| Problem                 | Function |
|-------------------------|----------|
| Binary classification   | Sigmoid  |
| Regression              | Linear   |
| Non-negative regression | ReLU     |

#### Hidden layers
- ReLU is by far the most common choice.
  - Faster: more efficient.
  - Flat only on negative values: gradient descent works faster.
- In the early history, sigmoid was used instead. It is hardly ever used anymore.

#### Why activation functions?
- Without activation functions, a neural network cannot compute any more complex features than a linear function.
  - A linear function of a linear function is itself a linear function.
  - The neural network would be equivalent to a linear regression model.

## Gradient descent
Gradient descent is a way to minimize an objective function $J ∈ θ$ parameterized by a model's parameters  $θ ∈ \mathbb{R}^d$ by updating the parameters in the opposite direction of the graient of the objective function w.r.t. to the parameters.

The learning rate $\eta$ determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

### Gradient descent variants
There are three variantes of gradient descent, which differ in **how much data** we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the *accuracy* of the parameter update and the *time* it takes to perform an update.

#### Batch Gradient Descent
Vanilla gradient descent (batch gradient descent) computes the gradient of the cost function with reference to the parameters $\theta$ for the entire training set:

$$ \theta = \theta - \triangledown_{\theta}J(θ)$$

The gradients for the whole dataset must be calculated to perfom just one update, which can be very slow and intractable for large datasets that don't fit in memory. Moreover, it doesn't allow us to update the model online (with new examples on-the-fly).

#### Stochastic Gradient Descent (SGD)
Performs a parameter update for each training example $x^{(i)}$ and label $y^{(i)}$:

$$\theta = \theta -\eta \cdot \triangledown_{\theta}J(\theta ; x^{(i)};y^{(i)}) $$

Batch gradient descent performs redundant computation for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD performs one update at a time. It performs frequent udpates with a high variance that cause the objective function to fluctuate heavily.

![](https://ruder.io/content/images/2016/09/sgd_fluctuation.png)

#### Mini-batch gradient descent


## Softmax Regression
- A generalization of the logistic regression algorithm.
- Used to carry out multiclass classification problems.

What is the chance of $y$ being any of the given classes?

$z_j=\vec{w_j}\cdot\vec{x}+b_j,\quad j=1,\ldots,N$

$a_j=\frac{e^{z}_j}{\sum_{k=1}^{N}{e^{z}_{k}}}=P(y=j|\vec{x})$

### Loss function: Sparse Categorical Crossentropy
- **Categorical**: $y$ is classified into categories
- **Sparse**: $y$ can only take on one of these categories

## Optimization algorithm: Adam
- **Ada**ptive **M**oment Estimation
- Automatically increase or decrease the learning rate $\alpha$ depending on how gradient descent is proceeding to get to the minimum faster
- Uses a different learning rate for every parameter of the model

Intuition
- If $w_j$ keeps moving in the same direction, increase $\alpha_j$.
- If $w_j$ keeps oscillating, reduce $\alpha_j$.

## Multi-label Classification
There are several different labales associated with a single input.


- Approach 1: treat situation as separate machine learning problems
- Approach 2: train one NN with all outputs, use sigmoid activation function for each node in output layer