<a href="https://colab.research.google.com/github/cyrus2281/DataStructure_Algorithm/blob/main/MachineLearning/Deep_Learning_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Table of Contents

>[Preprocessing](#scrollTo=xsNekpZKBOEc)

>>[Normalization](#scrollTo=7p33kYVQBRGQ)

>>>[Scale](#scrollTo=E9zLAgPcCDSo)

>>>>[MinMaxScaler](#scrollTo=Ok0T-s4sEQd1)

>>>[Standardize](#scrollTo=PrKtPYkGCDMb)

>>>>[RobustScaler](#scrollTo=4wuCZztWEjZl)

>>>>[StandardScaler](#scrollTo=wgcHnXy9E5Y6)

>>>>[Normalizer](#scrollTo=jcWMNwX9FurO)

>[Neural Network](#scrollTo=ELNsz_33314I)

>>[Activation Functions](#scrollTo=_s8F6y7T46-Q)

>>>[Softmax Function](#scrollTo=jWjrCgx84_mH)

>[Compiling a Model](#scrollTo=EhZfLrKulBpn)

>>[Loss Function](#scrollTo=FbgEgkqUfSsW)

>>>[Regression Loss Functions](#scrollTo=iOYd6N6EfWq_)

>>>>[Mean Square Error Loss (MSE)](#scrollTo=efhGe2hBfZeV)

>>>>[Mean Squared Logarithmic Error Loss](#scrollTo=s0uhvs0dfy00)

>>>>[Mean Absolute Error Loss](#scrollTo=L9OFgTGggPew)

>>>[Binary Classification Loss Functions](#scrollTo=_urvie6nhgbR)

>>>>[Binary Cross-Entropy Loss](#scrollTo=KJcaCabCkRz8)

>>>>[Hinge Loss](#scrollTo=n7sxDb-dmA4v)

>>>>[Squared Hinge Loss](#scrollTo=fu8QrIP4rQO8)

>>>[Multi-Class Classification Loss Functions](#scrollTo=0A9XLcElsZUr)

>>>>[Multi-Class Cross-Entropy Loss](#scrollTo=X61eXdGDx2fP)

>>>>[Sparse Multiclass Cross-Entropy Loss](#scrollTo=bXr8oJ9x0C_Z)

>>>>[Kullback Leibler Divergence Loss](#scrollTo=GNMDVkZQ0sZz)



# Preprocessing

## Normalization

resources:
* https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02

Normalize can be used to mean either `scale` or `standardize` (or even more!). Avoid the term normalize, because it has many definitions and is prone to creating confusion.
Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.

Examples of such algorithm families include:
* linear and logistic regression
* nearest neighbors
* neural networks
* support vector machines with radial bias kernel functions
* principal components analysis
* linear discriminant analysis

MinMaxScaler, RobustScaler, StandardScaler, and Normalizer are scikit-learn methods to preprocess data for machine learning.


### Scale
Scale generally means to change the range of the values. The shape of the distribution doesn’t change. Think about how a scale model of a building has the same proportions as the original, just smaller. That’s why we say it is drawn to scale. The range is often set at 0 to 1.

#### MinMaxScaler
For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully change the information embedded in the original data.

Note that MinMaxScaler **doesn’t reduce the importance of outliers**. It’s non-distorting.

The default range for the feature returned by MinMaxScaler is 0 to 1.

### Standardize
Standardize generally means changing the values so that the distribution’s standard deviation equals one. Scaling is often implied.

#### RobustScaler
RobustScaler transforms the feature vector by subtracting the median and then dividing by the interquartile range (75% value — 25% value).

Note that RobustScaler does not scale the data into a predetermined interval like MinMaxScaler. It does not meet the strict definition of *scale*.

Use RobustScaler if you want to **reduce the effects of outliers**, relative to MinMaxScaler.

#### StandardScaler
StandardScaler is the industry’s go-to algorithm.

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does not meet the strict definition of *scale*.

StandardScaler results in a distribution with a standard deviation equal to 1. The variance is equal to 1 also, because variance = standard deviation squared. And 1 squared = 1.

StandardScaler makes the mean of the distribution approximately 0.

Deep learning algorithms often call for zero mean and unit variance. Regression-type algorithms also benefit from normally distributed data with small sample sizes.

Use StandardScaler if you want each feature to have zero-mean, unit standard-deviation. If you want more normally distributed data, and are okay with transforming your data

#### Normalizer
Normalizer works on the rows, not the columns! I find that very unintuitive. It’s easy to miss this information in the docs.

By default, L2 normalization is applied to each observation so the that the values in a row have a unit norm. Unit norm with L2 means that if each element were squared and summed, the total would equal 1. Alternatively, L1 (aka taxicab or Manhattan) normalization can be applied instead of L2 normalization.

Normalizer does transform all the features to values between -1 and 1

# Neural Network

## Activation Functions

### Softmax Function

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, `each component will be in the interval (0,1), and the components will add up to 1`, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.



The standard (unit) softmax function $ \sigma: \mathbb{R}^K \rightarrow (0,1)^K $ is defined when $K \geq 1$ by the formula:


$$
\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} \text{ for } i = 1,...,K \text{ and } z= (z_1,...,z_K) \in \mathbb{R}^K
$$

In simple words, it applies the standard exponential function to each element $z_{i}$ of the input vector $z$ and normalizes these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector $\sigma (\mathbf {z} )$ is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector.

For example, the standard softmax of (1,2,8) is approximately (0.001,0.002,0.997), which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).

\\

The softmax function is used in various `multiclass classification methods`

# Compiling a Model

## Loss Function

resources:
* https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/

### Regression Loss Functions

#### Mean Square Error Loss (MSE)

The Mean Squared Error, or MSE, loss is the default loss to use for regression problems.

Mathematically, it is the preferred loss function under the inference framework of maximum likelihood if the distribution of the target variable is Gaussian. It is the loss function to be evaluated first and only changed if you have a good reason.

Mean squared error is calculated as the average of the squared differences between the predicted and actual values. The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0. The squaring means that larger mistakes result in more error than smaller mistakes, meaning that the model is punished for making larger mistakes.

keras: `mean_squared_error`

  #### Mean Squared Logarithmic Error Loss

  There may be regression problems in which the target value has a spread of values and when predicting a large value, you may not want to punish a model as heavily as mean squared error.

Instead, you can first calculate the natural logarithm of each of the predicted values, then calculate the mean squared error. This is called the Mean Squared Logarithmic Error loss, or MSLE for short.

It has the effect of relaxing the punishing effect of large differences in large predicted values.

As a loss measure, it may be more appropriate when the model is predicting unscaled quantities directly. Nevertheless, we can demonstrate this loss function using our simple regression problem.

Keras: `mean_squared_logarithmic_error`

#### Mean Absolute Error Loss

On some regression problems, the distribution of the target variable may be mostly Gaussian, but may have outliers, e.g. large or small values far from the mean value.

The Mean Absolute Error, or MAE, loss is an appropriate loss function in this case as it is **more robust to outliers**. It is calculated as the average of the absolute difference between the actual and predicted values.

keras: `mean_absolute_error`

### Binary Classification Loss Functions

#### Binary Cross-Entropy Loss

Cross-entropy is the default loss function to use for binary classification problems.

It is intended for use with binary classification where the target values are in the set {0, 1}.

Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.

Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0.

keras: `binary_crossentropy`

Requires an output activation layer of `sigmoid`

#### Hinge Loss
An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with Support Vector Machine (SVM) models.

It is intended for use with binary classification where the target values are in the set **{-1, 1}**.

The hinge loss function encourages examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values.

Reports of performance with the hinge loss are mixed, sometimes resulting in better performance than cross-entropy on binary classification problems.

keras: `hinge`

The output layer of the network must be configured to have a single node with a hyperbolic tangent activation function (keras: `tanh`) capable of outputting a single value in the range [-1, 1].

#### Squared Hinge Loss
The hinge loss function has many extensions, often the subject of investigation with SVM models.

A popular extension is called the squared hinge loss that simply calculates the square of the score hinge loss. It has the effect of **smoothing the surface** of the error function and making it numerically easier to work with.

If using a hinge loss does result in better performance on a given binary classification problem, is likely that a squared hinge loss may be appropriate.

As with using the hinge loss function, the target variable must be modified to have values in the set {-1, 1}.

keras: `squared_hinge`

The output layer must use a single node with a hyperbolic tangent activation function (keras: `tanh`) capable of outputting continuous values in the range [-1, 1].

### Multi-Class Classification Loss Functions

#### Multi-Class Cross-Entropy Loss
Cross-entropy is the default loss function to use for multi-class classification problems.

In this case, it is intended for use with multi-class classification where the target values are in the set {0, 1, 3, …, n}, where each class is assigned a unique integer value.

Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.

Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.

keras: `categorical_crossentropy`

The function requires that the output layer is configured with an n nodes (one for each class) and a `softmax` activation in order to predict the probability for each class.

#### Sparse Multiclass Cross-Entropy Loss
A possible cause of frustration when using cross-entropy with classification problems with a **large number of labels** is the one hot encoding process.

For example, predicting words in a vocabulary may have tens or hundreds of thousands of categories, one for each label. This can mean that the target element of each training example may require a one hot encoded vector with tens or hundreds of thousands of zero values, requiring significant memory.

Sparse cross-entropy addresses this by performing the same cross-entropy calculation of error, without requiring that the target variable be one hot encoded prior to training.

keras: `sparse_categorical_crossentropy`

The function requires that the output layer is configured with an n nodes (one for each class) and a `softmax` activation in order to predict the probability for each class.

#### Kullback Leibler Divergence Loss
Kullback Leibler Divergence, or KL Divergence for short, is a measure of how one probability distribution differs from a baseline distribution.

A KL divergence loss of 0 suggests the distributions are identical. In practice, the behavior of KL Divergence is very similar to cross-entropy. It calculates how much information is lost (in terms of bits) if the predicted probability distribution is used to approximate the desired target probability distribution.

As such, the KL divergence loss function is more commonly used **when using models that learn to approximate a more complex function** than simply multi-class classification, such as in the case of an autoencoder used for learning a dense feature representation under a model that must reconstruct the original input. In this case, KL divergence loss would be preferred. Nevertheless, it can be used for multi-class classification, in which case it is functionally equivalent to multi-class cross-entropy.

Keras: `kullback_leibler_divergence`

The function requires that the output layer is configured with an n nodes (one for each class) and a softmax activation in order to predict the probability for each class.