 view in Google Colab for best experience

# Table of Contents

>[Table of Contents](#scrollTo=j5-omnpm4t_3)

>[Preprocessing](#scrollTo=xsNekpZKBOEc)

>>[Normalization](#scrollTo=7p33kYVQBRGQ)

>>>[Scale](#scrollTo=E9zLAgPcCDSo)

>>>>[MinMaxScaler](#scrollTo=Ok0T-s4sEQd1)

>>>[Standardize](#scrollTo=PrKtPYkGCDMb)

>>>>[RobustScaler](#scrollTo=4wuCZztWEjZl)

>>>>[StandardScaler](#scrollTo=wgcHnXy9E5Y6)

>>>>[Normalizer](#scrollTo=jcWMNwX9FurO)

>>[Data Augmentation](#scrollTo=2hBKNVqjAAKM)

>>[Batches](#scrollTo=0t8i2Fky1Rv2)

>[Neural Network](#scrollTo=ELNsz_33314I)

>>[Fundamentals](#scrollTo=p4mDMzyMhmqY)

>>>[Tensor](#scrollTo=4sx5wS7khqcZ)

>>>[Neuron](#scrollTo=8ycTqUltiD_N)

>>>[Layer](#scrollTo=ORwIvoWwi-Cv)

>>>[Weights & Biases](#scrollTo=rYPkK9IXjGUY)

>>>[Mixed Precision](#scrollTo=SK7o1zAQc1NQ)

>>[Layers](#scrollTo=GRhEhh7oSlbR)

>>>[Input Layer](#scrollTo=219kZKYYmIkh)

>>>[Fully Connected (Dense) Layer](#scrollTo=CM8xMdybSnXm)

>>>[Convolution Layer](#scrollTo=MGf4ou3GSxcS)

>>>>[Padding](#scrollTo=uL1UPiq1rPUE)

>>>>>[Same Padding](#scrollTo=9ecJwax3Uy0y)

>>>>>[Valid Padding](#scrollTo=9TPAjFUBV5eo)

>>>>[Kernel size](#scrollTo=R8WFbXzfrTHM)

>>>>[Stride](#scrollTo=asP9JLplrr4r)

>>>[Pooling Layer](#scrollTo=BFF9DpWVXsPy)

>>>>[Max Pooling](#scrollTo=i2wnx3wkYE3E)

>>>>[Average Pooling](#scrollTo=7XJBKaoUYOJR)

>>>[Flatten Layer](#scrollTo=eUGmGyoOStMX)

>>[Transfer learning](#scrollTo=hNaoDxVHDteM)

>>>[Feature Extraction](#scrollTo=Yk71ZT8dDwCP)

>>>[Fine Tuning](#scrollTo=cRlvsjz8D0b1)

>>[Activation Functions](#scrollTo=_s8F6y7T46-Q)

>>>[Sigmoid](#scrollTo=_shE9oTDtXk9)

>>>[Softmax](#scrollTo=jWjrCgx84_mH)

>>>[ReLU](#scrollTo=xAv8zU3RsSyN)

>>[Algorithms](#scrollTo=kf17bf2Rjpkl)

>>>[Convolutional Neural Network (CNN)](#scrollTo=jWAQaROejuRB)

>>>[Natural Language Processing (NLP)](#scrollTo=VEQpNB0aOkpC)

>>>>[Syntax Analysis](#scrollTo=tC-Pb7RWQmW-)

>>>>[Semantic Analysis](#scrollTo=cOH6hL_kQmkJ)

>>>>[Tokenization](#scrollTo=zaID8rlJX8HX)

>>>>[Embeddings](#scrollTo=_eAb7clXYEA_)

>>>[Recurrent Neural Networks (RNN)](#scrollTo=RxqEMJsgxQyH)

>>>[Transformers](#scrollTo=QtIA3pOIKb7n)

>[Compiling a Model](#scrollTo=EhZfLrKulBpn)

>>[Optimization Function](#scrollTo=FbgEgkqUfSsW)

>>>[Stochastic Gradient Descent (SDG)](#scrollTo=NdKtMngE6aiR)

>>[Loss Function](#scrollTo=tgylCsjI6UDs)

>>>[Empirical Loss](#scrollTo=MsZj9cIYY4dJ)

>>>[Regression Loss Functions](#scrollTo=iOYd6N6EfWq_)

>>>>[Mean Square Error Loss (MSE)](#scrollTo=efhGe2hBfZeV)

>>>>[Mean Squared Logarithmic Error Loss](#scrollTo=s0uhvs0dfy00)

>>>>[Mean Absolute Error Loss](#scrollTo=L9OFgTGggPew)

>>>[Binary Classification Loss Functions](#scrollTo=_urvie6nhgbR)

>>>>[Binary Cross-Entropy Loss](#scrollTo=KJcaCabCkRz8)

>>>>[Hinge Loss](#scrollTo=n7sxDb-dmA4v)

>>>>[Squared Hinge Loss](#scrollTo=fu8QrIP4rQO8)

>>>[Multi-Class Classification Loss Functions](#scrollTo=0A9XLcElsZUr)

>>>>[Multi-Class Cross-Entropy Loss](#scrollTo=X61eXdGDx2fP)

>>>>[Sparse Multiclass Cross-Entropy Loss](#scrollTo=bXr8oJ9x0C_Z)

>>>>[Kullback Leibler Divergence Loss](#scrollTo=GNMDVkZQ0sZz)

>>[Metrics](#scrollTo=1yisTFEg5veU)

>>>[$R^2$](#scrollTo=4hHt0uXD5xji)



# ***Preprocessing***

## Normalization


---



resources:
* [Scale, Standardize, or Normalize with Scikit-Learn](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02)

Normalize can be used to mean either `scale` or `standardize` (or even more!). Avoid the term normalize, because it has many definitions and is prone to creating confusion.
Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed.

Examples of such algorithm families include:
* linear and logistic regression
* nearest neighbors
* neural networks
* support vector machines with radial bias kernel functions
* principal components analysis
* linear discriminant analysis

MinMaxScaler, RobustScaler, StandardScaler, and Normalizer are scikit-learn methods to preprocess data for machine learning.


### **Scale**
Scale generally means to change the range of the values. The shape of the distribution doesn’t change. Think about how a scale model of a building has the same proportions as the original, just smaller. That’s why we say it is drawn to scale. The range is often set at 0 to 1.

#### *MinMaxScaler*
For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully change the information embedded in the original data.

Note that MinMaxScaler **doesn’t reduce the importance of outliers**. It’s non-distorting.

The default range for the feature returned by MinMaxScaler is 0 to 1.

### **Standardize**
Standardize generally means changing the values so that the distribution’s standard deviation equals one. Scaling is often implied.

#### *RobustScaler*
RobustScaler transforms the feature vector by subtracting the median and then dividing by the interquartile range (75% value — 25% value).

Note that RobustScaler does not scale the data into a predetermined interval like MinMaxScaler. It does not meet the strict definition of *scale*.

Use RobustScaler if you want to **reduce the effects of outliers**, relative to MinMaxScaler.

#### *StandardScaler*
StandardScaler is the industry’s go-to algorithm.

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does not meet the strict definition of *scale*.

StandardScaler results in a distribution with a standard deviation equal to 1. The variance is equal to 1 also, because variance = standard deviation squared. And 1 squared = 1.

StandardScaler makes the mean of the distribution approximately 0.

Deep learning algorithms often call for zero mean and unit variance. Regression-type algorithms also benefit from normally distributed data with small sample sizes.

Use StandardScaler if you want each feature to have zero-mean, unit standard-deviation. If you want more normally distributed data, and are okay with transforming your data

The difference between standard normalization and softmax is that although both rescale the logits between 0 and 1, in softmax the correct answer have the largest “signal”. By using softmax, we are effectively “approximating” argmax while gaining differentiability. Rescaling doesn’t weigh the max significantly higher than other logits, whereas softmax does. Simply put, softmax is a “softer” argmax.

#### *Normalizer*
Normalizer works on the rows, not the columns! I find that very unintuitive. It’s easy to miss this information in the docs.

By default, L2 normalization is applied to each observation so the that the values in a row have a unit norm. Unit norm with L2 means that if each element were squared and summed, the total would equal 1. Alternatively, L1 (aka taxicab or Manhattan) normalization can be applied instead of L2 normalization.

Normalizer does transform all the features to values between -1 and 1

## **Data Augmentation**

a technique to increase the diversity of your training set by applying random (but realistic) transformations, such as image rotation.

When you don't have a large image dataset, it's a good practice to artificially introduce sample diversity by applying random, yet realistic, transformations to the training images, such as rotation and horizontal flipping. This helps expose the model to different aspects of the training data and reduce overfitting.

## **Batches**

A batch is a small subset of the dataset that a model looks at during training at a time.

Reason to use batches:
* The full dataset might not fit into the memoery of the processer (or GPU).
* Model might not learn well if training on a very large dataset at a time.

# ***Neural Network***


## Fundamentals 


---



### **Tensor**

A tensor can be thought of as an n-dimensional matrix. In the CNN, tensors will be 3-dimensional with the exception of the output layer.

### **Neuron**

A neuron also known as **perceptron**, can be thought of as a function that takes in multiple inputs and yields a single output. 

$$
y = g(w_0 + Σ^{m}_{i=1} x_i w_i)
$$
* y: Output
* g: Non-linear activation function
* $w_0$: Bias
* $x_i w_i$: Linear Combination of inputs


### **Layer**

A layer is simply a collection of neurons with the same operation, including the same hyperparameters.

### **Weights & Biases**

Kernel weights and biases, while unique to each neuron, are tuned during the training phase, and allow the classifier to adapt to the problem and dataset provided.

### **Mixed Precision**


Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy.

Today, most models use the float32 dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, float16 and bfloat16, each which take 16 bits of memory instead. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialized hardware to run 16-bit computations and 16-bit dtypes can be read from memory faster.
However, variables and a few computations should still be in float32 for numeric reasons so that the model trains to the same quality. 

While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs and Cloud TPUs. NVIDIA GPUs support using a mix of float16 and float32, while TPUs support a mix of bfloat16 and float32.

Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math performance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups. On CPUs, mixed precision will run significantly slower, however.


If it doesn't affect model quality, try running with double the batch size when using mixed precision. As float16 tensors use half the memory, this often allows you to double your batch size without running out of memory. Increasing batch size typically increases training throughput, i.e. the training elements per second your model can run on.


Modern NVIDIA GPUs use a special hardware unit called Tensor Cores that can multiply float16 matrices very quickly. However, Tensor Cores requires certain dimensions of tensors to be a multiple of 8.e

## Layers


---


Resources:
* [CNN Explainer](https://poloclub.github.io/cnn-explainer/)
* [A Comprehensive Guide to Convolutional Neural Networks](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

### **Input Layer**

***In a ConvNet***, The input layer represents the input image into the CNN. Because we use RGB images as input, the input layer has three channels, corresponding to the red, green, and blue channels

### **Fully Connected (Dense) Layer**

***In a ConvNet*** adding a Fully-Connected layer is a (usually) cheap way of learning non-linear combinations of the high-level features as represented by the output of the convolutional layer. The Fully-Connected layer is learning a possibly non-linear function in that space.

### **Convolution Layer**

The convolutional (Kernel) layers  are the foundation of CNN, as they contain the learned kernels (weights), which extract features that distinguish different images from one another.

![convolving layer](https://miro.medium.com/v2/resize:fit:786/1*GcI7G-JLAQiEoCON7xFbhg.gif)

In the above demonstration, the green section resembles our 5x5x1 input image. The element involved in the convolution operation in the first part of a Convolutional Layer is called the **Kernel/Filter**, K, represented in color yellow. We have selected K as a 3x3x1 matrix.

The Kernel shifts 9 times because of Stride Length = 1 (Non-Strided)(stride of 1 means that the kernel is shifted over 1 pixel per dot product), every time the convolutional neuron performs an elementwise dot product with a unique kernel and the output of the previous layer’s corresponding neuron. This will yield as many intermediate results as there are unique kernels. The convolutional neuron is the result of all of the intermediate results summed together with the learned bias.


***The objective of the Convolution Operation is to extract the high-level features such as edges, from the input image.***

Conventionally, the first ConvLayer is responsible for capturing the Low-Level features such as edges, color, gradient orientation, etc. With added layers, the architecture adapts to the High-Level features as well, giving us a network that has a wholesome understanding of images in the dataset.

**Hyperparameters:**

#### *Padding*
Padding is often necessary when the kernel extends beyond the activation map. Padding conserves data at the borders of activation maps, which leads to better performance, and it can help preserve the input's spatial size, which allows an architecture designer to build depper, higher performing networks. There exist many padding techniques, but the most commonly used approach is zero-padding because of its performance, simplicity, and computational efficiency. The technique involves adding zeros symmetrically around the edges of an input. This approach is adopted by many high-performing CNNs such as AlexNet.

There are two types of results to the operation — one in which the convolved feature is reduced in dimensionality as compared to the input, and the other in which the dimensionality is either increased or remains the same. This is done by applying ***Valid Padding*** in the case of the former, or ***Same Padding*** in the case of the latter.

##### Same Padding

![same padding: 5x5x1 image is padded with 0s to create a 6x6x1 image](https://miro.medium.com/v2/resize:fit:640/1*nYf_cUIHFEWU1JXGwnz-Ig.gif)

When we augment the 5x5x1 image into a 6x6x1 image and then apply the 3x3x1 kernel over it, we find that the convolved matrix turns out to be of dimensions 5x5x1. Hence the name — Same Padding.

```
"SAME" = with zero padding:

               pad|                                      |pad
   inputs:      0 |1  2  3  4  5  6  7  8  9  10 11 12 13|0  0
               |________________|
                              |_________________|
                                             |________________|

```

"SAME" tries to pad evenly left and right, but if the amount of columns to be added is odd, it will add the extra column to the right, as is the case in this example (the same logic applies vertically: there may be an extra row of zeros at the bottom).

This keeps the input tensors's shape

##### Valid Padding


```
"VALID" = without padding:

   inputs:         1  2  3  4  5  6  7  8  9  10 11 (12 13)
                  |________________|                dropped
                                 |_________________|
```
"VALID" only ever drops the right-most columns (or bottom-most rows).

This reduces the input tensor's shape

#### *Kernel size*

Kernel size, often also referred to as filter size, refers to the dimensions of the sliding window over the input. Choosing this hyperparameter has a massive impact on the image classification task. For example, small kernel sizes are able to extract a much larger amount of information containing highly local features from the input. a smaller kernel size also leads to a smaller reduction in layer dimensions, which allows for a deeper architecture. Conversely, a large kernel size extracts less information, which leads to a faster reduction in layer dimensions, often leading to worse performance. Large kernels are better suited to extract features that are larger. At the end of the day, choosing an appropriate kernel size will be dependent on your task and dataset, but generally, smaller kernel sizes lead to better performance for the image classification task because an architecture designer is able to stack more and more layers together to learn more and more complex features!

#### *Stride*

Stride indicates how many pixels the kernel should be shifted over at a time. For example, Tiny VGG uses a stride of 1 for its convolutional layers, which means that the dot product is performed on a 3x3 window of the input to yield an output value, then is shifted to the right by one pixel for every subsequent operation. The impact stride has on a CNN is similar to kernel size. As stride is decreased, more features are learned because more data is extracted, which also leads to larger output layers. On the contrary, as stride is increased, this leads to more limited feature extraction and smaller output layer dimensions. One responsibility of the architecture designer is to ensure that the kernel slides across the input symmetrically when implementing a CNN.

### **Pooling Layer**

![3x3 pooling over 5x5 convolved feature](https://miro.medium.com/v2/resize:fit:640/1*uoWYsCV5vBU8SHFPAPao-w.gif)

Similar to the Convolutional Layer, the Pooling layer is responsible for gradually reducing the spatial size of the Convolved Feature. This is to **decrease the number of parameters, and the computational power required to process the data** through dimensionality reduction. Furthermore, it is useful for **extracting dominant features** which are rotational and positional invariant, thus maintaining the process of effectively training the model.

There are two types of Pooling: Max Pooling and Average Pooling.

![Types of Pooling](https://miro.medium.com/v2/resize:fit:828/format:webp/1*KQIEqhxzICU7thjaQBfPBQ.png)

#### *Max Pooling*

Max Pooling returns the **maximum value** from the portion of the image covered by the Kernel.

The Max-Pooling operation requires selecting a kernel size and a stride length during architecture design. Once selected, the operation slides the kernel with the specified stride over the input while only selecting the largest value at each kernel slice from the input to yield a value for the output. 

#### *Average Pooling*

Average Pooling returns the **average of all the values** from the portion of the image covered by the Kernel.

The Average-Pooling operation requires selecting a kernel size and a stride length during architecture design. Once selected, the operation slides the kernel with the specified stride over the input while only selecting the average value at each kernel slice from the input to yield a value for the output. 


### **Flatten Layer**

Flattening is used to convert all the resultant 2-Dimensional arrays from pooled feature maps into a single long continuous linear vector. The flattened matrix is fed as input to the fully connected layer to classify the image.

## Transfer learning


---

Resources:
* [TensorFlow - Transfer Learning](https://www.tensorflow.org/tutorials/images/transfer_learning)



The intuition behind transfer learning for image classification is that if a model is trained on a large and general enough dataset, this model will effectively serve as a generic model of the visual world. You can then take advantage of these learned feature maps without having to start from scratch by training a large model on a large dataset.

The very last classification layer (on "top", as most diagrams of machine learning models go from bottom to top) is not very useful. Instead, you will follow the common practice to depend on the very last layer before the flatten operation. This layer is called the "bottleneck layer". The bottleneck layer features retain more generality as compared to the final/top layer.

### **Feature Extraction**

Feature Extraction: Use the representations learned by a previous network to extract meaningful features from new samples. You simply add a new classifier, which will be trained from scratch, on top of the pretrained model so that you can repurpose the feature maps learned previously for the dataset.

You do not need to (re)train the entire model. The base convolutional network already contains features that are generically useful for classifying pictures. However, the final, classification part of the pretrained model is specific to the original classification task, and subsequently specific to the set of classes on which the model was trained.

It is important to freeze the convolutional base before you compile and train the model. Freezing (keras: by setting `layer.trainable = False`) prevents the weights in a given layer from being updated during training.

### **Fine Tuning**

Fine-Tuning: Unfreeze a few of the top layers of a frozen model base and jointly train both the newly-added classifier layers and the last layers of the base model. This allows us to "fine-tune" the higher-order feature representations in the base model in order to make them more relevant for the specific task.

You should try to fine-tune a small number of top layers rather than the whole base model. In most convolutional networks, the higher up a layer is, the more specialized it is. The first few layers learn very simple and generic features that generalize to almost all types of images. As you go higher up, the features are increasingly more specific to the dataset on which the model was trained. The goal of fine-tuning is to adapt these specialized features to work with the new dataset, rather than overwrite the generic learning.

**Important note about BatchNormalization layers in TensorFlow**
Many models contain tf.keras.layers.BatchNormalization layers. This layer is a special case and precautions should be taken in the context of fine-tuning.

When you set layer.trainable = False, the BatchNormalization layer will run in inference mode, and will not update its mean and variance statistics.

When you unfreeze a model that contains BatchNormalization layers in order to do fine-tuning, you should keep the BatchNormalization layers in inference mode by passing training = False when calling the base model. Otherwise, the updates applied to the non-trainable weights will destroy what the model has learned.

## Activation Functions


---



### **Sigmoid**

The sigmoid activation function is a commonly used mathematical function in artificial neural networks (NNs) and deep learning models. It is a type of nonlinear activation function that maps any input value to a range between 0 and 1.

The mathematical expression for the sigmoid function is:

\begin{equation}
S = \frac{1}{1+e^{-(m \times x+b)}}
\end{equation}


The sigmoid function has a distinctive S-shaped curve that increases gradually at first and then more steeply, before leveling off again as it approaches its maximum value of 1.0. The sigmoid function has the property of being differentiable, which is important for training neural networks using backpropagation.

One advantage of the sigmoid function is that it is easy to compute and is well-suited for binary classification problems. However, the main disadvantage of the sigmoid function is that it can suffer from the vanishing gradient problem when used in deep neural networks, which can slow down or even prevent convergence during the training process. As a result, other activation functions like ReLU, Leaky ReLU, and ELU have become more popular in modern deep learning architectures.

### **Softmax**

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, `each component will be in the interval (0,1), and the components will add up to 1`, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.



The standard (unit) softmax function $ \sigma: \mathbb{R}^K \rightarrow (0,1)^K $ is defined when $K \geq 1$ by the formula:


$$
\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} \text{ for } i = 1,...,K \text{ and } z= (z_1,...,z_K) \in \mathbb{R}^K
$$

In simple words, it applies the standard exponential function to each element $z_{i}$ of the input vector $z$ and normalizes these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector $\sigma (\mathbf {z} )$ is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector.

For example, the standard softmax of (1,2,8) is approximately (0.001,0.002,0.997), which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).

\\

The softmax function is used in various `multiclass classification methods`

\\

The difference between standard normalization and softmax is that although both rescale the logits between 0 and 1, in softmax the correct answer have the largest “signal”. By using softmax, we are effectively “approximating” argmax while gaining differentiability. Rescaling doesn’t weigh the max significantly higher than other logits, whereas softmax does. Simply put, softmax is a “softer” argmax.

### **ReLU**

the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument:

$$
\text{ReLU}(x) = \text{max}(0,x) = \left\{\begin{matrix}
x \text{ if } x > 0\\ 
0 \text{ if } x \leq  0
\end{matrix}\right.
$$

\\

![relu graph](https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png)


## Algorithms

---

### **Convolutional Neural Network (CNN)**

A CNN is a neural network: an algorithm used to recognize patterns in data. Neural Networks in general are composed of a collection of neurons that are organized in layers, each with their own learnable weights and biases.

A CNN conveys a differentiable score function, which is represented as class scores in the visualization on the output layer.

CNNs utilize a special type of layer, aptly named a convolutional layer, that makes them well-positioned to learn from image and image-like data. Regarding image data, CNNs can be used for many different computer vision tasks, such as image processing, classification, segmentation, and object detection.

### **Natural Language Processing (NLP)**

Resources:
* [A Simple Introduction to Natural Language Processing](https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32)
* [The illustrated Word2Vec](http://jalammar.github.io/illustrated-word2vec/)

Natural Language Processing is the technology used to aid computers to understand the human’s natural language.



#### *Syntax Analysis*

Syntax refers to the arrangement of words in a sentence such that they make grammatical sense.

In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules.

Here are some syntax techniques that can be used:

* **Lemmatization**: It entails reducing the various inflected forms of a word into a single form for easy analysis.
* **Morphological segmentation**: It involves dividing words into individual units called morphemes.
* **Word segmentation**: It involves dividing a large piece of continuous text into distinct units.
* **Part-of-speech tagging**: It involves identifying the part of speech for every word.
* **Parsing**: It involves undertaking grammatical analysis for the provided sentence.
* **Sentence breaking**: It involves placing sentence boundaries on a large piece of text.
* **Stemming**: It involves cutting the inflected words to their root form.

#### *Semantic Analysis*

Semantics refers to the meaning that is conveyed by a text.
It involves applying computer algorithms to understand the meaning and interpretation of words and how sentences are structured.

Here are some techniques in semantic analysis:

* **Named entity recognition (NER)**: It involves determining the parts of a text that can be identified and categorized into preset groups. Examples of such groups include names of people and names of places.
* **Word sense disambiguation**: It involves giving meaning to a word based on the context.
* **Natural language generation**: It involves using databases to derive semantic intentions and convert them into human language.

In NLP, there are two main concepts for turning text into numbers: Tokenization, Embeddings


#### *Tokenization*
A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:
  1. Using **word-level tokenization** with the sentence "I love TensorFlow" might result in "I" being `0`, "love" being `1` and "TensorFlow" being `2`. In this case, every word in a sequence considered a single **token**.
  2. **Character-level tokenization**, such as converting the letters A-Z to values `1-26`. In this case, every character in a sequence considered a single **token**.
  3. **Sub-word tokenization** is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple **tokens**.



#### *Embeddings*
An embedding is a representation of natural language which can be learned. Representation comes in the form of a **feature vector**. For example, the word "dance" could be represented by the 5-dimensional vector `[-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]`. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings: 
  1. **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as [`tf.keras.layers.Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)) and an embedding representation will be learned during model training.
  2. **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.

### **Recurrent Neural Networks (RNN)**

Resources:
* [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
* [Visualizing A Neural Machine Translation Model](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)

### **Transformers**

Resources:
* [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)

The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors: Sequences in the input, the output, or in the most general case both. A few examples may make this more concrete:

![rnn-types](https://karpathy.github.io/assets/rnn/diags.jpeg)

So how do these things work? At the core, RNNs have a deceptively simple API: They accept an input vector x and give you an output vector y. However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in in the past.

# ***Compiling a Model***


## Optimization Function


---


### **Stochastic Gradient Descent (SDG)**

*Linear Regession Gradient Descent univariate Simplified Steps*:
1. Pick a value for 'b' and 'm'
2. Calculate the slope of MSE with respect to 'b' and 'm' (Usign slope, instead of comparing 2 MSE)
3. The slope or rate of change can be used to figure out whether 'b' or 'm' was too high or low? if both very small, we're done!
4. Multiply both by learning rate
5. Subtract that from 'b' and 'm' -> go to step 2

Calculating the slope of MSE with repect to M and B in one step:

\begin{equation}
\frac{\text{Features}^T * ((\text{Features} * \text{Weights}) - \text{Labels})}{n}
\end{equation}

Where: `Labels` is tensor of our label data, `Features` is tensor of our feature data, `n` number of obeservations, and `weights` is 'M' and 'B' in a tensor

*Equation of Stochastic Gradient Descent*

\begin{equation}
\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta_{0}, \theta_{1}, \dots, \theta_{n})
\end{equation}

where $\theta_{j}$ is the parameter to be updated, $\alpha$ is the learning rate, and $J(\theta_{0}, \theta_{1}, \dots, \theta_{n})$ is the cost function. The derivative of the cost function with respect to the parameter $\theta_{j}$ is computed and used to update the value of $\theta_{j}$.

## Loss Function


---


resources:
* [How to Choose Loss Functions When Training Deep Learning Neural Networks](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/)

### **Empirical Loss**

The empirical loss measures the total loss Over our entire dataset.
Also known as **Objective Fucntion**, **Cost Function**, **Empirical Risk**.

$$
J(W)=\frac{1}{n}\sum^{n}_{i=1}\mathcal{L}(f(x^{(i)};W), y^{(i)})
$$

* $f(x^{(i)};W)$: Prediceted
* $y^{(i)}$: Actual



### **Regression Loss Functions**

#### *Mean Square Error Loss (MSE)*

The Mean Squared Error, or MSE, loss is the default loss to use for regression problems.

Mathematically, it is the preferred loss function under the inference framework of maximum likelihood if the distribution of the target variable is Gaussian. It is the loss function to be evaluated first and only changed if you have a good reason.

Mean squared error is calculated as the average of the squared differences between the predicted and actual values. The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0. The squaring means that larger mistakes result in more error than smaller mistakes, meaning that the model is punished for making larger mistakes.

keras: `mean_squared_error`

\begin{equation}
\text{Mean Squared Error} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y_{i}} - y_{i})^2
\end{equation}

where $n$ is the number of samples, $y_{i}$ is the true value of the target variable for the $i^{th}$ sample, and $\hat{y_{i}}$ is the predicted value of the target variable for the $i^{th}$ sample. The MSE is a measure of the average squared difference between the predicted and true values of the target variable.

*Vectorized Mean Squared Error:*

\begin{equation}
\text{VMSE} = \frac{\text{sum}( ((\text{Features} * \text{Weights}) - \text{Lables})^2}{n}
\end{equation}


Equation for the **derivative** of Mean Squared Error (MSE)

\begin{equation}
\frac{\partial}{\partial \hat{y}} MSE = \frac{2}{n} \sum_{i=1}^{n} (\hat{y} - y_i)
\end{equation}


where $\frac{\partial}{\partial \hat{y}} MSE$ is the derivative of the MSE with respect to the predicted value $\hat{y}$, $n$ is the number of samples, $\hat{y}$ is the predicted value, and $y_i$ is the true value of the target variable for the $i^{th}$ sample. The derivative of the MSE with respect to the predicted values is used in gradient descent to update the parameters of the model.

*Vectorized Equation for the derivative of Mean Squared Error:*

\begin{equation}
\frac{\text{Features.T} * ((\text{Features} * \text{Weights}) - \text{Labels})}{n}
\end{equation}

Where: `Labels` is matrix of our label data, `Features` is matrix of our feature data, `n` number of obeservations, and `weights` is 'M' and 'B' in a matrix

#### *Mean Squared Logarithmic Error Loss*

  There may be regression problems in which the target value has a spread of values and when predicting a large value, you may not want to punish a model as heavily as mean squared error.

Instead, you can first calculate the natural logarithm of each of the predicted values, then calculate the mean squared error. This is called the Mean Squared Logarithmic Error loss, or MSLE for short.

It has the effect of relaxing the punishing effect of large differences in large predicted values.

As a loss measure, it may be more appropriate when the model is predicting unscaled quantities directly. Nevertheless, we can demonstrate this loss function using our simple regression problem.

Keras: `mean_squared_logarithmic_error`

#### *Mean Absolute Error Loss*

On some regression problems, the distribution of the target variable may be mostly Gaussian, but may have outliers, e.g. large or small values far from the mean value.

The Mean Absolute Error, or MAE, loss is an appropriate loss function in this case as it is **more robust to outliers**. It is calculated as the average of the absolute difference between the actual and predicted values.

keras: `mean_absolute_error`

### **Binary Classification Loss Functions**

#### *Binary Cross-Entropy Loss*

Cross-entropy is the default loss function to use for binary classification problems.

It is intended for use with binary classification where the target values are in the set {0, 1}.

Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.

Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0.

keras: `binary_crossentropy`

Requires an output activation layer of `sigmoid`

\begin{equation}
-\left(\frac{1}{n}\right)\sum^n_{i=0} \text{Actual} \cdot \log(\text{Guess}) + (1 - \text{Actual}) \cdot \log(1 - \text{Guess})
\end{equation}

Where Actual is the encoded label value, Guess is the prediction value of sigmoid function, and n is the number of observations 

*Vectorized Equation for Binary Cross Entropy:*

\begin{equation}
-\frac{1}{n} \cdot \left(\text{Actual}^T \cdot \log(\text{Guess}) + (1 - \text{Actual})^T \cdot log(1-\text{Guess})\right)
\end{equation}

Where: `Actual` is matrix of our label data, `Guess` is matrix of prediction, `n` number of obeservations

*Vectorized Equation for the derivative of Binary Cross Entropy:*

\begin{equation}
\frac{\text{Features}^T * (\text{Softmax}((\text{Features} * \text{Weights})) - \text{Labels})}{n}
\end{equation}

Where: `Labels` is matrix of our label data, `Features` is matrix of our feature data, `n` number of obeservations, and `weights` is 'M' and 'B' in a matrix

#### *Hinge Loss*
An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with Support Vector Machine (SVM) models.

It is intended for use with binary classification where the target values are in the set **{-1, 1}**.

The hinge loss function encourages examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values.

Reports of performance with the hinge loss are mixed, sometimes resulting in better performance than cross-entropy on binary classification problems.

keras: `hinge`

The output layer of the network must be configured to have a single node with a hyperbolic tangent activation function (keras: `tanh`) capable of outputting a single value in the range [-1, 1].

#### *Squared Hinge Loss*
The hinge loss function has many extensions, often the subject of investigation with SVM models.

A popular extension is called the squared hinge loss that simply calculates the square of the score hinge loss. It has the effect of **smoothing the surface** of the error function and making it numerically easier to work with.

If using a hinge loss does result in better performance on a given binary classification problem, is likely that a squared hinge loss may be appropriate.

As with using the hinge loss function, the target variable must be modified to have values in the set {-1, 1}.

keras: `squared_hinge`

The output layer must use a single node with a hyperbolic tangent activation function (keras: `tanh`) capable of outputting continuous values in the range [-1, 1].

### **Multi-Class Classification Loss Functions**

#### *Multi-Class Cross-Entropy Loss*
Cross-entropy is the default loss function to use for multi-class classification problems.

In this case, it is intended for use with multi-class classification where the target values are in the set {0, 1, 3, …, n}, where each class is assigned a unique integer value.

Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.

Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.

keras: `categorical_crossentropy`

The function requires that the output layer is configured with an n nodes (one for each class) and a `softmax` activation in order to predict the probability for each class.

*Vectorized Equation for the derivative of Multi Nominal Cross Entropy:*

\begin{equation}
\frac{\text{Features}^T * (\text{Sigmoid}((\text{Features} * \text{Weights})) - \text{Labels})}{n}
\end{equation}

Where: `Labels` is matrix of combined labels, `Features` is matrix of feature data, `n` number of obeservations, and `weights` is 'M' and 'B' for each category in a matrix

#### *Sparse Multiclass Cross-Entropy Loss*
A possible cause of frustration when using cross-entropy with classification problems with a **large number of labels** is the one hot encoding process.

For example, predicting words in a vocabulary may have tens or hundreds of thousands of categories, one for each label. This can mean that the target element of each training example may require a one hot encoded vector with tens or hundreds of thousands of zero values, requiring significant memory.

Sparse cross-entropy addresses this by performing the same cross-entropy calculation of error, without requiring that the target variable be one hot encoded prior to training.

keras: `sparse_categorical_crossentropy`

The function requires that the output layer is configured with an n nodes (one for each class) and a `softmax` activation in order to predict the probability for each class.

#### *Kullback Leibler Divergence Loss*
Kullback Leibler Divergence, or KL Divergence for short, is a measure of how one probability distribution differs from a baseline distribution.

A KL divergence loss of 0 suggests the distributions are identical. In practice, the behavior of KL Divergence is very similar to cross-entropy. It calculates how much information is lost (in terms of bits) if the predicted probability distribution is used to approximate the desired target probability distribution.

As such, the KL divergence loss function is more commonly used **when using models that learn to approximate a more complex function** than simply multi-class classification, such as in the case of an autoencoder used for learning a dense feature representation under a model that must reconstruct the original input. In this case, KL divergence loss would be preferred. Nevertheless, it can be used for multi-class classification, in which case it is functionally equivalent to multi-class cross-entropy.

Keras: `kullback_leibler_divergence`

The function requires that the output layer is configured with an n nodes (one for each class) and a softmax activation in order to predict the probability for each class.

## Metrics


---


### **$R^2$**

$R^2$ is the coefficient of determination. 1 means it's 100% accurate, less than 0 means that it's worse than just taking the average. 0-1 means it's learning something.

\begin{equation}
R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}
\end{equation}

Where $SS_{\text{res}}$ is the sum of quares of residuals, and $SS_{\text{tot}}$ is the total sum of squares.

\begin{equation}
SS_{\text{tot}} = \sum_{i=1}^n(\text{Actual}-\text{Average})^2
\end{equation}

$SS_{\text{tot}}$ has no relation to the predictions. Acts as a baseline accuracy value.

\begin{equation}
SS_{\text{res}} = \sum_{i=1}^n(\text{Actual}-\text{Predicted})^2
\end{equation}

