# DL Conceptual brief

## Libraries
### TensorFlow
- Google calls it **"an open source software library for machine intelligence"**
- other deep learning libraries
  - PyTorch (https://pytorch.org/), 
  - Caffe (https://caffe.berkeleyvision.org/), 
  - MxNet (https://mxnet.apache.org/)

#### Features
- have **auto-differentiation**, 
- support for CPU/GPU option, 
- have **pretrained models**, 
- support commonly used NN architectures like recurrent neural networks, convolutional neural networks, and deep belief networks
- support for **eager computation** in TensorFlow 2.0, 
- support for **graph computation based on static graphs**
- Keras – a high-level neural network API that has been **integrated with TensorFlow** (in 2.0, Keras became the standard API for interacting with TensorFlow)


### PyTorch

### Keras
- The core layers in Keras includes dense layers, activation layers, and dropout layers. There are other layers that are more complex, including convolutional layers and pooling layers
- A dense layer is also known as a fully-connected layer. It is fully-connected because it uses all of its input (as opposed to a subset of the input) for the mathematical function that it implements.
- A dense layer implements the following function:

$$\hat{y} = \sigma(Wx + b)$$

  - where $\hat{y}$ is the output, $\sigma$ is the activation function, $x$ is the input, and $W$ and $b$ are the weights and biases respectively.
- A model is a collection of layers
- Loss function – error metric for neural network training

<img src="https://www.oreilly.com/library/view/neural-network-projects/9781789138900/assets/081fd69c-7a42-4394-a198-a533a7e2892d.png">

- A linear classifier can solve the AND and OR problems but is not able to solve the XOR
- https://dev.to/jbahire/demystifying-the-xor-problem-1blk


## General concepts
### Perceptron algorithm
### Softmax algorithm

### Transfer Learning
- Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.
- How do you make an image classifier that can be trained in a few minutes on a CPU with very little data?
  - Use pre-trained models, i.e., models with known weights

### Representation Learning

### Activation function
### ReLU

### Back propagation
- Gradients are important
  - If it's differentiable, we can probably learn on it
- Gradients can vanish
  - Each additional layer can successively reduce signal vs. noise, when layers become deep
  - ReLUs are useful here
  - Try to limit depth of model
- Gradients can explode
  - can get NaNs
  - Learning rates are important here
    - try lower learning rate
  - Batch normalization (useful knob) can help
- ReLu layers can die
  - Keep calm and lower your learning rates

### Forward propagation

### Training Neural Nets
#### Normalizing Feature Values
  - features must have reasonable scales
    - Roughly zero-centered, \[-1, 1\] range often works well
    - Helps gradient descent converge; avoid NaN trap
    - Avoiding outlier values can also help
  - Can use a few standard methods
    - Linear scaling
    - Hard cap (clipping) to max, min
    - Log scaling

#### Dropout Regularization
- Dropout: Another form of regularization, useful for NNs
  - Works by randomly "dropping out" units in a network for a single gradient step
    - There's a connection to ensemble models here
  - The more you drop out, the stronger the regularization
    - 0.0 = no dropout regularization
    - 1.0 = drop everything out! learns nothing
    - Intermediate values more useful


### Cross Entropy
### Encoder-Decoder

### Neural Style Transfer 
- is the technique of blending style from one image into another image keeping its content intact. The only change is the style configurations of the image to give an artistic touch to your image.

### Content Loss
### Style Loss
### Gram Matrices

### Multi-Class Neural Nets
#### One-Vs-All Multi-Class 
  - SoftMax Multi-Class
    - Add an additional constraint: Require output of all one-vs-all nodes to sum to 1.0
    - use logits that sums to 1
    - The additional constraint helps training converge quickly
    - Plus, allows outputs to be interpreted as probabilities
    
#### Multi-Class, Single-Label Classification:
  - An example may be a **member of only one class.**
  - Constraint that classes are **mutually exclusive** is helpful structure.
  - Useful to encode this in the loss.
  - Use **one softmax loss for all possible classes.**
  
#### Multi-Class, Multi-Label Classification:
  - An example may be a **member of more than one class.**
  - No additional constraints on class membership to exploit.
  - **One logistic regression loss for each possible class.**
  - **Multiple logistic regression**

### SoftMax Options
#### Full SoftMax
- For Multi-Class classification problems
  - Brute force; calculates for all classes.
- DisAdv
  - relatively expensive to train
  - for million classes, million output nodes needs to be trained, for every single example
  
#### Candidate Sampling
- allows to be more efficient by differentiating between classes
  - for labrador/pup, we can easily differentiate it with a toaster
- Calculates for all the positive labels, but only for a random sample of negatives.
  - is more efficient


### Local Response Normalization vs Batch Normalization
- https://towardsdatascience.com/difference-between-local-response-normalization-and-batch-normalization-272308c034ac


## 10 most popular deep learning algorithms
[Link](https://www.simplilearn.com/tutorials/deep-learning-tutorial/deep-learning-algorithm)

### Convolutional Neural Networks (CNNs)
- Computer Vision with Deep Learning has been constructed and perfected with time, primarily over one particular algorithm — a **Convolutional Neural Network(ConvNet/CNN)**.
- Neural networks came to prominence in **2012** as machine learning expert **Alex Krizhevsky** utilized them to get **first prize in the ImageNet competition**.
- Applications
  - Facebook’s famous automatic tagging algorithm works? The answer is neural networks.
  - product recommendation you get on Amazon and several other similar platforms is because of neural networks.
  - Neural networks are the reason behind Google’s superb image searching abilities.
  - Instagram’s solid search infrastructure is possible because the social media network uses neural networks.
- A convolutional network **ingests such images** as three separate strata of color **stacked one on top of the other**. A normal color image is seen as a rectangular box whose width and height are measured by the number of pixels from those dimensions. The depth layers in the three layers of colours(RGB) interpreted by CNNs are referred to as channels.
- A ConvNet is able to successfully **capture the Spatial and Temporal dependencies** in an image through the application of relevant filters. 
- The **role of the ConvNet** is to **reduce the images** into a form which is easier to process, **without losing features** which are critical for getting a good prediction.
- The objective of the Convolution Operation is to **extract the high-level features** such as edges, from the input image. ConvNets need not be limited to only one Convolutional Layer. Conventionally, the **first ConvLayer** is responsible for **capturing the Low-Level features** such as edges, color, gradient orientation, etc.
- CNN's have a **ReLU layer** to **perform operations on elements**. The output is a rectified feature map.
- the **Pooling layer** is responsible for **reducing the spatial size of the Convolved Feature**. This is to **decrease the computational power required to process the data through dimensionality reduction**. Furthermore, it is useful for **extracting dominant features which are rotational and positional invariant**, thus maintaining the process of effectively training of the model.
- **Adding a Fully-Connected layer** is a (usually) **cheap way of learning non-linear combinations of the high-level features** as represented by the output of the convolutional layer. The Fully-Connected layer is learning a possibly non-linear function in that space.
- There are **various architectures of CNNs available** which have been key in building algorithms which power and shall power AI as a whole in the foreseeable future. Some of them have been listed below:
  - LeNet
  - AlexNet
  - VGGNet
  - GoogLeNet
  - ResNet
  - ZFNet
- https://poloclub.github.io/cnn-explainer/

#### General Concepts
##### Convolution
##### Max Pooling
##### Average Pooling
##### Fully Connected
##### SOTA Models
##### Differential Learning Rate
##### Feature Maps
##### Kernel
##### Filter
##### Spatial information
##### local receptive fields, shared weights, and pooling
##### Strides
##### Convolution2D, MaxPooling2D
##### Padding
##### Flatten Layer
##### Pooling layers



##### Convolutional Layer

<img src="https://miro.medium.com/max/1400/1*qtinjiZct2w7Dr4XoFixnA.gif">


##### CNN sequence to classify handwritten digits

<img src="https://miro.medium.com/max/1400/1*uAeANQIOQPqWZnnuH-VEyw.jpeg">


##### Flattened 3x3 image matrix 1 dimension

<img src="https://miro.medium.com/max/850/1*GLQjM9k0gZ14nYF0XmkRWQ.png" width=500 height=500>


##### Kernel 

- Convoluting a 5x5x1 image(Green) with a 3x3x1 kernel(Yellow) to get a 3x3x1(Red) convolved feature

<img src="https://miro.medium.com/max/1052/1*GcI7G-JLAQiEoCON7xFbhg.gif" width=500 height=500>


- Convolution operation on a MxNx3 image matrix with a 3x3x3 Kernel

<img src="https://miro.medium.com/max/1400/1*ciDgQEjViWLnCbmX-EeSrA.gif">


- Convolution Operation with Stride Length = 2

<img src="https://miro.medium.com/max/790/1*1VJDP6qDY9-ExTuQVEOlVg.gif" width=500 height=500>


- SAME padding: 5x5x1 image is padded with 0s to create a 6x6x1 image

<img src="https://miro.medium.com/max/790/1*nYf_cUIHFEWU1JXGwnz-Ig.gif" width=500 height=500>


##### Types of Pooling 

<img src="https://miro.medium.com/max/1192/1*KQIEqhxzICU7thjaQBfPBQ.png" width=500 height=500>


##### Fully Connected Layer (FC Layer)

<img src="https://miro.medium.com/max/1400/1*kToStLowjokojIQ7pY2ynQ.jpeg" width=500 height=500>



- https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
- https://github.com/ss-is-master-chief/MNIST-Digit.Recognizer-CNNs/blob/master/MNIST-Hand.Written.Digit.Recognition-CNN.ipynb
- https://medium.datadriveninvestor.com/introduction-to-how-cnns-work-77e0e4cde99b


#### CNN Research Papers


##### ResNet
- Deep Residual Learning for Image Recognition

##### VGG
- Very Deep Convolutional Networks for Large-Scale Image Recognition

##### DenseNet
- Densely Connected Convolutional Networks
- A DenseNet is a type of convolutional neural network that utilises dense connections between layers, through Dense Blocks, where we connect all layers (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers.

##### AlexNet
- ImageNet Classification with Deep Convolutional Neural Networks

##### VGG-16
- Very Deep Convolutional Networks for Large-Scale Image Recognition
- https://arxiv.org/pdf/1409.1556.pdf

##### MobileNetV2
- MobileNetV2: Inverted Residuals and Linear Bottlenecks

##### EfficientNet
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

##### Darknet-53
- YOLOv3: An Incremental Improvement

##### ResNeXt
- Aggregated Residual Transformations for Deep Neural Networks

##### GoogLeNet
- Going Deeper with Convolutions

##### Xception
- Xception: Deep Learning With Depthwise Separable Convolutions

##### SqueezeNet
- SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

##### Inception-v3
- Rethinking the Inception Architecture for Computer Vision

##### CSPDarknet53
- YOLOv4: Optimal Speed and Accuracy of Object Detection

#### CNN Applications
##### General Classification
##### Image Classification
##### Semantic Segmentation
- Classify Each Pixel
- Fully-Convolutional Networks
- SegNet & U-NET
- Faster R-CNN linked to Semantic Segmentation: Mask R-CNN

##### Object Detection 
##### Quantization 
##### Computed Tomography (CT) 
##### Domain Adaptation 
##### Out-of-Distribution Detection 
##### Language Modelling 

### Long Short Term Memory Networks (LSTMs)

### Recurrent Neural Networks (RNNs)
- **Recurrent Neural Networks (RNNs)** are widely used for data with some kind of **sequential structure**. For instance, **time series data has an intrinsic ordering based on time**. Sentences are also sequential, “I love dogs” has a different meaning than “Dogs I love.” Simply put, if the **semantics of your data is altered by random permutation, you have a sequential dataset and RNNs may be used for your problem!** To help solidify the types of problems RNNs can solve, here is a list of common applications :
- **Applications**
  - Speech Recognition
  - Sentiment Classification
  - Machine Translation (i.e. Chinese to English)
  - Video Activity Recognition
  - Name Entity Recognition — (i.e. Identifying names in a sentence)
- What are RNNs
  - RNNs are **different than the classical multi-layer perceptron (MLP) networks** because of two main reasons: 
    - 1) They take into account what happened previously and 
    - 2) they share parameters/weights.
  - A recurrent neural network is a neural network that is specialized for processing a sequence of data x(t)= x(1), . . . , x(τ) with the time step index t ranging from 1 to τ. For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs. In a NLP problem, if you want to predict the next word in a sentence it is important to know the words before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far.


------



- https://pub.towardsai.net/whirlwind-tour-of-rnns-a11effb7808f
- https://towardsdatascience.com/recurrent-neural-networks-rnns-3f06d7653a85
- https://github.com/javaidnabi31/RNN-from-scratch/blob/master/RNN_char_text%20generator.ipynb
- https://www.deeplearningbook.org/contents/rnn.html
- https://gist.github.com/karpathy/d4dee566867f8291f086
- https://www.coursera.org/learn/nlp-sequence-models/lecture/0h7gT/why-sequence-models
- [A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation](https://arxiv.org/pdf/1610.02583.pdf)



#### Evolution of R-CNN
- Region based CNN
- Fast R-CNN
- Faster R-CNN

#### Fully-Convolutional Networks (FCN)
#### RNN vs CNN
- Type of Data
  - CNNs are used in solving problems related to spatial data, such as images. 
  - RNNs are better suited to analyzing temporal, sequential data, such as text or videos.
- Architecture
  - CNNs are "feed-forward neural networks" that use filters and pooling layers, 
  - RNNs feed results back into the network 
- Size of input
  - CNNs have fixed size of the input and the resulting output are fixed. 
  - In RNNs size of input and its resulting output may vary
- Use cases 
  - CNNs include facial recognition, medical analysis and classification
  - RNNs include text translation, NLP, sentiment analysis and speech analysis

#### U-NET 
- Long Skip Connections

#### Model Compression
##### Knowledge distillation
##### Pruning
##### Quantization
##### Low-rank approximation and sparsity



### Generative Adversarial Networks (GANs)

### Radial Basis Function Networks (RBFNs)

### Multilayer Perceptrons (MLPs)

### Self Organizing Maps (SOMs)

### Deep Belief Networks (DBNs)

### Restricted Boltzmann Machines( RBMs)

### Autoencoders


## Optimization algorithms - Deep Learning
- [Optimizing GD](https://ruder.io/optimizing-gradient-descent/)
- [Optimizer Visualization](https://github.com/Jaewan-Yun/optimizer-visualization)

### ASGD
### Adadelta
### Adagrad
### Adam
### AdamW
### Adamax
### LBFGS
### NAdam
### RAdam
### RMSprop
### Rprop
### SGD
### SparseAdam

## Artificial neural network
- https://towardsdatascience.com/neural-network-architectures-156e5bad51ba


### Autoencoder
### Cognitive computing
### Deep learning
### DeepDream

### Multilayer perceptron
A **perceptron** was the name given to a model having one single linear layer and, if it has multiple layers, it is  called a **multi-layer perceptron (MLP)**. Note that the input and the output layers are visible from outside, while all the other layers in the middle are hidden – hence the name hidden layers. In this context, a single layer is simply a linear function and the MLP is therefore obtained by stacking multiple single layers one after the other:

<img src="https://www.researchgate.net/profile/Hassan-Afzaal/publication/338103191/figure/fig2/AS:838599264174093@1576949053675/The-multilayer-perceptron-MLP-model-for-various-input-variable-combinations.jpg" width=500 height=500>

### RNN 
#### LSTM
#### GRU
#### ESN
### Restricted Boltzmann machine
### GAN
### SOM
### Convolutional neural network 
#### U-Net
### Transformer Vision
### Spiking neural network
### Memtransistor
### Electrochemical RAM (ECRAM)

## Computer Vision
- Background Subtraction
- Colorspace
- Features
- Filters
- Geometry
- Affine transforms
- Projective transforms
- HOG Features
- Histograms
- Homography
- Hough Transform
- Image Gradients
- K-Means
- Kalman Filter
- Linear algebra
- Vectors
- Matrices
- Morphological Operations
- Optical Flow
- Segmentation
- Thresholding



## Intro to Deep Learning - Lectures

- Lecture 1 - Intro to Deep Learning
- Lecture 2 - Deep Sequence Modeling
- Software Lab 1 - Intro to TensorFlow; Music Generation
- Lecture 3 - Deep Computer Vision
- Lecture 4 - Deep Generative Modeling
- Software Lab 2 - De-biasing Facial Recognition Systems
- Lecture 5 - Deep Reinforcement Learning
- Lecture 6 - Limitations and New Frontiers
- Software Lab 3 - Learning End-to-End Self-Driving Control
- Lecture 7 - Autonomous Driving with LiDAR
- Lecture 8 - Uncertainty in Deep Learning
- Lecture 9 - AI 4 Science
- Lecture 10 - Speech Recognition
- Final Project
- Project Competition