## 0. Table of Contents
1. [Tutorial](#tutorial)
2. [Dataset](#dataset)  
    2.1 [Visual](#visual)  
    2.2 [Audio](#audio)  
    2.3 [Text](#text)
3. [Module](#module)  
    3.1 [Linear Layer](#linear_layer)  
    3.2 [Convolution Layer](#convolution_layer)  
    3.3 [Pooling Layer](#pooling_layer)  
4. [Model](#model)  
    4.1 [Convolution](#convolution)  
    4.2 [Recurrent](#recurrent)  
5. [Loss](#loss)  
    4.1 [Regression](#convolution)  
    4.2 [Classification](#recurrent)  
6. [Activation](#activation)   

<a id="tutorial"></a>
## 1. Tutorial
* [CS231n](http://cs231n.github.io/convolutional-networks/)
* [Convolution Arithmetic](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html#)

<a id="dataset"></a>
## 2. Dataset
<a id="visual"></a>
### 2.1 Visual
#### 2.1.1 MNIST
The MNIST database of handwritten digits 0-9 has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
![MNIST](./img/MNIST.jpg)
* URL:
    * [MNIST](http://yann.lecun.com/exdb/mnist/)
    * [Pytorch](https://pytorch.org/docs/stable/torchvision/datasets.html#mnist)
        
#### 2.1.2 COCO
COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:
* Object segmentation
* Recognition in context
* Superpixel stuff segmentation
* 330K images (>200K labeled)
* 1.5 million object instances
* 80 object categories
* 91 stuff categories
* 5 captions per image
* 250,000 people with keypoints
![COCO](./img/COCO.jpg)
* URL:
    * [COCO](http://cocodataset.org/#home)
    * [Pytorch](https://pytorch.org/docs/stable/torchvision/datasets.html#coco)
    
#### 2.1.3 WebVision 
Previously known as Imagenet. The WebVision dataset is designed to facilitate the research on learning visual representation from noisy web data. Our goal is to disentangle the deep learning techniques from huge human labor on annotating large-scale vision dataset. We release this large scale web images dataset as a benchmark to advance the research on learning from web data, including weakly supervised visual representation learning, visual transfer learning, text and vision, etc. (see recommended settings for the WebVision dataset).

Similar to WebVisioin 1.0 dataset, the WebVision 2.0 dataset also contains images crawled from the Flickr website and Google Images search. In this new version, we extend the number of visual concepts from 1,000 to 5,000, and the total number of trianing images reaches 16 million. The 5,000 visual concepts contains the original 1,000 concepts in WebVision 1.0 dataset, and additional 4,000 synsets in ImageNet with the most number of images. Semantically overlapped synsets are removed, such that there is no pair of synsets that one is the parent or child of the other. All 5,000 visual concepts have their corresponding synsets in the ImageNet dataset, so a bunch of existing approaches can be directly investigated and compared to the models trained from the human annotated ImageNet dataset, and also makes it possible to study the dataset bias issue in the large scale scenario. The textual information accompanied with those images (e.g., caption, user tags, or description) are also provided as additional meta information. A validation set contains around 250K images (up to 50 images per category) is provided to facilitate the algorithmic development.
![WebVision](./img/WebVision.jpg)
* URL:
    * [Imagenet](http://www.image-net.org/)
    * [WebVision](https://www.vision.ee.ethz.ch/webvision/)
    * [Pytorch](https://pytorch.org/docs/stable/torchvision/datasets.html#coco)

#### 2.1.4 CIFAR
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. 
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. 

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
![CIFAR](./img/CIFAR.jpg)
* URL:
    * [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html)
    * [Pytorch](https://pytorch.org/docs/stable/torchvision/datasets.html#cifar)

#### 2.1.4 SVHN
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. 
![SVHN](./img/SVHN.jpg)
* URL:
    * [SVHN](http://ufldl.stanford.edu/housenumbers/)
    * [Pytorch](https://pytorch.org/docs/stable/torchvision/datasets.html#svhn)
    
<a id="audio"></a>    
### 2.2 Audio
    
<a id="text"></a> 
### 2.3 Text


<a id="module"></a>
## 3. Module
<a id="linear_layer"></a>
### 3.1 Linear Layer
#### 3.1.1 Linear
* Other names: Fully Connected, Dense
* Parameters: $C_{in},C_{out}$
* Formula:  
\begin{equation*}
\text{out}(N, C_{out})=\text{input}(N, C_{in})(\text{weight}(C_{out},C_{in}))^{T} + \text{bias}(C_{out})
\end{equation*}
* Shape:
    * Input: $(N,∗,C_{in})$ where $∗$ means any number of additional dimensions
    * Output: $(N,∗,C_{out})$ where all but the last dimension are the same shape as the input
    * Weight:  $(C_{out}, C_{in})$
    * Bias: $(C_{out})$
* URL:
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#linear)

<a id="convolution_layer"></a>
### 3.2 Convolution Layer
#### 3.2.1 Conv (2D)
* Parameters: $C_{in},C_{out},\text{kernel_size},N_{stride},N_{padding},N_{dilation},N_{groups}$
* Formula:  
\begin{equation*}
\text{out}(N_i, C_{out_j}) = \text{bias}(C_{out_j}) +
                        \sum_{k = 0}^{C_{in} - 1} \text{weight}(C_{out_j}, k) \star \text{input}(N_i, k)
\end{equation*}
where $⋆$ is the valid cross-correlation operator
* Shape:
    * Input: $(N, C_{in}, H_{in}, W_{in})$ 
    * Output: $(N, C_{out}, H_{out}, W_{out})$ where
\begin{align}\begin{aligned}
H_{out} = \left\lfloor\frac{H_{in}  + 2 * N_{padding}[0] - N_{dilation}[0] * (\text{kernel_size}[0] - 1) - 1}{N_{stride}[0]} + 1\right\rfloor\\
W_{out} = \left\lfloor\frac{W_{in}  + 2 * N_{padding}[1] - N_{dilation}[1] * (\text{kernel_size}[1] - 1) - 1}{N_{stride}[1]} + 1
\right\rfloor
\end{aligned}\end{align}
    * Weight:  $(C_{out}, C_{in}, \text{kernel_size}[0], \text{kernel_size}[1])$
    * Bias: $(C_{out})$
* URL:
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#conv2d)
    
#### 3.2.1 ConvTranspose (2D)
* Parameters: $C_{in},C_{out},\text{kernel_size},N_{stride},N_{padding},N_{output_padding},N_{dilation},N_{groups}$
* Formula:  
\begin{equation*}
\text{out}(N_i, C_{out_j}) = \text{bias}(C_{out_j}) +
                        \sum_{k = 0}^{C_{in} - 1} \text{weight}(C_{out_j}, k) \star \text{input}(N_i, k)
\end{equation*}
where $⋆$ is the valid cross-correlation operator
* Shape:
    * Input: $(N, C_{in}, H_{in}, W_{in})$ 
    * Output: $(N, C_{out}, H_{out}, W_{out})$ where
\begin{align}\begin{aligned}
H_{out} = (H_{in} - 1) * N_{stride}[0] - 2 * N_{padding}[0]+ \text{kernel_size}[0] + N_{output\_padding}[0]\\
W_{out} = (W_{in} - 1) * N_{stride}[1] - 2 * N_{padding}[1]+ \text{kernel_size}[1] + N_{output\_padding}[1]
\end{aligned}\end{align}
    * Weight:  $(C_{out}, C_{in}, \text{kernel_size}[0], \text{kernel_size}[1])$
    * Bias: $(C_{out})$
* URL:
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#conv2d)

<a id="pooling_layer"></a>
### 3.3 Pooling Layer
#### 3.3.1 MaxPool (2D)
* Parameters: $C_{in},C_{out},\text{kernel_size},N_{stride},N_{padding},N_{dilation}$
* Formula:  
\begin{equation*}
\text{out}(N_i, C_j, h, w)  = \max_{m=0, \ldots, kH-1} \max_{n=0, \ldots, kW-1}
                       \text{input}(N_i, C_j, \text{stride}[0] * h + m, \text{stride}[1] * w + n)
\end{equation*}
* Shape (2D):
    * Input: $(N, C_{in}, H_{in}, W_{in})$ 
    * Output: $(N, C_{out}, H_{out}, W_{out})$ where
\begin{align}\begin{aligned}
H_{out} = \left\lfloor\frac{H_{in}  + 2 * N_{padding}[0] - N_{dilation}[0] * (\text{kernel_size}[0] - 1) - 1}{N_{stride}[0]} + 1\right\rfloor\\
W_{out} = \left\lfloor\frac{W_{in}  + 2 * N_{padding}[1] - N_{dilation}[1] * (\text{kernel_size}[1] - 1) - 1}{N_{stride}[1]} + 1
\right\rfloor
\end{aligned}\end{align}
    * Kernel Size:  $(kH,kW)$
* URL:
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#maxpool2d)
    
#### 3.3.2 AvgPool (2D)
* Parameters: $C_{in},C_{out},\text{kernel_size},N_{stride},N_{padding}$
* Formula:  
\begin{equation*}
\text{out}(N_i, C_j, h, w)  = \frac{1}{kH * kW} \sum_{m=0}^{kH-1} \sum_{n=0}^{kW-1}
                       \text{input}(N_i, C_j, \text{stride}[0] * h + m, \text{stride}[1] * w + n)
\end{equation*}
* Shape (2D):
    * Input: $(N, C_{in}, H_{in}, W_{in})$ 
    * Output: $(N, C_{out}, H_{out}, W_{out})$ where
\begin{align}\begin{aligned}
H_{out} = \left\lfloor\frac{H_{in}  + 2 * N_{padding}[0] - \text{kernel_size}[0]}{N_{stride}[0]} + 1\right\rfloor\\
W_{out} = \left\lfloor\frac{W_{in}  + 2 * N_{padding}[1] - \text{kernel_size}[1]}{N_{stride}[1]} + 1\right\rfloor
\end{aligned}\end{align}
    * Kernel Size:  $(kH,kW)$
* URl:
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#avgpool2d)


<a id="model"></a>
## 4. Model

In [5]:
import torch
from torch import nn

<a id="convolution"></a>
### 4.1 Convolution
#### 4.1.1 LeNet5
LeNet5 is a pioneering 7-level convolutional network by LeCun et al in 1998, that classifies digits, was applied by several banks to recognise hand-written numbers on checks (cheques) digitized in 32x32 pixel greyscale inputimages. The ability to process higher resolution images requires larger and more convolutional layers, so this technique is constrained by the availability of computing resources.
![LeNet5](./img/LeNet5.jpg)
* URl:
    * [LeNet5](http://yann.lecun.com/exdb/lenet/)

In [10]:
class LeNet5(nn.Module):
   def __init__(self):
       super().__init__()
       self.conv1 = nn.Conv2d(1, 6, 5)
       self.conv2 = nn.Conv2d(6, 16, 5)
       self.fc1 = nn.Linear(16*5*5, 120)
       self.fc2 = nn.Linear(120, 84)
       self.fc3 = nn.Linear(84, 10)
   def forward(self, x):
       x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
       x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
       x = x.view(x.size(0), -1)
       x = F.relu(self.fc1(x))
       x = F.relu(self.fc2(x))
       x = self.fc3(x)
       return x
    
# Current Implementation     
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, 5)
        self.conv2 = nn.Conv2d(10, 20, 5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return x

#### 4.1.2 AlexNet
In 2012, AlexNet significantly outperformed all the prior competitors and won the challenge by reducing the top-5 error from 26% to 15.3%. The second place top-5 error rate, which was not a CNN variation, was around 26.2%.
The network had a very similar architecture as LeNet by Yann LeCun et al but was deeper, with more filters per layer, and with stacked convolutional layers. It consisted 11x11, 5x5,3x3, convolutions, max pooling, dropout, data augmentation, ReLU activations, SGD with momentum. It attached ReLU activations after every convolutional and fully-connected layer. AlexNet was trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason for why their network is split into two pipelines. AlexNet was designed by the SuperVision group, consisting of Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever.
![AlexNet](./img/AlexNet.jpg)
* URl:
    * [AlexNet](https://en.wikipedia.org/wiki/AlexNet)

In [11]:
class AlexNet(nn.Module):

    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x

#### 4.1.3 VGGNet
The runner-up at the ILSVRC 2014 competition is dubbed VGGNet by the community and was developed by Simonyan and Zisserman . VGGNet consists of 16 convolutional layers and is very appealing because of its very uniform architecture. Similar to AlexNet, only 3x3 convolutions, but lots of filters. Trained on 4 GPUs for 2–3 weeks. It is currently the most preferred choice in the community for extracting features from images. The weight configuration of the VGGNet is publicly available and has been used in many other applications and challenges as a baseline feature extractor. However, VGGNet consists of 138 million parameters, which can be a bit challenging to handle.
![VGGNet_1](./img/VGGNet_1.jpg)
![VGGNet_2](./img/VGGNet_2.jpg)
* URl:
    * [VGGNet](https://arxiv.org/abs/1409.1556)

In [13]:
class VGG(nn.Module):

    def __init__(self, features, num_classes=1000, init_weights=True):
        super(VGG, self).__init__()
        self.features = features
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

#### 4.1.4 InceptionNet
#### 4.1.4.1 InceptionNet v1
The winner of the ILSVRC 2014 competition was GoogleNet(a.k.a. Inception V1) from Google. It achieved a top-5 error rate of 6.67%! This was very close to human level performance which the organisers of the challenge were now forced to evaluate. As it turns out, this was actually rather hard to do and required some human training in order to beat GoogLeNets accuracy. After a few days of training, the human expert (Andrej Karpathy) was able to achieve a top-5 error rate of 5.1%(single model) and 3.6%(ensemble). The network used a CNN inspired by LeNet but implemented a novel element which is dubbed an inception module. It used batch normalization, image distortions and RMSprop. This module is based on several very small convolutions in order to drastically reduce the number of parameters. Their architecture consisted of a 22 layer deep CNN but reduced the number of parameters from 60 million (AlexNet) to 4 million. We have different size of convolution kernels and we hope to have features from different scale.
![InceptionNet_1](./img/InceptionNet_1.jpg)
![InceptionNet_2](./img/InceptionNet_2.jpg)
* URl:
    * [InceptionNet_v1](http://arxiv.org/abs/1409.4842)

#### 4.1.4.2 InceptionNet v2, v3
* Add [Batch Normalization Layer](#bn_layer)
* The 5x5 convolutional layers are replaced by two consecutive layer of 3x3 convolutions with up to 128 filters to reduce the number of parameters needed.
* Factorize 2d Convolution into two 1d Convolution. 7x7 convolution is factorized into two（1x7,7x1）convolution，3x3 as（1x3,3x1). Here approximate 2d Convolution with less parameters and then we can use the extra computation power to go deeper.
![InceptionNet_3](./img/InceptionNet_3.jpg)
* URl:
    * [InceptionNet_v2_3](http://arxiv.org/abs/1512.00567)
    
#### 4.1.4.3 InceptionNet v4
Utilize Residual Network to enchance the performance of InceptionNet
![InceptionNet_3](./img/InceptionNet_4.jpg)
* URl:
    * [InceptionNet_v4](https://arxiv.org/abs/1602.07261)
    * [Pytorch](https://github.com/pytorch/vision/blob/master/torchvision/models/inception.py)

#### 4.1.5 ResNet
At the ILSVRC 2015, the so-called Residual Neural Network (ResNet) by Kaiming He et al introduced anovel architecture with “skip connections” and features heavy batch normalization. Such skip connections are also known as gated units or gated recurrent units and have a strong similarity to recent successful elements applied in RNNs. Thanks to this technique they were able to train a NN with 152 layers while still having lower complexity than VGGNet. It achieves a top-5 error rate of 3.57% which beats human-level performance on this dataset.
ResNet solves the gradient vanishing issue and it become possible to go far deeper to 100 even 1000 layers.
![ResNet_1](./img/ResNet_1.jpg)
![ResNet_2](./img/ResNet_2.jpg)
* URl:
    * [ResNet](https://arxiv.org/abs/1512.03385)
    * [Pytorch](https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py)
    * [Variants](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035)

#### 4.1.5 WideResNet
Because ResNet is so deep，many residual block only provide limited information.16 layer WideResNet can achieve 1000 layer ResNet result. The largest contribution of ResNet is Residual block, going deep is not necessary. The parameters of WideResNet and basic ResNet are similar but owing to GPU parallel computation. WideResNet is about 8 times faster than ResNet.
* Width or number of channels enchance the performance
* Increasing depth and width is valid until pararmeter space is too large to regularize
* For the same amount of parameters, increasing width is better than depth
![WideResNet](./img/WideResNet.jpg)
* URl:
    * [WideResNet](https://arxiv.org/abs/1605.07146)
    * [Pytorch](https://github.com/szagoruyko/wide-residual-networks/blob/master/pytorch/resnet.py)

#### 4.1.6 XceptionNet
The hypothesis: "cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly." Instead of partitioning input data into several compressed chunks, it maps the spatial correlations for each output channel separately, and then performs a 1x1 depthwise convolution to capture cross-channel correlation.
![XceptionNet](./img/XceptionNet.jpg)
* URl:
    * [XceptionNet](https://arxiv.org/abs/1610.02357)
    * [Pytorch](https://github.com/tstandley/Xception-PyTorch/blob/master/xception.py)

#### 4.1.7 ShuffleNet
ShuffleNet is an efficient convolutional neural network architecture for mobile devices. Grouped convolution is a variant of convolution where the channels of the input feature map are grouped and convolution is performed independently for each grouped channels.Channel shuffle is an operation (layer) which changes the order of the channels. This operation is implemented by tensor reshape and transpose. The main contribution of this kind of Net is to reduce the number of parameters and computation while still maintain its performance. These kind of models can be used in resource scarce scenario like in moblie application.

![ShuffleNet](./img/ShuffleNet.jpg)
* URl:
    * [ShuffleNet](https://arxiv.org/abs/1707.01083)
    * [Group](https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d)
    * [Pytorch](https://github.com/kuangliu/pytorch-cifar/blob/master/models/shufflenet.py)

#### 4.1.8 SENet
SENet introduces a building block for CNNs that improves channel interdependencies at almost no computational cost. They were used at this years ImageNet competition and helped to improve the result from last year by 25%. Besides this huge performance boost, they can be easily added to existing architectures. The main idea is this:
Let’s add parameters to each channel of a convolutional block so that the network can adaptively adjust the weighting of each feature map.
The function is given an input convolutional block and the current number of channels it has. We squeeze each channel to a single numeric value using average pooling. A fully connected layer followed by a ReLU function adds the necessary nonlinearity. It’s output channel complexity is also reduced by a certain ratio. A second fully connected layer followed by a Sigmoid activation gives each channel a smooth gating function. At last, we weight each feature map of the convolutional block based on the result of our side network.
![SENet_1](./img/SENet_1.jpg)
![SENet_2](./img/SENet_2.jpg)
* URl:
    * [SENet](https://arxiv.org/abs/1709.01507)
    * [Pytorch](https://github.com/moskomule/senet.pytorch)

<a id="recurrent"></a>
### 4.2 Recurrent

<a id="loss"></a>
### 5. Loss
#### 5.1 Regression
#### 5.1.1 MAELoss
* other names: L1Loss
\begin{equation*}
\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad
l_n = \left| x_n - y_n \right|,
\end{equation*}
* URL: 
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#torch.nn.L1Loss)
    
#### 5.1.2 MSELoss
* other names: L2Loss
\begin{equation*}
\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad
l_n = \left( x_n - y_n \right)^2,
\end{equation*}
* URL: 
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#torch.nn.MSELoss)
    
#### 5.2 Classification
#### 5.2.1 CrossEntropyLoss
This loss combines Softmax and NLLLoss in one single loss
\begin{equation*}
\text{loss}(x, class) = weight[class] \left(-x[class] + \log\left(\sum_j \exp(x[j])\right)\right)
\end{equation*}
* URL: 
    * [Pytorch](https://pytorch.org/docs/stable/nn.html#torch.nn.MSELoss)

<a id="activation"></a>
### 6. Activation
#### RELU
\begin{equation*}
\text{ReLU}(x)= \max(0, x)
\end{equation*}

<a id="organic"></a>
### 7. Organic Machine

Oraganic Machine is a coherent artificial intelligence system which dynamically adjusts parameters and achitecture to adapt various signal and problem scenarios and ultimately mimics or surpasses human intelligence 

<a id="signal"></a>
### 7.1 Signal

<a id="objective"></a>
### 7.2 Objective

<a id="achitecture"></a>
### 7.3 Achitecture

### 7.3.1 Exploring Neural Network
Neural Network usually has a lot of parameters. There exists a trade-off among parameters, computation, and performance. The rule-of-thumb of exploring Neural Network Achitecture is to reduce the number of parameters and computation cost while achieving high performance<br> <br>

Classical feature selection claims to find the features that are the most correlated with target while are the least correlated with themselves. We often need to come up with a distance metric to evaluate these two "correlation".  Besides, we also need to incorporate the metric into a search algorithm since the number of possible discrete state of $K$ features is $2^K$. Even if the overhead of computing distance metric is small, it become hard to explore Neural Network Achitecure due to computation limitation. <br> <br>

Owing to server computation scarce, interval estimate by Monte Carlo sampling is also not attractive. Variational Inference is marginally acceptable if the number of parameters to sample is small, since even a single forward-pass can create nonnegligible overhead. The compromise between Bayesian and Frequentist is more of a compromise of computation here. So far, Bayesian Deep Learning framework is gradually getting attention for the sake of inference and interpretation. But we are still lack of a framework that can efficiently switching between point and interval estimate. Therefore, classical Bayesian Variable Selection technique are not suitable here because of computation. People are trying to use relaxation of discrete distribution called concrete distribution to directly compute the gradient. They apply this trick to Dropout and the so-called concrete Dropout adpatively adjusts Dropout rate in a Variational Inference framework. However, since Dropout is more of data-augmentation, its usage in variable selection has not been addressed. <br> <br>

Rationally, the most widely used techinque in high-dimension data analysis should be appropriate in this setting. The so-called LASSO and its variants have been attractive to the researchers since the beginning. It can be naturally applied in a Gradient Descent algorithm and the parameters can be sparsified. It not only takes performance into account but also computation. However, it introduces another hyperparameter that is hard to tune. Using search algorithm like LARS is also not suitable owing to computation cost. Besides, we still need to compute the gradient of each paremeter in order to select features. Although it addresses the computation load in a farily high-dimensional problem, it fails in the situation allowing the paramter space to expand. Another gradient based method uses a sigmoid gate to weight features while the weight is determined by trainable parameters. Although this method performs well in LSTM and SENet, it is not scalable and efficient to introduce more parameters to solve a feature selection problem.<br> <br>

It seems like computation is a more heavy load to balance. The techinque we need has to roughly satisfy following criterions:
1. select features that are most correlated with target
2. select features that are least correlated with themselves or perform certain decorrelation transformation
3. selection process prefers not involving any searching, sampling, or gradient based algorithms which do not scale well with the number of parameters
4. After applying selection and Gradient Descent, we also need to explore new parameter space by intializing or perturbing from existing parameters

Pinciple Component Analysis is a simple dimension reduction tool. Its heaviest computation only involves an eigenvalue decomposition. With batch data and a finite number of features, PCA is possibly the cheapest way to select features. We can also decorrelate the features with whitening. Therefore, we greatly alleviate the computation load and satisfy the 2nd and 3rd criterion. To address the 1st criterion, it is possible to first screen features whose correlation with target is less than a threshold. At each layer, the number of input features is determined by the previous layer representation, and the number of output features is determined by computation power. In this way, all of the number of features of Neural Network can be determined dynamically with respect to a trade-off between performance and computaion. The details are listed following:


**Hyperparameter**: $\epsilon$, pca cutoff ratio $c_{pca}$, exploration cutoff ratio $c_{p}$, running average momentum $m$  
**For** iteration $e$   
**Input**: Exploitation paremeter set $P_{e},|P_{e}|=p$, Exploration paremeter set $Q_{e},|Q_{e}|=q$, batch signal $X_{b\times (p+q)}^e$, expected mean $\bar{\mu}$, expected projection matrix $\bar{\Sigma}_{p\times p}^{-1/2}$  
**Output**: batch signal after selection $\tilde{X}_{b\times k}^{e}$  
**Forward Pass**  
**1.** $X_{bxp}^{e},P_{e},Q_{e} = $ pre-screen$(X^{e},P_{e}+Q_{e},c_{p})$  
**2.** $\tilde{\mu}_{1\dots p} = \frac{1}{b}\sum\limits_{i=1}^{b}X_{i,1\dots p}^{e}$  
**3.** $\tilde{\Sigma} = \frac{1}{b}\sum\limits_{i=1}^{b}(X_{i}^{e}-\tilde{\mu})^{T}(X_{i}^{e}-\tilde{\mu})+\epsilon I$  
**4.** $\tilde{\Sigma} = USV^T$  
**5.** $\lambda_{1 \dots p} = diag(S)^2$  
**6.** select $k$ principle components $s.t. \frac{\sum\limits_{i=1}^{k}\lambda_{i}}{\sum\limits_{i=1}^{p}\lambda_{i}}<c_{pca}$   
**7.** $\tilde{X} = U_{1 \dots k}S_{1 \dots k}(V^TX^{e})_{1 \dots k} $  
**8.** $\bar{\mu} = (1-m)\bar{\mu} + m\tilde{\mu}$  
**9.** $\bar{\Sigma} = (1-m)\bar{\Sigma} + m\tilde{\Sigma}$  
**Backward Pass**  
**10.** Update $P$ with Back-Propagation $\nabla L_P$  
**11.** Update $Q$ by reinitialization, sampling from distribution $p(\nabla L_P)$, or pertubing with random noise (cheaper than computing gradient)