Neural Network
---

\begin{align} y = f(w^T\phi(x)) \end{align}

\begin{align} J(\theta) = \sum_{i=1} (h_\theta(x_i)-y_i)^2 \end{align}

<img src="https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/Single-Perceptron.png?x31195" alt="Drawing" style="width: 500px;"/>

Relation with previous model
---

\begin{align} y = f(w^T\phi(x)) \end{align}

\begin{align} J(\theta) = \sum_{i=1} (h_\theta(x_i)-y_i)^2 \end{align}


### Perceptron Rule for SVM

\begin{align} y = sgn(w^{*T}x+b^*) \end{align}

\begin{align} E_p(w) = -\sum y_iw^Tx_i \end{align}




<img src="https://image.slidesharecdn.com/2012mdsp-pr13supportvectormachine-130701022429-phpapp02/95/2012mdsp-pr13-support-vector-machine-5-638.jpg?cb=1372645656" alt="Drawing" style="width: 600px;"/>


Cost Function
---

### Probabilistic Supervised Learning - Logistic Regression

![](http://www.saedsayad.com/images/LogReg_1.png)

\begin{align} p(y = 1 |x; \theta) = \sigma(\theta^Tx) \end{align}


### Maximum Likelihood Estimation


### Information Theory

\begin{align} \theta_{ML} = \underset{\theta}{\operatorname{argmax}} P(Y|X; \theta) \end{align}

- Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content what so ever.
- Less likely events should have higher information content.
- Independent events should have additive information.

\begin{align} I(X) = -logP(X) \end{align}


### Shannon's Entropy

\begin{align} H(X) = E_{x \sim P}[I(X)] \end{align}

<img src="https://i.imgur.com/Pynf9sG.png" alt="Drawing" style="width: 200px;"/>


### Cross-Entropy

데이터 분포 P(X) 상에서 추정 확률 분포 Q(X)가 가지는 정보의 기댓값

\begin{align} H(P,Q) = H(P) + D_{KL}(P||Q) \end{align}

\begin{align} H(P,Q) = E_{x \sim P}[\log Q(X)] \end{align}


### Kullback-Leibler divergence, KLD

두 확률분포의 차이를 계산하는 데 사용하는 함수. 가지고 있는 데이터의 분포 P(x)와 모델이 추정한 데이터의 분포 Q(x) 간의 차이

\begin{align} D_{KL}(P||Q) = E_{x \sim P}[\log\frac{P(X)}{Q(X)}] \end{align}


### Relation between Cross Entropy Loss and Maximum Likelihood Estimation


Back Propagation
---


Generalization
---

<img src="https://image.slidesharecdn.com/random-170910154045/95/-53-638.jpg?cb=1505089848" alt="Drawing" style="width: 600px;"/>


### Learning Theory - VC Dimension (Vapnik–Chervonenkis dimension)

<img src="http://www.guochaoping.top/wp-content/uploads/2016/11/Sample_complexity_tradeoff.png" alt="Drawing" style="width: 600px;"/>

\begin{align} error_{test} \geq error_{training} + \sqrt{\frac{1}{N}\Big[D\Big(\log\Big(\frac{2N}{D}\Big)+1\Big)-\log\Big(\frac{1}{\delta}\Big)\Big]} \end{align}

\begin{align} d_{VC} = \frac{D}{N}, \:\: D:\:model\:complexity, \:\: N:\:number\:of\:samples \end{align}


### Regularization

<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/Screen-Shot-2018-04-03-at-7.52.01-PM-e1522832332857.png" alt="Drawing" style="width: 700px;"/>

\begin{align} J(\theta) = \sum_{i=1} (h_\theta(x_i)-y_i)^2 + \lambda ||\theta||^2 \end{align}

\begin{align} h_\theta(x_i) = \theta^Tx_i + b \end{align}

\begin{align} \theta^* = \arg\min_\theta J(\theta) \end{align}


### Drop Out


### Data Augmentation


### Hyperparamter Training - Cross Validation

<img src="http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/08_CV_Ensembling/fig-holdout.png" alt="Drawing" style="width: 700px;"/>

<img src="https://i.stack.imgur.com/1fXzJ.png" alt="Drawing" style="width: 700px;"/>



In [None]:
%matplotlib inline
#MNIST Tutorials

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

cuda = torch.cuda.is_available()
print('Using PyTorch version:', torch.__version__, 'CUDA:', cuda)

batch_size = 32

kwargs = {'num_workers': 1, 'pin_memory': True} if cuda else {}

train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True, **kwargs)

validation_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=False, **kwargs)

for (X_train, y_train) in train_loader:
    print('X_train:', X_train.size(), 'type:', X_train.type())
    print('y_train:', y_train.size(), 'type:', y_train.type())
    break

pltsize=1
plt.figure(figsize=(10*pltsize, pltsize))

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(X_train[i,:,:,:].numpy().reshape(28,28), cmap="gray")
    plt.title('Class: '+str(y_train[i]))


Training Techniques
---

### Early Stopping

### Batch Normalization

### Curriculum Learning

### Supervised PreTraining