<a href="https://colab.research.google.com/github/chaitragopalappa/MIE590-690D/blob/main/3_Lecture_NN_tabular_data_DeeperDive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nuts and Bolts - Neural Networks for Structured Data or Tabular data
Feed Forward Neural Network (FFNN) / Multi-layer Perceptrons (MLP)
References:
* Chapter 13, Probabilistic Machine Learning: An Introduction by Kevin Murphy  
* Bengio, Y., Practical recommendations for gradient-based training of deep architectures, 2012, https://arxiv.org/abs/1206.5533
* Nawankpa, C., Activation Functions: Comparison of trends in Practice and Research for Deep Learning, 2017, https://doi.org/10.48550/arXiv.1811.03378


# **Recommended readings**
* Bengio, Y., Practical recommendations for gradient-based training of deep architectures, 2012, https://arxiv.org/abs/1206.5533
* Nawankpa, C., Activation Functions: Comparison of trends in Practice and Research for Deep Learning, 2017, https://doi.org/10.48550/arXiv.1811.03378

---

### **Outline**
* Data handling
* Data preprocessing
* Split data into test and train  
* Batching of train set
* Activation functions
* Regularization (avoid overfitting)
* Optimizers
* Hyperparameter tuning
* Loss functions


---

## **Data preprocessing**
 * Data standardization, or Z-score normalization, scales data to have a mean of 0 and a standard deviation of 1, preserving the distribution shape but without a fixed range.
 * Data normalization, such as Min-Max Scaling, scales data to a specific range, typically 0 to 1, which can alter the distribution's shape and is useful for algorithms that require data within a certain range

 In deep learning we prefer standardization aas it preserves the distribution

 ---

### **Train and test sets**
* Typical to split data into train and test sets
* Use ‘train’ set to train the data
* Test the trained model on the ‘test’ set
---

## **Batching of train set**

* Batch: use the full train set to train the model
* Mini-batch: divide train set into small mini-batches
  * Typically $2^n$: 32, 64, 128, 256 (corresponding to CPU /GPU architecture)
* Incremental/ online learning: single sample at a time

Bengio, Practical recommendations for gradient-based training of deep architectures, 2012, https://arxiv.org/abs/1206.5533

---

## **Activation functions**
Common used functions
\begin{array}{|l|l|l|l|}
\hline
\textbf{Name} & \textbf{Definition} & \textbf{Range} & \textbf{Reference} \\
\hline
\text{Sigmoid} & \sigma(a) = \frac{1}{1 + e^{-a}} & [0, 1] &  \\
\hline
\text{Hyperbolic tangent} & \tanh(a) = 2\sigma(2a) - 1 & [-1, 1] &  \\
\hline
\text{Softplus} & \sigma_{+}(a) = \log(1 + e^{a}) & [0, \infty) & \text{[GBB11]} \\
\hline
\text{Rectified linear unit} & \mathrm{ReLU}(a) = \max(a, 0) & [0, \infty) & \text{[GBB11; KSH12]} \\
\hline
\text{Leaky ReLU} & \max(a, 0) + \alpha \min(a, 0) & (-\infty, \infty) & \text{[MHN13]} \\
\hline
\text{Exponential linear unit} & \max(a, 0) + \min\big(\alpha(e^{a} - 1), 0\big) & (-\infty, \infty) & \text{[CUH16]} \\
\hline
\text{Swish} & a \, \sigma(a) & (-\infty, \infty) & \text{[RZL17]} \\
\hline
\text{GELU} & a \, \Phi(a) & (-\infty, \infty) & \text{[HG16]} \\
\hline
\end{array}
*Table Source:* Reproduced from Table 13.4 PML: An Introduction by Murphy

<p align="center">
  <img src="https://raw.githubusercontent.com/probml/pml-book/main/book1-figures/Figure_13.14_A.png" width="45%" />
  <img src="https://raw.githubusercontent.com/probml/pml-book/main/book1-figures/Figure_13.14_B.png" width="45%" />
</p>

<p align="center"><em>Figure 13.14 from PML: An Introduction by Murphy (Directly embedded figure from the textbook's GitHub repository)</em></p>


---

## **Vanishing/Exploding Gradient Issues**
When training very deep models, the gradient tends to become either very small (this is called the vanishing gradient problem) or very large (this is called the exploding gradient problem).

Solutions:
* INitialize weights to be not too high or too low
* The exploding gradient problem can be fixed by **gradient clipping** $g' = min(1,\frac{c}{||g||})g $  
  * $g$ is the gradent at some layer; the equation ensures that the norm of $ g$ is never greater than some constant $c$; and multiplying by $g$ ensures the vector is in same direction as $g$
* Use non-saturating activation functions


---



**Saturating functions: Sigmoid function and Tanh activation functions**

Sigmoid saturate at 1 for large positive inputs, and at 0 for large negative inputs. Tanh function, which has a similar shape, but saturates at -1 and +1.
In the saturated regimes, the gradient of the output wrt the input will be close to zero, so any gradient signal from higher layers will not be able to propagate back to earlier layers. This is called
the vanishing gradient problem. Solution is to use **non-saturating** functions

---


**ReLU (Rectified Linear Unit)**

The ReLU function simply “turns off” negative inputs, and passes positive inputs unchanged.
$ReLU(a) = max(a, 0) =a\mathbb{I}(a>0); \mathbb{I}$ is an indicator function.  
$ReLU'(a) = \mathbb{I}(a>0)$  
Suppose $z = ReLU(Wx)$   
$\frac{\partial z}{\partial W}=\mathbb{I}(a>0)x^T$

**Dead ReLU problem**

When using ReLU, if the weights are initialized such that a = Wx take on large negative values, then the signal outputs from neurons are never activated, and the signal dies out . This is called the **dead ReLU** problem

**Leaky ReLU : Non-saturating version of ReLU**  
Overcomes dead ReLU problem.
$LReLU(a; \alpha) = max(\alpha a, a)$  
where $0 < \alpha < 1$. The slope of this function is 1 for positive inputs, and $\alpha$ for negative inputs, thus ensuring there is some signal passed back to earlier layers, even when the input is negative.

$\alpha$ can be **hyperparamter**

**Exponential Linear Unit (ELU)**
$$
\mathrm{ELU}(a; \alpha) =
\begin{cases}
\alpha \big(e^{a} - 1\big), & \text{if } a \leq 0 \\
a, & \text{if } a > 0
\end{cases}
$$

**Self-normalizing ELU**  
$SELU(a; α, λ) = λELU(a; α)$  
The authors prove that by setting $\alpha$ and $\lambda$ to carefully chosen values, this activation function is guaranteed to ensure that the output of each layer is standardized (provided the input is also standardized), even without the use of techniques such as batchnorm

---

## **Regularization- steps to avoid over-fitting**
1. **Early stop**: stopping the training procedure when the error on the validation set starts to increase
2. **Weight decay**:  This is equivalent to L2 regularization-  by adding a penalty to the loss function based on the sum of the squares of the model's weights. In the neural networks literature, this is called weight decay, since it encourages small weights, and hence simpler models, as in ridge regression. (This is equivalent to using a Gaussian prior for the weights $\mathcal{N} (w|0, α^2I)$ and biases,$\mathcal{N} (b|0, β^2I)$.)
  *  IN NN packages this is an input to the Optimizer
3. **Drop-out**: Turn off all the outgoing connections from each neuron with probability $p$. Can dramatically reduces over-fitting and us widely used. Intuitively, each unit must learn to perform well even if some of the other units are missing at random.

___


## **Optimizers (See SLIDES on Canvas)**
Different options for search direction and learning rate (step size)
### Line search methods
Methods that pick a search direction and step in that direction with some step size. In ML step-size are typically referred to as learing rate
* Gradient as search direction
  * SDG
  * SDG with momentum
  * SDG with Nestrov
* Adaptive learning rate (step size)
  * AdaGrad
  * RMSProp
  * Adam
* Hessian as search direction
  * Newtons method
  * Conjugate gradients
  * BFGS

### Trust-region methods
Methods that determine search direction and step-size together
* Levenberg Marquardt

**See SLIDES on Canvas**

### References
[Algorithms](https://www.deeplearningbook.org/contents/optimization.html) Algorithms 8.1 to 8.7 from Chapter 8: Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning  
[Computational implementation in Pytorch](https://docs.pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD)  

[Vizualization tool](https://github.com/lilipads/gradient_descent_viz)


(See convergence properties for optimizers [1suppl_Mathematical Foundations of ML](https://github.com/chaitragopalappa/MIE590-690D/blob/main/suppl_files/1suppl_Mathematical_foundations_of_ML.ipynb)

---

### **Optimizers -Additional variants**

New variants are continuoualy added- Best way to keep up is to look at NN libraries (Keras, Pytorch, Tensorflow) on available options
* [Kieras](https://keras.io/api/optimizers/)  
* [Pytorch](https://pytorch.org/docs/stable/optim.html)  
* [Tensorflow](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers )  

Bengio, Practical recommendations for gradient-based training of deep architectures, 2012, https://arxiv.org/abs/1206.5533

---

## **Hyperparameter tuning summary -for convergence**
* Initialize with different initial weights (or use different random seeds if fixing the seed)
* Pick non-saturating activation functions if issues of vanishing/exploding gradient
* Tune hyperparmeters - parameters related to optimizer, weight decay (regularization weights)
* Use different optimizers

---


# **Architecture for regression v. classification**

## **Architecture: MLP on tabular data : regression problem**


 $y=f(\mathbf{x})$  
 Predict:  
 $\mathbf{z}_L= y= (\mathbf{b}_{L} +\mathbf{W}_{L}\mathbf{z}_{L−1})$  
 $\mathbf{z}_l=  \phi_l (\mathbf{b}_{l} +\mathbf{W}_{l}\mathbf{z}_{l−1})$ (matrix form) for $l=1:L-1$ hidden layers; $z_0=x$

## **Architecture: MLP for tabular data : classification problem**

**TWo classes**  
Predict:  
 $p(y|x; θ) = Ber(y|σ(a_L)) $  
 $\sigma(a)$ is sigmoid activation function  
$a_L = w^T_L z_{L-1} + b_L$  
 $\mathbf{z}_l=  \phi_l (\mathbf{b}_{l} +\mathbf{W}_{l}\mathbf{z}_{l−1})$ (matrix form) for $l=1:L-1$ hidden layers; $z_0=x$

 $a_L$, i.e., the output from the final layer are called the **logit score**  
 IN statistics, a logit is the natural logarithm of the odds, where odds are the ratio of the probability of an event occurring to the probability of it not occurring. The logit function transforms a probability (a value between 0 and 1) into a value that can range from negative to positive infinity. IN deep learning, we do the inverse, pass teh logist either through sigmoid (for 2-class problem) or softmax (for N-class problem) to convert outputs to a probability dsitribution. Thus, the output from the NN are equivalent to learning the logits.

 **Multi-class classification (> 2 classes)**  
Same as above except logit scores are passed through a soft-max function to generate a probaility distrinution that adds to 1 (sigmoid only converts it into values between 0 and 1, so only used in 2-class problem. In 2-class problem, if we know probability of class 1 (p) we can calculate probability of class 2 as  1-p)

In NN packages: soft-max is typically integrated into the cross-entropy loss, and thus, in setting-up NN archtecture,  we need not dpass the final layer through soft-max.


 ---

# **Loss function for MLP regression v. classification**

## **Loss function for regression - MSE**
Objective function: $ Min\mathcal{L}(\mathbf{\theta}) =Min_\mathbf{\theta}||\mathbf{\hat{y}-y}||_2^2$

## **Loss function for classification- cross-entropy or log-loss**

* Cross-Entropy loss or log-loss
  * Binary output (2 classes): Suppose ouput $y$ takes values 1 or 0, and $p_i$ is the model predicted probability of label for observation $i$, then entropy loss is
  $$-\frac{1}{N}\sum_{i=1}^N (y_i log(p_i)+(1-y_i)log(1-p_i))$$

  * Categorical output (>2 classes): $y_{i,c}=1$ if true class is $c$ and $p_{i,c}$ is the model predicted probability
   $$ -\sum_c y_{i,c} log(p_{i,c})$$

---

# MLP for heteroskedastic regression
“Heteroskedastic”  means that the predicted output variance is input-dependent. This function has two outputs which compute $f_{\mu}(x) = \mathbb{E} [y|x, θ]$ and $f_{\sigma}(x) = \sqrt{\mathbb{V} [y|x, θ]}$.
Most of the layers (and hence parameters) can be shared between the two functions by using a common “backbone” and two output “heads”. A linear activation for the $\mu$ head, a softplus activation for the $\sigma$ head are typically used.
![](https://brendanhasz.github.io/assets/img/dual-headed/TwoHeadedNet.svg)
Source: Brandan Hasz, Trip Duration Prediction using Bayesian Neural Networks and TensorFlow 2.0, https://brendanhasz.github.io/2019/07/23/bayesian-density-net.html
