# Part 2: Deep Learning: Modern Practices

# Chapter 6: Deep Feedforward Networks

Also called multilayer perceptrons (MLPs) <br>
The goal of this network is to approximate some function <br>
A chain of functions are used for this function: $f(x) = f^{(3)}(f^{(2)}(f^{(1)}((x))))$ <br>
^ f1 is first layer ^ f2 is second layer ^ f3 is third layer and so on <br>
^ Number of layers accounts for <b>depth</b> of the function

Dimensionality of these hidden layers determines the width of the model <br>

We want a non-linear function rather than a linear function as it has limitations

We want $f$ to be as close to $f^*$; $θ$ is trainable parameters <br> 
The Mean Squared Loss (MSE) Function: <br>
J(θ) = $\frac{1}{4}\sum_{x∈X}^{} (f^∗(x) − f(x; θ))^2$

If function is linear: $f(x; w, b) = x^Tw + b$

ReLU: $g(z) = max(0, z)$

## How a Model processes Batches of Data

Let input be $X$, a design matrix where each row is an input

You multiply the Input Matrix by the Weight Matrix then add the bias vector (LINEAR FUNCTION)

## 6.2: Gradient Based Learning

The non-linearity of a neural network causes most interesting functions to become non-convex. <br>
Stochastic gradient descent applied to nonconvex loss functions has no such convergence guarantee and is sensitive to the values of the initial parameters.

## 6.2.1: Cost Functions

Negative Log-likelihood: <br>
$J(θ) = −E_{x,y∼pˆ_{data}} log p_{model}(y | x)$ <br>
$p_{model}(y | x) = N (y; f(x; θ), I)$ 

Cross-Entropy Cost Function is more popular than mean squared error or mean absolute error because these two often yeild to poor results from gradient-based optimization

## 6.2.2: Output Units

## 6.2.2.1: Linear Units for Gaussian Output Distributions

## 6.2.2.2: Sigmoid Units for Berrnoulli Output Distributions

$yˆ = σ(w^Th + b)$ <br>
where, σ is the logistic sigmoid function (converts to probability for us)

## 6.2.2.3: Softmax Units for Multinoulli Output Distributions

Probability distribution over a discrete variable with n possible values

## 6.2.2.4: Other Output Types

Neural Networks with Gaussian mixtures as their output are often called mixture density networks.

## 6.3 Hidden Units

ReLU is a good choice; also, the ReLU function is not differentiable at 0 <br>

## 6.3.1 Rectified Linear Units and Their Generalizations

ReLU use activation function $g(z) = max(0, z)$

Absolute value rectification $g(z) = |z|$

Leaky ReLU

Parametric ReLU

PReLU

Rectified linear units and all these generalizations of them are based on the
principle that models are easier to optimize if their behavior is closer to linear

## 6.3.2: Logistic Sigmoid and Hyperbolic Tangent

Logistic Sigmoid Activation Function: $g(z) =  σ(z)$ <br>
Hyperbolic Tangent Activation Function: $g(z) = tanh(z)$ <br>
These two functions are closely related because $tanh(z) = 2σ(2z)-1$

## 6.3.3: Other Hidden Units

It is possible to have no activation function at all (Identity Function) <br>
A few other reasonably common hidden types include: <br>
1) Radial basis function (RBF) unit <br>
2) Softplus <br>
3) Hard tanh <hr>
<b>NOTE FROM TEXTBOOK: Hidden unit design remains an active area of research, and many useful hidden
unit types remain to be discovered </b>

## 6.4: Architecture Design

The ideal network architecture for a task must be found via experimentation guided by monitoring the validation set error <br><br>
Neural Networks can be described through depth of a network and width of each layer

## 6.4.1: Universal Approximation Properties and Depth

## 6.4.2: Other Architectural Considerations

## 6.5: Back-Propagation and Other Differentiation Algorithms

The backpropagation is propagating backwards in the neural network to calculate the gradients. The
back-propagation algorithm does so using a simple and inexpensive procedure. <br>
The backpropagation algorithm is often simply called backprop <hr>
Forward Propagation: From input $x$ we produce an output $y^$ (y hat) from the MLP <br>
Backward Propagation: 

## 6.5.1 Computational Graphs

A node in the graph is to indicate a variable (it can be a scalar, vector, matrix, tensor or anything else)

## 6.5.2: Chain Rule of Calculus

## 6.5.3: Recursively Applying the Chain Rule to Obtain Backprop

## 6.5.4: Back-Propagation Computation in Fully Connected MLP

## 6.5.5: Symbol-to-Symbol Derivatives

## 6.5.6: General Back-Propagation

## 6.5.7: Example: Back-Propagation for MLP Training

## 6.5.8: Complications

Real-world implementations of back-propagation also need to handle various data types, such as 32-bit floating point, 64-bit floating point, and integer values.

## 6.5.9: Differentiation outside the Deep Learning Community

The field of automatic differentiation is concerned with how to compute derivatives algorithmically.

## 6.5.10: Higher-Order Derivatives

## 6.6: Historical Notes

<br>Feedforward networks can be seen as efficient nonlinear function approximators
based on using gradient descent to minimize the error in a function approximation.
From this point of view, the modern feedforward network is the culmination of
centuries of progress on the general function approximation task.</b><br>
One of these algorithmic changes was the replacement of mean squared error
with the cross-entropy family of loss functions. Mean squared error was popular in
the 1980s and 1990s but was gradually replaced by cross-entropy losses and the
principle of maximum likelihood as ideas spread between the statistics community
and the machine learning community.