# Deep Learning from the Basics
## Python and Deep Learning: Theory and Implementation

#### Koki Saitoh
#### Packt Publishing 2021

- Data/Code available on GitHub, link in the book

## Stated Learning Objectives
- Use Python with minimum external libraries to implement DL programs
- Study various DL and NN theories
- Learn how to set initial values of weights
- Implement techniques such as batch normalization, dropout, and Adam
- Explore applications like automatic driving, image generation, and reinforcement learning

## Target Audience
- Data Scientists
- Data Analysts
- Developers

...who want to use DL to develop efficient solutions

- "This book is ideal for those who want a deeper understanding as well as an overview of the techs"

## Prerequisite Knowledge
- Some working knowledge of Python is a must
- numpy/pandas beneficial but not necessary

## Getting Started Notes/Impressions
- The book clearly states its goal, to implement DL algorithms from scratch
- The book clearly states its focus is on image recognition and not other areas
- It also clearly states that it is not utilizing standard Python DL packages/frameworks
- It also says it is not focusing on GPU usage, model tuning, or latest research

# Ch 1: Intro to Python

- Into to Python -> stated to skip if already familiar with Python
- Basic intro to Python that seems to cover the most basic info including very basic intro to classes and OOP
- Does mention using numpy/matplotlib, but leaves out `pip install` or `conda install` directions
- I mostly skimmed this chapter, but it does seem like a decent intro

# Ch 2: Perceptrons

- A logical starting point for DL basics
- Does a good job of representing how perceptrons work, describing weight/bias
- Explains how multi-layer perceptrons build on simple principles to represent more complex problems

# Ch 3: Neural Networks

- Focuses on forward propagation
- Explains NN layout: input layer, hidden layer(s), output layer
- Addresses how activation functions impact output from a node
- Explains why non-linear functions must be used as activation functions -> linear activation functions can be represented by a single layer NN
    - So you lose the advantage gained by multiple layers
- Talks about step, sigmoid, and ReLU activation functions
- Goes over matrix multiplication
- Implements a three layer NN using matrix multiplication
- Introduces activation functions for output layers and common choices: identity/regression, sigmoid/2-class, softmax/multi-class
- Explains why output of the softmax function can be interpreted as probability
- Says that the softmax function is often omitted from the Output Layer (doesn't change the order of probabilities, meaning which class is the 'answer')
- Explains how to determine the number of output nodes (neurons) (equal to the number of classes for classification)
- Sort of explains that the number of input layers = the number of features with MNIST data (flattened array length)
- Effectively explains batching and why this accelerates the calculation process

# Ch 4: Neural Network Training

- Distinguishes between characteristics of how ML & DL learn from data
- Good intro to train/test (generalization) and overfitting
- Decent explanation of Loss Function
- Decent examples of calculating loss function
- Introduces gradient descent and describes how it works before mentioning the term
- Brings into the fold the learning rate and how it applies by adjusting the amount to update during each iteration of the gradient method
    - Also mentions that checking whether or not training is successful can be accomplished by changing the learning rate (0.01 and 0.001 are common)
    - Only after describing does it mention the term hyperparameter and talk about modifying it (without using the term hyperparameter tuning)
- Lastly describes that Stochastic Gradient Descent (SGD) occurs when using random mini-batches from the training set
- Describes that an epoch is the completion of all mini-batches once (all training data has been seen) and indicates the number of iterations through the mini-batches

# Ch 5: Backpropagation

- Uses computational graphs (network graphs with nodes/edges) instead of formulas to describe backpropagation
- Good job of visually describing how backpropagation finds local differentials
- Illustrates how the chain rule applies to back propagation with an example to show how it works
- Uses derivatives to explain why backpropagation through an addition node doesn't change the value passed to the lower stream (multiply by 1, derivated of addition only is 1)
- Illustrates multiplication backpropagation by reversing the operation (multiplying by x forward means multiplying by y backward and vice versa)

In [1]:
# ch 5 code exercise
class MulLayer:
    # a multiplication layer with forward and backward operations
    def __init__(self):
        self.x = None
        self.y = None
        
    def forward(self, x, y):
        # multiplies the product of two inputs and returns one output
        self.x = x
        self.y = y
        out = x * y
        
        return out
        
    def backward(self, dout):
        # excepts the downstream value (output of forward) and returns the two upstream values (inputs of forward)
        dx = dout * self.y # reverse x and y
        dy = dout * self.x
        
        return dx, dy

In [2]:
# forward prop in a multiplication layer
apple = 100
apple_num = 2
tax = 1.1

# layer
mul_apple_layer = MulLayer()
mul_tax_layer = MulLayer()

# forward
apple_price = mul_apple_layer.forward(apple, apple_num)
price = mul_tax_layer.forward(apple_price, tax)

print(price)

220.00000000000003


In [3]:
# backward prop in a multiplication layer
# with input of the derivative of the apple price (derivative of 100 = 1)
dprice = 1
dapple_price, dtax = mul_tax_layer.backward(dprice)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)
print(dapple, dapple_num, dtax)

2.2 110.00000000000001 200


In [4]:
# an addition layer
class AddLayer:
    def __init__self(self):
        pass
    
    def forward(self, x, y):
        return x + y
    
    def backward(self, dout):
        # derivative of addition returns the downstream value (output of forward prop) * 1 to each previous node
        dx = dout * 1
        dy = dout * 1
        return dx, dy

In [5]:
# set vars
apple = 100
apple_num = 2
orange = 150
orange_num = 3
tax = 1.1

# layers to multiply the number*price of apple/oranges, layer to add prices together, layer to multiply the tax
mul_apple_layer = MulLayer()
mul_orange_layer = MulLayer()
add_apple_orange_layer = AddLayer()
mul_tax_layer = MulLayer()

In [6]:
# forward prop
apple_price = mul_apple_layer.forward(apple, apple_num)
orange_price = mul_orange_layer.forward(orange, orange_num)
all_price = add_apple_orange_layer.forward(apple_price, orange_price)
price = mul_tax_layer.forward(all_price, tax)
print(price)

715.0000000000001


In [7]:
# backward prop
dprice = 1
dall_price, dtax = mul_tax_layer.backward(dprice)
dapple_price, dorange_price = add_apple_orange_layer.backward(dall_price)
dorange, dorange_num = mul_orange_layer.backward(dorange_price)
dapple, dapple_num = mul_apple_layer.backward(dapple_price)
print(dapple_num, dapple, dorange, dorange_num, dtax)

110.00000000000001 2.2 3.3000000000000003 165.0 650


- Implementing an Activation Function Layer

- ReLU
    - In forward prop, returns x if x > 0, or 0 if x <= 0
    - In back prop, does the same (except this serves as the derivative)

In [8]:
# the relu class

# assumes x is a numpy array
class Relu:
    def __init__(self):
        self.mask = None
        
    def forward(self, x):
        # create a T/F array 
        self.mask = (x <= 0)
        out = x.copy()
        
        # replaces all x <= 0 values with 0
        out[self.mask] = 0
        
        return out
    
    def backward(self, dout):
        # accepts a numpy array of T/F values, sets all True values to 0
        dout[self.mask] = 0
        
        # essentially turns off the signal if the derivative was 0
        dx = dout
        
        return dx

- Complex chapter that really dives into the details for how backpropagation works

# Ch 6: Training Techniques

- Discusses optimization techniques and advantages/disadvantages including stochastic gradient descent (SGD), Momentum, AdaGrad
- Reinforces the idea that the learning rate is an important hyperparameter (learning rate is how much weights are updated during training)
    - Too small and training takes forever
    - Too large and divergence occurs and correct training does not occur
- Learning rate decay
    - Learning rate is larger at first and decreases as training progresses
- Disadvantage of SGD
    - If the gradient is small is a particular dimention, it is inefficient (folded paper, not much slope to the middle of the paper in one direction)
- Momentum
    - Helps this problem by reducing the amount of zigzag (takes a shorter path) to get to the local minimum
- AdaGrad
    - Adjusts the learning rate for each element of the parameter adaptively for training
    - If conducted infinitely, the learning rate becomes 0 and no updates occur
- RMSProp
    - Solves the learning rate to 0 issue with AdaGrad by forgetting past gradients and reduces the scale of past gradients exponentially
        - Doesn't reduce the learning rate as much as AdaGrad for each iteration
- Adam
    - Basic idea is combining Momentem and AdaGrad
    - Has characteristic of 'bias correction' for hyperparameters
    - Research paper on Adam indicates the hyperparameter values for the primary moment (beta1) and secondary moment (beta2) are often 0.9, and 0.999 respectively, and are effective in many cases
    
- Explains why initializing with random weights is required (uniform weights will result in improper training using backpropagation)
    - Also explains that if wanting weights to start small, use a random normal distribution with a small stdev (`np.random.randn(10, 100) * 0.01`)
- Briefly explains the problem of gradient vanishing, where gradients are either 1 or 0
    - Shows that this occurred when initial weights had a stdev of 1, fixed by making the initial weights with a stdev of 0.01, but then this causes all of the activations to be more uniform, which negates the value of having multiple neurons
        - Activations become biased in this situation resulting in 'limited representation' (distributions are very narrow and all around the same value)
    - Explains that the distribution of activations need to be spread properly -> this is efficient learning
        - Otherwise you get either gradient vanishing or 'limited representation'
    - Talks about the Xavier initializer which are frequently used in ordinary DL frameworks (for tanh/sigmoid activation functions)
        - Uses a distribution with a stdev of $\frac{1}{\sqrt{n}}$ where `n` is the number of nodes in the previous layer
        - This results in activation values for each layer that are more spread out, they still have the same mean, but the stdevs of the activations within the layers are much larger
        - Also explains why the tanh function is better than sigmoid (because it's symmetrical about 0, 0 rather than 0, 0.5 like the sigmoid function
    - Also states that for ReLU -> the initial value is recommended for initialization
        - This is the He intializer
        - Gaussian dist with stdev of $\sqrt{\frac{2}{n}}$ where n = the number of nodes in the previous layer
        - This essentially means that for ReLU, the coefficient must be doubled to provide more spread vs the Xavier initializer (because negative is 0 output for ReLU)
        - Results of experiments for weight initializers
            - Gaussian stdev of 0.01 -> gradient vanishing and narrow spread (no learning)
            - Xavier -> bigger spread, but still some gradient vanishing with higher frequencies around 0 than any other value -> slower training than He
            - He -> a high number of 0, but otherwise a flat distribution (even frequencies) for other activation values -> best result
- Batch Normalization (batch norm)
    - Purpose: to adjust the distribution of activations in each layer so they have a proper spread
    - Accelerates learning (can increase the learning rate)
    - Not as dependent on initial weigth values (don't need to be cautious with initial values)
    - Reduces overfitting (and the need for dropout)
    - How it works
        - it normalizes each mini-batch used for training (avg = 0, stdev = 1) between layers
        - use it either before or after the activation function (some discussion on which is better) to reduce the distribution bias of the data
- Regularization (discusses L2 norm only, but mentions L1 and L)
    - Weight Decay (adding L2 penalty to the loss function to penalize large weights)
        - imposes a penalty on large weights during training (reduces overfitting)
    - Dropout
        - erases hidden layer neurons at random during training (random different ones each time the data flows)
        - during testing, all neurons are used
    - Ensemble Models
        - similar idea to dropout in that several models are used and predictions are averaged (dropout essentially uses different models each time)
        - can improve accuracy by several percent
- Validating Hyperparameters
    - Essential to not use test data as validation data -> overfitting and test leakage
    - Good job explaining that validation data is obtained from the training data
    - Reports that random sampling of hyperparameters is a better method for NN's than systematic search (random vs grid search)
    - Also mentions that using a log scale (powers of 10) is a good initial approach
    - Also mentioned that the size of the epoch for training is often reduced during hyperparameter tuning (search) since trying many options takes a lot of time
    - Also mentions trying Bayesian Optimization for hyperparameter tuning and offers a paper

# Ch 7: Convolutional Neural Networks

- Problem with fully connected layers
    - Shape of input data is ignored (3d images: height, width, channel dim) is flattened
    - CNN's retain shape, receive shape in the input and also output it -> thus preserving spatial relationships between data points
    - Input/output data for a conv. layer is a feature map (input feature map and output feature map)

# Ch 8: Deep Learning

- Data Augmentation
    - Artificially increases to expand training data (particularly images) by adding new images that are slight modifications of existing images
        - Does so by rotation or vertical/horizontal movements
        - Can also cut out a part of an image, flip horizontally (only works when symmetry isn't important), changing brightness
- Discusses Transfer Learning
    - Take part of the trained weights from one pretrained model then fine tune them
- Discusses GPUs for LTM data
- Also distributed use of GPUs using Google's TensorFlow or MS Computational Network Toolkit (CNTK)
- Talks about different areas of DL and their applications moving forward

# Impressions
- Good for someone who is familiar with ML/DL but wants to have a better understanding of the inner-workings of DL/NN
- Otherwise good for those strong in math, calculus (particularly derivatives), and vector/matrix math
- Would be difficult to comprehend without a background, but does a good job of shedding light on the details of these principles that many are familiar with and really should understand to properly implement DL
- Explains the math behind what's occuring to help Data Scientists make more informed decisions on the choice of 
    - activation functions 
    - regularization
    - hyperparameter tuning
    - activation functions
    - optimization methods for loss functions
- Provides some intro on particular types of DL and areas where they are applied, things to look out for