
overview:
- hyperparameter sweeps
- batch normalization
- softmax loss 

# hyper parameters
ordered of importance of tuning:
- learning rate (alpha) : most important tuning hyper param
- momentum term (beta) : second most important
- number of hidden units
- learning rate decay
- number of layers
- mini batch size
- beta1+2 for adam


It was common practice to sample points in a grid for 2 hyper parameters; it worked okay for when number of hyper parameters was small

Its better to choose hyper parameters on a random distribution
- its difficult to know which hyper parameters are most relevant for your problem
- some hyper parameters will have no to little impact on training
- sampling hyper parameters at random will allow for you to search through more values

Coarse to fine:
- once you see that a certain range of hyper parameters has the best result, limit your sample distribution to to this region to further optimize your hyper parameters

# picking hyperparameter scale

Some hyper parameters such as number of layers can be sampled uniformly

other hyper parameters, such as learning rate, might benefit to sample on a different scale, such as on a log scale as we want to check values such as 1e-4, 1e-5, and 1e-6 

Beta (decay rate) should be sampled on values such as 0.9, 0.99, 0.999 etc.

```python
def sample_log_range(low, high):
    """
    Sample from [low, high] on a log scale.
    low, high: positive numbers
    """
    a = np.log10(low)
    b = np.log10(high)
    r = np.random.uniform(a, b)
    return 10**r

def sample_beta(beta_min=0.9, beta_max=0.999):
    """
    Sample beta in [beta_min, beta_max] by sampling (1-beta) on a log scale.
    """
    low = 1 - beta_max
    high = 1 - beta_min

    a = np.log10(low)
    b = np.log10(high)

    r = np.random.uniform(a, b)
    return 1 - 10**r


```


Hyperparameter intuitions often do not transfer cleanly over different ML domains(Image, NLP, etc)


babysitting one model (pandas)
- Used when compute is limited
- You monitor a single modelâ€™s learning curve daily or hourly, nudging hyperparameter as training progresses.
- reverting to earlier model checkpoints if new adjustment causes divergence

many models in parallel(caviar)
- if you have a lot of compute power
- train many different models with different hyper parameters

For large datasets sometimes you have to use pandas approach

# Batch Normalization


### normalizing input features
```python
X = X - np.mean(X)
Variance_squared = np.mean(X**2)
X = X / variance_squared
```

In the same way we normalize the input layer:

For any hidden layer, can we normalize the inputs of a[l-1] so to train the hidden layer a[l] faster? (actually normalize values of z[l])

### implementing batch norm

Given hidden values z[1]... z[n]:

```python
mu = z[n] - np.mean(z[n]) 
variance_squared = np.var(z[n]) 
z_norm = (z[n] - mu) / np.sqrt(variance_squared + epsilon)

gamma = 0 # learnable parameters of NN
beta = 0 # learnable parameters of NN

z_new[n] = gamma * z_norm + beta
```

Batch norm normalizes values deep in the network, but you do not want the values to always stay at mean 0 and variance 1

If that happened, it could limit what the network can represent.

Using gamma and beta NN can tune these parameters to change the disruption to make the most use out of the non linear activation function

Batch norm helps training by making each layer see more stable distributions during training, which makes optimization easier and often much faster


```python
def forward(self,prev_a):
    self.prev_a = prev_a
    # linear transformation 
    self.z = self.W @ prev_a # + self.b # self.b gets zeroed out by normalization

    epsilon = 1e-8
    mu = np.mean(self.z, axis=0,keep_dim=True)  # shape (1, p)
    var = np.var(self.z, axis=0,keep_dim=True)  # shape (1, p)
    
    # Normalize z   
    z_norm = (self.z - mu) / np.sqrt(var + epsilon)  # shape (m, p)
    
    # Scale and shift with gamma and beta
    z_new = gamma * z_norm + beta  # shape (m, p)


    # activation
    self.a = self.activation(self.z)
    return self.a
```

interestingly the normalization zeros out the bias term used to calculate z. so the beta term replaces the bias 

covariant shift: Covariant shift is a phenomenon in machine learning where the statistical distribution of the input variables (covariances or features) in the production or test data changes from that in the training data, while the underlying relationship between the inputs and outputs remains the same

by normalizing data we reduce this covariant shift

Makes job of learning in later layers easier as the upstream parameters become more stable

allows for each layer to learn more independently

### regularization effect

Batch norm introduces slight noise because mean and variance are estimated from each mini batch rather than the full dataset

noise comes from it trying to estimate mean and variance for a relatively small dataset compared to the bull batch when in mini batch

as it adds noise it forces downstream layers to not rely on any 1 single input

Should not replace regularization methods, effects are minor

# Batch norm at test time

How to do forward pass with no bathes?

with 1 input z_norm will always be 0 as it will be normalized. cannot compute meaningful values of mean and variance

- estimate mu and var from train set
- implement exponentially weighted average of mu and var during runtime of training
- this ensures that in testing and production the output of the model is deterministic and independent of any other test datapoint


# Softmax Regression

multi-class classification

NN where output layer has n nodes

If we have 4 classifications our output will look like

- First unit  -> probability input is class 0
- Second unit  -> probability input is class 1
- Third unit  -> probability input is class 2
- Fourth unit  -> probability input is class 3


```python
z[L] = w[L]@a[L-1] + b[L]

t = np.exp(z[L])

a[L] = np.exp(z[L]) / np.sum(np.exp(z[L]))

assert np.sum(a[L]) == 1.0

```

softmax is the multi-class generalization of logistic regression, by itself its able to draw linear decision boundaries like logistic regression but for multiple classes



softmax name comes from comparison to hardmax; where largest value becomes 1 and rest become 0, softmax allows us to see probabilities

### Loss function

NN is now outputting a vector instead of single value

```python
y_true = [0,0,1,0] # one-hot encoding
y_hat = [.3,.2,.1,.4]

#in y_true only 1 value is 1, rest are 0
# this isolates only the probability of the target prediction in y_hat (the third unit)
# cross entropy loss -> generalization of binary cross entropy
Loss = - np.sum(y_true * np.log(y_hat))
```

backprop

```python
dJ_wrt_dz = dz[L] = y_hat - y_true
```


# Deep learning frameworks

Abstract many parts of the learning algorithm, such as implementing back propagation

Lots of different options with their own advantages and disadvantages such as;
- TensorFlow
- Pytorch
- Caffe

Should consider
- ease of use
- running support
- if Open source
- how much is it supported?


# Tensorflow





In [7]:
import numpy as np
import tensorflow as tf

In [30]:
# We have a function `J(w) = (w**2) - (10*w) + 25` and want to find a value of w that would minimize the cost function 
    # (we know that w=5 would make the function 0, but want the algorithm to find it for us)

#parameter we want to optimize
w = tf.Variable(0,dtype = tf.float32)
optimizer = tf.keras.optimizers.Adam(0.1)

def train_step():
    #compute our function with gradient tape (builds out a graph of all operations preformed)
    with tf.GradientTape() as tape:
        cost = w ** 2 - 10 * w +25
    trainable_variables = [w]
    # using that computation graph GradientTape built, use it to find partial derivative of w
    grads = tape.gradient(cost, trainable_variables)
    # apply the -derivative of w to w
    optimizer.apply_gradients(zip(grads,trainable_variables))

for i in range(1000):
    train_step()
print (w)

<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=5.000000953674316>


In [39]:
w = tf.Variable(0,dtype = tf.float32)
x = np.array([1.0,-10.0,25.0])
optimizer = tf.keras.optimizers.Adam(0.1)


def training(x,w,optimizer):
    #function defines our cost function
    # allows tensorflow to build out computational graph of all operations preformed in forward pass
    def cost_fn():
        return x[0] * w ** 2 + x[1] + w + x[2]
    for _ in range(1000):
        # as it knows all forward operations preformed, it can use that to figure out the partial derivatives of w
        optimizer.minimize(cost_fn,[w])
    return w

training(x,w,optimizer)

AttributeError: 'Adam' object has no attribute 'minimize'