# Office Hours - Back Propagation

## Intro

[Transcript - Intro](#Transcript---Intro)




# Stuff we know

[Transcript - Stuff we know](#Transcript---Stuff-we-know)

Output layer math:

1. Derivative of the logistic function 

$$\boldsymbol{\frac{\partial \hat{y}}{\partial a_3}} = \hat{y}(1 - \hat{y})$$


2. Derivative of the logistic cost function

$$\boldsymbol{\frac{\partial E}{\partial \hat{y}}} = \hat{y} - y$$


3. Chain rule of $g(x) = f(h(x))$:

$$ \boldsymbol{\frac{\partial g}{\partial x}} = \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}$$


4. Weight update rule:

$$\boldsymbol{w} = w - \alpha\Delta w$$

# The example neural network for this exercise:

[Transcript - Example neural network](#Transcript---Example-neural-network)

* 2 input nodes
* 2 hidden nodes
* 1 output node

![01-example-network.png](01-example-network.png)


## At some point during training, the network looks like this:

[Transcript - Example Training Weights](#Transcript---Example-Training-Weights)

![02-example-weights.png](02-example-weights.png)


## And we encounter the example:

$$[x_1, x_2, y] = [0, 1, 1]$$

# A Forward Pass using the example:

[Transcript - Forward Pass](#Transcript---Forward-Pass)

$$\boldsymbol{[x_1,x_2,y]} = [0,1,1]$$

$$\boldsymbol{a_1} =0(.1)+1(.3)=0+.3=.3$$

$$\boldsymbol{z_1} =log_{10}(.3) = .57$$

$$\boldsymbol{a_2}=0(.2)+1(.4)=0+.4=.4$$

$$\boldsymbol{z_2}=log_{10}(.4) = .60$$

$$\boldsymbol{a_3}=.57(.5)+.60(.6)=.646$$

$$\boldsymbol{\hat{y}}=log_{10}(.646) = .66$$

In [1]:
from scipy.special import expit

import math

w = [.1, .2, .3, .4, .5, .6]
x1 = 0
x2 = 1
y = 1

a1 = (x1 * w[0]) + (x2 * w[2])
print(f'a1 = {a1:.2f}')

z1 = expit(a1)
print(f'z1 = {z1:.2f}')

a2 = (x1 * w[1]) + (x2 * w[3])
print(f'a2 = {a2:.2f}')

z2 = expit(a2)
print(f'z2 = {z2:.2f}')

a3 = (z1 * w[4]) + (z2 * w[5])
print(f'a3 = {a3:.2f}')

y_hat = expit(a3)
print(f'y_hat = {y_hat:.6f}')



a1 = 0.30
z1 = 0.57
a2 = 0.40
z2 = 0.60
a3 = 0.65
y_hat = 0.656206


## Cost Function

[Transcript - Calculate the Cost Function](#Transcript---Calculate-the-Cost-Function)


cost $(y, \hat{y}) = -log_{10}(\hat{y})$ if $y = 1$

$= .18$



In [2]:
cost = -math.log10(y_hat)
print(f'cost = {cost:.2f} if y = 1')

cost = 0.18 if y = 1


# Back Propagation

[Transcript - Back Propagation - first layer - between the output later and the hidden layer](#Transcript---Back-Propagation---first-layer---between-the-output-later-and-the-hidden-layer)

[Transcript - Back Propagation - next layer - between the hidden later and the first layer](#Transcript---Back-Propagation---next-layer---between-the-hidden-later-and-the-first-layer)


## Prep for Back Propagation:

* Create w_old
* Alpha is the learning rate, which is a dampening function applied to the change specified by the error allocation.
* $\alpha = 0.2$, meaning that the error allocation will be reduced to 20% of the specified change.

In [3]:
alpha = 0.2
w_old = w.copy()

## First, update the hidden layer based on the output.  

We use the chain rule twice to assign the errors to $w_5$ and $w_6$.

Output layer math:

$$\boldsymbol{\frac{\partial E}{\partial w_5}}=\frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a_3} \cdot \frac{\partial a_3}{\partial w_5}$$

Derivative of the logistic cost function

$$\boldsymbol{\frac{\partial E}{\partial \hat{y}}} = \hat{y} - y = 0.66 - 1$$

In [4]:
dE_dy_hat = y_hat - y

print(f'{y_hat:.2f} - {y:.2f} = {dE_dy_hat:.2f}')

0.66 - 1.00 = -0.34


Derivative of the logistic function 

$$\boldsymbol{\frac{\partial \hat{y}}{\partial a_3}} = \hat{y}(1 - \hat{y}) = 0.66 (1 - 0.66)$$

In [5]:
dy_hat_da_3 = y_hat * (1 - y_hat)

print(f'{y_hat:.2f} (1 - {y_hat:.2f}) = {dy_hat_da_3:.2f}')

0.66 (1 - 0.66) = 0.23


$$\boldsymbol{\frac{\partial a_3}{\partial w_5}} = z_1$$

In [6]:
print(f'z1 = {z1:.2f}')

z1 = 0.57


$$\boldsymbol{\frac{\partial a_3}{\partial w_6}} = z_2$$

In [7]:
print(f'z2 = {z2:.2f}')

z2 = 0.60


### Adjustment for $w_5$:

$$\boldsymbol{\frac{\partial E}{\partial w_5}}=\frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a_3} \cdot \frac{\partial a_3}{\partial w_5}$$

In [8]:
dE_dw_5 = dE_dy_hat * dy_hat_da_3 * z1

print(f'dE_dw_5 = {dE_dy_hat:.2f} * {dy_hat_da_3:.2f} * {z1:.2f} = {dE_dw_5:.3f}')

dE_dw_5 = -0.34 * 0.23 * 0.57 = -0.045


### Adjustment for $w_6$:

$$\boldsymbol{\frac{\partial E}{\partial w_6}}=\frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a_3} \cdot \frac{\partial a_3}{\partial w_6}$$

In [9]:
dE_dw_6 = dE_dy_hat * dy_hat_da_3 * z2

print(f'dE_dw_6 = {dE_dy_hat:.2f} * {dy_hat_da_3:.2f} * {z2:.2f} = {dE_dw_6:.3f}')

dE_dw_6 = -0.34 * 0.23 * 0.60 = -0.046


### Weight update for $w_5$:

$$\boldsymbol{w_5} = w_5 - \alpha\Delta w_5$$

In [10]:
w[4] = w_old[4] - (alpha * dE_dw_5)

print(f'w_5 = {w_old[4]:.2f} - ({alpha:.2f} * {dE_dw_5:.3f}) = {w[4]:.2f}')

w_5 = 0.50 - (0.20 * -0.045) = 0.51


### Weight update for $w_6$:

$$\boldsymbol{w_6} = w_6 - \alpha\Delta w_6$$

In [11]:
w[5] = w_old[5] - (alpha * dE_dw_5)

print(f'w_6 = {w_old[5]:.2f} - ({alpha:.2f} * {dE_dw_6:.3f}) = {w[5]:.2f}')

w_6 = 0.60 - (0.20 * -0.046) = 0.61


## Update the next layer based on the previous updates.  

We use the chain rule again assign the errors to $w_1$ through $w_4$.



Hidden layer math:

$$\boldsymbol{\frac{\partial E}{\partial w_1}} = 
\frac{\partial E}{\partial z_1} \cdot \frac{\partial z_1}{\partial a_1} \cdot \frac{\partial a_1}{\partial w_1}$$

$$\boldsymbol{\frac{\partial E}{\partial z_1}}
= \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a_3} \cdot \frac{\partial a_3}{\partial z_1}
= \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a_3} \cdot w_5
= (-0.34) \cdot (0.23) \cdot (0.5) 
= $$

In [12]:
dE_dz_1 = dE_dy_hat * dy_hat_da_3 * w_old[4]

print(f'dE_dz_1 = {dE_dy_hat:.2f} * {dy_hat_da_3:.2f} * {w_old[4]:.2f} = {dE_dz_1:.3f}')

dE_dz_1 = -0.34 * 0.23 * 0.50 = -0.039


$$\boldsymbol{\frac{\partial E}{\partial z_2}}
= \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a_3} \cdot \frac{\partial a_3}{\partial z_2}
= \frac{\partial E}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a_3} \cdot w_6
= (-0.34) \cdot (0.23) \cdot (0.6) 
= $$

In [13]:
dE_dz_2 = dE_dy_hat * dy_hat_da_3 * w_old[5]

print(f'dE_dz_2 = {dE_dy_hat:.2f} * {dy_hat_da_3:.2f} * {w_old[5]:.2f} = {dE_dz_2:.3f}')

dE_dz_2 = -0.34 * 0.23 * 0.60 = -0.047


$$\boldsymbol{\frac{\partial z_1}{\partial a_1}} = a_1(1 - a_1) = 0.30 (1 - 0.30) = $$

In [14]:
dz_1_da_1 = a1 * (1 - a1)

print(f'dz_1_da_1 = {a1:.2f} (1 - {a1:.2f}) = {dz_1_da_1:.2f}')

dz_1_da_1 = 0.30 (1 - 0.30) = 0.21


$$\boldsymbol{\frac{\partial z_2}{\partial a_2}} = a_2(1 - a_2) = 0.40 (1 - 0.40) = $$

In [15]:
dz_2_da_2 = a2 * (1 - a2)

print(f'dz_2_da_2 = {a2:.2f} (1 - {a2:.2f}) = {dz_2_da_2:.2f}')

dz_2_da_2 = 0.40 (1 - 0.40) = 0.24


$$\boldsymbol{\frac{\partial a_1}{\partial w_1}} = x_1$$

In [16]:
da_1_dw_1 = x1

print(f'da_1_dw_1 = {da_1_dw_1:.2f}')

da_1_dw_1 = 0.00


$$\boldsymbol{\frac{\partial a_2}{\partial w_2}} = x_1$$

In [17]:
da_2_dw_2 = x1

print(f'da_2_dw_2 = {da_2_dw_2:.2f}')

da_2_dw_2 = 0.00


$$\boldsymbol{\frac{\partial a_1}{\partial w_3}} = x_2$$

In [18]:
da_1_dw_3 = x2

print(f'da_1_dw_3 = {da_1_dw_3:.2f}')

da_1_dw_3 = 1.00


$$\boldsymbol{\frac{\partial a_2}{\partial w_4}} = x_2$$

In [19]:
da_2_dw_4 = x2

print(f'da_2_dw_4 = {da_2_dw_4:.2f}')

da_2_dw_4 = 1.00


## Calculate the Adjustments

### Adjustment for $w_1$:

$$\boldsymbol{\frac{\partial E}{\partial w_1}} = 
\frac{\partial E}{\partial z_1} \cdot \frac{\partial z_1}{\partial a_1} \cdot \frac{\partial a_1}{\partial w_1}$$

In [20]:
dE_dw_1 = dE_dz_1 * dz_1_da_1 * da_1_dw_1

print(f'dE_dw_1 = {dE_dz_1:.3f} * {dz_1_da_1:.2f} * {da_1_dw_1:.2f} = {dE_dw_1:.3f}')

dE_dw_1 = -0.039 * 0.21 * 0.00 = -0.000


### Adjustment for $w_2$:

$$\boldsymbol{\frac{\partial E}{\partial w_2}} = 
\frac{\partial E}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_2} \cdot \frac{\partial a_2}{\partial w_2}$$

In [21]:
dE_dw_2 = dE_dz_2 * dz_2_da_2 * da_2_dw_2

print(f'dE_dw_2 = {dE_dz_2:.3f} * {dz_2_da_2:.2f} * {da_2_dw_2:.2f} = {dE_dw_2:.3f}')

dE_dw_2 = -0.047 * 0.24 * 0.00 = -0.000


### Adjustment for $w_3$:

$$\boldsymbol{\frac{\partial E}{\partial w_3}} = 
\frac{\partial E}{\partial z_1} \cdot \frac{\partial z_1}{\partial a_1} \cdot \frac{\partial a_1}{\partial w_3}$$

In [22]:
dE_dw_3 = dE_dz_1 * dz_1_da_1 * da_1_dw_3

print(f'dE_dw_3 = {dE_dz_1:.3f} * {dz_1_da_1:.2f} * {da_1_dw_3:.2f} = {dE_dw_3:.3f}')

dE_dw_3 = -0.039 * 0.21 * 1.00 = -0.008


### Adjustment for $w_4$:

$$\boldsymbol{\frac{\partial E}{\partial w_4}} = 
\frac{\partial E}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_2} \cdot \frac{\partial a_2}{\partial w_4}$$

In [23]:
dE_dw_4 = dE_dz_2 * dz_2_da_2 * da_2_dw_4

print(f'dE_dw_4 = {dE_dz_2:.3f} * {dz_2_da_2:.2f} * {da_2_dw_4:.2f} = {dE_dw_4:.3f}')

dE_dw_4 = -0.047 * 0.24 * 1.00 = -0.011


## Update Weights

### Weight update for $w_1$:

$$w_1 = w_1 - \alpha\Delta w_1$$

In [24]:
w[0] = w_old[0] - (alpha * dE_dw_1)

print(f'w_1 = {w_old[0]:.2f} - ({alpha:.2f} * {dE_dw_1:.3f}) = {w[0]:.3f}')

w_1 = 0.10 - (0.20 * -0.000) = 0.100


### Weight update for $w_2$:

$$w_2 = w_2 - \alpha\Delta w_2$$

In [25]:
w[1] = w_old[1] - (alpha * dE_dw_2)

print(f'w_2 = {w_old[1]:.2f} - ({alpha:.2f} * {dE_dw_2:.3f}) = {w[1]:.3f}')

w_2 = 0.20 - (0.20 * -0.000) = 0.200


### Weight update for $w_3$:

$$w_3 = w_3 - \alpha\Delta w_3$$

In [26]:
w[2] = w_old[2] - (alpha * dE_dw_3)

print(f'w_3 = {w_old[2]:.2f} - ({alpha:.2f} * {dE_dw_3:.3f}) = {w[2]:.3f}')

w_3 = 0.30 - (0.20 * -0.008) = 0.302


### Weight update for $w_4$:

$$w_4 = w_4 - \alpha\Delta w_4$$

In [27]:
w[3] = w_old[3] - (alpha * dE_dw_4)

print(f'w_4 = {w_old[3]:.2f} - ({alpha:.2f} * {dE_dw_4:.3f}) = {w[3]:.3f}')

w_4 = 0.40 - (0.20 * -0.011) = 0.402


# Run forward propagation again using updated weights to get a new prediction

[Transcript - Run forward propagation again using updated weights to get a new prediction](#Transcript---Run-forward-propagation-again-using-updated-weights-to-get-a-new-prediction)

In [28]:
print(f'x1 = {x1}')
print(f'x2 = {x2}')

print('\nold')
[print(f'w_old_{i+1} = {w_old[i]}') for i in range(len(w))]
print(f'a1 = {a1:.2f}')
print(f'z1 = {z1:.2f}')
print(f'a2 = {a2:.2f}')
print(f'z2 = {z2:.2f}')
print(f'a3 = {a3:.2f}')

y_hat_old = y_hat.copy()
print(f'y_hat_old = {y_hat_old:.6f}')


print('\nnew')


[print(f'w_{i+1} = {w[i]}') for i in range(len(w))]


a1 = (x1 * w[0]) + (x2 * w[2])
print(f'a1 = {a1:.2f}')

z1 = expit(a1)
print(f'z1 = {z1:.2f}')

a2 = (x1 * w[1]) + (x2 * w[3])
print(f'a2 = {a2:.2f}')

z2 = expit(a2)
print(f'z2 = {z2:.2f}')

a3 = (z1 * w[4]) + (z2 * w[5])
print(f'a3 = {a3:.2f}')

y_hat = expit(a3)
print(f'y_hat = {y_hat:.6f}')

print(f'change = {y_hat - y_hat_old}')


x1 = 0
x2 = 1

old
w_old_1 = 0.1
w_old_2 = 0.2
w_old_3 = 0.3
w_old_4 = 0.4
w_old_5 = 0.5
w_old_6 = 0.6
a1 = 0.30
z1 = 0.57
a2 = 0.40
z2 = 0.60
a3 = 0.65
y_hat_old = 0.656206

new
w_1 = 0.1
w_2 = 0.2
w_3 = 0.30162875345302287
w_4 = 0.40223371902128857
w_5 = 0.5089107165030491
w_6 = 0.6089107165030491
a1 = 0.30
z1 = 0.57
a2 = 0.40
z2 = 0.60
a3 = 0.66
y_hat = 0.658680
change = 0.002473435353621767


# Transcript

## Transcript - Intro



 Unknown Speaker

00:00:29

Good.

 Todd Holloway

00:00:46

By the way, for preparation. This is super critical pen and paper. So if you don't have something to write with Andy good time to go get something

00:02:23

Get started.

00:02:27

Joining in a minute.

00:02:29

So here's what we're gonna do.

00:02:36

back propagation. So we want to go through the math on back propagation. We want to, you know, go through all the excruciating details.

00:02:42

So that'll be sort of done at once.

00:02:45

Hopefully that like

00:02:48

Helps internalize what it's doing is, is

00:02:54

In terms of how it's updating the parameters.

00:02:58

But we also don't want it to. It's all a ton of bookkeeping, as we'll see in a minute to do this. So we want to sort of do as little as possible.

00:03:07

In terms of like the minimum the minimum amount to see all of the action happening by propagation without doing any extra meaning any like repeat math things that we want to learn from. And so what we're going to do is take a very small neural network. This one.

00:03:28

Basically, it's like, what's the minimum we can do and still to learn everything we need no more revelation. And so that's this network that to input nodes to hidden notes and one open

00:03:39

So we're going to do one forward pass on this network and then going to one complete backward pass, and then we'll do one for plastic and see if you've improved them all.

00:03:50

I'll take a whole hour probably

00:04:00

So let's go work out my notes here.

00:04:04

A little bit.

00:04:08

So before we get into the network itself. There's a few things we want to just kind of list off. These are things that we've already encountered as a class in the sink.

00:04:19

So none of this is new. It's just things we know things we know that we're going to use here. So the first thing we know is that the derivative of the logistic function is why had times one minus y. Super simple. Right.

00:04:36

A few questions about any of this as we go along. Just, just chime in.

00:04:41

Second thing we know

00:04:44

Is that the derivative of the logistic function.

00:04:48

Will just sort of logistic costs function is simply why why hat minus why

00:04:55

It's really important. We get this sign right as it inside here. Otherwise, we'll get the wrong what we couldn't have update the parameters in the wrong direction to make the model.

 Unknown Speaker

00:05:04

Worse.

 Todd Holloway

00:05:08

I'll go a little bit slow as well. So if you guys want to write this down along the way to come

00:05:14

To the third thing we know is that the chain rule states that if we have a function g of x.

00:05:21

That function can be

00:05:36

Chain rule states that function g of x can be described as a composition of to other functions of age of x.

00:05:43

And that can be applied.

00:05:46

To help us decompose the derivative into to constituent parts. So the first part being that the derivative of g with the changing g with respect to change in x.

00:05:58

Equals changing g with respect to the changing age, times the change in age with respect to the change in x. So care kind of see where this would come into play. Right, so like where we're at a particular note in the network and we have an activation function. The activation function is

00:06:20

Multiplied against the

00:06:24

The dot product. So essentially ideation, though, we have a composition of two functions. And so we can see how it might be valuable to be able to decompose

00:06:35

The derivative of those two functions.

00:06:41

Will come back 30 seconds and then the way up their role, which is simply the near way.

00:06:49

Equals the old ways, minus the learning rate alpha

00:06:54

Times the change in the way

00:06:57

delta w

00:06:59

weights of the parameters.

00:07:04

Questions about any of this stuff.

00:07:11

Will lead us off to the side, we can refer to it.

00:07:22

## Transcript - Stuff we know



Here's our, our network.

00:07:26

So labeled everything

00:07:29

So, or two features are x one x two.

00:07:38

At the hidden nodes. We're going to use to labels. We're going to have

00:07:43

A NZ. And so a one is the node it for the note value before the activation functions applied and Z one is node one, the hidden node one after that commission junction and apply

00:08:00

That's gonna be helpful to have that decomposition. Again, that's where the channels good play

00:08:08

And then

00:08:09

The output node. Again, the value before the activation functions can be called a three and the value after the activation function in this case where you call why happy because it's a prediction.

00:08:21

So we have it's fully connected. So we have a six. I just 6.6 parameters W one, W. W. W. W WC usually the literature you see these written with to some scripts, one for the

00:08:39

Essentially the layer. It's in and the other for the note within the layer. I'm just calling it was using one subscript here because this is super small network.

00:08:49

Because with me so far.

 Lee Moore

00:08:51

Todd. What would that look like if you had written it with the two sub scripts, can you

 Todd Holloway

00:08:55

Just say that this is to be like, there's no to be W one one and this, that would be W two to come on

00:09:10

Okay, so then we're going to say that at some point. I'm trying to process already ingrained to send this to Cathy greatness, then we're iterating through all the training examples.

00:09:20

That each point each training example we do a forward past, we compute the air then we do a backward pass. We update our model. And so some particular as we're both here is a big long trend is that for whatever problem we're trying to solve.

## Transcript - Example neural network



00:09:34

At some particular moments. There's some particular example, we get to this point, which is that w one equals point one w two equals point to W three equals point three.

00:09:45

W four equals point forward, w equals point five and six equals point six obviously there arbitrary values, but you can imagine that at some point in training model looks like this. Right.


## Transcript - Example Training Weights



00:10:01

So then the next example we encounter is this one x one equals zero, x two equals one and y equals one. So the expectation here is that inputs, zero and one, the predictions should be one. So, the air is going to be how far off from one it was.

00:10:27

Obviously I'm intentionally keeping this

00:10:31

Super simple, but also have everything we

00:10:36

Need to learn how this works.

00:10:45

Guys, good.

 Unknown Speaker

00:10:48

All right.

## Transcript - Forward Pass



Todd Holloway

00:11:06

So the first thing we want to do is at at that particular example for that particular example, we want to do a forecast, but I want to make a prediction.

00:11:15

So once you guys to work out the math and do a Ford British. Do you know how to do for prediction right done a bunch of times.

00:11:23

Already so much to do with this network right here in the middle of the right hand screen are still for past with this example.

00:11:32

So, you know, you're gonna you're gonna take zero here as input for x one and you'd say one. And so this node before activation is going to be zero times point one plus one times point three.

00:11:47

And then you're gonna put that through the logistic function, you can use Wolfram Alpha to help you with that one.

00:11:54

Is everyone familiar with both Wolfram Alpha

00:12:03

It's a more than that. But it's like have an island graphing calculator puzzle a knowledge base.

00:12:26

So you can just type logistic function right into the info box.

00:12:34

Just like logistic whatever you want.

00:12:39

So you do that, you get the logistic function value of point three.

00:12:44

And then

00:12:48

A three is going to be W five plus sorry it's gonna be 01 times w five plus the two times w six times. And we're going to use logistic function again. So, again, to keep things simple. We just need to push for all activation just totally fine.

00:13:12

So if you plug that in.

00:13:14

And you want to keep your bookkeeping around so you want to again meticulously lay this out on paper. So everyone is going to equal zero plus point three equals point 3010 we should get point five, seven.

00:13:31

Let me know if you have trouble getting that from open alpha

00:13:35

A two is going to be zero plus point four because point for z two is going to be again logistic point four, which is point six

00:13:47

And a three is going to be

00:13:50

Point five, seven times point five plus point six times point six

00:13:58

Gives you point five to five and then we have our prediction about predictions going to be the logistic appoint by two.

00:14:13

So I'm going to pause for a minute. I'm going to pause actually even possible five minutes, but you guys work through this board pass, make sure everyone gets the solution. If you have questions or comments just shut them out.

 Amy Breden

00:14:31

link to that calculator in the chat.

 Todd Holloway

00:14:38

Box.

 Lee Moore

00:14:43

I don't see it in the chat box.

 Unknown Speaker

00:14:45

Oh,

 Todd Holloway

00:14:51

Sorry.

00:16:39

Oh, you just like to segue into Wolfram Alpha

 Unknown Speaker

00:16:43

5.25

 Todd Holloway

00:17:07

Same result.

 Lee Moore

00:17:15

Todd, I am sort of a side question i for some reason I came away from the ACR thinking that the sigmoid was the same as the hyperbolic tangent function. It's not

 Todd Holloway

00:17:27

It's not technically

00:17:31

Technically sigmoid is a classic functions. It's any function that has that shaped

00:17:36

One instance of a stigma that is attached image.

00:17:40

But people often and practices sigma logistic function or terribly

00:17:45

Regardless of 10 inches between is it as a sigmoid function between minus one and one and the logistic function as a as a sigmoid between zero and one.

 Lee Moore

00:17:56

And when we type in sigmoid into Wolfram Alpha, which gives us religious just take

 Todd Holloway

00:18:09

The 10 ish is is the lot of literature, because

 Unknown Speaker

00:18:14

People sometimes find a convergence pastor, the logistic problem because of the negative numbers.

 Rajiv Nair

00:18:38

What is it actually doing the signal. I mean like when you. What does it mean when you put in a number and it gives you a percent. Is that a percentage of, like, what is it actually

00:18:50

In terms of the input. What is it, actually, what does it mean

00:18:54

Like a sigma 2.3 is equal 2.57 but what does that I understand the formula, but I don't quite understand what is it telling me about

 Todd Holloway

00:19:06

What is it like, what's the point

 Rajiv Nair

00:19:09

Yeah.

00:19:11

Yeah.

00:19:26

slightly off topic.

 Todd Holloway

00:19:32

So it's just the properties that were concerned with. Right. So the properties are that it's ask them one and zero. So we're saying we're saying we want

00:19:47

So we're saying we want a output that's always going to be bounded between zero and one.

00:19:54

It's continuous so that it's differentiable at all points.

00:20:06

And that's the main thing is basically just the shape of it intuitively for trying to do with activation function is is

00:20:14

Squash a value in a certain range at least once it was intended, with the SIG one

00:20:20

Itself was the squash evaluate a certain range.

00:20:24

In a way that's continuously differentiable. And so like it's those two those two aspects of it that are make it appealing, but again, we could use other functions that looks similar.

00:20:36

Can he can have the same shape but

00:20:40

With different Awesome job.

00:20:49

There's like 10 inches gonna have the same properties because I have some type of value in this case minus one on one. And it's continuously differentiable

00:20:59

And so like

00:21:02

And then you can experiment with that you could change. You could use a different function as a different slope. And that's going to mean different things for the activation of the notes.

00:21:12

And then again in recent years, people have challenged this by using functions like Ray lose rectified activations that aren't us and tonic at all. And so, and that has different set of properties.

 Rajiv Nair

00:21:25

So does it activated. If it's over 50% or something like that.

 Todd Holloway

00:21:29

Oh yeah, what does it mean in terms of a threshold. So that's a separate thing is like you can apply a threshold to it.

00:21:36

In the same way, to apply. So you could treat this like a backup so you can treat this as a probability right you know

00:21:43

That's in particular. That's why, that's why value between zero and one was popular for a long time because it's interpreted as a probability

00:21:51

And in the same way that you know when we talk about like naive Bayes which was probability or decision tree, which was possibility.

00:21:57

We could apply a threshold to convert that to predict the label. And so, so if we want we can use a threshold here to to convert if it's two classes to convert into a predictions and if it's more than two classes, we can use a soft max to convert into prediction.

00:22:14

And soft max is just takes the highest probability

 Unknown Speaker

00:22:19

Basically

 Kevin Stone

00:22:22

Taught yeah I'm probably gonna embarrass myself here.

00:22:28

So that the calculate like a three. It's just Z one W five

00:22:35

Plus z three w six correct

 Todd Holloway

00:22:41

Yes, so the 01

 Unknown Speaker

00:22:46

Plus point six

 Todd Holloway

00:22:49

2.57 times point five plus points six times point six and that will give you a three. I got

00:23:00

Point 6456 double check them out. So

 Bruno Todescan

00:23:06

So tonight.

 Todd Holloway

00:23:40

I'm wrong.

 Kevin Stone

00:23:43

Yeah.

 Todd Holloway

00:24:06

Alright, so it should be point six five.

 Unknown Speaker

00:24:11

Good catch.

 Todd Holloway

00:24:35

Okay. Does everyone want to have that

## Transcript - Calculate the Cost Function



00:24:45

And then measure the error of the next step right

00:24:49

So how close. Obviously, if we're doing absolutely there it's point three five off.

00:25:07

We're not doing absolutely right, we're doing. We're using the surrogate cost function of the logistic function.

00:25:13

was covered in the basic

00:25:17

Which comes out to be minus log of why had if y is intended to be one.

00:25:23

So make sure enough, my math and still off.

 Bruno Todescan

00:25:29

Content just to confirm, that's a natural longer than right yeah

 Todd Holloway

00:26:45

Yeah, you should get minus

00:26:52

Sorry positive

00:26:59

One nine now.

 Amy Breden

00:27:12

Are we doing. Are we just taking the natural log of point six 5am I getting that right

 Unknown Speaker

00:27:29

Point four three.

 Shaji Kunjumohamed

00:27:33

I might have to use the calculator wrong. Yeah.

 Todd Holloway

00:27:39

So as long as to one basis.

 Unknown Speaker

00:27:42

It's lobbies 10 yeah

 Kevin Stone

00:27:59

So it's not the natural log

 Todd Holloway

00:28:02

Sorry log base time

00:28:11

So Wolfram Alpha is to minus log

 Unknown Speaker

00:28:16

Minus long time

00:28:33

Okay.

 Todd Holloway

00:28:34

So that's going to be a logistic error. So it's in a sense of depth in the air a little bit by using the logistic loss function.

 Rajiv Nair

00:28:47

I'm sorry. How did you get that formula for cost.

 Todd Holloway

00:28:53

So if you're calling a sink introduced want to talk about logistic function.

00:28:59

We introduced the surrogate cost functions letter B convex

00:29:05

So I'm just using that cost question. So with any of these things that you can show you the record place other things, right. So we didn't have to use this activation function of being unhappy users cost function, but these are the ones up choosing us

00:29:18

And why am I choosing to use well. Which isn't to use

00:29:22

The logistic function for

00:29:25

The activation functions because it's

00:29:28

Sort of the most straightforward to do here.

00:29:32

And the one we talked with most of class.

 Lee Moore

00:29:41

CAN YOU EXPLAIN I'M. Where does a three come from. It doesn't mean that number still valid point five to five.

 Unknown Speaker

00:29:50

So,

 Todd Holloway

00:29:53

That's this number. So that's the value of the last note the

 Unknown Speaker

00:29:58

Output

 Todd Holloway

00:30:00

For

 Lee Moore

00:30:06

Alright. So yeah, if you multiply three with a one and a two. Yeah, that's what you would get. Okay.

 Unknown Speaker

00:30:51

Okay.




## Transcript - Back Propagation - first layer - between the output later and the hidden layer



 Todd Holloway

00:30:54

We'll move on to the back propagation

00:31:20

Okay, so back obligations so

00:31:24

Where do you layer by layer backwards. So the first layer. We want to update is the between the output later and the hidden layer one of those parameters. Right. And so the way to think about

00:31:37

back propagation

00:31:48

When we have more nodes on the on the previous layer. In this case, since we have two. Is that what we want to do is, is take our error and assign a like

00:32:02

Sign a fraction to it that we believe that the parameter on one of those notes is responsible for right so basically want to say how much of that area, we believe, W. Five is responsible for and how much of the air w six responsible for

00:32:16

Does that make sense.

00:32:19

This is where so that's where we're going to use the chain rule.

00:32:23

To do a partial derivatives.

00:32:27

To that assignment so

00:32:30

We're going to get there by taking the partial derivative twice that sort of taking the chain rule twice. Right. So we're going to say

00:32:36

We have the era. Now, right, which was the point

 Unknown Speaker

00:32:42

So that was

 Todd Holloway

00:32:44

Where was was one night, right, we said

00:32:50

We have the arrow. So we're gonna say the derivative of the air with respect to w five over the air with just like a W five equals the derivative, the air change in the air with respect the change and white hat.

00:33:05

Times the change in y hat with respect to the change in a three.

00:33:10

Times the change in a three with respect to change and Wi Fi.

00:33:18

So what we're doing here.

00:33:23

Is we're looking at

00:33:27

The derivative of the slow with respect to check the change in in each between each of these, these items right the change in a three.

00:33:36

With respect to w five. So it's really the change with respect to z one

00:33:41

Change from a free from sorry from 012 a three given a change in w

00:33:47

change in y hat.

00:33:50

From

00:33:52

Z one times w given a three.

00:34:00

That's going to help us a sign that that attribution that partial iteration.

00:34:05

So if we break these out. So how do we do that when we first we did two channels. Right. So the first gen gave us web site change changing the air with respect to Wi Fi.

00:34:21

decomposes. The change in the air with respect to y hat times the change in white hat with respect to change in Wi Fi and then we've we've had a terrible again broken out once more.

00:34:33

And so what are these values. Well, the first one that changing the air with respect to change and white hat.

00:34:42

Is going to be the derivative of the cost function, right, because what is the relationship between my hat and the air. Well, the cost function, what's the derivative, the cost of actually going to change. Right. And so it's by definition.

00:34:57

And then the second item is was the change in y hat with respect to the change. So with respect to change in a three. And that's the job of the logistic function.

00:35:10

By definition,

00:35:11

Right.

00:35:13

And then how does a three change with respect to a change in Wi Fi or changes as you want, right, because a three equals, you want to be fine.

00:35:27

Super simple. Right.

00:35:35

Questions on that.

00:35:57

It's just bookkeeping. It's all like if you break down each constituent part super, super simple.

00:36:04

It's just all the bookkeeping is Mr.

 Bruno Todescan

00:36:09

Todd, are these are these partial derivatives.

00:36:15

Yes, he may be partial derivatives. Yeah.

 Todd Holloway

00:36:18

It's partially because it's the change in error with respect to just, okay, great.

 Unknown Speaker

00:36:23

Okay, great.

 Todd Holloway

00:36:35

So go ahead and run those numbers you can use your values, instead of my philosophy is point six times 2.6 you should come on guys are pretty similar line.

 Shaji Kunjumohamed

00:36:51

So the third one question for number one you have. Why had into one one minus Y hat. Right. It has to be sigmoid into one minus the sigmoid. We had tried

00:37:09

Seek more way had into one minus seek mode, we had right or

00:37:16

Or where is this. Oh, the, the stuff we know the number one debate you have the cost logistic function.

 Todd Holloway

00:37:26

Or what about it. Yeah.

 Shaji Kunjumohamed

00:37:28

It is why you had to do one minus Y hat. Right, so it has to be sigmoid of white hat into one minus sigmoid of white hat. Right. It has to be like that right or am I

 Todd Holloway

00:37:41

Know, it was so delicious a function that derivative of it is why

 Unknown Speaker

00:37:47

Okay, yes, this is the definition I wrote down here.

 Todd Holloway

00:37:55

Again, the order is really important. If you mix up the order and all this bookkeeping, you make some design, you'll end up probably changing the grammar in the wrong direction.

00:38:16

Okay, so again we're gonna take a few minutes here and if everyone can try to write this down.

00:38:23

Again, so

00:38:25

Multiple so you can use point six five to prefer a point six five, you should at the end she represented results and okay

00:38:33

Point six five minus one would be

00:38:38

The first component

00:38:42

Because that's this and then point six five times one minus point six five.

00:38:48

And then times

00:38:51

01 which was points, five, seven,

00:38:56

And then she come out to value around might be a little different around minus point 00 0.05

00:39:06

And I do the same thing, but do it for w six

00:39:26

And I use a learning right here points here, which is really high.

00:39:30

What it illustrated, which is the point.

00:39:35

And even with a really high learning rate, you'll notice that our new ways is only going to be 100 larger right so we go from point five 2.51

00:39:49

And from point six one.

00:39:53

Are so it back propagation is timeless to this point is that it believes for w and w six or too small.

00:40:02

At least with respect to that particular trend example.

 Lee Moore

00:40:20

In practice, how can we be positive numbers for the changes because we just have to negative values in the parentheses.

00:40:32

Now we don't. Okay. Sorry, I just got it everyone

 Amy Breden

00:40:38

For the learning rate. How do you in practice actually figure out what that should be like you just pick point two, but I might have picked one. I don't like I don't know what the magnitude should be

 Todd Holloway

00:40:52

In practice, these days, people most often adaptive algorithms.

00:40:57

Like RMS prop. And so these are algorithms. They're built in the cycle learning TensorFlow and carries

00:41:05

Their algorithms that will like the initialized to value, but then they change it. So the idea is that like as you start converting

00:41:13

You want smaller and smaller like basically early, early in process to really highlight any right so you move quickly to the better values. But then as you start to converge. You want small it's parlor anyways.

 Amy Breden

00:41:26

And it just happens to be very similar to what our error was that's not at all related

 Todd Holloway

00:41:34

What is the weather.

 Amy Breden

00:41:36

Or error was point one nine but the learning that you chose was point two. That's not related. Right. Okay.

 Todd Holloway

00:41:55

You guys all good roughly the same result.

00:42:11

Slide this over here as you can still see you want to bring up




## Transcript - Back Propagation - next layer - between the hidden later and the first layer



00:42:17

This is where it gets to be a lot of bookkeeping.

00:42:20

The hidden layer.

00:42:24

Same process. But now, obviously we have, first of all, we have four inches before parameters.

00:42:33

And second of all, to compute an update to W one, we have to account for the update to Wi Fi. So they start to compose your parlance of every layer you add in that work as a backdrop. It's more and more bookkeeping.

00:42:49

And so same formula. It's still

00:42:54

Change the air with respect, you're changing W one decompose to the three constituent parts.

00:43:01

And the second two parts are still pretty straightforward. So the change in a one with respect to change in w one is going to be x one, which in this case x one is zero right zero

 Unknown Speaker

00:43:17

So,

 Todd Holloway

00:43:23

X one.

00:43:26

So that's the same as before, right. So it's the same same Z one here, we have x one here. So, then a changing XE one with respect to change in the area one. Well, that's the activation function.

00:43:40

Derivative so

00:43:43

And we're using the same activation function. That means the output layer. So it's going to be here. It's going to be point three times one minus

00:43:51

Y hat effectively the prediction on the hidden node why at times one minus point

00:44:00

You kind of think of it as like, part of the reason back propagation is a kind of why was back probably isn't hard to embed like why did it take people you know 10 plus years to a method

00:44:11

Like part of its figuring out like you're trying to assign error on the hidden knows, but you don't know what it is they're supposed to be predicting right

00:44:21

So back propagation in a way of saying is we're making a claim about what the hidden known should be predicting and what the error of the hidden note is

00:44:33

OK, so it's these two components. Pretty straightforward. They look a lot like on the output layer, the least two components.

00:44:40

But the same, right. So the difference is that when it gets to this component you try to compete this now. It's a composition of all the previous players.

00:44:49

And so for me, a simple Valley having one edge there and one up a note. But even so, we've now got that change in the air with respect to changes in

00:45:02

Z one

00:45:06

Is defined as the change in the air with respect to change and what has changed. What have a three change and change in a three, which is like the shamans he wants. So now you've put everything you computed over here into this term.

00:45:22

That we added a second hidden layer. We will now have

00:45:27

Everything every component it computed here as the single term in that

00:45:33

Competition.

00:45:36

Make sense

00:45:42

To try to read this out of become

00:46:08

Try to write this out for all four winds, if you do that.

00:46:13

We can update the model and then we can do another forward pass and see if recalls better

00:46:55

This all makes sense or isn't clear as one

00:47:11

Kevin, what do you think

 Kevin Stone

00:47:17

Yeah, it makes sense. I'm

00:47:20

Looking at it but

 Unknown Speaker

00:47:23

Thanks.

 Todd Holloway

00:47:34

Okay, so we'll wait for the first person. The first person who comes up with all of the updates, let us know.

00:48:21

It's not that hard, right, once you go through it all, once you kind of like, you know, not that complicated algorithm is just a lot of bookkeeping. Right.

00:48:33

It's just a lot of bookkeeping and you just need to know. You need to have the continuous derivatives continuously differentiable function.

00:48:56

You can see how there's the problem of vanishing gradient, as it's called, where as you add more layers. The numbers you're multiplying keep getting smaller and smaller. And so the left most layers start to become

00:49:15

Not easy not easily updatable not easily updated.

00:49:20

Which is why people have moved away from

00:49:23

These activation functions that have a

00:49:27

Less shape to them.

00:49:30

Toward ones that have

00:49:33

Like a rally.

00:49:35

That is not a synthetic

00:49:38

That does not just approach a number

00:49:42

Where there's a hard part

00:49:47

Because the vanishing gradient, the small numbers, but come from.

00:49:51

If you're applying activation, the smaller and smaller numbers.

00:49:55

Sort of numbers are closer, closer to 01 or minus one on one. It's smaller and smaller numbers back in sponsored updates.

00:50:02

Which means the miles that learning quickly,

 Padma Sridhar

00:50:19

In the derivative of the errors with respect to w one is that coming from point three times one minus

00:50:29

Point five seven

00:50:34

You're saying this number right here 23 or four this point you want. Yeah.

00:50:40

This point to one. Yeah.

 Unknown Speaker

00:50:43

Is

 Todd Holloway

00:50:46

Yes, it was a point three times one minus point three of using slightly different numbers. I mean, you might get something slightly different.

00:51:01

Dogs are in the background.

00:51:04

My lectures.

00:51:21

Questions.

00:51:25

All the way it's something

00:51:56

Again she got a slightly closer prediction one after you've updated them all at one time.

00:52:03

And you might want to like if you'd be worried if it was too big of a jump in that direction. I'd be a red flag.

00:52:11

We definitely but but obviously if you

00:52:14

Listen to think about like to take it to an extreme. If you just have one training example and he kept doing exactly what we did. Over and over again, it should converge on a perfect prediction in short order order right to be overfitting, so to speak, or maybe

00:52:28

Maybe not really the only the only thing you want to predict that it's not orbiting fitting, but

00:52:36

But it should just keep it. One example back propagation should just keep moving the model in the direction of having perfect accuracy and not one example.

00:52:56

You have enough nodes and parameters to represent the function

00:54:07

How's it going so far.

00:54:17

Everybody good

00:54:43

According to the bathroom THINK THEY'RE NOT TALKING WHAT'S GOING ON. I was off.

 Lee Moore

00:55:10

Todd, I'm probably still a little confused about why

00:55:16

The derivative of z one with respect to A one is, yeah, this point three times one point but yeah the whites, the point three, I guess, is

 Todd Holloway

00:55:30

Yeah, good question. So maybe because

00:55:35

It's a function of the input right and the input is going to be zero times point one.

 Lee Moore

00:55:41

Okay, yes.

00:55:43

Yeah.

 Kevin Stone

00:55:49

That note that should be Z one, not a one.

00:55:53

Year.

00:55:56

I think Z one is point

00:55:59

Five, seven or something.

 Todd Holloway

00:56:08

Sir, what do you make them.

 Kevin Stone

00:56:10

Well, yeah, that little star to the right of the equation.

00:56:16

Oh yeah there. Oh.

00:56:20

I think XE one is point 570

 Todd Holloway

00:56:23

You're right, you're right.

00:56:26

That's why. Oh, that's what that was the confusion to

00:56:34

That's right. So it's been so much point five, seven times one minus point seven so be

00:56:44

Yeah look that up from what your previous

00:56:50

Thanks.

00:57:33

Was

00:57:35

A little bit.

00:57:38

It's more authentic.

00:57:52

It's tricky if you pick up a

00:57:55

Example looked a lot different textbooks and now they have a tried

00:58:00

And

00:58:02

It's tricky because there's a lot of different treatments of back propagation, with all the, you know, ultimately, it's all the

00:58:08

Same but

00:58:11

One of the ways that Scrabble back propagation works.

00:58:14

Like this is just kind of like trying to distill it down as much as possible. Just the pure calculations.

00:58:52

Well, somebody's got it.

00:59:17

Next time I to just be my iPod in real time.

00:59:21

And then it might be either

00:59:24

But these are all on

00:59:48

I'm sure

01:00:07

Know,

01:00:17

Getting the right direction or

01:00:25

You

 Craig Fleischman

01:00:27

Need to go back and redo the math.

 Todd Holloway

01:00:38

You

 Bruno Todescan

01:00:41

Know, I, I got a little bit lost in the derivatives. I was trying to like compute the derivative on my own, like, make sure they knew where the derivative came from.

01:00:52

It's like apply the dirt. The derivatives to the cost function and then solve that. And I got a little bit my I really get the brush.

01:01:02

Yeah, kind of embarrassing but

01:01:05

I stopped there and then the number is kind of confused me so I'm I'm rambling into right now on the side me

 Amy Breden

01:01:17

By the way, the result I got lost with what we're supposed to do after we got the W one, W. W. W. Four. I didn't know what to do with them.

 Todd Holloway

01:01:27

Would help you guys go back over real quick. What's more,

 Unknown Speaker

01:01:30

Yes.

 Todd Holloway

01:01:41

Okay, let's go, let's go to the whole thing real quick.

01:01:45

So again,

01:01:48

Stuff. We now can we add the logistic costs function sort of religiously activation function of all all three nodes in the hidden layer. The two hidden layer and then when I put could use a different one there. It'd be even easier if we use the regular really has the super simple derivative

01:02:07

Then we have a cost function together we're using a surrogate cost function that comes from the ACE think

01:02:14

We use a surrogate because continuously differentiable, and has a simple derivative and the derivative of it. The Lucy cost me to the surrogate, which is a cost function is of Y hat minus y.

01:02:26

Then just writing the chain rule when it is

01:02:30

You know is that this is only one decomposition composition. So whenever we're using the we're actually plant general twice.

01:02:37

And then the way to update rule, then we're just using a fixed clarity, right, a point to

01:02:42

That's the stuff, stuff we know

01:02:45

Here again, kind of the simplest network that's still interesting if we only permitted even small and only have one hit a note, it wouldn't be just enough and so began. You got one more down it's fully connected, you have a ton more calculations to make

01:03:01

And so

01:03:04

So break into a one and z one A one is before the activation function XE one is after the activation function on the output layer subcontinent Z three we're going to call it a while because the prediction.

01:03:17

Again, we're so reduce the costs are going to set at some point, we get to a particular example that example.

01:03:26

We get to a particular example in at that point the parameters have been set 2.1 point 2.3 point

01:03:33

Six. And then we have this as a particular example, and we want to compute the air and then update the parameters, based on the error and then see if we're more accurate on six which we should be

01:03:44

It should be if we're stochastic means that we're only doing one example at a time. The model should always get better in the direction of that example. Right. And so, so that's our example is x equals one, or sorry, x equals x, y equals zero, x

01:04:00

Two equals one and our prediction expected prediction is one

01:04:06

Video forward pass

01:04:11

Video forward pass somewhere here.

01:04:22

We are four paths.

01:04:26

And literally just means doing the three dot products right so we're gonna do adopt product times logistic function dot product trends logistic function dot product has logistic function that's a prediction. So if we were doing this.

01:04:43

Other code.

01:04:46

Produce that our code.

01:04:50

So if we go to reduce on our, on our GPU machine and vector eyes in it, we would just have three matrix operations and three multiplication.

01:05:04

So this would be super, super buzzed with hardware acceleration. Okay. And so then we want to do.

01:05:10

back propagation. We do in two steps. The first step is to back propagate the output layer to the hidden layer propagating we cycle back propagation or by copying the air right

01:05:20

And so, literally meaning that we're going to, we're trying to push the air back and it bypasses the network. And in doing so, partially attribute the air to the different weights.

01:05:35

And so we're going to do that by taking the air with respect to a particular ways

01:05:42

So the question here is basically question we're trying to answer it. This equation is what is the contribution of W five to the overall error. Right.

01:05:50

And we're getting that this by by using the general twice to decompose into three terms which we can just look up right, basically. And so the first one is, what does it. How does a three change with respect to Wi Fi, what changes by the one

01:06:05

Just the statement of all the neural network works.

01:06:08

The second thing is, how does the prediction itself change with respect to a change in a three fold the difference between the prediction and a three is the activation function right

01:06:19

And so it's going to change by the derivative of that commission function and then the change in the air with respect to the change of the prediction is what's called air, which is the cost function. So it's going to be derivative of that right

01:06:33

And so these are just basically statements of how we created our own network.

01:06:40

So if we multiply that all together, we should get a

01:06:46

change in w five

01:06:51

That we want to make after destiny by learning right

01:06:57

So, SO W five here. It was point five. Now we're going to dampen it with warranty rate of point to. And so as a result we get a new W five which is pointing right so we do the same thing for w six and then we want to do back propagation to the next layer.

01:07:18

For a hidden layer to our input layer assigning a partially attributing yard each of those w one w four so that we can update those always make sense.

01:07:32

It's the same process here. The only difference this time is that now to go from the air. The final error to z one

01:07:45

We have to incorporate everything we just previously computed

01:07:50

We added another layer, we would have to do the same thing but previously, but to go from like this would be the two and this one easy one. And we'd have to to change Eric's from one to two

01:08:02

Sorry, a change in the in the air with respect to z one in that case would be everything and two stops was going to keep composing

01:08:11

So this is going to turn that changes from the last layer in that it now contains the last layer is computations, which we could just look off from a lookup table we're implementing this right

01:08:24

So if we compete. This we should get that

01:08:30

W one does not change W two does not change W three increases by a little bit and W four increases by a little bit by little bit. I've been to 1000 right

01:08:45

So you also notice something else is that as a backdrop from left from right to left.

01:08:51

If we use an S shaped activation, we get smaller and smaller multiplication. So we tend to see less change in the earlier layers. Again, as part of the reason why people don't use these activation functions as much anymore to use Rails.

01:09:05

And similar activation function.

## Transcript - Run forward propagation again using updated weights to get a new prediction





01:09:12

So now if you plug those in. If you plug in the same W on the same W 2.302 as W 3.402 is W 4.51 is w six five and point six one and w six and run that same train example again, you should get a prediction.

01:09:35

I believe it's

01:09:38

Is like slightly higher in the direction of expected

 Unknown Speaker

01:09:42

That

 Amy Breden

01:09:46

I got point five minutes 6595

 Todd Holloway

01:09:49

Perfect. So we were successful at updating the ways in the direction of desired direction. Right.

 Rajiv Nair

01:10:00

Sorry. How did the initial, initial weights that he just randomly pick some are like how I know you probably did it. How is it typically done

 Todd Holloway

01:10:08

Yeah, so it's typically typically you pick random values on a very small. You don't want to pick zeros. Is there no I don't think zeros.

 Lee Moore

01:10:18

Yeah, doesn't update

 Todd Holloway

01:10:21

Why might you want random numbers. And so just to save a life. Everything

 Amy Breden

01:10:28

Optimize like a local.

01:10:31

Like minimum or maximum

 Todd Holloway

01:10:36

That could happen. Also, just if you have a uniform weights across the network.

01:10:42

Depending upon the distribution of values in your in your example you could end up not basically learning features learning interesting features. So you can just get some

01:10:54

If you want some variability in your in your

01:10:57

Network to start with.

 Rajiv Nair

01:10:59

It would be the end and weights between zero and one.

 Todd Holloway

01:11:03

Now countries are typically people to very small numbers to start with. Okay. Yeah. Because if you if you pick two highways and the random

01:11:13

That can

01:11:16

Happen there.

01:11:23

If you pick your highways in a random. I think you have trouble with convergence.

01:11:31

Potentially

01:11:36

I'll see you don't want the fact that you don't want to like the random like you really want them to be zero. Like, you don't want the initial settings to influence with a network learns

01:11:44

But you can't sum zero because then you won't learn anything. Right. So you want them to be small around them.

 Padma Sridhar

01:11:53

And

01:11:56

Equally evenly balanced sort of network. If we have attributed the same way to every node, then when we do the SEC propagation, would we end up seeing an equal amount of air attributed to each other.

 Todd Holloway

01:12:11

If the feature values were to be all the same across all features and then you're doing the first pass through. And so all your ways for the same thing. Yeah, equal attribution and you want to really learn anything useful. Yeah.

 Unknown Speaker

01:12:25

Okay.

 Shaji Kunjumohamed

01:12:28

Oh, for the bias terms basically for the bad back propagation

01:12:34

treated differently. The same. That's why.

 Todd Holloway

01:12:36

That's why they put bias in here just because they want they want added into the learning. It's the same. It's exactly the same process. Okay.

01:12:44

You can even think of when you when you go from like a like a logistic regression equation to a single hidden layer neural network.

01:12:51

You can think of the bias note as being the intercept with a, with the input always be in one. So you can think of, like, like if you do your homework is prediction and your equation is price equals

01:13:03

Alpha. Alpha is the intercept plus beta one times number of bass beta two times number of beds. You can think of it as being price equals one times alpha plus

01:13:18

Beta one of the replies, like there's a feature value there. They're just always one. And that's the wise.

01:13:24

So, so you can treat it. So really the bias is just another feature that is the same.

01:13:30

Same input.

01:13:35

Other questions.

01:13:42

Okay, I hope that was helpful, I hope that shed a little bit of light. If you have other questions. If you think about it more, let me know. Otherwise, I'll see you guys next time.

 Craig Fleischman

01:13:53

Thanks john

 Padma Sridhar

01:13:54

Thank you.

 Lee Moore

01:13:55

Thank you.

 Sonal Thakkar

01:13:57

Thank you.


