# B"H

## The Partial Derivative

### Intro

The **partial derivative** measures how much impact a **single** input has on a function’s output. 

The method for calculating a partial derivative is the **same** as for derivatives explained in the previous chapter; we simply have to repeat this process for each of the independent inputs.

<br>

---

<br>

We have to calculate the derivative with respect to each input **separately** to learn about each of them. That’s why we call these partial derivatives **with respect to given input**. 

<br>

---

<br>

The partial derivative is a **single equation**, and the full multivariate function’s derivative consists of a set of equations called the **gradient**. 

The gradient is a **vector**, of the size of inputs, containing partial derivative solutions with respect to each of the inputs.

<br>

---

<br>

To denote the partial derivative, we’ll be using Euler’s notation. It’s very similar to Leibniz’s notation, as we only need to replace the differential operator $\large d$ with $\large ∂$. 

> While the $\large d$ operator might be used to denote the differentiation of a multivariate function, its meaning is a bit different — it can mean the rate of the function’s change in relation to the given input, but **when other inputs might change as well**, and it is used mostly in physics. 
> 
> We are interested in the partial derivatives, a situation where we try to find the impact of the given input to the output **while treating all of the other inputs as constants**. We are interested in the impact of singular inputs since our goal, in the model, is to update parameters. The $\large ∂$ operator means explicitly that — the partial derivative.

![](https://drive.google.com/uc?id=11Y3HORRpUJeOfxRz9lIe-iAfssUgQlIQ)

### Partial Derivative of a Sum

#### Example 1

![](https://drive.google.com/uc?id=1BEQjT_gFJGBIYr2hEkwoa370aBmgy6iY)

#### Example 2

![](https://drive.google.com/uc?id=1rpGJsbz6CWgWnlhvf5kV30obmSXERRQk)

### Partial Derivative of Multiplication

#### Example 1

![](https://drive.google.com/uc?id=1-4bOjxzAB_nWlqMwEYXfKuxGbQb_ap-5)

We have already mentioned that we need to treat the other independent variables as constants, and we also have learned that we can move constants to the outside of the derivative. 

That’s exactly how we solve the calculation of the partial derivative of multiplication — we treat other variables as constants and move them outside of the derivative. 

<br>

---


The intuition, when calculating the partial derivative with respect to $\large x$: 
- every change of $\large x$ by 1 changes the function’s output by $\large y$. 
- For example: 
    - If $y=3$ and $x=1$, the result is $1·3=3$. 
    - When we change $x$ by $1$ so $y=3$ and $x=2$, the result is $2·3=6$. 
    - We changed $x$ by $1$ and the result changed by $3$, by the $y$.

#### Example 2

This is a longer example but not more complicated.


![](https://drive.google.com/uc?id=16HhEK2G39JoJU7yP4rXnmdS7-JEQDbQb)

---

![](https://drive.google.com/uc?id=1oJMWNOz4wbRZbvVQ-wTfluabTEK9MPUu)

---

![](https://drive.google.com/uc?id=10NmqfFM_bRCE345FGkcDsD1hDaeLq37D)

### Partial Derivative of Max

We know that the derivative of x **with respect to x** equals 1, so the derivative of this function with respect to x equals 1 if x is greater than y, since the function will return x. 

In the other case, where y is greater than x and will get returned instead, the derivative of max() **with respect to x** equals 0 — we treat y as a constant, and the derivative of y **with respect to x** equals 0. 

![](https://drive.google.com/uc?id=1p8sfbRH0Hwj1cJHQsAobC-qW4T6AOPdw)

We can denote that as 1(x > y), which means 1 if the condition is met, and 0 otherwise. 

> **Note:** we could also calculate the partial derivative of max() **with respect to y**, but we won’t need it anywhere in this book.


### Partial Derivative of Max with Constant

![](https://drive.google.com/uc?id=1Tn5WUOU-3h8oD7WdfVdDGtnJs8ec3V8m)

As above, the derivative is $1$ when $x$ is greater than $0$, otherwise, it’s $0$.


Notice that since this function takes a **single parameter**, we used the $\large d$ operator instead of the $\large ∂$ to calculate the non-partial derivative. 


Handling this is going to be useful when we calculate the derivative of the ReLU activation function since that activation function is defined as $max(x, 0)$.



## The Gradient

As we mentioned at the beginning of this chapter, the **gradient** is a **vector** composed of all of the partial derivatives of a function, calculated with respect to each input variable.

### Example

![](https://drive.google.com/uc?id=1i4CiZ_4nEdDS-_ljBuGaoig4Eg0QnOiM)

That’s all we have to know about the gradient - it’s a vector of all of the possible partial derivatives of the function, and we denote it using the $\large ∇$ — nabla symbol that looks like an inverted delta symbol.

<br>

---

We’ll be using **derivatives** of single-parameter functions and **gradients** of multivariate functions to perform **gradient descent** using the **chain rule**, or, in other words, to perform the **backward pass**, which is a part of the model training.

## The Chain Rule

### Intro

Let’s take 2 functions:

![](https://drive.google.com/uc?id=1TPkvGXoUGU8INubvqDs0uy88MKJj4gpJ)

<br>

We could write the same calculation as:

![](https://drive.google.com/uc?id=1f-t6NRPXHmQ_72Pu95iKOLY4vyZn611s)

<br>

The output of the function $g$ is influenced by $x$ in some way, so there must exist a derivative which can inform us of this influence.

### The Why

We can look at the **loss function** not only as a function that takes the **model’s output and targets as parameters** to produce the error, but also as a function that takes **targets, samples, and all of the weights and biases as inputs** if we chain all of the functions performed during the forward pass. 

To improve loss, we need to learn how each weight and bias impacts it. How to do that for a chain of functions? By using the chain rule.

<br>


The chain rule turns out to be the most important rule in finding the **impact of singular input to the output of a chain of functions**, which is the calculation of loss in our case. 

### The Rule

This Chain Rule: **the derivative of a function chain** $\large \large =$ **the _product_ of derivatives of _all_ of the functions in this chain**.

For example:

![](https://drive.google.com/uc?id=1cEMUvaXaFihOYWWnv36pm6FRkz3M6Rhh)



### With Multiple Inputs

Here's an example of 3 functions and multiple inputs. 

We'll show the partial derivative of this function **with respect to x** 

> **IMPORTANT NOTE**: we can’t use the prime ($\prime$) notation in this case since we have to mention which variable we are deriving with respect to.

![](https://drive.google.com/uc?id=1PmtF49J2vCnDym7f5loQ6nEWjZW8fFy4)

<br>

Here is the function def in pseudo-code:

```
f(
    g(
        y,
        h(x,z)          
    )    
)
```

> **IMPORTANT NOTE**: 
>
> Looking at the above pseudo-code we can now see why the middle derivative is with respect to $h(x, z)$ and not $y$.
>
> This is because $h(x, z)$ is in the chain to the parameter $x$. 
>
> This idea will become more clear as we work thru some examples.

### Example

Let’s solve the derivative of $\large h(x) = 3(2x^2)^5$ 

The first thing that we can notice here is that we have a complex function that can be split into two simpler functions: 
- $\large g(x) = 2x^2$ 
- $\large f(y) = 3(y)^5$
- i.e. $\large y$ is $\large g(x)$

In short: $\large h(x) = f(g(x)) = 3(2x^2)^5$ 

<br>

---

Let's solve it.

**Step 1 - Get outer function derivative**
- Don't touch what's inside the parentheses, i.e. the inner function.

![](https://drive.google.com/uc?id=11U_LvFnYCiwKIB-FwuKfvIXdxYzeQreo)

**Step 2 - Multiply above derivative with the derivative of the interior function**

![](https://drive.google.com/uc?id=19ycpI14ZCAFHv-NZ7hmJx2CcVlt3Hlr7)

<br>

---

In theory, we could just stop here. We can enter some input into $15(2x^2)^4 · 4x$ and get the answer. 

That said, let's simplify:

![](https://drive.google.com/uc?id=1aaNhSmUUUrz9jtzHRNgx9EPX5ONaiIVk)

#### Alternative

Alternatively, we **do NOT** have to treat it as two functions, but rather just keep it as one single function.

Let's solve the derivative of $\large h(x) = 3(2x^2)^5$ without separating it into two functions. 

---

**Step 1 - Simplify Expresssion**

> $\large 3(2x^2)^5$
>
> $\large = 3 * 2^5  x^{5*2}$
>
> $\large = 3 * 32  x^{10}$
>
> $\large = 96  x^{10}$

**Step 2 - Solve Derivative**

> $\large f(x) = 96  x^{10}$
>
> $\large f^\prime(x) = \frac{d}{dx}  96  x^{10}$
>
> $\large f^\prime(x) = 10*96  x^{9}$
>
> $\large f^\prime(x) = 960  x^{9}$

#### IMPORTANT NOTE

In other words, the **Chain Rule** is a tool that can be very helpful at times - that does not mean it is strictly required.

### When to use/not use

See [here](https://socratic.org/questions/when-do-you-use-the-chain-rule-instead-of-the-product-rule) for details.

<br>

---

**Explanation 1**

![](https://drive.google.com/uc?id=14lvO4TxcFDibRtR1ASsVZEyWrdxG0IsN)

<br>

**Explanation 2**

![](https://drive.google.com/uc?id=1BwuD4azFg2yA3kj8TDBIe1gaGfokvypU)

#### Side note

- Reminder of [Product Rule](https://socratic.org/calculus/basic-differentiation-rules/product-rule)
- Reminder of [Quotient Rule](https://socratic.org/calculus/basic-differentiation-rules/quotient-rule)