## Machine Learning and Artificial Intelligence 
Summer High School Academic Program for Engineers (2025)
## Calculus Review (just the basics)


Overview: 
* 1. Limits of Sequences and Functions
* 2. Derivatives of Single-Variable Functions
* 3. Partial Derivatives of Multi-Variable Functions

## 1. Limits of Sequences of Functions 

### Sequences

A **sequence** is an ordered list of elements (usually numbers). Each element has an index $a_n$. 

Often, we can define the next element of a sequence in terms of the previous ones. For example: 2,4,6,8, ... 
Here, $a_1 = 2$ and each other element $a_{n}$ = $a_{n-1} + 2$. Alternatively, we can define the sequence based on $n$ directly, here $a_n = 2n$. 

Another example: $a_1 = 1$ and  $a_n = \frac{1}{2} a_{n-1}$. So we get $1, \frac{1}{2}, \frac{1}{4}, \frac{1}{8}, ...$
We can also write this as $a_n = \frac{1}{2^{n-1}}$

These are exampels for **infinite sequences**, but a sequence can also be a **finite sequence**. 

An infinite sequence may be **divergent**, that is it keeps growing unbounded as $n\rightarrow \infty$ or it oscilates between values. 
It can also be **convergent**, that is it approaches a specific value as $n\rightarrow \infty$. For example, $\lim\limits_{n\rightarrow \infty} = \frac{1}{2^{n-1}} = 0$ 

Formally, a converges towards $x$,    that is $\lim\limits_{n\rightarrow \infty} a_n = x$, if for each $\epsilon > 0$, there exists a natural number $N > 0$, such that for each $n \geq N$, we have $|a_n - x| < \epsilon$.



### Functions

For functions we are interested in the limit at a specific point $a$: $\lim\limits_{x \rightarrow a} f(x) = L$.

As $x$ approaches $a$ from either side, the value $f(x)$ approaches the limit $L$, iff for each $\epsilon > 0$ there exists a $\delta > 0$   such that if $0<|x-a|< \delta$, then $|f(x)-L|< \epsilon$.

Note that the limit may still exist if $f(a)$ is undefined!

<img src="https://www.cs.columbia.edu/~bauer/shape/function_limit.png" width=400px>

**One-sided Limit**

$\lim\limits_{x\rightarrow a^+} f(x) = L$  means that the limit of $f(x)$ as $x$ approaches $a$ from the right is $L$.  
$\lim\limits_{x\rightarrow a^-} f(x) = L$  means that the limit of $f(x)$ as $x$ approaches $a$ from the right is $L$.  

For a two-sided limit to exist, both one-sided limits must exist and must be identical.

For example, the step function has two one sided limits, but they are not identical. 

<img src="https://www.cs.columbia.edu/~bauer/shape/one_sided__limit.png" width=400px>

**Other Nonexistent Limits**

There are other cases in which a two-sided limit does not exist. 

For example, $\lim\limits_{x \rightarrow 0} \frac{1}{x}$  "blows up" to positive and negative infinity.

<img src="https://www.cs.columbia.edu/~bauer/shape/hyperbola.png" width=400px>

And $\lim\limits_{x \rightarrow 0} \sin \frac{1}{x}$ oscillates between +1 and -1, no matter how close you get to 0. 

<img src="https://www.cs.columbia.edu/~bauer/shape/sin1overx.png" width=400px>

**Continuity**

A function $f(x)$ is **continous at c** if $\lim\limits_{x\rightarrow c} f(x)$ exists, $f(c)$ is defined, and $\lim_{x\rightarrow c}f(x)=f(c)$.

A function is **contious** if it is continous at all real values of $c$. 

## 2. Derivatives of Single Variable Functions

### Slope of a Linear Function 

The idea behind the derivative is that it captures the rate of change of a function $f(x)$, with respect to $x$.

Consider the slope $m$ of a linear function $y = f(x) = m x + b$

<img src="https://www.cs.columbia.edu/~bauer/shape/linear_slope.png" width=200px>

$m = \frac{\Delta f}{\Delta x} = \frac{f(x + \Delta{x}) - f(x)}{\Delta x}$

For linear functions, the rate of change is constant across all values of $x$. 

### Derivative of a function

What if the slope is not linear? The rate of change may be different at each point.

<img src="https://www.cs.columbia.edu/~bauer/shape/xsquared.png" width=300px>


The **derivative** of $f$ is a function describing the rate of change for any point $x$.

The derivative of $f$ is defined as 

$$f'(x) = \frac{df}{dx} = \lim\limits_{x \rightarrow 0} \frac{\Delta f}{\Delta x}$$
$$ = \lim_{x \rightarrow 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}$$

You can think of the limit as finding better and better estimates for the rate of change by reducing $\Delta x$ until it is infinitesimally small. 

<img src="https://www.cs.columbia.edu/~bauer/shape/xsquared_limit.png" width=300px>


Example: Derivative of $x^2$

$f(x) = x^2$

$\frac{df}{dx} = \lim\limits_{x \rightarrow 0} \frac{(x + \Delta x)^2 - x^2}{\Delta x}$

$= \lim\limits_{x \rightarrow 0} \frac{(x^2 + 2 x \Delta x + \Delta x^2) - x^2}{\Delta x}$

$ = \lim\limits_{x \rightarrow 0} \frac{2 x \Delta x + \Delta x^2}{\Delta x}$

$ = \lim\limits_{x \rightarrow 0} 2 x + \Delta x$ 

$ = 2x $


### Some Rules for Derivatives

* **Power rule:**  if $f(x) = x^a$ then $\frac{df}{dx} = a x^{a-1}$

  Examples: $\frac{dx^5}{dx} = 5x^4$  and  $\frac{d\sqrt{x}}{dx} =  \frac{d x^{\frac{1}{2}}}{dx} = \frac{1}{2} x ^{-\frac{1}{2}}$

* **Exponential rule:** if $f(x)  = b^x$ then $\frac{df}{dx} = b^x \ln b$.  ($\ln b$ is the natural logaritm of $b$:   $\ln e^b = b$)

  Example: $\frac{d e^x}{f x} = e^x \ln e = e^x$

* **Rule of linearity** (linear combinations): if $f(x) = a g(x)  + b h(x)$, then $\frac{df}{dx} = a \frac{dg}{dx} + b \frac{dh}{dx}$.

    Example: $f(x) = 3x^5 + 2x^4 + 3x.$      $\frac{df}{dx} = 15 x^4 + 8 x^3 + 3$

  * **Sum rule**: if $f(x) = g(x) + h(x)$, then $\frac{df}{dx} = \frac{dg}{dx} + \frac{dg}{dx}$

* **Product rule**: if $f(x) = g(x) \cdot h(x)$ then $\frac{df}{dx} = g(x) \frac{dh}{dx} + h(x) \frac{dg}{dx}$

* **Quotient rule:** if $f(x) = \frac{g(x)}{h(x)}$ then $\frac{df}{dx} = \frac{h(x) \frac{dg}{dx} - g(x) \frac{dh}{dx}}{h(x)^2}$

### The Chain Rule

We will look at the chain rule in detail because of its importance for the backpropagation algorithm in neural networks.

Consider the case of **function composition**: $f(x) = h(g(x)) = (h \circ g)(x)$. The function $h$ is applied to the output of function $g$. 

Note that composition is different from just multiplying the output of the two functions. 
For example, if $g(x) = x^3$  and $h(x) = x^4$ then $h(x) g(x) = x^4 x^3 = x^7$.
But $h(g(x)) = (x^3)^4 = x^{12}$.

* The **Chain Rule** says: If $f(x) = h(g(x))$ then $\frac{df}{dx} = \frac{dh}{dg} \frac{dg}{x}$

To see why, consider that each change in $x$ causes a change in $y=g(x)$, and each change in $y$ causes a change in $z=h(g(x))$.

<img src="https://www.cs.columbia.edu/~bauer/shape/chain_rule.png" width=600px>

We can therefore factor the derivative into the two changes: 

$\lim\limits_{x \rightarrow 0}\frac{\Delta f(x)}{\Delta x} = \lim\limits_{x \rightarrow 0}\frac{\Delta z}{\Delta x} =  \lim\limits_{x \rightarrow 0}\frac{\Delta z}{\Delta y} \frac{\Delta y}{\Delta x}$

Example: $g(x) = x^3$ and $h(x) = x^4$ and  $f(x) = h(g(x))$.

$\frac{df}{dx} = 4(g(x))^3 \cdot 3x^2 = 4(x^3)^3 \cdot 3x^2 = 4x^9 3 x^2 = 12 x^{11}$


## 3. Multi-variable Functions and Partial Derivatives

### Partial Derivatives
Assume we have a multi-variable function such as $f(x,y) = x^2 - y^2 + 2x + y$.

<img src="https://www.cs.columbia.edu/~bauer/shape/saddle_function_plot.png" width=300px>

Instead of the derivative, we now compute the **partial derivatives** of the function with respect to each variable: 
$ \frac{\partial f(x,y)}{\partial x}$ and $ \frac{\partial f(x,y)}{\partial y}$. 
We usually just write $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$.

In other words, we are asking how does the value of the function change if one of the variables changes, while the other(s) stay constant. 
One way to think about keeping the variables constant is to imagine slicing the function parallel to the axes. 

<img src="https://www.cs.columbia.edu/~bauer/shape/saddle_function_slice.png" width=300px>

The partial derivative of $f$ w.r.t. $y$  is $$\frac{\partial f(x,y)}{\partial{y}} = \lim\limits_{y \rightarrow 0} \frac{\Delta f}{\Delta y} = \lim\limits_{y \rightarrow 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y}$$

Most of the rules for single-variable derivatives work exactly the same, except that we treat all variables but one as constant. 

For example: 
if $f(x,y) = x^2 - y^2 + 2x + y$, then

$\frac{\partial f}{\partial y} = 2x + 2$ and 

$\frac{\partial f}{\partial y} = -2y + 1$.

Another example: 
if $f(x,y) = x^2 y^2 + xy + y$, then

$\frac{\partial f}{\partial x} = 2xy^2 + y$ and 

$\frac{\partial f}{\partial y} = 2yx^2 + x + 1$ 

### Gradients 

The **gradient** of a multivariable function is a **function valued vector**. 

$$\nabla f(x_1, \ldots x_d) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_d} \end{bmatrix} $$

Example: If $f(x,y) = x^2 - y^2 + 2x + y$, then 

$$\nabla f = \begin{bmatrix}2x + 2 \\ -2y + 1 \end{bmatrix}$$

The gradient describes a **vector field**. For any input point of the function, we can apply the gradient functions and obtain a real-valued vector. This vector describes the direction of the greatest change / steepest ascent of the function. 

For the example above: 
$ \nabla{f}(2,3) = \begin{bmatrix} 6 \\ -5\end{bmatrix}$
<img src="https://www.cs.columbia.edu/~bauer/shape/saddle_function_gradient_field.png" width=400px>

In machine learning, we often say "the gradient of f at (2,3)" to refer to the value of the gradient vector at that specific point.

When $f$ is a scalar valued function applied to a vector, such as $f(\mathbf{x})$ we often just write the gradient as
$\nabla f(\mathbf{x}) = \frac{\partial f}{\partial \mathbf{x}}$. The gradient is the vector of partial derivatives with respect to each component of $\mathbf{x}$. 

### Generalized Chain Rule

The generalized chain rule is what makes backpropagation work, so this is an important section. 

Recall the chain rule for single-variable functions
                
If $f(x) = h(g(x))$ then $\frac{df}{dx} = \frac{dh}{dg} \frac{dg}{x}$


**Composite Function With a Single Input Variable**

Now consider the case where f(x,y), where x=g(t)  and y=h(t).

<img src="https://www.cs.columbia.edu/~bauer/shape/comp_graph_single_var.png" width=100px>

The derivative of $f$ needs to account for all ways in which $f$ changes when $t$ changes. Part of this change is due to the change in $h(t)$ and part is due to the change in $f(t)$. 

$\frac{df}{dt} f(g(t), h(t)) = \frac{\partial f}{\partial x} \frac{dx}{dt} +  \frac{\partial f}{\partial y}\frac{dy}{dt}$

**Generalized Chain Rule**

Now assume there are ultiple input variables $t_1, \ldots, t_n$. 

Assume we have a function $f(x_1, \ldots x_m)$. Each of the $x_i = g_i(t_1, \ldots, t_n)$ is the result of applying some function $g_i$ to the input variables.

We can now ask for the partial derivative of $f$ with respect to each $t_i$. 

$\frac{\partial f}{\partial t_j} = \frac{\partial f}{\partial g_1} \frac{\partial g_1}{t_j} + \frac{\partial f}{\partial g_1} \frac{\partial g_1}{t_j} + \cdots + \frac{\partial f}{\partial g_m} \frac{\partial g_m}{t_j}$

Example: 

$f(x,y) = xy$

$x = g(a,b) = a+b$

$y = h(a,b) = b+1$


We can display the function $f$ as a **computation graph**

<img src="https://www.cs.columbia.edu/~bauer/shape/chain_rule_example1.png" width=300px>

We want to compute $\frac{\partial f}{\partial a}$ and $\frac{\partial f}{\partial b}$..

The chain rule tells us that 

$\frac{\partial f}{\partial a} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial a} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial a} $ 

and 

$\frac{\partial f}{\partial b} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial b} + \frac{\partial f}{\partial y} \frac{\partial y}{\partial b} $ 

So we need 

$\frac{\partial f}{\partial x} = y$

$\frac{\partial f}{\partial y} = x$


and (left side)

$\frac{\partial x}{\partial a} = 1$

$\frac{\partial x}{\partial b} = 1$

and (right side)

$\frac{\partial y}{\partial b} = 1$
and $\frac{\partial y}{\partial a} = 0$ (note: $y$ doesn't depend at all on $a$).

Plugging these back in: 

$\frac{\partial f}{\partial a} = y \cdot 1 + x \cdot 0 = y = b+1$ 

$\frac{\partial f}{\partial b} = x \cdot 1 + y \cdot 1 = x + y = (a+b) + (b+1) = 2b + a + 1$ 
