# 1. Preliminaries.

Before talking about deep learning and deep learning framework. Let's establish a bit of notation. From here on, we will talk about optimizing models. Think of a model as a set of functions. For example, a model could be a family of functions described as 
$$F(x; w) = x + w \text{ with } x,w \in \mathbb{R}$$
$$\text{or equivantly } F(x; w) = \{f:\mathbb{R}\rightarrow\mathbb{R}\texttt{ }\vert\texttt{ }f(x) = x + w, w \in \mathbb{R}\}$$

However, we will use the first notation. Now, let's make a very important distinction. We call the arguments before ";" **inputs** (such is $x$). Instead, we call the argument after ";" **parameters** (such is $w$).

In python a model could look something like this:

In [2]:
def F(w):
    return lambda x: x + w

print(f"F returns functions. For example F(5) = {F(5)}")
print(f"if F returns functions we can call F(5). For example, F(5)(5) = {F(5)(5)}")

F returns functions. For example F(5) = <function F.<locals>.<lambda> at 0x7fdfae908160>
if F returns functions we can call F(5). For example, F(5)(5) = 10


## 1.1 Training/Optimizing

What does it mean to train or optimize a model? It means searching among the family of functions for those who satisfy desired properties. For example, we could want to search for function that are symmetric. Or we could search for functions that interpolate a set of points. Ultimately, this amount to find good parameters. You can search for good parameters any ways you desire. You can even randomly sample parameters until you find good ones.

## 1.2 Optimize a constant function.

Let's make a concrete, yet super simple example. This should help you to understand how all blocks fall together. So consider the following model: 

$$F(;w_1) = w_1^2 + 1$$ 

Where $w_1 \in [-10,10]$. $F$ is a model which can be instantiated in infinitely many functions. Each one of these functions has no input and constant real output.  This function is not particularly useful to optimize but it can help you to understand the methodology. 

Now suppose that we want to find in $F$ the function with lowest output. For example, if we set $w_1 = 5$ we get $g(x) = 5^2+1 = 26$ which is not very good. Instead, if we set $w_1 = 2$ we get $s(x)=2^2+1=5$ which is already better. Let's formalize wath we want:

$$\min_{w \in [-10,10]}\{F(;w_1)\} = \min_{w \in [-10, 10]}\{w_1^2 + 1\}$$

### 1.1.1 Zeroth Order Optimization (or derivative-free optimization)
Zeroth order optimization tries to optimize models without knowledge of gradients. 0th order optimization is useful when you cannot compute gradients of the model (it may require too much time, too much memory, or you do not know the model at all). These methods are often called black-box optimization, as they require only to be able to query the model. One common kind of zeroth order algorithms are evolutionary ones.

Let's see a super simple example of Zeroth Order optimization that randomly search for the optimal value.

In [3]:
import random

# let's define our model as before
def F(w1):
    return lambda: w1**2 + 1
    
# let's search for good functions by randomly sampling w1.
best_score = float("+inf")
best_param = None
for i in range(10000):
    w1 = random.uniform(-10, 10)
    if F(w1)() < best_score:
        best_score, best_param = F(w1)(), w1
        
print(f"best parameter found w1={best_param} which yield f(w1)={F(best_param)()}")

best parameter found w1=0.00020506493425997974 which yield f(w1)=1.0000000420516273


Which is pretty close to the optimal value ($1$). Unfortunately, randomly searching works only on super simple cases. If we want to tackle real world problems we need to searching more intelligently. 

### 1.1.2 First Order Optimization

First-order optimization algorithms require the knowledge of the derivative of $F$. The most used first-order technique is gradient descent. With respect to zeroth-order algorithms, we have more knowledge so usually, we can obtain good results faster. However, computing gradients does require time and memory. Let's compute the derivative of $F$ wrt. $w_1$.

$$\frac{dF(;w_1)}{dw_1} = \frac{d w_1^2 + 1}{d w_1} = 2w_1$$

Now, no matter how we choose $w1$, we can compute the slope of the model. By knowing the slope, we can make small steps towards smaller and smaller functions. Let me visualize a little bit better what I mean.

In [4]:
%matplotlib widget
%matplotlib inline

import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np


@widgets.interact(w1=(-10, 10, 0.25))
def update(w1 = 1.0):
    w = np.linspace(-10, 10)
    y = F(w)() 
    plt.plot(w, y)
    
    plt.scatter(w1, F(w1)())
    
    x = np.linspace(-10,10)
    y = 2*x*w1 - F(w1)() +2
    plt.plot(x,y)
    
    plt.gca().set_xlim((-5,+5))
    plt.gca().set_ylim((-2,+10))


interactive(children=(FloatSlider(value=1.0, description='w1', max=10.0, min=-10.0, step=0.25), Output()), _do…

The blue line is our model. Remember, each point of our model is actually a function, a constant function. For each of this point we can compute a derivative, the derivative tells us the slope of model. By following the slope we can obtain a function with higher value. By following the negative slope, we can obtain a function with lower value. Step by step, we can obtain incresingly bigger functions of icreasingly smaller functions.   

Just for fun, let's write an algorithm that performs the gradient descent of our model.

In [5]:
import random

def F     (w1): return lambda: w1**2 + 1
def dF_dw1(w1): return 2*w1

param_w1 = random.uniform(-10,10)
learning_rate = 0.001

for i in range(10000):
    param_w1 = param_w1 - learning_rate * dF_dw1(param_w1) # compute a small step in the direction of the slope

print(f"best parameter found w1={param_w1} which yield f(w1)={F(param_w1)()}")

best parameter found w1=-9.375544264783508e-09 which yield f(w1)=1.0


### 1.1.3 Second Order Optimization

We could go a step beyond and use even the second derivative. There are a lot of reasons to use and to not use the second derivative. However, we will not cover these optimization methods.

## 1.2 Optimize a simple function

Now that we have seen how to optimize a model that produced constant functions, we can now try to optimize a model that describes more interesting functions. Consider this model:

$$ F(x;w_1, w_2) = x*w_1 + w_2 $$

Firstly, let's see what kinds of functions does our $F$ describes.

In [6]:
def F(w1, w2):
    return lambda x: x*w1+w2

@widgets.interact(w1=(-10, 10, 0.25), w2=(-10, 10, 0.25))
def plot(w1=7.5, w2=7.5):
    x = np.linspace(-10, 10)
    y = F(w1, w2)(x)
    plt.plot(x, y)
    
    plt.gca().set_xlim((-5,+5))
    plt.gca().set_ylim((-10,+10))


interactive(children=(FloatSlider(value=7.5, description='w1', max=10.0, min=-10.0, step=0.25), FloatSlider(va…

These are all straight lines. No matter which $w_1$ and $w_2$ you will chose. $f$ will always describe a straight line. Let's suppose that we want to find the function with these input-output relation:

| x | y |
|---|---|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |

A function from our model is good when fed with $x$s it outputs something close to the described $y$s. Let's define a bit more formally what does this means. So, we derive yet another model from F, which we call L.

$$ L(x; w_1, w_2) = \sum_{(x,y) \in D} (F(x; w1, w2) - y)^2 $$

$L$ is close to $0$ when $F(x_i; w1, w2)$ is similar to $y_i$, for all $i$. In fact $L$ is exactly $0$ when $F(x_i; w1, w2) = y_i$, for all $i$. So, optimizing this model yields exactly the property that we desire. Again, we need to compute the gradients.

Now we need the derivative wrt. $w_1$: 
$$\frac{d L(x;w_1,w_2)}{d w_1} = 
\frac{d \sum_{(x,y) \in D} (F(x; w1, w2) - y)^2}{d w_1} = 
\sum_{(x,y) \in D} \frac{d(F(x; w1, w2) - y)^2}{dw_1} =
\sum_{(x,y) \in D} \frac{d(x*w_1+w_2 - y)^2}{dw_1} = 
\sum_{(x,y) \in D} 2(x*w_1+w_2 - y) \frac{dx*w_1+w_2 - y}{dw_1} = 
\sum_{(x,y) \in D} 2(x*w_1+w_2 - y)x
$$

And also wrt. $w_2$:

$$\frac{d L(x;w_1,w_2)}{d w_2} = 
\frac{d \sum_{(x,y) \in D} (F(x; w1, w2) - y)^2}{d w_2} = 
\sum_{(x,y) \in D} \frac{d(F(x; w1, w2) - y)^2}{dw_2} =
\sum_{(x,y) \in D} \frac{d(x*w_1+w_2 - y)^2}{dw_2} = 
\sum_{(x,y) \in D} 2(x*w_1+w_2 - y) \frac{dx*w_1+w_2 - y}{dw_2} = 
\sum_{(x,y) \in D} 2(x*w_1+w_2 - y)
$$

Now, we have again all the pieces to perform our first order optimization, like we did before.

In [10]:
import numpy as np
import random

x = np.array([1,2,3,4])
y = np.array([2,3,4,5])

def F(w1, w2):
    return lambda x: x*w1+w2

def dF_dw1(w1, w2):
    return (2*(x * w1 + w2 - y)*x).sum()

def dF_dw2(w1, w2):
    return (2*(x * w1 + w2 - y)).sum()

param_w1 = random.uniform(-10,10)
param_w2 = random.uniform(-10,10)
learning_rate = 0.001

for i in range(100000):
    param_w1, param_w2 = (param_w1 - learning_rate * dF_dw1(param_w1, param_w2), 
                          param_w2 - learning_rate * dF_dw2(param_w1, param_w2))

print(f"best parameter found w1={param_w1, param_w1} which yield f(2;w1,w2)={F(param_w1, param_w2)(2)}")

best parameter found w1=(1.0000000000000249, 1.0000000000000249) which yield f(2;w1,w2)=2.999999999999981


# 2. Automatic Differentiation

As you have noticed, computing the the gradients by hand is a real pain, and it becomes unmanageble real fast. Fortunately, this work can be done automatically.   


In [8]:
import tensorflow as tf


w1 = tf.Variable(10.0, name="x", trainable=True, dtype=tf.float32)

with tf.GradientTape() as tape:
  f = w1**2 + 1

df_dx = tape.gradient(f, w1)

print(df_dx.numpy())

20.0


2022-02-23 18:15:07.044616: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-23 18:15:07.148009: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-02-23 18:15:07.148031: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-02-23 18:15:07.148977: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN