# Practical 9b: Gradient Descent Algorithm

To obtain a local minimum of a function $z=f(x,y)$, the gradient descent algorithm can be implemented. The partial derivatives of the function, $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$ are needed and can be obtained using the SymPy library. The algorithm works this way,

Step 0.	Set a learning rate $\alpha>0$ and an initial point $x=x_0,\ y=y_0$ and compute $f(x_0,y_0)$.

Step 1.	At n-th point $x=x_n,\ y=y_n$, compute ${{f}_{x}}({{x}_{n}},{{y}_{n}})$ and $f_y(x_n,y_n)$.

Step 2.	Update to the (n+1)-th point, $x_{n+1}=x_n-\alpha f_x(x_n,y_n), \ y_{n+1}=y_n-\alpha f_y(x_n,y_n)$ and compute $f(x_{n+1},y_{n+1})$. 

Step 3.	Repeat step 1 and 2 until a stopping criterion is reached.

Each or a combination of the following can be the stopping criterion:    
1.	The **maximum number of iterations** is reached.
2.	The **value of $f_x^2(x_n,y_n)+ f_y^2(x_n,y_n)$** is smaller than a fixed constant. 
3.	**Convergence**, which, in simple terms, means that the update to the current point does not differ much from the previous point. It can refer to little difference from $x=x_n,\ y=y_n$ to $x=x_{n+1},\ y=y_{n+1}$ or little reduction in $f(x_{n},y_{n})$ to $f(x_{n+1},y_{n+1})$ (the difference is smaller than a fixed constant).
 
As an example, we want to find the minimum of the function $f(x,y)=x^2+3y^2$. The partial derivatives are $f_x(x,y) = 2x$ and $f_y(x,y)=6y$. With an initial point of $x=4,\ y=5$, learning rate $\alpha=0.1$, the code for the algorithm is as follows,

In [None]:
import numpy as np

next_x = 4 # Initial point
next_y = 5 # Initial point
alpha = 0.01 # Learning rate
epsilon = 0.001 # Stopping criterion constant
max_iters = 500 # Maximum number of iterations

# Partial derivatives and function
partialf_x = lambda x,y: 2*x 
partialf_y = lambda x,y: 6*y 
func = lambda x,y: x**2+3*y**2

next_func = func(next_x,next_y) # Initial value of function

for n in range(max_iters):
    current_x = next_x
    current_y = next_y
    current_func = next_func
    next_x = current_x-alpha*partialf_x(current_x,current_y) # update of x
    next_y = current_y-alpha*partialf_y(current_x,current_y) # update of y
    next_func = func(next_x,next_y)
    change_func = abs(next_func-current_func) # stopping criterion: values of function converge
    print("Iteration",n+1,": x = ",next_x,", y = ",next_y,", f(x,y) = ",next_func)
    if change_func<epsilon:
        break

The convergence criterion used in the above code is when the difference in the values of the function in 2 consecutive updates is less than a fixed constant, epsilon. The stopping criterion is the convergence criterion and when the maximum number of iterations is reached, whichever comes first.

Now, try to repeat the example above with different learning rates and observe what happens.

$\alpha=0.05$:

$\alpha=0.1$:
    
$\alpha=0.2$:
    
$\alpha=0.25$:
    
$\alpha=0.5$: 
    
Conclusion: 

Next, we want to look at how different initial points may result in different outputs. Given a function $f(x,y)=x^2+y^4+3y^3-y^2-3y$, we would like to find the local minimum using the gradient descent algorithm with different initial points. Fill in the code below with the appropriate parameters and functions.

In [None]:
# Feel free to set your own constants

import numpy as np

next_x = _____ # Initial point # students to set 
next_y = _____ # Initial point # students to set
alpha = _____ # Learning rate # students to set
epsilon = _____ # Stopping criterion constant # students to set
max_iters = _____ # Maximum number of iterations # students to set

# Partial derivatives and function
partialf_x = lambda x,y: ________ # students to fill in formula
partialf_y = lambda x,y: ________ # students to fill in formula
func = lambda x,y: ________ # students to fill in formula

next_func = func(next_x,next_y) # Initial value of function

for n in range(max_iters):
    current_x = next_x
    current_y = next_y
    current_func = next_func
    next_x = current_x-alpha*partialf_x(current_x,current_y) # update of x
    next_y = current_y-alpha*partialf_y(current_x,current_y) # update of y
    next_func = func(next_x,next_y)
    change_func = abs(next_func-current_func) # stopping criterion: values of function converge
    print("Iteration",n+1,": x = ",next_x,", y = ",next_y,", f(x,y) = ",next_func)
    if change_func<epsilon:
        break

With all parameters constant except the initial points, what do you observe?

Initial point = (1,1):
    
Initial point = (2,1):
    
Initial point = (0,-3):
    
Initial point = (-1,-2):

Conclusion: 
    
## Stopping criterion

So far, the stopping criterion has always been a combination of maximum number of iterations and convergence of the values of the function. Try editing the code to change the stopping criterion from the convergence of the function to the value of $f^2_x(x_n,y_n)+f^2_y(x_n,y_n)$ being smaller than a small number.

In [None]:
# Feel free to set your own constants

import numpy as np

next_x = _____ # Initial point # students to set 
next_y = _____ # Initial point # students to set
alpha = _____ # Learning rate # students to set
epsilon = _____ # Stopping criterion constant # students to set
max_iters = _____ # Maximum number of iterations # students to set

# Partial derivatives and function
partialf_x = lambda x,y: ________ # students to fill in formula
partialf_y = lambda x,y: ________ # students to fill in formula
func = lambda x,y: ________ # students to fill in formula

next_func = func(next_x,next_y) # Initial value of function

for n in range(max_iters):
    current_x = next_x
    current_y = next_y
    current_func = next_func
    next_x = current_x-alpha*partialf_x(current_x,current_y) # update of x
    next_y = current_y-alpha*partialf_y(current_x,current_y) # update of y
    next_func = func(next_x,next_y)
    partial_norm = ________ # stopping criterion: (f_x)^2+(f_y)^2 # student to fill in 
    print("Iteration",n+1,": x = ",next_x,", y = ",next_y,", f(x,y) = ",next_func)
    if partial_norm<epsilon:
        break