# How to optimize with Torch
Kai Puolamäki, 28 June 2022

## Introduction

This brief tutorial demonstrates how [PyTorch](https://pytorch.org) can be used to find minimum values of arbitrary functions, as is done in [SLISEMAP](https://github.com/edahelsinki/slisemap). The advantages of PyTorch include the use of autograd and optionally GPU acceleration. These may result in significant speedups when optimizing high-dimensional loss functions, which often happens in deep learning but also elsewhere.

The existing documentation of PyTorch is geared towards deep learning. It is currently difficult to find documentation of how to do "simple" optimization without any deep learning context, which is why I wrote this tutorial in the hope that it will be useful for someone.

## Toy example

Here we minimise a simple least squares loss given by
$$
L = \lVert {\bf y}-{\bf X}{\bf b} \rVert_2^2,
$$
where ${\bf X}\in{\mathbb{R}}^{3\times 2}$ and ${\bf y}\in{\mathbb{R}}^3$ are constants and ${\bf{b}}\in{\mathbb{R}}^2$ is a vector whose values are to be found by the optimiser. We could optimize any reasonably behaving function; here we picked the least squares loss for simplicity.

In this example, we use the following values for the constant matrix and vector:
$$
{\bf{X}}=
\begin{pmatrix}
1 & 2.2 \\
1 & 3 \\
1 & 4.4
\end{pmatrix},~~~~
{\bf{y}}=
\begin{pmatrix}
1 \\ 2 \\ 3 
\end{pmatrix}.
$$
In this example, the loss $L$ obtains a minimal value of $L=0.0484$ when ${\bf{b}}=\left(-0.839,0.887\right)^T$.

## Numpy and Scipy

We first solve the problem with the standard `scipy` optimizer, by using an arbitrarily chosen initial starting point.

We first define the matrices and vectors as Numpy arrays and then define a loss function `loss_fn0` that that takes the value of ${\bf{b}}$ as input and outputs the value of the loss $L$.

In [1]:
import numpy as np
from scipy.optimize import minimize

In [2]:
X0 = np.array([[1,2.2], [1,3], [1,4.4]], dtype=float)
y0 = np.array([1,2,3], dtype=float)
b0 = np.array([0.12,0.34], dtype=float)

def loss_fn0(b):
    return ((y0 - X0 @ b)**2).sum()

loss_fn0(b0)

2.672479999999999

For this starting point the loss value is $L=2.672$, which is clearly larger than the optimal value.

We can find the value of ${\bf{b}}$ that minimizes the loss $L$ by using a library optimization algorithm, BFGS in this case. We find the correct value of ${\bf{b}}$ and the corresponding loss:

In [3]:
res = minimize(loss_fn0, b0, method="BFGS")
res.x, loss_fn0(res.x)

(array([-0.83870945,  0.8870967 ]), 0.04838709677420688)

## PyTorch

We'll repeat the same with Pytorch. First we define a helper function `LBFGS` that takes in the loss function and the variables to be optimized as input and that as a side effect updates the variables to their values at the minimum of the loss function.

The helper function uses the Torch LBFGS optimizer. The `closure` is a function that essentially evaluates the loss function and updates the gradient values. The file [utils.py](https://github.com/edahelsinki/slisemap/blob/main/slisemap/utils.py) in the SLISEMAP source code contains a more advanced version of the helper function.

You can use this helper function as a generic optimizer, much in the same way as you would use the `scipy.optimize.minimize` above.

In [4]:
import torch

def LBFGS(loss_fn, variables, max_iter=500, line_search_fn="strong_wolfe", **kwargs):
    """Optimise a function using LBFGS.
    Args:
        loss_fn (Callable[[], torch.Tensor]): Function that returns a value to be minimised.
        variables (List[torch.Tensor]): List of variables to optimise (must have `requires_grad=True`).
        max_iter (int, optional): Maximum number of LBFGS iterations. Defaults to 500.
        line_search_fn (Optional[str], optional): Line search method (None or "strong_wolfe"). Defaults to "strong_wolfe".
        **kwargs (optional): Argumemts passed to `torch.optim.LBFGS`.
    Returns:
        torch.optim.LBFGS: The LBFGS optimiser.
    """
    
    optimiser = torch.optim.LBFGS(
        variables,
        max_iter=max_iter,
        line_search_fn=line_search_fn,
        **kwargs
    )
    
    def closure():
        optimiser.zero_grad()
        loss = loss_fn()
        loss.backward()
        return loss
    
    optimiser.step(closure)
    
    return optimiser

Torch functions typically require that we define the variables torch tensors. The torch tensors correspond to Numpy arrays, but they carry autograd information and they can optionally be used within a GPU. Notice that we need to attach the slot for the gradients to ${\bf{b}}$ tensor because we want to optimize it!

In [5]:
X = torch.tensor(X0, dtype=torch.float)
y = torch.tensor(y0, dtype=torch.float)
b = torch.tensor(b0, dtype=torch.float, requires_grad=True)
X, y, b

(tensor([[1.0000, 2.2000],
         [1.0000, 3.0000],
         [1.0000, 4.4000]]),
 tensor([1., 2., 3.]),
 tensor([0.1200, 0.3400], requires_grad=True))

The safe way to make Torch tensors Numpy arrays is to first move them to CPU, then detach any autograd part, and then make them numpy arrays:

In [6]:
X.cpu().detach().numpy()

array([[1. , 2.2],
       [1. , 3. ],
       [1. , 4.4]], dtype=float32)

Next, we define the loss function that takes no parameters as an input and which outputs the loss (a tensor with only one real number as a value). If you want to evaluate the value of loss for different values of ${\bf{b}}$ you must update the values in the corresponding tensor.

It is important to use only Torch arithmetic operations that support autograd. Luckily, there are enough operations to cover most needs. Instead of `sum` method in the [Tensor object](https://pytorch.org/docs/stable/tensors.html) as in the first example below we can alternatively use [torch.sum](https://pytorch.org/docs/stable/generated/torch.sum.html) (both of which supports torch tensors and autograd), but we cannot use, e.g., [np.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) (which does not support torch tensors and autograd).

In [7]:
## Using the sum method in the Tensor object:
def loss_fn():
    return ((y - X @ b)**2).sum()

## Alternate but equivalent way of writing the same thing by using torch.sum:
def loss_fn():
    return torch.sum((y - X @ b)**2)

## You cannot use, e.g., Numpy operations which do not support tensors and autograd:
def loss_fn_WRONG_DO_NOT_USE():
    return np.sum((y - X @ b)**2)

Evaluating the loss function gives the value of the loss (as a tensor):

In [8]:
loss_fn()

tensor(2.6725, grad_fn=<SumBackward0>)

If we want to have the loss value as a real number then the correct procedure is to first move the tensor to CPU (this matters if we use GPU, otherwise it is a null operation), then detach the autograd component, and then take the only item out as a real number:

In [9]:
loss_fn().cpu().detach().item()

2.6724798679351807

We use the helper function `LBFGS` defined above to do the optimization. We need to give as parameters the loss function and a list of tensors to be optimized. As a result, the value of the tensor ${\bf{b}}$ is updated to the value that minimizes the loss!

In [10]:
LBFGS(loss_fn, [b])
X, y, b

(tensor([[1.0000, 2.2000],
         [1.0000, 3.0000],
         [1.0000, 4.4000]]),
 tensor([1., 2., 3.]),
 tensor([-0.8387,  0.8871], requires_grad=True))

The optimum value of the loss function is the same as in the first example with Numpy and standard Scipy optimization function.

In [11]:
loss_fn().cpu().detach().item()

0.048387106508016586