# Numerical Optimization using PyTorch

In this Jupyter notebook, I perform some basic econometric optimization routines using the PyTorch package.


In [1]:
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable, grad
import torch.distributions as td

## A Basic Optimization Problem

Consider the following problem:

$$
    \min_{x} \quad 2 x^2 - 7 x + 6
$$

We use the `torch.tensor` method to store parameters, constants and matrices. However, the parameters over which we are optimizing the objective function require an additional option `requires_grad=True` so that PyTorch knows how to perform the backpropagation.

In [2]:
x = torch.tensor([-2.45], requires_grad=True)

optimizer = optim.Adam([x], lr=0.05)

for _ in range(1000):
    optimizer.zero_grad()
    y = (2 * x**2 - 7 * x + 6)
    y.backward()
    optimizer.step()

print("Numerical optimization solution:", x)
print("Analytic optimization solution:", 7/4)

Numerical optimization solution: tensor([1.7500], requires_grad=True)
Analytic optimization solution: 1.75


`optimizer.step()` moves the parameter in the direction that minimizes the objective function using the gradient computed in the `.backward()` operation.

$$
\quad
$$

----------------------------------------------------------

## Maximum Likelihood Estimation Part 1

In this example, consider some observations $\{X_i \mid i = 1, \ldots, N\}$ drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$.

In [3]:
N = 1000
μ, σ = 0.05, 0.1
x_data = μ + torch.randn(N) * σ

We know that the ML estimates maximize the log-likelihood function, or correspondingly minimize the negative log-likelihood. We can set the mean and standard deviation as PyTorch `Variable`s and ask the optimizer to minimize the objective accordingly.

$$
    \widehat{\boldsymbol{\theta}} \; \equiv \; \begin{pmatrix} \widehat{\mu} \\ \widehat{\sigma} \end{pmatrix} \quad \in \;\; \underset{\begin{pmatrix} \mu \\ \sigma \end{pmatrix}}{\arg \min} -\frac{N}{2} \log (2 \pi) - N \log \sigma - \frac{1}{2} \sum_{i = 1}^N  {\left( \frac{X_i - \mu}{\sigma} \right)}^2
$$

I used `torch.distribution`'s inbuilt log-likelihood method corresponding to a Normal distribution to define the sum of the log-likelihoods over the set of observations.

The subpackage `torch.optim` implements various optimization algorithms. To construct an `Optimizer`, we provide an iterable containing the parameters (here, $\mu$ and $\sigma$) to optimize. Then, we can specify optimizer-specific options such as the learning rate, weight decay, etc. I chose the Adam algorithm with a small learning rate.

In [4]:
# Set parameters for the ML optimization.
mu_hat = Variable(torch.zeros(1), requires_grad=True)
sigma_hat = Variable(torch.ones(1), requires_grad=True)

# Define the objective function.
def log_lik(mu, sigma):
    return td.Normal(loc=mu, scale=sigma).log_prob(x_data).sum()

# Define the adaptive gradient descent optimizer used to find the estimates.
opt = optim.Adam([mu_hat, sigma_hat], lr=0.05)

In [5]:
for epoch in range(10000):

    opt.zero_grad() # Reset gradient inside the optimizer

    # Compute the objective at the current parameter values.
    loss = - log_lik(mu_hat, sigma_hat)
    loss.backward() # Gradient computed.
    opt.step()      # Update parameter values using gradient descent.

In [6]:
print('Parameters: mu = {:10.3e}, sigma = {:10.3e}'.format(
    mu_hat.detach().numpy()[0], sigma_hat.detach().numpy()[0])
)

Parameters: mu =  5.195e-02, sigma =  9.651e-02


We know that the asymptotic distribution of the ML estimator is given by

$$
    \sqrt{N} \left(\widehat\boldsymbol{\theta} - \boldsymbol{\theta} \right) \;\; \underset{d}{\longrightarrow} \;\; \mathcal{N}\left( \boldsymbol{0} \, , \, \mathbf{I}(\boldsymbol{\theta})^{-1} \right) \;\; \Longrightarrow \;\; \widehat\boldsymbol{\theta} \; \underset{d}{\sim} \;\; \mathcal{N}\left( \boldsymbol{\theta} \, , \, \frac{\mathbf{I}(\boldsymbol{\theta})^{-1}}{N}  \right)
$$

Since the Fisher information matrix is the inverse of the Hessian of the log-likelihood, we can use `torch.autograd.functional.hessian` to derive the Hessian matrix, which yields the information matrix and then the standard errors by taking the square root of the diagonal elements.

In [7]:
theta_hat = (mu_hat, sigma_hat)

# Fisher Information matrix.
I = -torch.tensor(torch.autograd.functional.hessian(log_lik, theta_hat))

# Compute variance matrix.
V = torch.inverse(I)/N

# Compute standard errors.
std_err = np.sqrt(np.diag(V))

In [8]:
print('Standard Errors: mu = {:10.3e}, sigma = {:10.3e}'.format(std_err[0], std_err[1]))

Standard Errors: mu =  9.651e-05, sigma =  6.626e-05
