# Optimization and Calculus of Variations

## Overview
### 1. Optimization Primer
### 2. The Derivative
### 3. Calculus of Variations
### 4. Euler Lagrange Equations

## 1. Optimization Primer

Optimization is the science of finding the best way to do something. Unsurprisingly, it is prevalent in many fields of engineering, science, and finance. We want to find ways to maximize profits, minimize failures, et cetera.
The entire field of Operations Reasearch is dedicated to solving optimization problems.

In it's simplest form, an optimization problem is given as a function, and we seek the point at which this function becomes a minimum or maximum.

**Example**: I hold two risky securities whose risky returns have variance $\sigma_1$ and $\sigma_2$. The securities are correlated with covariance $c$. What proportion of each security should I hold in order to minimize the variance?

**Model**: Let $w$ be the fraction that I invest in security 1. Then $(1-w)$ is the fraction of cash invested in the second security. The total variance of this portfolio is $w^2\sigma_1^2 + (1-w)^2\sigma_2^2+2w(1-w)c$.
To solve the problem, I must find $w$.


### Methodology

In the above example, we modeled the problem as a function of one unknown variable. We seek the the value of the variable that will result in the minimum value of a function. 

We should make precise what we mean by a minimum:

#### Defintion: Minimum
Let $S$ be a set and $f: S \rightarrow \mathbb{R}$ a function from this set to the real numbers. $f$ has a **local minimum** at $x_0 \in S$ if there exists a neighborhood of $x_0$ such that $f(x) \geq f(x_0) \forall x \in U$. 
$x_0$ is a **gobal minimum** if $f(x) \geq f(x_0) \forall x \in S$.

The definition for maximum is similar with the inequalities reversed. 

I have not yet defined what I mean by a *neighborhood*. Intuitivelly, it means a subset of $S$ consisting of points that are close to $x$. In order to determine if points are close or far, we need some measure of distance defined on the set $S$.
Luckily, the real line, as well as any euclidean space comes equiped with a natural measure of distance.

Recall that the standard norm in $\mathbb{R}^n$ is given by $||x|| = \sqrt{\left(\sum_i x_i^2\right)}$
and the distance between two points x and y is given by the norm of the difference 
$||x-y|| = \sqrt{\left(\sum_i (x_i-y_i)^2\right)}$

The key result to solving optimization problems, is the following theorem from calculus:

#### Theorem: Necessary Optimality Conditions
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a continuous differentiable function.

> If $f$ has a local minimum at $x_0$, then $\nabla f(x_0) = 0$. 

The converse is not always true but if $f$ has a second derivative,
a stronger condition exists to guarantee a minimum.

> $f$ has a local minimum at $x_0$ if and only if, $\nabla f(x_0) = 0$ and $\nabla^2 f(x_0) \geq 0$. 


Note that $\nabla f$ stands for the vector $(\partial{f}/\partial{x_1}, \ldots \partial{f}/\partial{x_n})$.
In the one dimensional case this is simply the well known derivative $\frac{df}{dx}$

The following picture illustrates a function $f: \mathbb{R} \rightarrow \mathbb{R}$ and the various points where the derivative is 0.

<img src='variations/Tangent_function_animation.gif'/>

This property suggests a simple algorithm for finding the minimum: Find all points where the derivative of the function is 0. If there are multiple, compute the value of the fucntion at each one, and choose the minimum.

Using this algorithm, we can now solve the variance minimization problem:

$$
f(w) = w^2\sigma_1^2 + (1-w)^2\sigma_2^2+2w(1-w)c
$$

Taking the derivative and doing a bit of algebra, we obtain:

$$
w_{min} = \frac{\sigma_2^2 - c}{\sigma_1^2+\sigma_2^2-2c}
$$

Even though the algorithm for finding minima by computing the points of zero gradiend only guarantees a local minimum, it works so well that is has widespread applicatins. 
A large class of Machine Learning problems are formulated as multi-variate optimization problems that are solved by computing the value of zero gradient. When it comes to Deep Learning networks, this optimization involves space of tens of thousands, even millions, of dimensions.

## Limitations
We have a powerful tool for calculating optimality over a large number of dimensions. Yet, there are still problems that cannot be solved with this framework as developed so far.

### Example: Surface of revolution
Consider the following problem: I draw the graph of a function $x(t): \mathbb{R} \rightarrow \mathbb{R}$ between two points $t_1$ and $t_2$. Then I revolve the graph around the t-axis to create a surface.
The area of the surface thus described has area 
$$
A = 2\pi\int\limits_{t_1}^{t_2} t\sqrt{1+\dot{x(t)}}dt
$$

We are interested in finding the function that results in the surface of revolution of minimal area.

<img src='variations/revolution-vase.png'/>

## 2. The Derivative

A problem of the type above calls for optimizing a function of another function. Such a function is frequently called a *Functional*. We can think of a functional as a function $F: V \rightarrow \mathbb{R}$, where $V$ is a space of functions.
Unlike the problems we tackled before where the domain $f: \mathbb{R}^n \rightarrow \mathbb{R}$ had finite dimension, 
this new space of functions has potentially infinite dimension.

Can we even take derivatives in an infinite dimensional space?


The first thing to do is to look closely at the definition of the derivative and see how it can be extened.

In a calculus class the derivative at point $x$ is usually defined as

$$
\frac{df}{dx} := \lim_{h \rightarrow 0}\frac{f(x+h)-f(x)}{h} 
$$

Even in this simple one dimensional definition one has to be careful, because we may arrive at different result if we approach 0 from the left (h negative) or from the right (h positive).
<img src='variations/Absolute_value.png'  width='300'/>

So in general we need to make sure that we take the derivative in one direction, then the other and make sure that the two agree. In more dimensions, the situation becomes even trickier. Each partial derivative may exist at a given point, but the function may still fail to be differentiable at that point. 

There are a few ways to generalize the simple derivative from calculus to more than one dimension and even to infinite dimensional vector spaces. The "proper" way to define a derivative is called the *Frechet* or *Strong* derivative. 
It's definition is somewhat technical and getting into the details would bring is too far afield.

Instead, I will present a generalization of the directional derivative, called the *variational derivative*, which is simpler and is what is usually used in practice.

### Definition: The Variational Derivative
Let $F: V \rightarrow \mathbb{R}$ be a real-valued function defined on a vector space $V$ (possibly infinite dimensional).
The $variational derivative$ of $F$ at $x$ in the direction $h$ is given by 

$$
D_hF(x) := \lim_{\epsilon → 0} \frac{f(x+\epsilon h)-f(x)}{\epsilon}
$$

where $\epsilon$ is a positive real number.

Note that this derivative is in general dependent on the direction vector $h$. If, upon calculating the derivative,
we find it is independent of $h$, then that's a good sign as it means that probably the derivative is well defined in every direction.

## 3.  Calculus of Variations

The *Calculus of Variations* is the field of computing derivatives and solving optimization problems over function spaces.

Luckily, it turns out that the conditions for optimality still hold, with some care, in this case:

> If $x_0$ is a local minimum of $f:V → \mathbb{R}$, then the generilized (Frechet) derivative is 0 at $x_0$.

As we mentioned above, it is sufficient to calculate the Gateaux derivative and make sure it is indepdent of the choice of "direction" h.

### Example: Derivative of the norm.

Let $V$ be a (real) Hilbert space. A Hilbert space is a vector space equiped with an inner product - and therefore a norm. Hilbert spaces of finite and infinite dimensions are omnipresent in Quantum Mechanics.

Let $<v,w>$ denote the inner product of two vectors in the space, and $||v||$ the norm of the vector $v$.
We are interested in computing the derivative of the squared norm in the direction of $h$.

Applying the definition of the derivative:

$$
D_h ||x||^2 := \lim_{\epsilon → 0} \frac{||x+\epsilon h||^2-||x||^2}{\epsilon}
$$

Now we remember that $||x||^2 = <x,x>$, so we can write

$$
||x+\epsilon h|| = <x+\epsilon h, x+\epsilon h> = ||x||^2+2e<x,h>+e^2||h||^2
$$

Plugging in this in the formula for the derivative we get 

$$
D_h ||x||^2 := \lim_{\epsilon → 0} \frac{2e<x,h>+e^2||h||^2}{\epsilon} =\\
2<x,h>
$$

So we obtain a result consistent with finite vector spaces.

## 4. The Euler Lagrange Equations

We are now in a position to introduce and solve the Euler Lagrange equations. Let

$$
I(x) = \int\limits_{t_1}^{t_2} L(x, \dot{x})dt
$$

**Problem**: Find the function x that minimizes I.

This is an optimization problem in an infinite dimensional space. 
It turns out that the situation is simular to the finite dimensional cases. 
We look for places where the derivative of I is equal to 0.

As we mentioned, the Frechet derivative is hard to compute by the definition. We compute the Gateaux derivative and if independent of the direction, it is the derivative.


### Preparation

$x$ is a function of the variable $t$ and $\dot{x}$ is it's derivative. $L(x, \dot{x})$ can be viewed as a function of two variables $L(x, y)$.

Recall that the Taylor expansion of any sunch function $L$ at the poing $(x+ex', y+ey')$ is given by

$$
L(x+ex', y+ey') = L(x,y) + e \left(\frac{\partial L(x,y)}{\partial x}x' + \frac{\partial L(x,y)}{\partial y}y'\right) + O(e^2)
$$

where $e$ is a small number. I can write the above as 

$$
L(x+ex', y+ey') - L(x,y) =  e \left(\frac{\partial L(x,y)}{\partial x}x' + \frac{\partial L(x,y)}{\partial y}y'\right) + O(e^2)
$$


### Computation of the derivative

Now we are going to use the definition of the variational derivative 
$\lim_{\epsilon → 0} \frac{f(x+\epsilon h)-f(x)}{\epsilon}$.

First we compute $I(x+eh)-I(x)$, where $e$ is a small number and $h$ an arbitrary function.
It is not however completely arbitrary. 
The function h must be chose so that the value at the endpoints $t_0$ and $t_1$ remains the same. 
In other words we must have $f(x_0)+h(t_0) = f(x_0)$, from which we obtain $h(t_0) = 0$. 
The same holds for $t_1$. The idea is illustrated in the picture below.

<img src='variations/variations.png' width=500/>

$$
I(x+eh) - I(x) =\\
\int dt \left(L(x+eh, \dot{x}+e\dot{h})-L(x,\dot{x})\right)
$$

Now I can use the Taylor expansion to obtain

$$
I(x+eh) - I(x) =\\
e\int dt \left(\frac{\partial L}{∂ x}h + \frac{\partial L}{∂ \dot{x}}\dot{h} \right) + O(e^2)
$$

We know that once we divide by e and take the limit $e\rightarrow 0$, the term $O(e^2)$ is going to 0, so we are not going to worry about it.

We are presently concerned about the second term. We are going to use a trick from calculus

$$
\frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}h\right) = 
h\frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}\right) + \dot{h}\frac{\partial L}{∂ \dot{x}} \implies\\
\dot{h}\frac{\partial L}{∂ \dot{x}} = 
\frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}h\right)
- h\frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}\right)
$$

Now we note that 

$$
\int\limits_{t_1}^{t_2} dt \frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}h\right)
= \frac{\partial L}{∂ \dot{x}}h(t_2) - \frac{\partial L}{∂ \dot{x}}h(t_1)
$$

But as we discussed, $h$ must live the endpoints fixed, which means $h(t_1) = h(t_2) = 0$.
We conclude that the value of this integral is 0.

We are left with

$$
\frac{I(x+eh) - I(x)}{e} = ∫ dt h \left[
\frac{\partial L}{∂ x}
- \frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}\right) 
\right] 
+O(e^2)
$$

Taking the limit eliminates the $O(e^2)$. To find the optimum, we set the derivative to 0. 

$$
∫ dt h \left[
\frac{\partial L}{∂ x}
- \frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}\right) 
\right] = 0
$$

But since $h$ is a (nearly) arbitrary function, the only way for this to hold for every $h$ is if the integrant is identical to 0.

$$
\frac{\partial L}{∂ x}
- \frac{d}{dt}\left(\frac{\partial L}{∂ \dot{x}}\right)  = 0
$$

### Application: Surface of Revolution

Let's return to encounter before. We want to find a surface of revolution between two points of minimal area:

$$
A = 2\pi\int\limits_{t_1}^{t_2} t\sqrt{1+\dot{x(t)}}dt
$$

We can now solve this problem by employing the Euler-Lagrange equations

In this case $L = t\sqrt{1+\dot{x(t)}}$

We have $\frac{\partial L}{∂ x}$ and
$$
\frac{\partial L}{∂ \dot{x}} = \frac{tx}{\sqrt{1+\dot{x}^2}}
$$

The EL equations in this case is 
$$
\frac{d}{dt}\frac{tx}{\sqrt{1+\dot{x}^2}} = 0 \implies \frac{tx}{\sqrt{1+\dot{x}^2}} = \textrm{constant} \equiv a
$$

Rearranging we obtain

$$
\dot{x} = \frac{a}{\sqrt{t^2-a^2}} \implies\\
x = \int \frac{adt}{\sqrt{t^2-a^2}} \implies\\
x = a \textrm{ acosh}(t/a) + b
$$

This type of curve is called a *catenary* and the resulting surcface of revolution a *catenoid*.

<img src='variations/catenoid.png'/>