These are some notes on online convex optimization. They mostly follow <a href=http://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf>this survey</a>, with a few excursions. If you want a really thorough intro, you should probably read that survey. This is going to be a little shorter, but will have more detail in a few areas.

So to start with let's set up what an online learning problem is. The goal is to find an algorithm $L$ that plays a particular "game" well. The game consists of $T$ rounds of the following:
<ol>
<li> The environment provides some value $x_t\in X$. For example, $x_t$ could a picture of some object, or meteorological data</li>
<li> The learning algorithm $L$ produces a value $\hat y_t\in Y$. $\hat y_t$ could be a guess about what the object is, or how much it will rain tomorrow.
<li> The environment reveals the true value $y_t\in Y$.
<li> The learning algorithm suffers a "loss" given by $\ell(x_t,y_t,\hat y_t)$. So in the picture example, $\ell(x_t,y_t,\hat y_t)$ is 1 if the guessed object was incorrect and zero otherwise, or in the rain example, $\ell(x_t,y_t,\hat y_t)$ is the difference between the true amount of rain and the predicted amount of rain.
</ol>

$L$ is considered to play this game well if the value of $\sum_{t=1}^T \ell(x_t,y_t,\hat y_t)$ is small - the loss is intended to measure how unhappy we are with our predicted value.
We will make no assumptions about how the $x_t$s and $y_t$s are chosen by the environment. In particular, they could be chosen adversarially in order to make $L$ perform as badly as possible, such as might occur in the case of email spam filtering.

This problem is very general, we make no assumptions about what sort of sets $X$ and $Y$ are and no assumptions about what $\ell$ might look like. It turns out (perhaps unsurprisingly) that this makes the problem impossibly hard. To start with, consider the case in which $x_t$ is some description of a Turing machine (or a computer program), and $y_t$ is whether or not the machine will always halt (whether the program always terminates). There is <a href=https://en.wikipedia.org/wiki/Halting_problem>no algorithm</a> that can solve this problem. Thus it would be in a sense unfair to penalize $L$ for being unable to do so. So instead of measuring performance with $\sum_{t=1}^T \ell(x_t,y_t,\hat y_t)$, we'll consider a quantity called <i>regret</i>. Let $H$ be a set of functions that map $X$ to $Y$. Then the regret of $L$ with respect to $H$ is the amount by which we would rather have used a fixed element in $H$ to make our predictions:
$$
\text{Regret}_T(H) = \sum_{t=1}^T\ell(x_t,y_t,\hat y_t) - \inf_{h\in H}\sum_{t=1}^T \ell(x_t,y_t,h(x_t))
$$

This makes things a little fairer in the Halting problem case. We can have $H$ be the set of all algorithms that run in quadratic time for example, and then it may be possible to achieve low regret even if we can't necessarily achieve low overall loss. In general we will consider an algorithm a "success" if it can achieve <i>sublinear</i> regret. That is, 
$$
\frac{1}{T}\text{Regret}_T(H)\to 0
$$ 
This means that on average we are doing as well as the best element of $h$.

Unfortunately, we're still not really safe. Consider the case in which $Y=\{0,1\}$, and $H$ consists of the function that always outputs $0$ and the function that always outputs $1$. Then if the environment sets $y_t=1-\hat y_t$, we have
$$
\sum_{t=1}^T \ell(x_t,y_t,\hat y_t) = T
$$
However, if we always predict the majority value of $y_t$ (which is one of the two functions in $H$), we see that
$$
\inf_{h\in H}\sum_{t=1}^T \ell(x_t,y_t,h(x_t))\le \frac{T}{2}
$$
so that our overall regret grows linearly:
$$
\text{Regret_T}(H)\ge \frac{T}{2}
$$

There are various ways to handicap the environment in order to make this problem go away. For example, we could allow the algorithm $L$ to use randomness that the environment cannot see.

Most of these methods for patching up the impossibility of the general problem can be encompassed in the framework of <i>online convex optimization</i>, which we'll describe now. For the rest of these notes we'll focus exclusively on this type of problem.

In online convex optimization, one attempts to to find an algorithm that plays the following game:

<ol>
<li> $L$ predicts a vector $w_t\in S$, where $S$ is some given <i>convex</i> set.</li>
<li> The environment provides a <a href=https://en.wikipedia.org/wiki/Convex_function><i>convex</i></a> loss function $f_t:S\to \R$.</li>
<li> $L$ suffers loss $f_t(w_t)$
</ol>

Regret is now measured relative to a convex set $U$:
$$
\text{Regret}_T(U) = \sum_{t=1}^T f_t(w_t) -\inf_{u\in U} \sum_{t=1}^T f_t(u)
$$
We will almost always set $U=S$, but the notation is defined so that this is not necessarily the case. We'll also define the regret relative to some particular vector $u$:
$$
\text{Regret}_T(u) = \sum_{t=1}^T f_t(w_t) -\sum_{t=1}^T f_t(u)
$$
so that $\text{Regret}_T(U) = \inf_{u\in U} \text{Regret}_T(u)$.


<h3>Follow the Leader</h3>

So now that we've set up the problem, let's talk about some actual algorithms. The first of these is the Follow the Leader (FTL) algorithm. It is quite simple - at each round we predict using the vector that was the best over all the previous rounds:
$$
w_t = \argmin_{w\in S} \sum_{i=1}^{t-1}f_i(w)
$$

The primary result used to analyze FTL is the following:

<div class = "theorem">
<h4>Lemma 1</h4>

If $w_1,\dots,w_T$ is the sequence of vectors produced by FTL, then for any $u\in S$ we have:
$$
\text{Regret}_T(u) = \sum_{t=1}^T f_t(w_t)-f_t(u) \le \sum_{t=1}^T f_t(w_t)-f_t(w_{t+1})
$$
</div>

<i>proof</i>

First note that the statement is equivalent to
$$
\sum_{t=1}^T f_t(u) \ge \sum_{t=1}^T f_t(w_{t+1})
$$
And now we make a short inductive argument:

By definition of $w_2$ we have $f_1(u)\ge f_1(w_2)$ for all $u\in S$. Now suppose that
$$
\sum_{t=1}^K f_t(u) \ge \sum_{t=1}^K f_t(w_{t+1})
$$
In particular,
$$
\sum_{t=1}^K f_t(w_{K+2}) \ge \sum_{t=1}^K f_t(w_{t+1})
$$
and by definition of $w_{K+2}$ we have
$$
\sum_{t=1}^{K+!} f_t(u)\ge \sum_{t=1}^{K+1} f_t(w_{K+2})=f_{K+1}(w_{K+2})+\sum_{t=1}^K f_t(w_{K+2})\ge \sum_{t=1}^{K+1} f_t(w_{t+1})
$$
<div align="right">$\square$</div>

So using this lemma, we see that if $w_t$ are such that $f_t(w_t)-f_t(w_{t+1})$ goes to 0 as $t\to\infty$, then $\text{Regret}_T(S)$ will grow sublinearly. Intuitively, if the $f_t$ are reasonably smooth, this suggests that as long as the $w_t$ are <i>stable</i> - that is they tend not to move too fast, then the regret should be small. 

As an example, consider the case of online quadratic optimization. In this scenario, $f_t(w)=\frac{1}{2}\|w-z_t\|^2$ for some $z_t\in \R^d$ with $\|z_t\|\le M$ at each round $t$. In this case it is not hard to find an analytical expression ofr $w_{t+1}$ (just differentiate the sum of the losses) and so we get: 
$$
w_{t+1}=\frac{1}{t}\sum_{i=1}^t z_t = \frac{t-1}{t}w_t+\frac{z_t}{t}
$$
where here we've set $w_1=0$. Since we want to apply the lemma, let's compute:
$$
\begin{align*}
f_t(w_t)-f_t(w_{t+1}) &= \frac{1}{2}\|w_t-z_t\|^2-\frac{1}{2}\|w_{t+1} -z_t\|^2\\
&= \frac{1}{2}\|w_t-z_t\|^2 -\frac{1}{2}\left\|\frac{t-1}{t}w_t +\frac{1-t}{t}z_t\right\|^2\\
&-\frac{1}{2}\|w_t-z_t\|^2 - \frac{1}{2}\left(\frac{t-1}{t}\right)^2\|w_t-z_t\|^2\le\frac{1}{t}\|w_t-z_t\|^2
\end{align*}
$$
Now since $w_t$ is the average of $z_1,\dots,z_{t-1}$ and $\|z_i\|\le M$, $\|w_t-z_t\|^2\le 4M^2$. Thus applying the lemma we have
$$
\text{Regret}_T(\R^d) \le \sum_{t=1}^T\frac{1}{t}\|w_t-z_t\|^2\le 4M^2\sum_{t=1}^T\frac{1}{t}\le 4M^2(\log(T)+1)
$$
which is only logarithmic growth.

So that seems pretty good! However this turns out to just be a very easy problem; in general the FTL algorithm has a few drawbacks. An intuitively obvious starting drawback is that it may take a while to compute each new $w_t$ since we have to solve a (non-online) convex optimization problem at each step. Further, the amount of memory required grows linearly with $T$ since we have to remember each $f_t$. In the case of quadratic losses above we were able to find a closed form for $w_t$ that removed these obstacles, but in this probably won't happen in general. However there is an even worse problem with FLT - it doesn't actually do a good job on many problems. The reason is that the $w_t$ can be made <i>unstable</i>, so that $f_t(w_t)-f_t(w_{t+1})$ is very large. 

Here's an example that illustrates this instability. Let $S=[-1,1]$ Let $f_1(x)=-\frac{1}{2}x$, and $f_t(x)=(-1)^tx$ for $t>1$. Then the sequence of $w_1,w_2,\dots$ produced by FTL will be $w_1,1,-1,1,-1,1,\cdots$. This produces a total loss of $-\frac{1}{2}w_1+T-1$, while predicting $0$ at all times would achieve a loss of $0$. Thus the total regret is at least $-\frac{1}{2}w_1+T-1$, which grows linearly with $T$.

Fortunately, it's possible to patch up FTL so that it works quite well. The idea is to enforce some form of stability on the $w_t$s and then use the lemma to prove regret bounds. It will turn out that the standard gradient descent algortihm will fall out of this analysis in a really cool way.


<h3>Follow the Regularized Leader</h3>

The way we will enforce stability on the $w_t$s is to make use of a <i>regularizer</i>. In symbols, we have a function $R:S\to \R$ and we set
$$
w_{t+1} = \argmin_{w\in S} R(w)+\sum_{i=1}^t f_i(w)
$$
Effectively, we are creatinga fictitious $f_0(x)=R(x)$ and running the same FTL algorithm with this regularizer in front. This is called Follow the Regularized Leader (FTRL). A simple choice for $R$ is $R(w)=\frac{1}{2}\|w\|_2^2$. Intuitively, this is saying that we want the $w_t$ values to shrink towards zero a little bit. In the example of instability, the $w_t$s were oscillating. This shrinking is intended to put a damper on that oscillation.

Let's show analyze FTRL. We can essentially immediately apply the lemma for FTL to get:
<div class="theorem">
<h4>Lemma 2</h4>
Let $w_1,w_2,\dots$ be the vectors produced by FTRL. Then for all $u\in S$,
$$
\text{Regret}_T(u)\le R(u)-R(w_1)+\sum_{t=1}^T (f_t(w_t)-f_t(w_{t+1}))
$$
</div>

<i>proof</i>

First note that FTRL is equivent to running FTL on the sequence of functions $R=f_0,f_1,f_2,\dots$. Thus by Lemma 1, we have, 
$$
\begin{align*}
R(w_0)-R(u)+\sum_{t=1}^T f_t(w_t)-f_t(u)&=\sum_{t=0}^Tf_t(w_t)-f_t(u)\\
&\le \sum_{t=0}^T f_t(w_t)-f_t(w_{t+1})\\
&=R(w_0)-R(w_1)+\sum_{t=1}^T f_t(w_t)-f_t(w_{t+1})
\end{align*}
$$

Subtracting $R(w_0)-R(u)$ from both sides finishes the proof.
<div align = "right">$\square$</div>

Now let's specialize to the case $R(w)=\frac{1}{2\eta}\|w\|_2^2$ for some chosen parameter $\eta$. We'll first show that this allows us to do well on the linear losses that caused oscillation with the ordinary FTL algorithm. Then we'll make our first use of the assumption that the $f_t$ are all convex (notice that we haven't actually need this yet!) to derive online gradient descent.

<div class="theorem">
<h4>Lemma 3</h4>
Suppose $f_t(w) = \langle z_t,w\rangle$ for some choices of $z_t\in \R^d=S$. Let $R(w)=\frac{1}{2\eta}\|w\|_2^2$ and let $w_1,w_2,\dots$ be the predictions of FTRL with regularizer $R(w)$. Then for all $u\in \R^d$ we have
$$
\text{Regret}_T(u) \le \frac{1}{2\eta}\|u\|_2^2+\eta\sum_{t=1}^T\|z_t\|_2^2
$$
</div>

An immediate consequence of this lemma is the following:

Let $U = \{u\ |\ \|u\|_2\le B\}$ and $L^2\ge \frac{1}{T}\sum_{t=1}^T \|z_t\|_2^2$ then by setting $\eta=\frac{B}{L\sqrt{2T}}$ we have
$$
\text{Regret}_T(U) \le BL\sqrt{2T}
$$
which is sublinear in $T$.

<i>proof</i>

First, since $R(w)+\sum_{i=1}^t f_t(w)=\frac{1}{2\eta}\|w\|_2^2+\langle \sum_{i=1}^t z_t,w\rangle$, we can see (take the derivative!) that $w_{t+1} = -\eta\sum_{i=1}^t z_t=w_t-\eta z_t$.

Using this equation we can see that
$$
f_t(w_t)-f_t(w_{t+1})=\langle z_t,w_t-w_{t+1}\rangle = \eta \|z\|_2^2
$$

Now let's finish up by using Lemma 2 (note that $w_1=0$):
$$
\text{Regret}_T(u)\le R(u)-R(w_1)+\sum_{t=1}^T (f_t(w_t)-f_t(w_{t+1}))=\frac{1}{2\eta}\|u\|_2^2+\eta \sum_{t=1}^T \|z\|_2^2
$$
<div align="right">$\square$</div>

So far (in Lemmas 1 and 2) we've actually made no use of assuming that the $f_t$ are all convex. Now let's make our first use of this assumption to derive Online Subgradient Descent. First, what is a "subgradient"?

The subgradient is just what we use in place of a gradient when $f_t$ is non-differentiable. If $S\subset \R^d$ is an open convex set and $f:S\to \R$ a convex function, then for all $w\in S$ there exists $z$ such that for all $w+h\in S$,
$$
f(w+h)\ge f(w)+\langle h ,z\rangle
$$
Any $z$ satisfying this condition is a <i>subgradient</i> of $f$ at $w$. The notation for this is $z\in \partial f(w)$. The set $\partial f(w)$ is the set of all subgradients at $w$ and is called the <i>subdifferential</i>. The next block proves a couple things about convex functions, including the existence of subgradients. It's not really helpful for understanding anything else, so can be safely skipped.

<div style="margin-left: 3em; margin-right: 6em;" class="aside">
<p>
Let's prove that subgradients actually exist. Let $G=\{(s,b)\in S\times \R|b\ge f(s)\}$, where $S$ is an open convex set. Our strategy is to first show that $G$ (which is called the <a href=https://en.wikipedia.org/wiki/Epigraph_%28mathematics%29>epigraph</a>) is convex. Then applying the <a href=https://en.wikipedia.org/wiki/Epigraph_%28mathematics%29>supporting hyperplane theorem</a> to any point on $G$ gives a subgradient at that point, so long as the normal to the hyperplane has a nonzero final component. However, since $S$ is open, this is guaranteed. Let $p_1= (x_1,b_1),\ p_2=(x_2,b_2)\in G$. We will show that the line segment connecting them is contained in $G$. The first thing to note is that if $(a,b)\in G$, then so is $(a,b+k)$ for all $k\ge 0$. Now let $q_1=(x_1,f(x_1)),\ q_2=(x_2,f(x_2))\in G$. Then for all $t\in [0,1]$, $f(tx_1+(1-t)x_2)\le tf(x_1)+(1-t)f(x_2)$. Therefore $tq_1+(1-t)q_2\in G$. Now $tf(x_1)+(1-t)f(x_2)\le tb_1+(1-t)b_2$ and so $tp_1+(1-t)p_2\in G$.
</p>
<br></br>
As an aside we'll also show that all convex functions $f:S\to \R$ with $S$ a convex open subset of $\R^n$ are continuous - I think this is surpisingly difficult thing to prove! Suppose $x\in S$. Let $x+h\in S$. Then
$$
f(x+th)\le (1-t)f(x)+tf(x+h)\\
f(x+th)-f(x)\le t(f(x+h)-f(x))
$$
Let $\Sigma\subset S$ be an $n$-simplex centered around $x$ with diameter $D$. A $\Sigma$ must exist since $S$ is open. Suppose $f$ is bounded on $\Sigma$: $f(z)\le B$ for all $z\in \Sigma$. Then let $r$ be small enough that the ball of radius $r$ centered at $x$ is contained in $\Sigma$. Let $\epsilon>0$, $\delta= 2r\epsilon/M$. Then for any $\Delta$ with $\|\Delta\|\le \delta$, we can write $\Delta = \frac{d}{r}h$ for some $h$ with $\|h\|= r$, $d\le \delta$, and so
$$
f(d+\Delta)-f(x)\le \frac{d}{r}(f(x+h)-f(x))\le \frac{d}{r}2M\le \epsilon
$$
and so convex functions are continuous. In fact this proof also shows that they are locally Lipschitz.
</p>
</div>

Now let's re-write the the subgradient condition as: For all $u\in S$,
$$
f(u)\ge f(w)+\langle u-w,z\rangle\\
$$
This implies something really cool:

<div class ="theorem">
<h4>Linear Functions are the Worst Case</h4>
If $f_t$ is convex for all $t$ and $z_t$ is a subgradient of $f_t$ at $w_t$ ($z_t\in \partial f_t(w_t)$), we have
$$
f_t(w_t)-f_t(u)\le \langle w_t-u,z_t\rangle=\langle w_t,z_t\rangle-\langle u,z_t\rangle
$$
$$
\text{Regret}_T(u) \le \sum_{t=1}^T \langle w_t,z_t\rangle -\langle u,z_t\rangle
$$
</div>

This means that if we have an algorithm that achieves low regret with respect to the <i>linear</i> functions $f_t'(w)=\langle w,z_t\rangle$, then this algorithm also achieves low regret with respect to the actual convex functions $f_t$. This is an extremely powerful tool - it reduces online convex optimization to online linear optimization. By lemma 3 we already have an algorithm that does well on online linear optimization- FTRL! 

Specifically, we'll apply FTRL with $R(w) = \frac{1}{2\eta} \|w\|_2^2$ to the sequence of functions $\hat f_t(w)=\langle f_t',w\rangle$. This gives us the Online Subgradient Descent algorithm. Expanding out the FTRL updates on the subgradients, we get:

<div class = "algorithm">
<h4>Online (sub)Gradient Descent (OGD):</h4>
<ol>
<li> Initialize $x_1=0$, $w_1=\Pi_S(x_1)$, where $\Pi_S(x)= \argmin_{w\in S}(x)$ is the <i>projection</i> onto the convex set $S$.
<br>
For each $t=1,2,\dots$</li>
<li> Predict $w_t=\Pi_S(x_t)$.</li>
<li> Receive subgradient $z_t$.</li>
<li> Update $x_{t+1}=x_t-\eta z_t$.</li>
</ol>
</div>

Let's analyze how this should perform. Immediate application of Lemma 3 to the fact that regret on $f_t$ is bounded by the regret on $f_t'$ tells us:
$$
\text{Regret}_T(u)\le=\frac{1}{2\eta}\|u\|_2^2+\eta \sum_{t=1}^T \|z\|_2^2
$$
Now to package this result in a nicer fashion, we can say:

<div class = "theorem">
<h4>OGD Regret</h4>

Suppose $f_t$ is $L_t$ Lipschitz, $L^2\ge \frac{1}{T}\sum_{t=1}^T L_t^2$. Suppose the $\|u\|\le B$ for all $u\in S$. Then $\|z_t\|_2^2\le L_t^2$ and so we have
$$
\text{Regret}_T(S)\le\frac{1}{2\eta}\|B\|_2^2+\eta \sum_{t=1}^T \|z\|_2^2
$$
Then setting $\eta = \frac{B}{L\sqrt{2T}}$ yields a regret of
$$
\text{Regret}_T(S)\le BL\sqrt{2T}
$$
</div>

So that's pretty cool, but in order to set $\eta$ we need to know three parameters- $B$, $L$, and $T$. We could just guess for $B$ and $L$ and still achieve sublinear regret, but this won't work for $T$. So what can we do? There are a couple ways around this - one is a cute trick, but I find the second method more satisfying. The "trick" method is to just first run $2$ steps of OGD assuming $T=2$, then start over and run 4 steps assuming $T=4$, then start over and assume $T=8$ and so on. This will obtain a regret of $O(\sqrt{T})$.

However, we can do a little better than this - instead of $w_{t+1}=w_t-\eta z_t$, we set $w_{t+1}=w_t-\eta_t z_t$ where $\eta_t =\frac{\alpha}{\sqrt{t}}$ for some $\alpha$. This has the advantage of being somehow "smoother" than the doubling method, and can achieve better regret. The analysis is somewhat different though.

So now we have a "time horizon agnostic" gradient descent algorithm:

<div class = "algorithm">
<h4>Infinite Horizon OGD:</h4>
<ol>
<li> Initialize $x_1$ to some vector in $\R^n$, $w_1=\Pi_S(x_1)$.
<br>
For each $t=1,2,\dots$</li>
<li> Predict $w_t=\Pi_S(x_t)$.</li>
<li> Receive subgradient $z_t$.</li>
<li> Update $x_{t+1}=w_t-\frac{\alpha}{\sqrt{t}} z_t$.</li>
</ol>
</div>

Notice that in the first version of OGD we analyzed the algorithm used <i>lazy projections</i>, in which we kept track of a vector $x_t$ outside of $S$ and only projected to $S$ when absolutely necessary. In this scheme we are using <i>greedy projections</i>. It is possible to analyze gradient descent with lazy projections and an unbounded horizon, but we'll do this <a href=mirror-descent-and-ftrl.html>later</a>. For now we'll just deal with the greedy projections case.
The analysis of this scheme follows that of <a href =http://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf>Zinkevich</a>.

We already saw that linear functions are the hardest to deal with in online learning, so we'll just suppose that our functions $f_t$ are given by a sequence of subgradient functions $f_t'(x) = \langle x,z_t\rangle$. Suppose $w^*\in S$. To bound the regret with respect to $w^*$, we're interested in $\langle z_t,w_t-w^*\rangle$. To get a handle on this we'll first work with $x_{t+1}-w^*$:
$$
x_{t+1}-w^*=w_{t}-\frac{\alpha}{\sqrt{t}}z_t-w^*
$$
$$
(x_{t+1}-w^*)^2 = (w_{t}-w^*)^2 + \frac{\alpha^2}{t}\|z_t\|_2^2 - 2\frac{\alpha}{\sqrt{t}}z_t(w_{t}-w^*)
$$
Now since $w^*\in S$ and $w_{t+1}=\Pi_S x_{t+1}$ and $S$ is convex, we have $(w_{t+1}-w^*)^2\le (x_{t+1}-w^*)^2$. Therefore (after some rearrangements):
$$
(w_{t}-w^*)z_t \le \frac{\sqrt{t}}{2\alpha}\left((w_{t}-w^*)^2-(w_{t+1}-w^*)^2\right) +\frac{\alpha}{2\sqrt{t}}\|z_t\|_2^2
$$
And now we see the light - the first term is going to telescope!
$$
\text{Regret}_T(w^*) \le \sum_{t=1}^T \langle w_t-w^*,z_t\rangle \le\sum_{t=1}^T \frac{\sqrt{t}}{2\alpha}\left((w_{t}-w^*)^2-(w_{t+1}-w^*)^2\right) +\frac{\alpha}{2\sqrt{t}}\|z_t\|_2^2\\
\le\frac{1}{2\alpha}\left((w_1-w^*)^2+\sum_{t=2}^T(\sqrt{t}-\sqrt{t-1})(w_t-w^*)^2\right)+\frac{\alpha}{2}\sum_{t=1}^T\frac{1}{\sqrt{t}}\|z_t\|_2^2
$$
So let's assume that the diameter of $S$ is at most $D$, and that $L_{\max}^2\ge  \|z_t\|_2^2$ for all $t$. Then we have
$$
\text{Regret}_T(S)\le \frac{D^2}{2\alpha}\left(1+\sum_{t=2}^T\sqrt{t}-\sqrt{t-1}\right)+\frac{\alpha L_\max ^2}{2}\sum_{t=1}^T\frac{1}{\sqrt{t}}\\
\le \frac{D^2\sqrt{T}}{2\alpha}+\frac{\alpha L_{\max}^2}{2}(2\sqrt{T}-1)
$$
Dropping the "-1" and optimizing for $\alpha$ we get $\alpha = \frac{D}{L_{\max}\sqrt{2}}$ which yields a regret bound of:

<div class = "theorem">

<h4>Infinite Horizon OGD Regret</h4>

Suppose $z_1,z_2,\dots$ is the sequence of gradients given to the Infinite Horizon OGD algorithm and $S$ is a convex set with diameter at most $D$. If $\|z_t\|_2\le L_\max$ for all $t$, then
$$
\text{Regret}_T(S)\le \frac{D^2\sqrt{T}}{2\alpha}+\frac{\alpha L_{\max}^2}{2}(2\sqrt{T}-1)
$$
and with $\alpha = \frac{D}{L_{\max}\sqrt{2}}$ we have
$$
\text{Regret}_T(S) \le L_{\max}D\sqrt{2T}
$$
</div>

Notice that this bound looks very similar to the bound for the non-infinite horizon version of OGD. However, here we are using a bound on the maximum gradient $(L_{\max})$ rather than one on the average gradient (just $L$). 

If we assume that the feasible set is the ball $S=\{w\ |\ \|w\|\le B\}$, then by setting $x_1=0$ and $\alpha = \frac{B}{\sqrt{2}L_{\max}}$ then we can actually improve this regret bound to:

$$
\text{Regret}_T(S) \le BL_{\max}\sqrt{2T} = \frac{L_{\max}B\sqrt{2T}}{2}
$$

So to recap, we've introduced the idea of online learning and online convex optimization along with the use of <i>regret</i> to measure the performance of algorithms. Under the online convex optimization framework, we used the Follow the Regularized Leader algorithm to achieve low regret on linear functions. Then we showed that by replacing convex functions with subgradients, an algorithm that dows well on linear functions necessarily does well on all convex functions, leading to the Online Gradient Descent algorithm. Finally, we were able to improve Online Gradient Descent to a time-horizon agnostic version by using a decaying learning rate. Unlike the previous algorithms, this decaying-rate version of OGD was not a special case of Follow the Leader and had to be analyzed in a different way.