In [None]:
from dialoghelper import add_msg
import re
from fastcore.foundation import Path
def md_to_notes(path):
    "Read markdown file and create a note for each header section"
    txt = Path(path).read_text()
    parts = re.split(r'^(#{1,4}\s+.+)$', txt, flags=re.MULTILINE)
    if parts[0].strip(): add_msg(content=parts[0].strip())
    for i in range(1, len(parts), 2):
        content = parts[i] + (parts[i+1] if i+1 < len(parts) else '')
        if content.strip(): add_msg(content=content.strip())

In [None]:
md_to_notes('./md/ch05.md')

## Chapter 5

## Math Preliminaries

In this chapter, we review some of the basic mathematical results needed in subsequent chapters. With an eye towards establishing the machinery of Taylor series, we present definitions and results for the gradient, hessian, and Jacobian of multivariate functions. Next we look at the growth rate categorizations established via big  $O$  and little  $o$  notation. This culminates in establishing several formulations of Taylor's theorem.

The following section deals with convex functions, of primary importance in our work in optimization. By using various results from Taylor's theorem, we prove three equivalencies depending on continuity assumptions.

### 5.1 Multivariate Calculus

In what follows, we will classify functions by their continuity properties. We say  $f \in C^N$  if  $f$  is  $N$ -times differentiable with a continuous  $N$ th derivative. In this vein, continuous functions will be denoted by the class  $C^0$ . Notice that if  $f \in C^N$ , then  $f \in C^{N-k}$  for  $k = 0, \dots, N$ .

#### 5.1.1 The Gradient and Hessian

The gradient is the direct analogue of the first derivative in univariate calculus. For a function

$$f: \mathbb{R}^N \to \mathbb{R},$$

the *gradient* of  $f$  is denoted by  $\nabla f$  and defined by

$$\nabla f = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_N} \end{pmatrix}. \quad (5.1)$$

where

$$\left. \frac{\partial f}{\partial x_i} \right|_x = \lim_{h \to 0} \frac{f(x + he_i) - f(x)}{h} \quad (5.2)$$

and  $e_i = (0, \dots, 0, 1, 0, \dots, 0) \in \mathbb{R}^N$  is the vector with a 1 in the  $i$ th component and zeros everywhere else.

As a result, the gradient is a function

$$\nabla f: \mathbb{R}^N \to \mathbb{R}^N.$$

We define a *stationary point* of  $f$  as any point  $x^*$  such that  $\nabla f(x^*) = 0$ , and will often use notation  $\nabla f^* = \nabla f(x^*)$  when the context is clear.

Certain useful properties of the gradient include that the gradient points in the direction of greatest increase for  $f$  (and as a corollary,  $-\nabla f$  points in the direction of greatest decrease), and that  $\nabla f(x)$  is tangent to the level curve of  $f$  at  $x$ .

We prove this first statement below.

*Proof.* To show that the gradient points in the direction of greatest increase, we must first define the *directional derivative*. For a unit vector  $u$ , the directional derivative at a point  $x$  is given by

$$D_u f(x) = (\nabla f(x), u). \quad (5.3)$$

The directional derivative measures the rate of change in the direction  $u$ . Notice that by Cauchy-Schwarz,

$$D_u f(x) = \cos\theta ||\nabla f(x)|| \cdot ||u||$$

where  $\cos\theta$  is defined as the angle between  $\nabla f(x)$  and  $u$ . As an immediate result, to maximize  $D_u f(x)$ , then, we require that  $\cos\theta = 1$ , or that the angle between  $\nabla f(x)$  and  $u$  must be 0. That is, the direction of greatest increase points in the direction of the gradient.

Similarly, to find the direction which decreases  $f$  the most, we must choose  $\theta$  so that  $\cos\theta = -1$ , giving that the direction of greatest decrease as  $-\nabla f(x)$ .  $\square$

We define a *descent direction* for the function  $f$  at a point  $x$  as any direction  $u$  satisfying  $D_u f(x) < 0$ . That is, a descent direction satisfies

$$u' \nabla f(x) < 0. \quad (5.4)$$

From the above, it is clear that  $-\nabla f(x)$  is a descent direction. Further, we have that for any  $H > 0$ , then  $-H\nabla f(x)$  is a descent direction as well.

In the sequel, we will be concerned with finding minima of functions. Based on the preceding result, the astute reader may see an immediate application of the gradient.

The *hessian* of  $f$  is the direct analogue of the second derivative in univariate calculus, and is denoted by  $\nabla^2 f$ . It is defined by the matrix of mixed partials

$$\nabla^2 f = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_2 \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_N \partial x_1} \\ & \ddots & & \vdots \\ & & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_1 \partial x_N} & \cdots & \cdots & \frac{\partial^2 f}{\partial x_N^2} \end{pmatrix}. \quad (5.5)$$

If  $f\in\mathbb{C}^2$ , then  $\nabla^2f$  is symmetric. Notice that  $\nabla^2f(x)$  is a function

$$\nabla^2f:\mathbb{R}^N\rightarrow\mathbb{R}^{N\times N}.$$

For univariate functions, the second derivative gives an indication of whether the function is ‘concave up’ or ‘concave down,’ according to whether the second derivative is positive or negative, respectively. While we cite these terms out of familiarity, the preferred technical terms will be shown to be ‘convex’ and ‘concave.’ In any event, the hessian provides similar color on the shape of the function  $f$ , with a positive definite hessian indicating that the function is locally convex. We will formally define these terms and derive this result in a subsequent section.

#### 5.1.2 The Jacobian

For a function,  $F$ ,

$$F:\mathbb{R}^N\rightarrow\mathbb{R}^M$$

given by

$$F(x)=\begin{pmatrix} f_1(x) \\ \vdots \\ f_M(x) \end{pmatrix}$$

with  $f_i(x):\mathbb{R}^N\rightarrow\mathbb{R}$  for each  $i=1,\dots,N$ , the *Jacobian* of  $F$  is denoted  $\nabla F$ , and defined by

$$\nabla F=\begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \cdots & \frac{\partial f_1}{\partial x_N} \\ \vdots & & & \vdots \\ \frac{\partial f_M}{\partial x_1} & \cdots & \cdots & \frac{\partial f_M}{\partial x_N} \end{pmatrix}. \quad (5.6)$$

The Jacobian of  $F$  may be written in terms of the gradients of  $f_i$ :

$$\nabla F=\begin{pmatrix} - & \nabla f'_1 & - \\ & \vdots & \\ - & \nabla f'_M & - \end{pmatrix}. \quad (5.7)$$

Notice that the Jacobian is a function

$$\nabla F:\mathbb{R}^N\rightarrow\mathbb{R}^{M\times N},$$

and that for  $f:\mathbb{R}^N\rightarrow\mathbb{R}$

$$\nabla(\nabla f)=\nabla^2f, \quad (5.8)$$

or ‘the Jacobian of the gradient is the hessian.’

The preceding observation highlights the intersection of notation. The apparent misuse is perhaps excusable when one considers that both the gradient and the Jacobian are the best linear approximations of their respective functions. For example, the Jacobian of

$$F(x)=Ax$$

for a constant matrix  $A \in \mathbb{R}^{M \times N}$  is  $\nabla F = A$ . We derive this formally for familiarity with the concepts involved in the next proof.

*Proof.* For matrix

$$A = \begin{pmatrix} a_{11} & \dots & a_{1N} \\ \vdots & & \vdots \\ a_{M1} & \dots & a_{MN} \end{pmatrix},$$

we have

$$Ax = \begin{pmatrix} \sum_{i=1}^{N} a_{1i}x_i \\ \vdots \\ \sum_{i=1}^{N} a_{Mi}x_i \end{pmatrix}.$$

Define  $A_j(x) = \sum_{i=1}^{N} a_{ji}x_i$  for  $j = 1, \dots, M$ . The gradient of each  $A_j$  is given by

$$\nabla A_j = \begin{pmatrix} a_{j1} \\ \vdots \\ a_{jN} \end{pmatrix},$$

so that the Jacobian of  $F$  is

$$\nabla F = \begin{pmatrix} - & \nabla A'_1 & - \\ & \vdots & \\ - & \nabla A'_M & - \end{pmatrix}.$$

Observing, finally, that  $\nabla A_j$  is exactly the transpose of the  $j$ th row of  $A$ ,  $\nabla F$  is exactly  $A$ .

□

The product rule of univariate calculus has a direct analogue as well. Let  $F$  and  $G$  both be functions from  $\mathbb{R}^N \to \mathbb{R}^M$ ,

$$F(x) = \begin{pmatrix} f_1(x) \\ \vdots \\ f_M(x) \end{pmatrix} \quad G(x) = \begin{pmatrix} g_1(x) \\ \vdots \\ g_M(x) \end{pmatrix}.$$

Then we have

$$\nabla(F'G) = \nabla F'G + \nabla G'F. \quad (5.9)$$

Note that on the left hand side we see the gradient operator and on the right the Jacobian.

*Proof.* We have that

$$\begin{aligned}\nabla(F'G) &= \nabla\left(\sum_{i=1}^N f_i g_i\right) \\ &= \left(\begin{array}{c}\sum_{i=1}^M \frac{\partial f_i}{\partial x_1} g_i \\ \vdots \\ \sum_{i=1}^M \frac{\partial f_i}{\partial x_N} g_i\end{array}\right) + \left(\begin{array}{c}\sum_{i=1}^M \frac{\partial g_i}{\partial x_1} f_i \\ \vdots \\ \sum_{i=1}^M \frac{\partial g_i}{\partial x_N} f_i\end{array}\right).\end{aligned}$$

Observing from (5.6) that

$$\nabla F' = \left(\begin{array}{cccc}\frac{\partial f_1}{\partial x_1} & \cdots & \cdots & \frac{\partial f_M}{\partial x_1} \\ \vdots & & & \vdots \\ \frac{\partial f_1}{\partial x_N} & \cdots & \cdots & \frac{\partial f_M}{\partial x_N}\end{array}\right).$$

we see that

$$\nabla F'G = \left(\begin{array}{c}\sum_{i=1}^M \frac{\partial f_i}{\partial x_1} g_i \\ \vdots \\ \sum_{i=1}^M \frac{\partial f_i}{\partial x_N} g_i\end{array}\right),$$

and similarly  $\nabla G'F$  is equal to the second summand above, proving the result.  $\square$

We say that  $q: \mathbb{R}^N \to \mathbb{R}$  is quadratic if  $q$  may be written as

$$q(x) = x'Gx + g'x + c \tag{5.10}$$

for constants  $G$ ,  $g$ , and  $c$ . The gradient of  $q$  requires us to know  $\nabla(x'Gx)$ . This follows immediately from the above results, since

$$\begin{aligned}\nabla(x'Gx) &= \nabla(x)'Gx + \nabla(Gx)'x \\ &= Gx + G'x \\ &= (G + G')x.\end{aligned}$$

This gives that

$$\nabla q = (G + G')x + g. \tag{5.11}$$

In the case that  $G$  is symmetric, this reduces to the familiar  $\nabla q = 2Gx + g$ . The hessian of  $q$  is the Jacobian of (5.11), which, since  $\nabla q$  is linear, is

$$\nabla^2 q = G + G'. \tag{5.12}$$

#### 5.1.3 Big $O$ and Little $o$

Big  $O$  and little  $o$  notation are used to categorize real valued functions with respect to their growth rates – everywhere, at a single point, or in the limit as

their dependent variable tends to infinity. We consider real valued functions,  $f$  and  $g$ ,

$$f: \mathbb{R}^N \to \mathbb{R}$$

$$g: \mathbb{R}^N \to \mathbb{R}.$$

We say that  $f$  is big  $O$  of  $g$  as  $x$  approaches  $a$  if for a given  $\delta$  there exists an  $M$  such that

$$||f(x)|| \le M||g(x)|| \quad (5.13)$$

for all  $x$  satisfying  $||x - a|| \le \delta$ .

If the limit  $\lim_{x \to a} f(x)$  exists, and

$$\lim_{x \to a} \frac{||f(x)||}{||g(x)||} = c < \infty$$

then  $f$  is big  $O$  of  $g$ . The notation is often abused, stating  $f = O(g)$  to indicate  $f$  is big  $O$  of  $g$ . Of course, since  $O(g)$  is a set, it may be more appropriate to write  $f \in O(g)$ , but the previous notation is ubiquitous.

**Example 5.1.1.** It is easy to show that

- $\pi = O(1)$  everywhere
- $10x^2 + 3x + 17 = O(x^2)$  everywhere
- $\sin x = O(1)$  as  $x \to 0$
- $\sin x = O(x)$  as  $x \to 0$
- $\sin x \neq O(x^2)$  as  $x \to 0$

Several useful properties obtain as well, including:

1. $O(f) \cdot O(g) = O(f \cdot g)$
2. $O(f) + O(g) = O(|f| + |g|)$
3. $f + O(g) = O(|f| + |g|)$
4. $O(c \cdot f) = O(f)$  for  $c \in \mathbb{R}$ ,  $c \neq 0$
5. if  $\lim_{x \to a} |g(x)| = C_g < \infty$ , then  $O(f \cdot g) = O(f)$

To give a flavor of how to work with the notation, we prove the first property, leaving several others to the reader in the exercises.

*Proof.* Let  $f_0$  be  $O(f)$ ,  $g_0$  be  $O(g)$ , and  $h_0$  be  $O(f \cdot g)$ . Then we know that there exist constants,  $M_f$ ,  $M_g$ , and  $M_h$  such that

$$\begin{aligned} ||f_0|| &\le M_f ||f|| \\ ||g_0|| &\le M_g ||g|| \\ ||h_0|| &\le M_h ||f \cdot g||. \end{aligned}$$

To prove that  $O(f) \cdot O(g) = O(f \cdot g)$ , we need to show that there exists an  $M_h^*$  such that  $||h_0|| \le M_h^* ||f|| \cdot ||g||$  and an  $M_{fg}^*$  such that  $||f_0|| \cdot ||g_0|| \le M_{fg}^* ||f \cdot g||$ . Each claim is straightforward. We have

$$\begin{aligned} ||h_0|| &\le M_h ||f \cdot g|| \\ &= M_h ||f|| \cdot ||g||, \end{aligned}$$

so we may set  $M_h^* = M_h$ . On the other hand,

$$\begin{aligned} ||f_0 \cdot g_0|| &= ||f_0|| \cdot ||g_0|| \\ &\le M_f ||f|| \cdot M_g ||g|| \\ &= M_f M_g ||f \cdot g||, \end{aligned}$$

and we may set  $M_{fg}^* = M_f M_g$ . □

We say that  $f$  is little  $o$  of  $g$  as  $x$  approaches  $a$  if

$$\lim_{x \to a} \frac{||f(x)||}{||g(x)||} = 0. \quad (5.14)$$

And, again, the notation will be stretched so that we may write  $f = o(g)$  to indicate  $f$  is little  $o$  of  $g$ . Clearly, little  $o$  notation is a stronger statement about the relative growth rates of  $f$  and  $g$  than big  $O$ . As such, if  $f = o(g)$ , then  $f = O(g)$ . We also have

1. $c \cdot o(f) = o(f)$  for  $c \in \mathbb{R}$
2. $o(f) \cdot O(g) = o(f \cdot g)$

These properties are left as exercises for the reader.

It is important to note that if  $f = o(g)$  at  $a$ , then based on the definition, for any  $\epsilon > 0$ , there exists a  $\delta$  such that

$$||f(x)|| \le \epsilon ||g(x)||$$

for all  $||x - a|| \le \delta$ .

#### 5.1.4 Taylor Series

The tool we will use to study optimization of smooth functions is Taylor's Theorem. At times we will have to use various formulations, each a result of continuity assumptions of  $f: \mathbb{R}^N \to \mathbb{R}$ . Recall the first and second order Taylor approximations of univariate functions,

$$f(x + \delta) \approx f(x) + \delta f'(x) \quad (5.15)$$

$$f(x + \delta) \approx f(x) + \delta f'(x) + \frac{1}{2} \delta^2 f''(x). \quad (5.16)$$

In the multivariate case,  $f: \mathbb{R}^N \to \mathbb{R}$ , we have:

1. For  $f$  differentiable,

$$f(x) = f(x_0) + \nabla f(x_0)'(x - x_0) + o(||x - x_0||). \quad (5.17)$$

Notice that we may write the above as

$$f(x + \delta) = f(x) + \nabla f(x)' \delta + o(||\delta||).$$

An alternate formulation with this same continuity assumption is

$$f(x + \delta) = f(x) + \nabla f(x + t\delta)' \delta \quad (5.18)$$

for some  $t \in (0, 1)$ .

This second formulation in  $t$  will be extremely useful in many of the proofs that follow.

2. For  $f$  twice differentiable

$$f(x) = f(x_0) + \nabla f(x_0)'(x - x_0) + \frac{1}{2}(x - x_0)' \nabla^2 f(x_0)(x - x_0) + o(||x - x_0||^2). \quad (5.19)$$

And again, we may rewrite this in terms of a perturbation,  $\delta$ , as

$$f(x + \delta) = f(x) + \nabla f(x)' \delta + \frac{1}{2} \delta' \nabla^2 f(x) \delta + o(||\delta||^2),$$

and have a formulation with strict equality as

$$f(x + \delta) = f(x) + \nabla f(x)' \delta + \delta' \nabla^2 f(x + t\delta)' \delta \quad (5.20)$$

for some  $t \in (0, 1)$ .

3. For  $f$  twice differentiable with continuous second derivative; i.e.,  $f \in C^2$ ,

$$f(x) = f(x_0) + \nabla f(x_0)'(x - x_0) + O(||x - x_0||^2), \quad (5.21)$$

or, as before

$$f(x + \delta) = f(x) + \nabla f(x)' \delta + O(||\delta||^2).$$

We also have the integral formulation

$$\nabla f(x + \delta) = \nabla f(x) + \int_0^1 \nabla^2 f(x + t\delta) \delta dt \quad (5.22)$$

For each of the above formulations in  $x_0$ , we identify them as *Taylor expansions of  $f$  about  $x_0$* .

Finally, for  $F: \mathbb{R}^N \to \mathbb{R}^M$  and  $F$  differentiable, we also have, for  $\nabla F$  indicating the Jacobian,

$$F(x) = F(x_0) + \nabla F(x_0)(x - x_0) + o(||x - x_0||) \quad (5.23)$$

and

$$F(x+\delta)=F(x)+\nabla F(x+t\delta)\delta$$
 (5.24)

for some  $t\in(0,1)$ .

This equation also gives an expansion of the gradient of  $f$  for  $f$  twice differentiable. Namely,

$$\nabla f(x)=\nabla f(x_0)+\nabla^2f(x_0)(x-x_0)+o(||x-x_0||)$$
 (5.25)

since the Jacobian of the gradient is the hessian.

### 5.2 Convex Functions

Convex functions occupy a particularly attractive position in optimization. As we shall see, many appealing properties such as uniqueness of global optima are readily obtained when the function under consideration is convex. In this section we establish several equivalent formulations of convexity, each with an accompanying continuity assumption. We begin with the definition for a *convex set*.

We say that a set,  $K$ , of points in  $\mathbb{R}^N$  is convex if for any  $x_0$  and  $x_1$  in  $K$ ,  $x_\theta$ , defined as

$$x_\theta=(1-\theta)x_0+\theta x_1,$$
 (5.26)

lies in  $K$  as well, for  $\theta\in[0,1]$ . Put another way, a set is convex if every line segment between two points in  $K$  is completely contained in  $K$  as well. One may show that, more generally, if  $K$  is convex, then

$$\sum_{i=1}^N\theta_ix_i\in K$$
 (5.27)

when each  $x_i\in K$  and  $\sum_i\theta_i=1$  with  $\theta_i>0$  for  $i=1,\dots,N$ . This exercise is left to the reader. We say in the preceding case that  $\sum_i\theta_ix_i$  is a *convex combination* of  $\{x_i\}_{i=1}^N$ .

A function,  $f:\mathbb{R}^N\rightarrow R$  is convex if

$$f(x_\theta)\le(1-\theta)f(x_0)+\theta f(x_1).$$
 (5.28)

In the work below, we will further reduce this notation to simply

$$f_\theta\le(1-\theta)f_0+\theta f_1.$$

The definition simply states that the value of the function as  $x$  ranges from  $x_0$  to  $x_1$  lies below the line segment connecting  $x_0$  and  $x_1$ .

We next prove the following equivalencies:

1. For  $f$  differentiable,  $f$  is convex if and only if

$$\nabla f'_0(x_1-x_0)\le f_1-f_0.$$
 (5.29)

Or, equivalently,

$$f_0 + \nabla f'_0(x_1 - x_0) \le f_1$$

giving that the  $f$  always lies above its linear approximation.

2. For  $f \in C^2$ ,  $f$  is convex if and only if

$$\nabla^2 f \ge 0. \tag{5.30}$$

In this case the function resembles (at least in low enough dimensions) an upward facing bowl.

We begin by proving (5.29).

*Proof.* Suppose  $f$  is differentiable and

$$\nabla f'_0(x_1 - x_0) \le f_1 - f_0.$$

Then, using the notation in 5.26 and following 5.28, we have that for any  $\theta \in (0, 1)$ , both

$$\nabla f'_\theta(x_1 - x_\theta) \le f_1 - f_\theta$$

and

$$\nabla f'_\theta(x_0 - x_\theta) \le f_0 - f_\theta.$$

Multiplying this first inequality in  $f_\theta$  by  $\theta$  and the second by  $1 - \theta$  gives

$$\begin{aligned} \theta \nabla f'_\theta(x_1 - x_\theta) & \le \theta(f_1 - f_\theta) \\ \nabla f'_\theta(\theta x_1 - \theta x_\theta) & \le \theta f_1 - \theta f_\theta \end{aligned}$$

and

$$\nabla f'_\theta((1 - \theta)x_0 - (1 - \theta)x_\theta) \le (1 - \theta)f_0 - (1 - \theta)f_\theta.$$

Adding these two resulting inequalities gives

$$\begin{aligned} \nabla f_\theta(\theta x_1 + (1 - \theta)x_0 - (\theta + 1 - \theta)x_\theta) & \le \theta f_1 + (1 - \theta)f_0 - (\theta + 1 - \theta)f_\theta \\ \nabla f_\theta(x_\theta - x_\theta) & \le \theta f_1 + (1 - \theta)f_0 - f_\theta \\ 0 & \le \theta f_1 + (1 - \theta)f_0 - f_\theta \end{aligned}$$

so that

$$f_\theta \le (1 - \theta)f_0 + \theta f_1$$

as desired.

To prove the converse, we assume that  $f$  is convex and establish (5.29). Since  $f$  is convex, we know that

$$f_\theta \le (1 - \theta)f_0 + \theta f_1,$$

and so

$$f_\theta - f_0 \le \theta(f_1 - f_0),$$

giving

$$\frac{f_\theta - f_0}{\theta} \le f_1 - f_0.$$

This final formulation gives an indication of how to introduce the gradient. Considering  $x_\theta$  as a point on the line segment connecting fixed  $x_0$  and  $x_1$ , we expand  $f$  in a Taylor Series as in (5.17) about  $x_0$  as

$$f(x_0 + \theta(x_1 - x_0)) = f(x_0) + \nabla f(x_0)'(\theta(x_1 - x_0)) + o(\theta||x_1 - x_0||)$$

or

$$f_\theta = f_0 + \nabla f_0'(\theta(x_1 - x_0)) + o(\theta).$$

Rearranging terms, we have

$$\frac{f_\theta - f_0}{\theta} = \nabla f_0'(x_1 - x_0) + \frac{o(\theta)}{\theta},$$

and taking the limit as  $\theta \downarrow 0$  (taking the limit from the right since  $\theta \in (0, 1)$ ), we see

$$\begin{aligned} \lim_{\theta \downarrow 0} \frac{f_\theta - f_0}{\theta} &= \nabla f_0'(x_1 - x_0) + \lim_{\theta \downarrow 0} \frac{o(\theta)}{\theta} \\ &= \nabla f_0'(x_1 - x_0). \end{aligned}$$

We are left to determine a bound for  $\lim_{\theta \downarrow 0} \frac{f_\theta - f_0}{\theta}$ . But it is clear from our previous observation that

$$\lim_{\theta \downarrow 0} \frac{f_\theta - f_0}{\theta} \le \lim_{\theta \downarrow 0} (f_1 - f_0) = f_1 - f_0,$$

so that

$$\nabla f_0'(x_1 - x_0) \le f_1 - f_0$$

completing the proof.  $\square$

Next, we prove (5.30).

*Proof.* Suppose  $f$  is twice differentiable and that

$$\nabla^2 f(x) \succeq 0$$

for all  $x$ . Expanding  $f$  about  $x_0$  as before and using 5.20, we have

$$f_1 = f_0 + \nabla f_0'(x_1 - x_0) + \frac{1}{2}(x_1 - x_0)'\nabla^2 f(x_0 + t(x_1 - x_0))(x_1 - x_0)$$

for some  $t \in (0, 1)$ . By the positive semidefiniteness of the hessian, the second summand is nonnegative, and hence

$$\begin{aligned} f_1 &\ge f_0 + \nabla f_0'(x_1 - x_0) \\ f_1 - f_0 &\ge \nabla f_0'(x_1 - x_0) \end{aligned}$$

giving that  $f$  is convex by (5.29).

We next assume that  $f$  is convex and twice differentiable. Let  $x_1 = x_0 + \alpha s$  for  $s \neq 0$  and  $\alpha$  a positive scalar, and expand the gradient of  $f$  about  $x_0$  as in (5.25),

$$\nabla f_1 = \nabla f_0 + \nabla^2 f_0(\alpha s) + o(\alpha). \quad (5.31)$$

From (5.29), we know that both

$$\begin{aligned} \nabla f'_0(x_1 - x_0) &\le f_1 - f_0 \\ \nabla f'_1(x_0 - x_1) &\le f_0 - f_1 \end{aligned}$$

giving

$$\nabla f'_0(x_1 - x_0) \le f_1 - f_0 \le \nabla f'_1(x_1 - x_0).$$

Replacing  $x_1$  by  $x_0 + \alpha s$ , we see that

$$\nabla f'_0(\alpha s) \le f_1 - f_0 \le \nabla f'_1(\alpha s). \quad (5.32)$$

Now, premultiplying (5.31) by  $\alpha s$ , we see that

$$(\alpha s)'\nabla f_1 = (\alpha s)'\nabla f_0 + (\alpha s)'\nabla^2 f_0(\alpha s) + o(\alpha^2).$$

Combining this result with the preceding set of inequalities, we get

$$\begin{aligned} \nabla f'_0(\alpha s) &\le f_1 - f_0 \le (\alpha s)'\nabla f_0 + (\alpha s)'\nabla^2 f_0(\alpha s) + o(\alpha^2) \\ 0 &\le f_1 - f_0 \le \alpha^2 s'\nabla^2 f_0 s + o(\alpha^2). \end{aligned}$$

Dividing through by  $\alpha^2$  and taking the limit as  $\alpha \downarrow 0$ , we get

$$\begin{aligned} 0 &\le s'\nabla^2 f_0 s + \lim_{\alpha \downarrow 0} \frac{o(\alpha^2)}{\alpha^2} \\ 0 &\le s'\nabla^2 f_0 s, \end{aligned}$$

proving the result since  $x_0$  and  $s$  were arbitrarily chosen.  $\square$

### Exercises

1. Prove  $O(f) + O(g) = O(|f| + |g|)$ .
2. Prove that if  $f$  is  $o(g)$ , then  $f$  is  $O(g)$ .
3. Prove  $o(f) \cdot O(g) = o(f \cdot g)$ .
4. Ledoit and Wolf [19] consider various biased estimators of the mean and covariance. In the following problems we look at some of their preliminary results.
   1. Let  $X \in \mathbb{R}^N$  be a multivariate random variable with mean  $\mu$  and covariance  $\Sigma$ . Let  $\hat{\mu}$  be the unbiased estimator of the mean,

$$\hat{\mu} = \frac{1}{T} \sum_{t=1}^{T} \mu_t$$

and  $f \in \mathbb{R}^N$  a constant. A shrinkage estimator of the mean is given by

$$(1 - \alpha)\hat{\mu} + \alpha f$$

for some  $\alpha$ . To determine  $\alpha$ , Ledoit and Wolf consider the expected loss function

$$R(\alpha) = \mathbb{E} \left( ||(1 - \alpha)\hat{\mu} + \alpha f - \mu||^2 \right).$$

i. Show that  $R(\alpha)$  is minimized when

$$\alpha^* = \frac{\mathbb{E} \left( ||\hat{\mu} - \mu||^2 \right)}{\mathbb{E} \left( ||\hat{\mu} - \mu||^2 \right) + ||f - \mu||^2}$$

ii. Show that

$$\mathbb{E} \left( ||\hat{\mu} - \mu||^2 \right) = \frac{1}{T} \text{tr}(\Sigma).$$

Use the expectation of quadratic forms rule: for  $Y$  a random vector, and  $A$  a matrix,  $\mathbb{E}(Y'AY) = \text{tr}(A \cdot S) + m'Am$ , where  $\mathbb{E}(Y) = m$ ,  $\text{Cov}(Y) = S$ .

iii. Show that, as a result,

$$\alpha^* = \frac{(N/T)\bar{\sigma}^2}{(N/T)\bar{\sigma}^2 + ||f - \mu||^2}$$

where  $\bar{\sigma}^2 = \frac{1}{N} \text{tr}(\Sigma)$ .

(b) We have defined the condition number of a positive definite matrix,  $A \in \mathbb{R}^{N \times N}$ , as the ratio of the maximum and minimum eigenvalues of  $A$ . Consider

$$\Sigma_s = (1 - \alpha)\hat{\Sigma} + \alpha F.$$

Ledoit and Wolf [20] suggest another shrinkage method (i.e., a biased estimator of the covariance matrix,  $\Sigma$ ) as a convex combination of the sample covariance and a matrix  $F$  of the form

$$F = \sigma^2 I.$$

i. Show that the condition number of  $\Sigma_s$  is

$$k(\alpha) = \kappa(\Sigma_s) = \frac{(1-\alpha)\bar{\lambda} + \alpha\sigma^2}{(1-\alpha)\underline{\lambda} + \alpha\sigma^2}$$

where  $\bar{\lambda}$  and  $\underline{\lambda}$  are the maximum and minimum eigenvalues of  $\hat{\Sigma}$  respectively.

ii. Where is  $k(\alpha)$  increasing or decreasing on  $[0, 1]$ ? Where does it attain its maximum? Its minimum?

5. Find the gradient and Hessian function for  $f(x, y) = 100(y - x^2)^2 + (1 - x)^2$ . Show that for the local minimizer  $x^* = (1, 1)$ , the gradient vanishes and the Hessian is positive definite.

6. Prove that if  $\Sigma$  is a covariance matrix, the function

$$f(x) = x' \Sigma x$$

is convex.

7. Prove that the intersection of finitely many convex sets is convex.

8. Prove (5.27).