In [None]:
from dialoghelper import add_msg
import re
from fastcore.foundation import Path
def md_to_notes(path):
    "Read markdown file and create a note for each header section"
    txt = Path(path).read_text()
    parts = re.split(r'^(#{1,4}\s+.+)$', txt, flags=re.MULTILINE)
    if parts[0].strip(): add_msg(content=parts[0].strip())
    for i in range(1, len(parts), 2):
        content = parts[i] + (parts[i+1] if i+1 < len(parts) else '')
        if content.strip(): add_msg(content=content.strip())

In [None]:
md_to_notes('./md/ch06.md')

## Chapter 6

## Unconstrained Optimization

Much of the work we have seen in regression may be framed as examples of unconstrained optimization. That is, the problems obtained solutions by minimizing some loss function; viz., the  $L^2$  norm, or sum of squares. In that setting, we both identified solutions such as the  $\hat{\beta}$  of an OLS regression as well as discussed such a solution's distributional properties. Here, we focus on the former task, even while revisiting examples in OLS and a related problem in index replication.

The work begins by outlining conditions for a point,  $x^*$ , to be a solution of a minimization problem,  $\min_x f(x)$ . We then identify useful necessary and sufficient conditions that  $x^*$  must obtain to be optimal when working with a smooth  $f$  [5]. We also establish some attractive properties that convex  $f$  have in the current context.

Finally, we discuss Newton's method, an algorithm useful in identifying stationary points. As we shall see, under some assumptions, we may prove convergence of the method to a stationary point when  $f \in C^2$ . The algorithm provides a template for future work in constrained optimization as well.

### 6.1 Preliminaries

For a function

$$f: \mathbb{R}^N \to \mathbb{R},$$

we say that  $x^*$  is a *global minimizer* if

$$f(x^*) \le f(x) \tag{6.1}$$

for all  $x \in \mathbb{R}^N$ . We say that  $x^*$  is a *strict global minimizer* if the inequality above is strict for all  $x \neq x^*$ . Finally,  $x^*$  is a *local minimizer* if there exists a

neighborhood,  $\mathcal{N}(x^*)$ , about  $x^*$  such that  $f(x^*) \le f(x)$  for all  $x \in \mathcal{N}(x^*)$ . The strict qualifier may be applied here as well.

Our notation will often be abbreviated as

$$\begin{aligned}f^* &= f(x^*) \\ \nabla f^* &= \nabla f(x^*) \\ \nabla^2 f^* &= \nabla^2 f(x^*).\end{aligned}$$

In this section we study *unconstrained minimization problems*; i.e., problems of the form

$$\min_x f(x). \tag{6.2}$$

In this formulation,  $f$  is called the *objective function*, and the solution – should one exist – is a global minimizer of  $f$ .

#### 6.1.1 Necessary and Sufficient Conditions

We will prove various necessary and sufficient conditions that obtain at  $\nabla f^*$  and  $\nabla^2 f^*$ , referred to as first and second order conditions, respectively.

Recall that a *necessary condition* for  $x^*$  to be a minimizer is a condition that  $x^*$  must satisfy if it is to be a minimizer, while a *sufficient condition* for  $x^*$  to be a minimizer is a condition that implies  $x^*$  is a minimizer of  $f$ .

We proceed with proofs of a first order necessary condition, a second order necessary condition, and a second order sufficient condition.

**First Order Necessary Condition** Let  $x^*$  be a local minimizer of  $f \in C^1$  near  $x^*$ , then  $\nabla f^* = 0$ . That is,  $x^*$  is a stationary point.

*Proof.* Assume by contradiction that  $\nabla f^* \neq 0$ . Then we may find a  $\delta$  satisfying

$$\delta' \nabla f^* < 0.$$

(For example, choose  $\delta = -\nabla f^*$ .) By continuity, there exists a  $\tau$  such that

$$\delta' \nabla f(x^* + \hat{t}\delta) < 0.$$

for all  $\hat{t} \in [0, \tau]$ .

A Taylor expansion of  $f$  about  $x^*$  gives

$$f(x^* + \hat{t}\delta) = f^* + \hat{t}\delta' \nabla f(x^* + t \cdot \hat{t}\delta)$$

for some  $t \in (0, 1)$ . Now since  $t \cdot \hat{t}$  remains in  $[0, \tau]$ ,  $\delta' \nabla f(x^* + t \cdot \hat{t}\delta) < 0$ , and

$$f(x^* + \hat{t}\delta) < f^*,$$

contradicting the fact that  $x^*$  is a local minimizer of  $f$ . Hence  $\nabla f^* = 0$ .  $\square$

**Second Order Necessary Condition** Let  $x^*$  be a local minimizer of  $f \in \mathbb{C}^2$  near  $x^*$ , then  $\nabla^2 f^* \succeq 0$ .

The proof proceeds just as before, utilizing the fact that we now know that  $x^*$  is a stationary point as well.

*Proof.* Assume by contradiction that  $\nabla^2 f^*$  is not positive semidefinite. Then we may find a  $\delta$  such that

$$\delta' \nabla^2 f^* \delta < 0.$$

Since  $f$  is in  $\mathbb{C}^2$  near  $x^*$ , we have by continuity that there exists a  $\tau$  such that

$$\delta' \nabla^2 f(x^* + \hat{t}\delta) \delta < 0$$

for all  $\hat{t} \in [0, \tau]$ .

Taking a second order expansion of  $f$  about  $x^*$  gives

$$f(x^* + \hat{t}\delta) = f^* + \hat{t}\delta' \nabla f^* + \frac{1}{2} \hat{t}^2 \delta' \nabla^2 f(x^* + t \cdot \hat{t}\delta) \delta$$

for some  $t \in (0, 1)$ . Since  $x^*$  is a stationary point, this reduces to

$$f(x^* + \hat{t}\delta) = f^* + \frac{1}{2} \hat{t}^2 \delta' \nabla^2 f(x^* + t \cdot \hat{t}\delta) \delta.$$

Finally,  $t \cdot \hat{t} \in [0, \tau]$  so that  $\delta' \nabla^2 f(x^* + t \cdot \hat{t}\delta) \delta < 0$ , and

$$f(x^* + \hat{t}\delta) < f^*,$$

contradicting the fact that  $x^*$  is a local minimizer of  $f$ . Hence  $\nabla^2 f^* \succeq 0$ .  $\square$

**Second Order Sufficient Condition** Let  $x^*$  be a stationary point, and assume  $\nabla^2 f^* \succ 0$ . If  $f \in \mathbb{C}^2$  in a neighborhood of  $x^*$ , then  $x^*$  is a local minimizer.

*Proof.* By continuity, we have that there exists a  $\tau$  such that

$$\nabla^2 f(x^* + \hat{t}\delta) \succ 0$$

for all  $\hat{t} \in [0, \tau]$ . Expanding  $f$  about  $x^*$ , there exists a  $t \in (0, 1)$  so that

$$\begin{aligned} f(x^* + \hat{t}\delta) &= f^* + \hat{t}\delta' \nabla f^* + \frac{1}{2} \hat{t}^2 \delta' \nabla^2 f(x^* + t \cdot \hat{t}\delta) \delta \\ f(x^* + \hat{t}\delta) &= f^* + \frac{1}{2} \hat{t}^2 \delta' \nabla^2 f(x^* + t \cdot \hat{t}\delta) \delta \end{aligned}$$

since  $x^*$  is a stationary point.

As before,  $t \cdot \hat{t} \in [0, \tau]$ , so that  $\nabla^2 f(x^* + t \cdot \hat{t}\delta) \succ 0$ , and hence

$$f(x^* + \hat{t}\delta) \ge f^*$$

as desired.  $\square$

Several examples of finding extrema are immediately available to us having identified the above necessary and sufficient conditions. For example, we may frame OLS and GLS in terms of an optimization framework. Further, we introduce a version of so-called index tracking.

**Example 6.1.1.** We have already considered the OLS objective function (e.g., (4.2), (4.13))

$$f(\beta) = ||Y - X\beta||^2$$

for  $Y \in \mathbb{R}^N$  and  $X \in \mathbb{R}^{N \times p}$ . We may rewrite  $f$  as

$$\begin{aligned} f(\beta) &= (Y - X\beta)'(Y - X\beta) \\ &= \beta'X'X\beta - 2\beta'X'Y + Y'Y \end{aligned}$$

showing explicitly that  $f$  is a quadratic function of  $\beta$ . Since  $X'X$  is symmetric, we have

$$\nabla f(\beta) = 2X'X\beta - 2X'Y.$$

Setting the gradient equal to zero to find  $\beta^*$  gives

$$\beta^* = (X'X)^{-1}X'Y,$$

an answer seen several times in the preceding work in statistics. Now, since the hessian of  $f$  is

$$\nabla^2 f(\beta) = 2X'X$$

we have that  $\beta^*$  is a minimizer by virtue of the positive semidefiniteness of  $X'X$  and global continuity of  $f$ .

Notice that in the OLS-as-optimization presentation here, no discussion of the random component  $\epsilon$  was considered. In fact, no statistical properties of the estimator  $\beta^*$  are discernible here whatsoever. The reduction to (i.e., justification for) a quadratic minimization problem required an argument about the distributional properties of  $\epsilon$ . The fact that the OLS  $\hat{\beta}$  was a projection is also nowhere to be seen in this casting.

**Example 6.1.2.** We may also consider the generalized least squares case as an optimization problem. In the GLS case, the objective function becomes

$$f(\beta) = ||Y - X\beta||_{V^{-1}}^2$$

where  $V = \text{Cov}(\epsilon)$  and  $||x||_A^2 = (x, x)_A = x'Ax$  is defined as the inner product with respect to a positive definite matrix  $A$ . Here the expansion of  $f$  yields

$$\begin{aligned} f(\beta) &= (Y - X\beta)'V^{-1}(Y - X\beta) \\ &= \beta'X'V^{-1}X\beta - 2\beta'X'V^{-1}Y + Y'V^{-1}Y \end{aligned}$$

with gradient

$$\nabla f(\beta) = 2X'V^{-1}X\beta - 2X'V^{-1}Y.$$

which gives stationary point

$$\beta^* = (X'V^{-1}X)^{-1}X'V^{-1}Y$$

as in (4.49). Again,  $f$  is  $C^2$  everywhere with Hessian  $X'V^{-1}X$  which is positive semidefinite when  $X$  is full rank and  $V$  is a covariance matrix so the above result is sufficient for optimality of  $f$ .

As before, the justification of this particular objective function is lacking in the discussion along with distributional properties of the estimators.

In our next example, we look at an index tracking model. That is, given a target security’s returns through time,  $\{s_t\}_{t=1}^T$ , we identify optimal weights for a basket of tradeable securities,  $\{r_{i,t}\}_{t=1}^T$ , for  $i = 1, \dots, N$ .

**Example 6.1.3.** We define

$$r_t = \begin{pmatrix} r_{1,t} \\ \vdots \\ r_{N,t} \end{pmatrix}.$$

We consider the objective function in weights,  $w \in \mathbb{R}^N$ ,

$$f(w) = \sum_{t=1}^T (s_t - r'_t w)^2$$

which may be rewritten as in ordinary least squares as

$$f(w) = ||S - Rw||^2 \quad (6.3)$$

for

$$S = \begin{pmatrix} s_1 \\ \vdots \\ s_T \end{pmatrix}.$$

and  $R_{ij} = r_{j,i}$ ; i.e., the  $j$ th row of  $R$  consists of the  $N$  securities returns at time  $j$ .

We proceed as before, yielding gradient

$$\nabla f(w) = 2R'Rw - 2R'S$$

and optimal solution

$$w^* = (R'R)^{-1}R'S, \quad (6.4)$$

or the OLS solution in another guise.

We may apply the above to a particular set of data. Using the same cross-sectional data as in previous studies, we build index tracking portfolios from weekly return data of the 50 largest companies available each of the 200 months from 10/31/1997 to 5/31/2014. In particular,  $T = 117$  weeks and  $N = 50$  in

#### Target Returns to Synthetic Returns Using 50 Largest Companies for Synthetic

![](92f8c46f24a606d066065c559752e9b7_img.jpg)

Figure 6.1 displays two plots related to the performance of synthetic returns compared to target returns.

The top plot is a scatter plot titled "Target Returns to Synthetic Returns Using 50 Largest Companies for Synthetic". The X-axis is "Target Returns" (ranging from approximately  $-0.15$  to  $0.15$ ), and the Y-axis is "Synthetic Returns" (ranging from approximately  $-0.1$  to  $0.1$ ). The data points show a strong positive correlation, indicating that synthetic returns closely track target returns.

The bottom plot is a histogram titled "Histogram of Error Returns: Synthetic-Target". The X-axis is "Returns" (ranging from approximately  $-4$  to  $6$ ), and the Y-axis is "Frequency" (ranging from 0 to 80). The distribution is centered around zero, with a peak frequency of approximately 75. The distribution appears roughly symmetric and bell-shaped, suggesting that the errors between synthetic and target returns are centered around zero.

Figure 6.1: Summary statistics for the out of sample weekly return performance of an index tracking methodology obtained from an unconstrained quadratic optimization problem.

the above notation. We then use the optimal weights for each time period given by (6.4) to evaluate performance over the next four weeks.

Summary statistics are presented visually in Figure 6.1.1. The 95% confidence interval for the CAPM  $\alpha$  of the synthetic asset's weekly returns is  $[-0.0001, 0.0008]$ , while the CAPM  $\beta$  confidence interval is  $[0.9538, 0.9866]$ , significantly different from 1.0. We cannot reject the null hypothesis that the synthetic return's  $\alpha$  is different from zero – the rare case where this is not a disappointing statement to make.

Finally, across the roughly 800 weeks studied, the average tracking error was 3 bps (0.03%). This annualizes to 1.67% over nearly 17 years. These are appealing characteristics by and large. However, the procedure is not without its faults.

![Histogram showing the frequency distribution of optimal weights. The x-axis is 'Weights' ranging from -0.3 to 0.3. The y-axis is 'Frequency' ranging from 0 to 15. The distribution is centered around 0, with the highest frequency bar reaching approximately 17. The distribution is roughly symmetric, indicating a common occurrence of both positive and negative weights.](2d18a112a734dd3ca78d3b93ea8b30ff_img.jpg)

Histogram of Optimal Weights

Histogram showing the frequency distribution of optimal weights. The x-axis is 'Weights' ranging from -0.3 to 0.3. The y-axis is 'Frequency' ranging from 0 to 15. The distribution is centered around 0, with the highest frequency bar reaching approximately 17. The distribution is roughly symmetric, indicating a common occurrence of both positive and negative weights.

Figure 6.2: Histogram of all optimal weights found applying the index tracking procedure over 200 distinct months. Large positive and negative weights are apparent as well as a common occurrence of negative weights, generally.

In particular, the optimization is prone to large and negative values of  $w_i^*$ , as can be seen in Figure 6.1.3. Approximately 20% of all of the weights found are negative, and position sizing as large as  $\pm 28\%$  was seen. For a market tracking portfolio, one may argue that the appearance of negative positions is spurious. In practice, the added difficulty (and cost) of shorting securities makes this more than just an academic observation. Additionally, the near symmetry for outsized positioning bodes poorly for the method, indicating potentially spurious results.

The observations noted here will be seen again when we give mean-variance optimization a formal treatment. Our observations are not novel by any means, however, and are well noted in the current literature. We will see that a parsimonious solution relies on understanding the eigenvalues of the covariance of the returns involved. As a preview: the model results here are sensitive to near-zero eigenvalues. Solutions abound, however. In particular, an immediate and obvious fix is to *constrain* the possible weight sizing available to the model. Another is to ameliorate the eigenvalue just identified. This solution leverages the field of random matrix theory and is a burgeoning area of interest in the field.

#### 6.1.2 Convex Functions

Convex functions have particularly attractive properties with respect to their extrema. In particular, local minimizers are also global minimizers, and the first order necessary condition for optimality above is also sufficient. We prove each of these claims here.

Let  $x^*$  be a local minimizer of  $f$ , convex. Then  $x^*$  is a global minimizer of  $f$ .

*Proof.* Assume by contradiction that there exists a  $z$  such that  $f(z) < f^*$ . Then by convexity, for  $x_\theta = \theta z + (1-\theta)x^*$ ,

$$f_\theta \le \theta f(z) + (1-\theta)f^* < f^*.$$

Now, any neighborhood of  $x^*$  will contain  $x_\theta$  for sufficiently small  $\theta$ , though, contradicting the fact that  $x^*$  is a local minimizer. Therefore  $x^*$  must be a global minimizer.  $\square$

Next, suppose that  $f$  is convex and differentiable at  $x^*$ , then if  $x^*$  is a stationary point,  $x^*$  is a global minimizer of  $f$ .

*Proof.* We again proceed by contradiction, assuming there exists a  $z$  such that  $f(z) < f^*$ . By (5.29) we have

$$\nabla f^{**'}(z-x) \le f(z) - f^*$$

implying since  $x^*$  is a stationary point that

$$f^* \le f(z),$$

a contradiction.  $\square$

### 6.2 Newton's Method

Based on our work above, to find minima of  $f: \mathbb{R}^N \to \mathbb{R}$ , we are motivated to identify stationary points. Our prototype algorithm for doing so will be Newton's Method, an iterative procedure developed to find the roots of functions.

For a general  $F: \mathbb{R}^N \to \mathbb{R}^N$ , Newton's Method utilizes the linear approximation of  $F$  given by

$$F_l(x + \delta) = F(x) + \nabla F(x)\delta$$

and solves where this linear approximation,  $F_l$  is zero; viz.,

$$\delta = -(\nabla F(x))^{-1}F(x).$$

Immediately we see that we must have invertibility of the Jacobian for this particular implementation of the method to be well defined. Iterations continue, updating  $x$  with this particular  $\delta$ .

Given that we are particularly interested in the roots of the gradient function,  $\nabla f: \mathbb{R}^N \to \mathbb{R}^N$ , we may rewrite the above in terms of the gradient and hessian as

$$\delta = -(\nabla^2 f(x))^{-1} \nabla f(x). \quad (6.5)$$

This  $\delta$  is often referred to as the *Newton step* or *Newton direction*. We may also, for reasons that will become clear in later exposition, call this a *full* Newton step.

We use superscript notation to indicate iterates within an algorithm; e.g.,  $x^k$  as the  $k$ th iterate,  $\delta^k$  as the  $k$ th Newton step, and  $\nabla f^k = \nabla f(x^k)$ , and so on. The algorithm is outlined as follows, initializing with some  $x^0$  and having small and large threshold parameters  $\epsilon$  and  $K$ , respectively:

**Algorithm 1** Newton’s Method

Initialize  $x^0$   
**while**  $||\nabla f^k|| > \epsilon$  and  $k < K$  **do**  
 $\delta^k = -(\nabla^2 f^k)^{-1} \nabla f^k$   
 $x^{k+1} \leftarrow x^k + \delta^k$

Of course, Algorithm 1 presupposes some kind of stopping condition in the norm of the gradient is possible; viz., there is a tacit assumption that we will be able to find an iterate  $k$  such that  $||\nabla f^k||$  is in fact smaller than some prescribed  $\epsilon$ . And, of course, in this case we would expect some proximity to a stationary point. The maximum iteration parameter,  $K$ , is just realistic coding.

Our next theorem addresses the as yet aspirational view of finding a point  $x^k$  near a stationary point. We outline several assumptions first.

**Theorem 6.2.1.** For  $f \in \mathbb{C}^2$  in a neighborhood of  $x^*$  and with hessian satisfying both  $\nabla^2 f^* \succ 0$  and  $||\nabla^2 f(x_1) - \nabla^2 f(x_2)|| \le M||x_1 - x_2||$  for any  $x_1$  and  $x_2$  in that neighborhood, then for  $x^k$  sufficiently close to  $x^*$ , Newton’s method converges at second order and is well defined at each iterate.

In the above, quadratic convergence of a sequence  $\{x^k\}$  to  $x^*$  is defined by the satisfaction of

$$\frac{||x^{k+1} - x^*||}{||x^k - x^*||^2} \le C \quad (6.6)$$

for some constant,  $C$ . Of course, this implies that  $||x^{k+1} - x^*|| = O(||x^k - x^*||^2)$ .

*Proof.* Since

$$\nabla^2 f^* \succ 0$$

and  $f \in \mathbb{C}^2$  in a neighborhood of  $x^*$ , there exists a neighborhood about  $x^*$  such that the hessian remains positive definite. In addition, since

$$||\nabla^2 f(x_1) - \nabla^2 f(x_2)|| \le M||x_1 - x_2||$$

in a neighborhood of  $x^*$  the inverse of the hessian is bounded in that neighborhood.

These two neighborhoods may be taken jointly to identify a neighborhood,  $\mathcal{N}(x^*)$ , such that

$$\begin{array}{rcl} \nabla^2 f(x) & \succ & 0 \\ ||\nabla^2 f(x)^{-1}|| & < & M' \end{array}$$

for all  $x \in \mathcal{N}(x^*)$  and some  $M'$ .

Suppose  $x^k$  lies in  $\mathcal{N}(x^*)$  for some iterate  $k$ . We may define  $h^k$  by

$$x^* = x^k + h^k.$$

A Taylor expansion of the gradient of  $f$  about  $x^k$  gives

$$\nabla f(x^k + h^k) = \nabla f^k + \nabla^2 f^k h^k + O(||h^k||^2).$$

But since  $\nabla f(x^k + h^k) = \nabla f^* = 0$ , we get

$$0 = \nabla f^k + \nabla^2 f^k h^k + O(||h^k||^2).$$

Multiplying through by  $(\nabla^2 f^k)^{-1}$  gives

$$\begin{array}{rcl} -h^k & = & (\nabla^2 f^k)^{-1} \nabla f^k + O(||h^k||^2) \\ \delta^k - h^k & = & O(||h^k||^2) \end{array}$$

where the big  $O$  term remains the same since the norm of  $(\nabla^2 f^k)^{-1}$  is bounded on  $\mathcal{N}(x^*)$ .

With a little bit of algebraic manipulation, we see that

$$\begin{array}{rcl} x^k + h^k & = & x^{k+1} + h^{k+1} \\ x^k + h^k & = & x^k + \delta^k + h^{k+1} \\ -\delta^k + h^k & = & h^{k+1}. \end{array}$$

So that

$$h^{k+1} = O(||h^k||^2). \tag{6.7}$$

If we can show that  $x^{k+1}$  remains in  $\mathcal{N}(x^*)$ , the proof is complete as we already have second order convergence via (6.7).

Equation (6.7) gives that there exists a constant depending on  $k$ ,  $M_k$ , such that

$$||h^{k+1}|| \le M_k ||h^k||^2.$$

But by the integral formulation of the Taylor expansion given in (5.22) coupled with the derivation above, we may uniformly bound all iterates by some  $\tilde{M}$  by the continuity of  $\nabla^2 f$  as

$$||h^{k+1}|| \le \tilde{M} ||h^k||^2.$$

This exercise is left to the reader. For this  $\tilde{M}$ , assume that there is some  $k_0$  such that

$$||h^{k_0}|| \le \frac{\alpha}{\tilde{M}}$$

for some  $\alpha \in (0, 1)$  sufficiently small to remain in  $\mathcal{N}(x^*)$ . We have then,

$$\begin{aligned}||h^{k_0+1}|| &\le \tilde{M}||h^{k_0}||^2 \\ &\le \tilde{M}\left(\frac{\alpha}{\tilde{M}}\right)||h^{k_0}|| \\ &\le \alpha||h^{k_0}||\end{aligned}$$

so that  $x^{k_0+1}$  remains in  $\mathcal{N}(x^*)$  and all of the previous assumptions hold as well. Finally,

$$\begin{aligned}||h^{k_0+2}|| &\le \tilde{M}||h^{k_0+1}||^2 \\ &\le \tilde{M}\frac{\alpha}{\tilde{M}}\alpha||h^{k_0}|| \\ &\le \alpha^2||h^{k_0}||\end{aligned}$$

and in general,  $||h^{k_0+N}|| \le \alpha^N||h^{k_0}||$ , so that  $\lim_{k \to \infty} ||h^k|| = 0$ .  $\square$

### Exercises

1. Consider

$$\begin{aligned}f(\beta) &= ||Y - X\beta||_{V^{-1}}^2 \\&= \beta'X'V^{-1}X\beta - 2\beta'X'V^{-1}Y + Y'V^{-1}Y.\end{aligned}$$

(a) Show that the hessian of  $f$  is exactly  $\nabla^2 f(\beta) = 2X'V^{-1}X$ .  
(b) Prove that  $X'V^{-1}X$  is positive semidefinite when  $X$  is full rank and  $V$  is a covariance matrix.

2. We have seen the definition of a norm for a square matrix  $A$  given by

$$||A||_2 = \max_{||v||=1} ||Av||$$

and proved in an exercise that

$$||A||_2 = \max_{||v||} \frac{||Av||}{||v||}.$$

Suppose  $A$  is invertible. What is  $||A^{-1}||$ ? Prove your claim.

3. Suppose  $(\hat{\beta}^1, \dots, \hat{\beta}^N)$  are the OLS estimates of

$$Y^i = X\beta^i + \epsilon^i$$

for assets  $i = 1, \dots, N$ . Let  $Y = (Y^1 Y^2 \dots Y^N) \in \mathbb{R}^{M \times N}$ , where  $Y^i \in \mathbb{R}^M$ . We have seen that OLS regression is the solution to a minimization of norm.

(a) For some fixed  $w$ , find the solution to

$$\min_{\beta} ||Yw - X\beta||^2$$

in terms of  $w$ ,  $X$ , and  $Y$ .

(b) Conclude that the multifactor  $\beta$  of a portfolio is the weighted sum of asset  $\beta$ s.

4. For this problem, you will need to write code for the multivariate Newton's method. Consider the function

$$f(x, y) = 100(y - x^2)^2 + (1 - x)^2$$

(a) Using the multivariate version of Newton's method with initial point  $(4, 4)$ , find a stationary point of  $f$ .  
(b) Plot your iterates  $(x_i, y_i)$ . If possible, include the level curves of  $f$  in your plot.  
(c) Plot  $|\nabla f(x_i, y_i)|$  for each of your iterates.

5. In Merton's structural model, we have for  $r$  the risk free rate,  $F$  the debt barrier, and  $E$  and  $\sigma_E$  the given equity value and equity volatility, respectively, that

$$V_t\Phi(d_1) - E - Fe^{-r(T-t)}\Phi(d_2) = 0 \quad (6.8)$$

and

$$\sigma_V V - E\Phi^{-1}(d_1)\sigma_E = 0 \quad (6.9)$$

where

$$d_1 = \frac{\ln\left(\frac{V}{F}\right) + \left(r + \frac{\sigma_V^2}{2}\right)(T-t)}{\sigma_V\sqrt{T-t}},$$
$$d_2 = d_1 - \sigma_V\sqrt{T-t},$$

and  $\Phi(\cdot)$  is the cumulative distribution function of the standard normal density, with derivative denoted by the probability density function  $\phi(\cdot)$ . For fixed  $E$  and  $\sigma_E$ , let

$$M(V, \sigma_V) = \begin{pmatrix} V_t\Phi(d_1) - E - Fe^{-r(T-t)}\Phi(d_2) \\ \sigma_V V - E\Phi^{-1}(d_1)\sigma_E \end{pmatrix}.$$

(a) Find the Jacobian of  $M$ ,  $\nabla M$ .  
(b) Write a Newton's Method algorithm using your Jacobian function above to find  $(V, \sigma_V)$  when  $(E, F, \sigma_E, r, T, t) = (1, 0.5, 0.20, 0.0025, 1, 0)$ .  
6. Consider the shrinkage estimator problem where

$$R(\alpha) = \alpha F + (1 - \alpha)S - \Sigma$$

where  $F = (f_{ij})$  is the shrinkage target,  $S = (s_{ij})$  is the sample covariance, and  $\Sigma = (\sigma_{ij})$  is the covariance. Let

$$L(\alpha) = ||R(\alpha)||^2 = \text{tr}(R(\alpha)'R(\alpha))$$

(a) Show that  $\mathbb{E}(L(\alpha))$  is minimized at

$$\alpha^* = \frac{\sum_{i,j} \text{Var}(s_{ij}) - \text{Cov}(f_{ij}, s_{ij})}{\sum_{i,j} \mathbb{E}(f_{ij} - \sigma_{ij})^2 + \text{Var}(s_{ij}) - 2\text{Cov}(f_{ij}, s_{ij})}$$

(b) Show that you may write the denominator above as

$$\sum_{i,j} \text{Var}(f_{ij} - s_{ij}) + (\mathbb{E}(f_{ij}) - \sigma_{ij})^2$$

![](fde82173162f4bbc081626688b5d9253_img.jpg)