In [None]:
from dialoghelper import add_msg
import re
from fastcore.foundation import Path
def md_to_notes(path):
    "Read markdown file and create a note for each header section"
    txt = Path(path).read_text()
    parts = re.split(r'^(#{1,4}\s+.+)$', txt, flags=re.MULTILINE)
    if parts[0].strip(): add_msg(content=parts[0].strip())
    for i in range(1, len(parts), 2):
        content = parts[i] + (parts[i+1] if i+1 < len(parts) else '')
        if content.strip(): add_msg(content=content.strip())

In [None]:
md_to_notes('./md/ch10.md')

## Chapter 10

## Covariance Modifications

Just as we may alter our particular risk measure, we may also focus directly on the covariance matrix. Of course, as a driver for the elliptical distributions we have seen, this may be an effort which has multiple avenues of impact; e.g., directly in the objective function in a mean-variance setting, or perhaps as the driver of returns in a mean-CVaR setting. In any event, the importance and centrality of the covariance in much of what we have covered should be clear. Furthermore, the difficulties surrounding the sample covariance matrix, in terms of estimation, variation through time, and as an input in an objective function of an optimization problem have been made clear.

In this chapter, we consider several modifications to the covariance matrix; all of which are biased estimators of the covariance, and rely on varying amounts of predefined structure. First, we will formally outline the so-called *factor model* approach. Here, structure is explicit and full. A similar modification relies on a combination of the sample and some factor model based estimator. Broadly, these are *shrinkage estimators*, and while the literature is very robust, we will focus on a small subset of these results. Finally, as we have seen that constrained mean-variance optimization generally outperforms its unconstrained counterpart, we present a modification to the covariance unifying these observations via a *constrained maximum likelihood* correspondence.

### 10.1 Factor Models

In the standard factor model framework, we write

$$r_t = \sum_{k=1}^{K} \beta_k f_{k,t} + \epsilon_t, \quad (10.1)$$

with  $f_{k,t}$  the  $k^{\text{th}}$  *common factor* as observed at time  $t$ ,  $r_t$  the individual stock return being examined (sometimes with a difference of risk-free rate taken out),  $\beta_k$  the *factor loading* on the  $k^{\text{th}}$  common factor for this individual stock, and

$\epsilon_t$  the idiosyncratic error under the model. When several stocks (or, more generally, securities) are considered a subscript of  $i$  is included as  $r_{t,i}$ ,  $\beta_{k,i}$ , and  $\epsilon_{t,i}$  to indicate reference to the  $i^{\text{th}}$  stock.

We have already encountered such models by way of CAPM. Further, the relation to the standard OLS framework of (4.11) and (4.12) is immediate, the slight change of notation to  $f$ , notwithstanding. Of course, this interpretation makes a tacit assumption that the model is fit using a time series. More on this below.

In addition to the single factor model of CAPM [22, 31], another very well-regarded factor model is the Fama-French extension to CAPM [10]. This model resolves the apparent outperformance after controlling for market exposure using CAPM of small market cap stocks to large and higher value stocks to lower as measured by the cross-section of book to price. This is achieved, consistent with our approach so far, by accounting for these exposures via the regression

$$r_t - r_f = \beta_m(m_t - r_f) + \beta_h h_t + \beta_v v_t + \epsilon_t. \quad (10.2)$$

The particular method of construction of the time series for the common factors,  $h_t$  and  $v_t$ , may be seen in the original paper. Broadly, they are constructed via cross-sectional rank ordered portfolios, controlling for remaining factor exposures; viz.,  $h_t$  is determined as the average return (in the cross-section) of ‘small high value’ and ‘big high value’ minus the average return (again, in the cross-section) of ‘small low value’ and ‘big low value’, where ‘small’ and ‘big’ are with respect to the median market capitalization of the cross-section at that particular time.

In every case, we assume that the mean and covariance of the common factors is constant over time. Writing  $\mathbf{f}_t = (f_{1,t}, \dots, f_{K,t})'$ , this implies

$$\mathbb{E}(\mathbf{f}_t) = \mu_f \quad (10.3)$$

$$\text{Cov}(\mathbf{f}_t) = \Omega_f \quad (10.4)$$

for all  $t$ . We will also make a distinction when the  $\beta$ 's of (10.1) are fit using observations obtained from a time series for any given stock, and when the Gauss-Markov assumptions hold for each of these regressions (not necessarily with the distributional assumption). As an example of the distinction when fitting a time series regression, we will assume that  $\text{Cov}(f_{k,t}, \epsilon_{i,t}) = 0$  for all  $k$ ,  $t$ , and  $i$ .

In addition to these usual assumptions, we also assume that the idiosyncratic components for stock  $i$  and  $j$  are uncorrelated; i.e.,

$$\text{Cov}(\epsilon_{i,t}, \epsilon_{j,t}) = \begin{cases} \sigma_i^2 & i = j \\ 0 & i \neq j \end{cases} \quad (10.5)$$

where a lack of time dependence is implicit in the definition.

Notice that if a constant factor is included, the generalization of  $\alpha$  as in the CAPM model with intercept, (4.4), is immediate.

#### 10.1.1 Time Series Models: Observed Common Factors

For stocks  $i = 1, \dots, N$ , each of which is fit to (10.1) using a time series,  $t = 1, \dots, T$ , the model now specifies (omitting a particular time subscript),

$$r = \mathbf{B}f + \epsilon, \quad (10.6)$$

where  $r$  is the vector of returns for the cross-section,  $(r_1, \dots, r_N)$ , and  $B \in \mathbb{R}^{N \times K}$  is the matrix of factor loadings,

$$\mathbf{B} = \begin{pmatrix} - & \beta_1 & - \\ \vdots & \vdots & \vdots \\ - & \beta_N & - \end{pmatrix}, \quad (10.7)$$

with  $\beta_i = (\beta_{1,i}, \dots, \beta_{K,i})'$  for each  $i$ .

Under the assumptions of the model, we have that the covariance of the cross-section is simply

$$\text{Cov}(r) = \mathbf{B}\Omega_f\mathbf{B}' + \mathbf{D}, \quad (10.8)$$

where  $D = \text{diag}(\sigma_1^2, \dots, \sigma_N^2)$ . Further, the OLS estimates of  $\beta_i$  and residual variance,  $\sigma_i^2$  given by  $\hat{\beta}_i$  and  $s_i^2$ , respectively, for each  $i$ , yield the unbiased estimator of the covariance

$$\text{Cov}\hat{(r)} = \hat{\mathbf{B}}\hat{\Omega}_f\hat{\mathbf{B}}' + \hat{\mathbf{D}}, \quad (10.9)$$

where in addition to inputting estimates, the unbiased estimator of the factor covariance,  $\hat{\Omega}_f$  is also used. In general, the number of common factors is far less than the number of securities,  $K \ll N$ , remedying in small part the issue of insufficient observations in time normally encountered when estimating the sample (the so-called large  $N$ , small  $T$  problem).

Portfolio variance using (10.9) may be obtained directly, and the exercise is left to the reader. Similarly, if one common factor is simply a vector of ones, the model allows for an interpretation of excess return after controlling for factor exposure.

In the case that every stock has the same range of observed returns, a single multivariate regression may be performed. To see this, let

$$\mathbf{R} = \begin{pmatrix} r_{1,1} & & r_{1,N} \\ \vdots & \dots & \vdots \\ r_{T,1} & & r_{T,N} \end{pmatrix}, \quad \mathbf{F} = \begin{pmatrix} f_{1,1} & & f_{1,K} \\ \vdots & \dots & \vdots \\ f_{T,1} & & f_{T,K} \end{pmatrix} \quad (10.10)$$

and let  $\mathbf{B}$  be as before. Finally, define  $\mathbf{E}$  based on the residuals  $\epsilon$ . similarly. Then the simultaneous system of equations given by (10.11) for every stock  $i$  is

$$\mathbf{R} = \mathbf{F}\mathbf{B} + \mathbf{E}. \quad (10.11)$$

One may verify that the usual approach taken in the OLS setting – namely, a modification of (4.14) – solves the matrix equation under the Gauss-Markov assumptions.

#### 10.1.2 Cross-Sectional Models: Observed Factor Loadings

The model may also be fit via the cross-section at a specific time. In particular, using the same variables, but adapting to the use in the cross-section, we may write, now assuming that  $\mathbf{B}$  is given rather than  $\mathbf{f}$ ,

$$r_t = \mathbf{B}\mathbf{f}_t + \epsilon_t. \tag{10.12}$$

In this model, the common factors are unobserved and estimated from security features at a given time. Of note is that the residual covariance is no longer homoscedastic. In particular, we must assume (10.5) holds. If residual variances,  $\sigma_t^2$ , are assumed known, then, using the GLS formula (4.49), we have

$$\hat{\mathbf{f}}_t = (\mathbf{B}'\mathbf{D}^{-1}\mathbf{B})^{-1}\mathbf{B}'\mathbf{D}^{-1}r_t. \tag{10.13}$$

The OLS estimate will necessarily be biased.

Given  $\{\hat{\mathbf{f}}_t\}_{t=1}^T$ , an estimate of the covariance of unobserved common factors is given by the sample,

$$\hat{\Omega}_f = \frac{1}{T-1} \sum_{t=1}^T (\hat{\mathbf{f}}_t - \bar{\mathbf{f}})(\hat{\mathbf{f}}_t - \bar{\mathbf{f}})', \tag{10.14}$$

where  $\bar{\mathbf{f}}$  is the sample mean,

$$\bar{\mathbf{f}} = \frac{1}{T} \sum_{t=1}^T \hat{\mathbf{f}}_t. \tag{10.15}$$

For the cross-sectional model with unobserved common factors, the estimate of the covariance is again given by (10.9), but with (10.14) as input for  $\hat{\Omega}_f$ .

The astute reader may notice that the residual variances used in the formulation of (10.13) are not directly observable, but nonetheless needed for this model. These variances may be estimated from a simple OLS regression to obtain a time series of residuals  $\{\hat{\epsilon}_t\}_{t=1}^T$ , and sample variances may be obtained from the time series for each security in the cross-section. The diagonal matrix of sample variances  $\hat{\mathbf{D}}$  may then be used in (10.13).

Finally, one may show that the common factors given in (10.13) may be interpreted as the portfolio return where the portfolio weights are the solution to a minimum variance problem (using  $\mathbf{D}$  as covariance) subject to each portfolio factor  $\beta$  being one. The exercise, again, is left to the reader.

#### 10.1.3 Statistical Factor Models: Principal Component Analysis

In the previous subsections, either common factors or factor loadings were observed. In either case, the choice of factors was determined *a priori*. Presently we consider a factor model determined from the time series of returns directly. Our previous results give an indication of an attractive set of factors; namely, the eigenportfolios of (3.22) which we include again here for ease.

For  $\hat{\Sigma}\in\mathbb{R}^{N\times N}$ , the sample covariance matrix of the observed time series,  $\mathbf{R}$ , given in (10.10), we denote the eigenvalues and eigenportfolios or  $\hat{\Sigma}$  as

$$\lambda_1\ge\cdots\ge\lambda_N\ge 0$$

and

$$e_1,\dots,e_N,$$

respectively. Using the same dimension reduction technique exhibited in (3.25), we may choose  $K$  eigenportfolios explaining some prescribed fraction,  $\tau$ , of the total variance as  $\{e_k\}_{k=1}^K$ . The statistical common factor for this choice is then

$$\mathbf{f}_T=\begin{pmatrix} e'_1r_T \\ \vdots \\ e'_Kr_T \end{pmatrix}, \tag{10.16}$$

with  $r_T=(r_{T,1},\dots,r_{T,N})'$ , and notation reflecting the time window dependence of the definition.

Exactly as in the time series factor model approach given previously, factor loadings per security may be found via (10.1) using these  $K$  statistical factors, and the resulting factor model covariance is identical to that given in the time series with common factors.

One feature of the factors used in this statistical approach is that  $f_k$  and  $f_j$  are orthogonal by construction, with a key benefit being that attribution of returns is non-overlapping. Of course, this feature may be achieved with any set of chosen factors using stepwise regressions. That exercise is left to the reader.

### 10.2 Shrinkage Estimators

In the previous section, methods for developing well structured alternatives to the sample covariance were exhibited. An alternative stance would be to retain some information content in the sample covariance itself; viz., using a convex combination of the sample and, say, a time series model with observed common factors. Ledoit and Wolf [19] provide an elegant solution to just such an approach.

For a sample covariance matrix,  $\hat{\Sigma}\in\mathbb{R}^{N\times N}$ , based on returns  $\mathbf{r}$  as in (10.6) with  $T$  observations and a structured covariance matrix alternative estimate (as, for example, in the preceding sections),  $\hat{\Omega}$ , the *shrinkage estimator for the covariance* is defined by

$$\Sigma_s=(1-\alpha)\hat{\Sigma}+\alpha\hat{\Omega}. \tag{10.17}$$

As Ledoit and Wolf note, determining  $\alpha\in[0,1)$  is the technically challenging part. We will follow their approach here.

Throughout, we will use continue to use notation for the entries of a given covariance matrix with row-and-column subscripts; viz., the  $ij^{\text{th}}$  entry of  $\Sigma$  ( $\hat{\Omega}$ ) will be denoted  $s_{ij}$  ( $\omega_{ij}$ ), and the  $i^{\text{th}}$  diagonal element will be denoted as  $s_i^2$  ( $\hat{\omega}_i^2$ ).

To begin, we define the Frobenius norm on an  $N \times N$  matrix,  $A$ , with entries  $a_{ij}$ , as,

$$\|A\|_F^2 = \sum_{i=1}^N \sum_{j=1}^N a_{ij}^2. \quad (10.18)$$

One may show (and the exercise is left to the reader) that

$$\|A\|_F^2 = \text{tr}(A^2) = \sum_{i=1}^N \lambda_i^2, \quad (10.19)$$

where  $\{\lambda_i\}_{i=1}^N$  are the eigenvalues of  $A$ . If we assume that the true covariance of returns is denoted by  $\Sigma$ , we may define a loss function in  $\alpha$  by

$$L(\alpha) = \|(1-\alpha)\hat{\Sigma} + \alpha\hat{\Omega} - \Sigma\|_F^2. \quad (10.20)$$

That is, we would like to minimize the distance in Frobenius norm between the shrinkage estimator,  $\Sigma_s$ , as a function of  $\alpha$ , and  $\Sigma$ . This loss function is a random variable, and so we focus on the expectation,

$$\mathbb{E}(L(\alpha)) = R(\alpha) = \mathbb{E}\left(\|(1-\alpha)\hat{\Sigma} + \alpha\hat{\Omega} - \Sigma\|_F^2\right). \quad (10.21)$$

As in a previous exercise, we may see that

$$R(\alpha) = \sum_{i=1}^N \sum_{j=1}^N (1-\alpha)^2 \text{Var}(s_{ij}) + \alpha^2 \text{Var}(\hat{\omega}_{ij}) + 2\alpha(1-\alpha) \text{Cov}(\hat{\omega}_{ij}, s_{ij}) \\ + \alpha^2 (\omega_{ij} - \sigma_{ij})^2,$$

where  $\mathbb{E}(\hat{\Omega}) = \Omega$ , with entries  $\omega_{ij}$  and similarly with  $\Sigma$  and  $\sigma_{ij}$ .

By taking a first derivative of  $R(\alpha)$  and solving for the stationary point, the optimal  $\alpha$  is found to be

$$\alpha^* = \frac{\sum_{i=1}^N \sum_{j=1}^N \text{Var}(s_{ij}) - \text{Cov}(\hat{\omega}_{ij}, s_{ij})}{\sum_{i=1}^N \sum_{j=1}^N \text{Cov}(\hat{\omega}_{ij}, s_{ij}) + (\omega_{ij} - \sigma_{ij})^2}. \quad (10.22)$$

This, of course, is elegant but not ready for application as  $\Omega$  and  $\Sigma$  – and thus  $\omega_{ij}$  and  $\sigma_{ij}$  – are not known. A consistent estimator for  $\alpha^*$  is needed, and, indeed, Ledoit and Wolf establish that

$$\hat{\alpha}^* = \frac{1}{T} \frac{p - q}{c}, \quad (10.23)$$

for  $p$  and  $c$  defined as

$$p = \frac{1}{T} \sum_{t=1}^T ((r_{i,t} - \hat{\mu}_i)(r_{j,t} - \hat{\mu}_j) - s_{ij})^2, \\ c = \sum_{i=1}^N \sum_{j=1}^N (\hat{\omega}_{ij} - \hat{\sigma}_{ij})^2.$$

These values are independent of the choice of structured covariance matrix,  $\Omega$ . The calculation of  $q$ , being a consistent estimator for the term  $\sum_{i=1}^N\sum_{j=1}^N\text{Cov}(\hat{\omega}_{ij},s_{ij})$ , is dependent on this choice, however. In addition, it is unlikely, but possible, that  $\alpha^*$  may not be in  $[0, 1]$ . In these cases, a simple truncation is needed in practice; that is,  $\alpha^*=\max\left(0,\min\left(\frac{1}{T}\frac{p-q}{c},1\right)\right)$ .

We consider two structured covariance matrices in turn: first a CAPM-based single factor model; and, second, a multiple of the identity.

### 10.3 Constant Correlation Target

In balancing structure and data, an exemplar of a shrinkage target focusing on simplicity is that of a constant pairwise correlation matrix; viz.,  $\omega_{ij}=\bar{\rho}\sigma_i\sigma_j$  if  $i\neq j$ , otherwise  $\omega_{ii}=\sigma_i^2$ , using the notation in the present section. Here,  $\bar{\rho}$  is simply the average

$$\bar{\rho}=\frac{2}{(N-1)N}\sum_{i=1}^{N-1}\sum_{j=i+1}^N\rho_{ij},$$

where  $\rho_{ij}$  is the usual correlation statistic between returns  $r_i$  and  $r_j$ . The sample of this average is given as one might expect as

$$\hat{\rho}=\frac{2}{(N-1)N}\sum_{i=1}^{N-1}\sum_{j=i+1}^N\hat{\rho}_{ij},$$

which results in population and sample covariance matrix targets,

$$\Omega=\left(\begin{array}{ c c c c } \sigma_1^2 & & & \bar{\rho}\sigma_1\sigma_N \\ & \ddots & & \vdots \\ & & \ddots & \sigma_N^2 \\ \bar{\rho}\sigma_1\sigma_N & & & \sigma_N^2 \end{array}\right),\quad \hat{\Omega}=\left(\begin{array}{ c c c c } s_1^2 & & & \hat{\rho}s_1s_N \\ & \ddots & & \vdots \\ & & \ddots & s_N^2 \\ \hat{\rho}s_1s_N & & & s_N^2 \end{array}\right),$$

respectively.

Under these assumptions, Ledoit and Wolf [20] show that

$$q=\frac{1}{T}\sum_{t=1}^T\sum_{i=1}^N\left((r_{i,t}-\hat{\mu}_i)^2-s_i^2\right)^2 \\ +\frac{1}{T}\sum_{i=1}^N\sum_{j=1,j\neq i}^N\frac{\bar{\rho}}{2}\left(\frac{s_j}{s_i}\zeta_{ii,ij}+\frac{s_i}{s_j}\zeta_{jj,ij}\right),$$

where

$$\zeta_{ii,ij}=\sum_{t=1}^T\left((r_{i,t}-\hat{\mu}_i)^2-s_i^2\right)\left((r_{i,t}-\hat{\mu}_i)(r_{j,t}-\hat{\mu}_j)-s_{ij}\right) \\ \zeta_{jj,ij}=\sum_{t=1}^T\left((r_{j,t}-\hat{\mu}_j)^2-s_j^2\right)\left((r_{i,t}-\hat{\mu}_i)(r_{j,t}-\hat{\mu}_j)-s_{ij}\right).$$

### 10.4 Shrinking to CAPM

For the oft-revisited CAPM framework,

$$r_{i,t} - r_f = \alpha_i + \beta_i(m_t - r_f) + \epsilon_{i,t},$$

the structured covariance matrix is of the familiar form

$$\Omega = \sigma_m^2 \beta \beta' + \mathbf{D},$$

with estimator,

$$\hat{\Omega} = \sigma_m^2 \hat{\beta} \hat{\beta}' + \hat{\mathbf{D}},$$

where  $\beta = (\beta_1, \beta_2, \dots, \beta_N)'$  (and similarly for  $\hat{\beta}$ ).

In this case, and again, Ledoit and Wolf [20] prove

$$q = \frac{1}{T} \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{s_{jm}}{s_m^2} \zeta_{im,ij} + \frac{s_{im}}{s_m^2} \zeta_{jm,ij} - \frac{s_{im}s_{jm}}{s_m^2} \zeta_{m,ij}$$

where

$$\begin{aligned}\zeta_{im,ij} &= \sum_{t=1}^{T} ((r_{i,t} - \hat{\mu}_i)(m_t - \hat{\mu}_m) - \hat{\rho}_{im}s_i s_m) ((r_{i,t} - \hat{\mu}_i)(r_{j,t} - \hat{\mu}_j) - s_{ij}) \\ \zeta_{jm,ij} &= \sum_{t=1}^{T} ((r_{j,t} - \hat{\mu}_j)(m_t - \hat{\mu}_m) - \hat{\rho}_{jm}s_j s_m) ((r_{i,t} - \hat{\mu}_i)(r_{j,t} - \hat{\mu}_j) - s_{ij}) \\ \zeta_{m,ij} &= \sum_{t=1}^{T} ((m_t - \hat{\mu}_m)^2 - s_m^2)^2,\end{aligned}$$

with  $\hat{\rho}_{km}$  the sample correlation between return  $r_k$ , and the market return,  $m_t$ , and  $s_m^2$  the sample market variance.

#### 10.4.1 Some Comments on Shrinkage

Typically within this text, derivations of results take precedence, while in the establishment of the two example shrinkage targets above, results are shown without proof. These derivations, while direct, are outside the scope of this text; in particular, we have not covered distributional statements (including variance and covariance) for the general entries of a covariance matrix. However, the usefulness of the shrinkage estimator approach cannot be understated. Using historical return data, one may show that general mean-variance optimization problems result in more attractive risk-adjusted returns than, say, using the sample covariance matrix alone with its known inadequacies or any of several standard factor model based approaches.

The interested reader is urged to consult the original reference papers and pursue the statistical background necessary for a full treatment. Further, a generalization of the results shown here is possible, particularly in the identification of  $q$  above when  $\Omega$  has a factor model structure as in (10.8), but this extension is left to the reader.

### 10.5 Constraints as Modifications to the Covariance

Given any covariance matrix as input to a portfolio optimization problem, we have already seen that investors often impose constraints to forcibly yield portfolios that reflect *a priori* position and exposure requirements; viz., no shortsales, diversification, and particular style or sector exposures. Jagannathan and Ma identify the role of some example constraints in acting like a shrinkage estimator for the sample covariance [14]. In a word, they establish why *imposing the wrong constraints helps*.

In this section, we extend the work of Jagannathan and Ma to the case of general constraints, with attention paid to the application of the method as a modifier in the sense of Black and Litterman [3]; i.e., as a way to express investor views via constraints to obtain modified means and covariances (either jointly or individually). In contrast to Black-Litterman, however, we allow for the expression of views as both excess expected returns as well as desired exposures.

We begin by considering the usual constrained mean-variance optimization problem, with input sample covariance matrix,  $S$ , and arbitrary return vector  $m$ :

$$\begin{array}{ll}\min_w & \frac{1}{2}w'Sw \\ \text{s.t.} & Aw = b \\ & Cw \le d \\ & m'w = \mu^*\end{array}\tag{10.24}$$

and assume that no matter the choice of covariance matrix,  $S$ , we have  $S \ge 0$ . The equality and inequality constraints –  $A$  and  $C$ , respectively – reflect the desired properties of the target portfolio; viz.,  $\beta$ -neutral, fully allocated, long or short constraints, or maximum exposure requirements, as we have seen in previous chapters.

The by-now familiar Lagrangian of (10.24) is given by

$$\mathcal{L}(w, \lambda, \eta) = \frac{1}{2}w'Sw + \lambda'(Aw - b) + \eta'(Cw - d) + \lambda_0(m'w - \mu^*)$$

with gradient

$$\nabla_w \mathcal{L} = Sw + A'\lambda + C'\eta + \lambda_0 m$$

and Karesh-Kuhn-Tucker (KKT) conditions (both necessary and sufficient in

this case) for the optimal solution is given by

$$\begin{aligned}\nabla_w\mathcal{L}(\omega^*,\lambda^*,\eta^*) &= 0 \\ A\omega^* &= b \\ C\omega^* &\le d \\ m'\omega^* &= \mu^* \\ \eta^* &\ge 0 \\ \eta_i^*(C_i\omega^* - d_i) &= 0\end{aligned}\tag{10.25}$$

where  $C_i$  is the  $i$ th row of the matrix  $C$ .

The approach identifies a covariance matrix,  $\tilde{\Sigma}$ , based on  $S$ , such that the solution of (10.24) and the solution to the minimally constrained mean-constrained problem

$$\begin{aligned}\min_w & \frac{1}{2}w'\tilde{\Sigma}w \\ \text{s.t.} & \tilde{\mu}'w = \tilde{\mu}^*\end{aligned}\tag{10.26}$$

are the same.

The unique Lagrange multipliers of (10.25),  $(\lambda^*,\eta^*,\lambda_0^*)$ , and solution,  $\omega^*$  of (10.24) will remain fixed in the notation that follows. Further, it is also helpful to define

$$K = A'\lambda^* + C'\eta^*\tag{10.27}$$

$$\kappa = \lambda^*b + \eta^*d,\tag{10.28}$$

and notice  $\omega^*K = \kappa$ .

Finally, define

$$\Delta = \frac{1}{\mu^*}(Km' + mK) - \frac{2\kappa}{\mu^{*2}}M\tag{10.29}$$

with  $M = mm'$ .

The following propositions hold:

**Proposition 10.5.1.** If  $\omega^*$  is the solution to (10.24),  $\omega^*$  is also a solution of (10.26) with  $\tilde{\mu} = m$  (and necessarily  $\tilde{\mu}^* = \mu^*$ ) and

$$\tilde{\Sigma} = S + \Delta\tag{10.30}$$

with  $\Delta$  as in (10.29).

**Proposition 10.5.2.** If  $\tilde{\Sigma}$  given by (10.30) is invertible, then we may identify  $\tilde{\Sigma}$  as the solution to a constrained maximum likelihood estimation problem with constraints motivated by (10.24) when returns are assumed to be iid normal and  $m$  and  $S$  are the sample mean and covariance, respectively.

**Proposition 10.5.3.** If  $S \ge 0$ , then so is  $\tilde{\Sigma}$ , and if  $S > 0$ , then  $\tilde{\Sigma}$  is positive definite on the feasible set defined by (10.24). Since  $\tilde{\Sigma}$  is symmetric, it is also therefore a covariance matrix.

The above results may be interpreted as a shrinkage estimator on the input covariance,  $S$ .

We next prove the above propositions by first establishing some preliminary maximum likelihood results. Assuming for the proof that returns are jointly normal,  $r \sim N(\mu, \Sigma)$ , and iid, the log likelihood function is

$$l(\mu, \Sigma) \propto -\frac{T}{2} \ln|\Sigma| - \frac{1}{2} \sum_{t=1}^{T} (r_t - \mu)' \Sigma^{-1} (r_t - \mu).$$

This may be written in  $\Lambda = \Sigma^{-1}$  when  $\Sigma$  is invertible:

$$l_0(\mu, \Lambda) = -l(\mu, \Sigma^{-1}). \quad (10.31)$$

The optimization of the log likelihood function may be informed by the constraints of both the (so-called) unconstrained and constrained problem above. We may formulate the constraints in (10.24) in  $\Lambda$  via the relationship

$$\omega^* = \tilde{\mu}^* (\tilde{\mu}' \tilde{\Sigma}^{-1} \tilde{\mu})^{-1} \tilde{\Sigma}^{-1} \tilde{\mu} \quad (10.32)$$

as

$$\begin{aligned} \tilde{\mu}^* A \Lambda \mu - (\mu' \Lambda \mu) b &= 0 \\ \tilde{\mu}^* C \Lambda \mu - (\mu' \Lambda \mu) d &\le 0. \end{aligned}$$

We arrive, finally, at a constrained maximum likelihood problem

$$\begin{aligned} \min_{\mu, \Lambda} \quad & l_0(\mu, \Lambda) \\ \text{s.t.} \quad & \tilde{\mu}^* A \Lambda \mu - (\mu' \Lambda \mu) b = 0 \\ & \tilde{\mu}^* C \Lambda \mu - (\mu' \Lambda \mu) d \le 0. \end{aligned} \quad (10.33)$$

For ease of exposition, we give the partials of  $l_0$  in each of  $\mu$  and  $\Lambda$  here:

$$\frac{\partial l_0}{\partial \mu} = T \Lambda \mu - T \Lambda \tilde{\mu} \quad (10.34)$$

$$\begin{aligned} \frac{\partial l_0}{\partial \Lambda} &= T((- \Lambda^{-1} + \hat{\Sigma} + (\mu - \hat{\mu})(\mu - \hat{\mu})')) \\ & - \frac{T}{2} \text{diag}(- \Lambda^{-1} + \hat{\Sigma} + (\mu - \hat{\mu})(\mu - \hat{\mu})'). \end{aligned} \quad (10.35)$$

The Lagrangian of (10.33) is

$$\begin{aligned} \mathcal{L}(\mu, \Lambda, \xi, \delta) &= l_0(\mu, \Lambda) + \xi' (\tilde{\mu}^* A \Lambda \mu - (\mu' \Lambda \mu) b) \\ & + \delta' (\tilde{\mu}^* C \Lambda \mu - (\mu' \Lambda \mu) d), \end{aligned} \quad (10.36)$$

with partials in  $\mu$  and  $\Lambda$

$$\frac{\partial\mathcal{L}}{\partial\mu}=T\Lambda\mu-T\Lambda\hat{\mu}+\tilde{\mu}^*\Lambda(A'\xi+C'\delta)$$
 (10.37)

$$-2(\xi'b+\delta'd)\Lambda\mu$$

$$\frac{\partial\mathcal{L}}{\partial\Lambda}=T((-A^{-1}+\hat{\Sigma}+(\mu-\hat{\mu})(\mu-\hat{\mu})'))$$
 (10.38)

$$+\tilde{\mu}^*(\xi'A\mu'+\mu A'\xi+\delta'C\mu'+\mu C'\delta)$$

$$-2\mu\mu'(\xi'b+\delta'd)$$

$$-\frac{T}{2}\text{diag}(-A^{-1}+\hat{\Sigma}+(\mu-\hat{\mu})(\mu-\hat{\mu})')$$

$$-\frac{\tilde{\mu}^*}{2}\text{diag}(\xi'A\mu'+\mu A'\xi+\delta'C\mu'+\mu C'\delta)$$

$$+\text{diag}(\mu\mu'(\xi'b+\delta'd))$$

Begin by setting  $\tilde{\mu}^*=\mu^*$ , and consider, for  $l_m(\Lambda)=l_0(m,\Lambda)$ ,

$$\min_{\Lambda} l_m(\Lambda)$$

$$\text{s.t.} \quad \mu^*A\Lambda m-(m'\Lambda m)b=0$$
 (10.39)

$$\mu^*C\Lambda m-(m'\Lambda m)d\le 0.$$

The Lagrangian is obtained exactly as in (10.36), replacing  $\mu$  with  $m$ . Letting

$$\Delta(\xi,\delta)=\frac{\mu^*}{T}(\xi'A m'+m A'\xi+\delta'C m'+m C'\delta)-\frac{2}{T}M(\xi'b+\delta'd)$$

with

$$M=mm'.$$

We have that the KKT conditions for this problem are, for  $\Omega^*=\Lambda^{*-1}$ ,

$$\Omega^*=\hat{\Sigma}+\Delta(\xi^*,\delta^*)+(m-\hat{\mu})(m-\hat{\mu})'$$

$$\Lambda^* \quad \text{feasible in (10.39)}$$

$$\delta^*\ge 0$$

$$\delta_i^*(\mu^*C_i\Lambda m-(m'\Lambda m)d_i)=0.$$

Next consider

$$\frac{1}{T}\xi^*=\frac{1}{\mu^{*2}}\lambda^*, \quad \frac{1}{T}\delta^*=\frac{1}{\mu^{*2}}\eta^*$$

with  $(\lambda^*,\eta^*)$  the optimal Lagrange multipliers from (10.24), where for now we assume that the sample mean and covariance have been used as inputs in the constrained mean-variance optimization problem; that is,  $S=\hat{\Sigma}$  and  $m=\hat{\mu}$  in (10.24). We next define

$$\tilde{\Sigma}=\hat{\Sigma}+\Delta\left(\frac{T}{\mu^{*2}}\lambda^*,\frac{T}{\mu^{*2}}\eta^*\right).$$

Notice that  $\Delta\left(\frac{T}{\mu^{*2}}\lambda^*,\frac{T}{\mu^{*2}}\eta^*\right)$  is the same  $\Delta$  in (10.29). In terms of  $K$  and  $\kappa$  from (10.27) and (10.28), we have

$$\tilde{\Sigma}=\hat{\Sigma}+\frac{1}{\mu^*}(Km'+mK')-\frac{2\kappa}{\mu^{*2}}M,$$

giving

$$\begin{aligned}\tilde{\Sigma}\omega^* &= \hat{\Sigma}\omega^*+\frac{1}{\mu^*}(Km'+mK')\omega^*-\frac{2\kappa}{\mu^{*2}}M\omega^* \\ &= -K-\lambda_0^*m+\frac{1}{\mu^*}(K\mu^*+m\kappa)-\frac{2\kappa}{\mu^*}m \\ &= \left(-\lambda_0^*-\frac{\kappa}{\mu^*}\right)m.\end{aligned}$$

Hence  $\omega^*$  satisfies the functional form required for the mean-constrained minimum variance problem in  $\tilde{\Sigma}$ . Since we know that  $m'\omega^*=\mu^*$ , we conclude that  $\omega^*$  is a solution to (10.26) proving Proposition 10.5.1 when  $S=\hat{\Sigma}$  and  $m=\hat{\mu}$  (so that  $\Omega^*=\tilde{\Sigma}$  in that case as well).

Now, if  $\tilde{\Sigma}$  is nonsingular, the preceding result implies that

$$\omega^*=\mu^*(\tilde{\mu}'\tilde{\Sigma}^{-1}\tilde{\mu})^{-1}\mu^*$$

as in (10.32). With this relationship, verifying the feasibility of  $\Lambda^*=\tilde{\Sigma}^{-1}$  in the KKT conditions for the constrained maximum likelihood problem is straightforward. Similarly, since  $\delta^*=\frac{1}{\mu^{*2}}\eta^*$ , the nonnegativity and complementarity conditions are clear, verifying Proposition 10.5.2

In the case of general  $S$  and  $m$ , the same construction obtains using the Lagrange multipliers from (10.24). In particular, for

$$\tilde{\Sigma}=S+\frac{1}{\mu^*}(Km'+mK')-\frac{2\kappa}{\mu^{*2}}M$$

the same results follow except that the modified covariance no longer coincides with that from the constrained maximum likelihood problem; i.e., Proposition 10.5.1 holds for general  $S$  and  $m$ . Notice, too, that the  $(m-\hat{\mu})(m-\hat{\mu})'$  term in the modified covariance is necessarily omitted - rather than being simply zero in the case of  $m=\hat{\mu}$  - as  $m$  is biased in the general case.

We are left to verify  $\tilde{\Sigma}$  is a covariance matrix and do so in the general case of  $m$  and  $S$  as inputs to the original problem. Since symmetry is immediate, we are only left with the question of definiteness. We have, for arbitrary nonzero

$w$ , and  $S \succeq 0$ ,

$$\begin{aligned}w'\tilde{\Sigma}w &= w'(S + \Delta)w \\&= w'Sw + \frac{2}{\mu^*}w'Km'w - \frac{2\kappa}{\mu^{*2}}w'Mw \\&= w'Sw + \frac{2}{\mu^*}w'(-S\omega^* - \lambda_0^*m)m'w - \frac{2\kappa}{\mu^{*2}}w'Mw \\&\quad \text{(by first-order KKT)} \\&= w'Sw - \frac{2}{\mu^*}w'S\omega^*m'w - \frac{2\lambda_0^*}{\mu^*}w'Mw - \frac{2\kappa}{\mu^{*2}}w'Mw \\&\ge w'Sw - \frac{2}{\mu^*}|w'Sw| \cdot |m'w| - 2\left(\frac{\lambda_0^*}{\mu^*} + \frac{\kappa}{\mu^{*2}}\right)w'Mw \\&\ge w'Sw - \frac{2}{\mu^*}|w'Sw|^{1/2}|\omega^{*\prime} S\omega^*|^{1/2} \cdot |m'w| - 2\left(\frac{\lambda_0^*}{\mu^*} + \frac{\kappa}{\mu^{*2}}\right)w'Mw \\&\quad \text{(by Cauchy Schwarz)} \\&= \left((w'Sw)^{1/2} - \frac{|m'w|}{\mu^*}(\omega^{*\prime} S\omega^*)^{1/2}\right)^2 \\&\quad - \frac{|m'w|^2}{\mu^{*2}}(\omega^{*\prime} S\omega^*) - 2\left(\frac{\lambda_0^*}{\mu^*} + \frac{\kappa}{\mu^{*2}}\right)w'Mw.\end{aligned}$$

Now, since

$$\omega^{*\prime} S\omega^* = -\lambda_0^*\mu^* - \kappa,$$

we have

$$\begin{aligned}\frac{|m'w|^2}{\mu^{*2}}(\omega^{*\prime} S\omega^*) &= \frac{|m'w|^2}{\mu^{*2}}(-\lambda_0^*\mu^* - \kappa) \\&= -\left(\frac{\lambda_0^*}{\mu^*} + \frac{\kappa}{\mu^{*2}}\right)w'Mw,\end{aligned}$$

So that

$$\begin{aligned}w'\tilde{\Sigma}w &\ge \left((w'Sw)^{1/2} - \frac{|m'w|}{\mu^*}(\omega^{*\prime} S\omega^*)^{1/2}\right)^2 - \left(\frac{\lambda_0^*}{\mu^*} + \frac{\kappa}{\mu^{*2}}\right)w'Mw \\&\ge \left((w'Sw)^{1/2} - \frac{|m'w|}{\mu^*}(\omega^{*\prime} S\omega^*)^{1/2}\right)^2 + \frac{1}{\mu^{*2}}(\omega^{*\prime} S\omega^*)w'Mw.\end{aligned}$$

Finally, then,

$$w'\tilde{\Sigma}w \ge 0,$$

as desired, with strict inequality on the original feasible set when  $\tilde{\Sigma} \succ 0$ , proving Proposition 10.5.3.

Notice also that for this choice of  $\tilde{\Sigma}$ , we have that the variance of the optimal portfolio in (10.26) is fixed; viz.,

$$\omega^{*\prime} \tilde{\Sigma} \omega^* = \omega^{*\prime} S \omega^*. \quad (10.40)$$

That is, the shrinkage estimator keeps the calculated variance of the optimal portfolio the same.

#### 10.5.1 Value Constraint Modification

Consider the inclusion of a value constraint in portfolio construction. Denote the set of stocks in the upper decile of EBIT/EV (the value anomaly seen in previous chapters) by  $\mathcal{E}$ .

Let  $c \in \mathbb{R}^N$  be defined by

$$c_i = \begin{cases} 1 & \text{if stock } i \text{ is in } \mathcal{E} \\ 0 & \text{otherwise} \end{cases} \quad (10.41)$$

so that a requirement that  $\nu > 0$  percent of portfolio holdings are in the top decile of EBIT/EV may be written as

$$c'w \ge \nu.$$

The resulting statistical arbitrage constrained mean-variance optimization problem is given by

$$\begin{array}{ll} \min_w & \frac{1}{2} w' S w \\ & -c'w \le -\nu \\ & m'w \ge \mu^*. \end{array} \quad (10.42)$$

The modified covariance matrix obtained via (10.30) for this constraint is

$$\begin{aligned} \tilde{\Sigma} &= S + \Delta \\ &= S - \frac{\eta^*}{\mu^*} (c m' + m c') + \frac{2\eta^* \nu}{\mu^{*2}} M \end{aligned}$$

Assuming the constraint is binding and hence  $\eta^* > 0$ , the modified input variance of stock  $i$  becomes

$$\tilde{\sigma}_i^2 = s_i^2 - 2 \frac{\eta^*}{\mu^*} c_i m_i + \frac{2\eta^* \nu}{\mu^{*2}} m_i^2$$

The second order correction arising from the  $M$  term results in an increase of each stock's input variance, proportional to that stock's input squared mean. For stocks in  $\mathcal{E}$ , however, variance is reduced by  $2\frac{\eta^*}{\mu^*}|m_i|$  if the stock's expected return is positive; and increased by the same factor otherwise. Stocks not in the upper decile do not have this linear correction in  $m_i$ . In the final analysis, the first-order term results in a model preference for high EBIT/EV names with positive expectation, while the second order term punishes *ex ante* large expectations.

In addition, it may be shown using market data that those portfolios constructed using  $\tilde{\Sigma}$  in the minimally constrained portfolio optimization problem have statistically significant exposure to the value factor that was used in its

| Covariance                              | $\alpha$ | $\beta_m$ | $\beta_v$ |        |          |          |
|-----------------------------------------|----------|-----------|-----------|--------|----------|----------|
| $S$                                     | 0.0026   | 0.0044    | 0.3982    | 0.2216 | 0.1096*  | 0.1662** |
| $\Sigma_{LW}$                           | 0.0025   | 0.0044    | 0.3781    | 0.1754 | 0.1119*  | 0.1864** |
| $S_{\mathcal{F}_1,\mathcal{E}}$         | 0.0037   | 0.0057    | 0.1886    | 0.1974 | 0.3822** | 0.3185** |
| $\Sigma_{LW,\mathcal{F}_1,\mathcal{E}}$ | 0.0035   | 0.0054    | 0.1462    | 0.1439 | 0.4099** | 0.3411** |

Table 10.1: Exposures of *ex post* returns using the covariances listed under the feasible set descriptions given in each panel, rebalancing monthly from October 1997 to July 2015. For each column, data is presented in order for  $N = 100$  and  $N = 1,000$ . Significance of  $t$  tests are at 1% and 5% levels:  $t$  tests for unmodified estimators are against the null hypothesis of  $\beta_v = 0$  (exposure to value) and indicated using (\*);  $t$  tests for modified estimators are one sided tests of greater exposure against the relative unmodified estimator and indicated using ( $\star$ ). Statistical significance is only reported for value exposure.

construction. That is, in the balance between structure and data, the method outlined in this section provides a tool to tailor exactly what exposures might be desired in the structured alternative. This may be seen in Table 10.1, where both sample and optimal shrinkage covariance matrices are modified using (10.30) under the constraint sets,

$$\mathcal{F}_1 = \{w | 1'w = 1, w \ge 0\}$$
$$\mathcal{F}_{1,\mathcal{E}} = \mathcal{F}_0 \cap \mathcal{E}_v.$$

The data used to construct this table considers  $N = 100$  and  $N = 1,000$  stocks over 200 monthly dates from 10/31/1997 through 5/31/2014, and is comprised of the 1,000 largest domestic stocks on NYSE and AMXE by market capitalization each month, with a requirement that the share price was greater than \$5 at the close on each month end date. Sample covariance matrices,  $S$ , are constructed using 121 weeks of trailing returns at each monthly rebalance, and shrinkage estimators  $\Sigma_{LW}$  are constructed as in the previous section with a constant correlation target; modifications following the value-constraint modification established in (10.42) are denoted by  $S_{\mathcal{F}_1,\mathcal{E}}$  and  $\Sigma_{LW,\mathcal{F}_1,\mathcal{E}}$ , respectively.

From the table it may be seen that modifying each of the sample,  $S$ , and the shrinkage estimator using constant correlation target,  $\Sigma_{LW}$ , results in an *ex post* increase in exposure to value (measured by a regression on a proxy value factor) from 0.1096 and 0.1119 to .3822 and .4099, respectively, in the case of  $N = 100$  stocks. Similar results obtain in the case of  $N = 1,000$  stocks where the sample covariance matrix is underdetermined. These increases are statistically significant at the 1% level in each case.

That is, the empirical exercise bolsters the theoretical claim established by the derivations in (10.5.2).

### Exercises

1. For a portfolio with weights,  $w$ , what is the variance of  $w'r$  if the covariance of  $r$  is estimated using (10.9)?
2. Prove that the usual approach from OLS solves the matrix equation given in (10.11). What is the unbiased estimate of the residual covariance?
3. Explain why (10.13) holds.
4. Rigorously determine  $\hat{\mathbf{D}}$  for use in the cross-sectional model using GLS estimates in (10.13) using the method suggested in the text.
5. The common factors from the cross-sectional model given by (10.13) may be interpreted as factor returns of a constrained quadratic program. Consider

$$\begin{array}{c} \min_w \frac{1}{2} w' \mathbf{D} w \\ \mathbf{B}' w = 1. \end{array} \quad (10.43)$$

1. Interpret the constrained optimization problem above.
2. Solve for  $w^*$ .
3. Show that  $\hat{\mathbf{f}}_t = w^* r_t$  using the definition given in (10.13).

When the additional constraint  $1'w = 1$  is included, the resulting portfolio is called the *factor mimicking portfolio*.

1. Prove that the statistical factors given in (10.16) are orthogonal.
2. Given a set of common factors,  $\{f_k\}_{k=1}^K$ , provide a constructive method for creating a set of pairwise orthogonal common factors  $\{\hat{f}_k\}_{k=1}^K$ . What concerns might you have for your approach, specifically related to time sensitivity?
3. Show that, for a matrix,  $A \in \mathbb{R}^{N \times N}$ , with eigenvalues  $\{\lambda_i\}_{i=1}^N$ , the Frobenius norm satisfies:

1. $\|A\|_F^2 = \text{tr}(A^2)$ .
2. $\|A\|_F^2 = \sum_{i=1}^N \lambda_i^2$ .

1. Find  $R'(\alpha)$  and  $R''(\alpha)$  for  $R(\alpha)$  defined in (10.21).
   1. Establish (10.22).
   2. Explain why  $\alpha^*$  is a unique minima by examining  $R''(\alpha)$ .
   3. Following the methodology in Chapter 2, select the fifty largest stocks in the cross-sectional and historical return data on the last date available. Use the full 121 weeks of returns available to:

1. Shrinking to a constant correlation target, identify the shrinkage target, shrinkage intensity, and shrinkage estimator as established in this chapter.
2. Compare the distribution of eigenvalues from the sample covariance matrix and your identified shrinkage target. Discuss any observations you may have.

1. Prove (10.32).
2. Prove Proposition (10.5.1).