# The Elements of Statistical Learning - Chapter 3 Exercises

## Exercise 3.1

Show that the $F$ statistic (3.13) for dropping a single coefficient from a model is equal to the square of the corresponding $z$-score (3.12).

### Solution

Without loss of generality, assume that the smaller model has had the final feature $\mathbf{x}_p$ removed. Let $\hat{\mathbf{y}}$ denote the least squares approximation for the larger model and $\hat{\mathbf{y}}^{\prime}$ be that for the smaller. We need to show that

\begin{equation}
    \frac{\text{RSS}_0 - \text{RSS}_1}{\text{RSS}_1 / (N-p-1)} = \frac{\hat{\beta}_p^2}{\hat{\sigma}^2 v_p},
\end{equation}

where $v_j$ is the $j$th diagonal element of $(\mathbf{X}^T \mathbf{X})^{-1}$. 

First note that by definition $\hat{\sigma}^2 = \text{RSS}_0 / (N-p-1)$. Moreover, since $\hat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$, their difference is orthogonal to any element of the columns space. In particular, it is orthogonal to $\hat{\mathbf{y}} - \hat{\mathbf{y}}^{\prime}$, so

\begin{equation}
    \lVert \mathbf{y} - \hat{\mathbf{y}}^{\prime} \rVert^2
         = \lVert \mathbf{y} - \hat{\mathbf{y}} \rVert^2 
        + \lVert \hat{\mathbf{y}} - \hat{\mathbf{y}}^{\prime} \rVert^2
\end{equation}

\begin{equation}
    \Rightarrow \text{RSS}_0 - \text{RSS}_1
         = \lVert \mathbf{y} - \hat{\mathbf{y}}^{\prime} \rVert^2 
            - \lVert \mathbf{y} - \hat{\mathbf{y}} \rVert^2
         = \lVert \hat{\mathbf{y}} - \hat{\mathbf{y}}^{\prime} \rVert^2.
\end{equation}

Now let $\mathbf{z}_0,\ldots ,\mathbf{z}_p$ denote the orthogonal basis of the column space of $\mathbf{X}$, obtained from $\mathbf{x}_0,\ldots,\mathbf{x}_p$ using the Gram-Schmidt process (Algorithm 3.1). The least squares estimates  $\hat{\mathbf{y}}$ and $\hat{\mathbf{y}}^{\prime}$ are the projections of $\mathbf{y}$ onto the column space of $\mathbf{X}$ and

\begin{equation}
    \text{span}(\{\, \mathbf{x}_j \,\mid\, 0\leq j\leq p-1 \,\} = \text{span}(\{\, \mathbf{z}_j \,\mid\, 0\leq j\leq p-1 \,\}
\end{equation}

respectively. Since the $\mathbf{z}_j$ are orthogonal, this implies that

\begin{equation}
    \hat{\mathbf{y}} - \hat{\mathbf{y}}^{\prime} = \frac{\langle \mathbf{z}_p, \mathbf{y}\rangle}{\langle \mathbf{z}_p, \mathbf{z}_p\rangle}\mathbf{z}_p = \hat{\beta}_p \mathbf{z}_p.
\end{equation}

Putting these elements together, it just remains to show that $v_p = \lVert \mathbf{z}_p\rVert ^{-2}$. But, if $\mathbf{X} = \mathbf{Q}\mathbf{R}$ is the QR-decomposition of $\mathbf{X}$ then $(\mathbf{X}^T \mathbf{X})^{-1} = \mathbf{R}^{-1}(\mathbf{R}^{-1})^T$. Since $\mathbf{R}$ is upper-triangular, the $p$th diagonal element of $\mathbf{R}^{-1}$ is $R_{pp}^{-1} = \lVert \mathbf{z}_p\rVert ^{-1}$ and the claim follows.

## Exercise 3.2

Given data on two variables $X$ and $Y$, consider fitting a cubic polynomial regression model $f(X) = \sum_{j=0}^3\beta_jX^j$. In addition to plotting the fitted curve, you would like a 95% confidence band about the curve. Consider the following two approaches:

1. At each point $x_0$, form a 95% confidence interval for the linear function $a^T\beta = \sum_{j=0}^3 \beta_jx_0^j$.
2. Form a 95% confidence set for $\beta$ as in (3.15), which in turn generates confidence intervals for $f(x_0)$.

How do these approaches differ? Which band is likely to be wider? Conduct a small simulation experiment to compare the two methods.

### Solution

#### Construction of Confidence Intervals

Let $p=3$ and $\alpha=0.05$.

**1.** We have $\beta\sim\mathcal{N}(\hat{\beta}, \sigma^2\mathbf{I})$, so

\begin{equation}
    x_0^T(\hat{\beta}-\beta)\sim\mathcal{N}(0,x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0\sigma^2).
\end{equation}

Let $v = x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0\in\mathbf{R}$. Then

\begin{equation}
    \frac{x_0^T(\hat{\beta} - \beta)}{\hat{\sigma}\sqrt{v}}\sim t_{N-p-1},
\end{equation}

where $\hat{\sigma}$ is the unbiased estimate for $\sigma$ on p.47. Therefore, a $100(1-\alpha)$% confidence interval for $f(x_0) = x_0^T\beta$ has endpoints

\begin{equation}
    \hat{f}(x_0) \pm \hat{\sigma}\sqrt{x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0}~t_{N-p-1,\alpha/2}
\end{equation}

where $\hat{f}(x_0) = x_0^T\hat{\beta}$ and $t_{N-p-1,\alpha/2}$ is the $\alpha/2$th percentile of a $T$ distribution with $N-p-1$ degrees of freedom.

**2.** By the argument on p.49, an approximate 95% confidence set for $f(x_0)$ is the set of $x_0^T\beta$ such that $\beta$ lies in 

\begin{equation}
    C = \big\{\, \beta \,\big| \, \lVert \mathbf{X}(\beta - \hat{\beta})\rVert^2 \leq \hat{\sigma}^2\chi^2 \,\big\},
\end{equation}

where $\chi^2 = \chi_{p+1,\alpha}^2$ is the $\alpha$th percentile of a chi-squared distribution with $p+1$ degrees of freedom.

First note that $C$ is an ellipsoid in $\mathbb{R}^{p+1}$. This implies that the restriction of the linear function $x_0^T\beta$ to $C$ achieves its maximum and minimum on the boundary $\partial C$ of $C$ and takes every value in between. In particular, $\{ x_0^T\beta\mid\beta\in C\}$ is an interval and the endpoints of this interval are the maximum and minimum of $x_0^T\beta$ subject to the constraint $\beta\in\partial C$, or equivalently

\begin{equation}
    \lVert \mathbf{X}(\beta - \hat{\beta})\rVert^2 = \hat{\sigma}^2\chi^2.
\end{equation}

We solve this problem using Lagrange multipliers. Let

\begin{equation}
    \mathcal{L}(\beta,\lambda) = x_0^T\beta - \lambda\left(\lVert \mathbf{X}(\beta - \hat{\beta})\rVert^2 - \hat{\sigma}^2\chi^2\right).
\end{equation}

This has gradient $\nabla \mathcal{L} = (\frac{\partial \mathcal{L}}{\partial \beta},\frac{\partial\mathcal{L}}{\partial \lambda})$ with

\begin{equation}
    \frac{\partial \mathcal{L}}{\partial \beta} = x_0 - 2\lambda\mathbf{X}^T(\mathbf{X}\beta - \mathbf{X}\hat{\beta}),
        \qquad \frac{\partial\mathcal{L}}{\partial \lambda} = \lVert \mathbf{X}(\beta - \hat{\beta})\rVert^2 - \hat{\sigma}^2\chi^2.
\end{equation}

Any solution to our optimisation problem will have $\nabla\mathcal{L}=0$. Setting the first partial derivative equation to zero gives

\begin{equation}
    \mathbf{X}^T\mathbf{X}(\beta-\hat{\beta}) = \frac{1}{2\lambda}x_0 
        \quad\Rightarrow\quad \beta = \hat{\beta} + \frac{1}{2\lambda} \mathbf{X}^T\mathbf{X}x_0.
\end{equation}

Setting $\frac{\partial\mathcal{L}}{\partial \lambda}=0$ and substituting this in give

\begin{equation}
    \Big\lVert \frac{1}{2\lambda}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0\Big\rVert^2 = \hat{\sigma}^2\chi^2
        \quad\Rightarrow\quad \frac{1}{2\lambda} = \pm \frac{\hat{\sigma}\chi}{\lVert \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0\rVert}.
\end{equation}

So, since we know the optimisation problem has a maximuma and a minimum, they must occur at

\begin{equation}
    \beta = \hat{\beta} \pm \hat{\sigma}\frac{(\mathbf{X}^T\mathbf{X})^{-1}x_0}{\lVert \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0\rVert}\chi.
\end{equation}

Therefore, the confidence interval for $f(x_0)$ has endpoints

\begin{equation}
    \hat{f}(x_0) \pm \hat{\sigma}\frac{x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0}{\lVert \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0\rVert}\chi_{p+1,\alpha}.
\end{equation}

#### Comparison

The first method is exact (given the assumptions), whereas the second uses the approximation of $t_{N-p-1}$ by the standard normal distribution $p+1$ times. Since the $T$ distribution is less peaked than the standard normal, $t_{k, \alpha} > z_{\alpha}$ for $\alpha<0.5$ and thus we should expect the second method to *underestimate* the length of the confidence interval.

#### Simulation

## Exercise 3.3

Gauss–Markov theorem:

**(a)** Prove the Gauss–Markov theorem: the least squares estimate of a parameter $a^T \beta$ has variance no bigger than that of any other linear unbiased estimate of $a^T \beta$ (Section 3.2.2).

**(b)** The matrix inequality $\mathbf{B} \preccurlyeq \mathbf{A}$ holds if $\mathbf{A} − \mathbf{B}$ is positive semidefinite. Show that if $\hat{\mathbf{V}}$ is the variance-covariance matrix of the least squares estimate of $\beta$ and $\tilde{\mathbf{V}}$ is the variance-covariance matrix of any other linear unbiased estimate, then $\hat{\mathbf{V}} \preccurlyeq \tilde{\mathbf{V}}$ .

### Solution

**(a)** Let $\theta = a^T\beta$. The least squares estimator $\hat{\beta}$ of $\beta$ satisfies $\hat{\beta}\sim\mathcal{N}(\beta, \sigma^2(\mathbf{X}^T\mathbf{X})^{-1})$ so the least squares estimate $\hat{\theta} = a^T\hat{\beta}$ of $\theta$ has

\begin{equation}
    \text{Var}(\hat{\theta}) = \sigma^2a^T(\mathbf{X}^{-1}\mathbf{X})^{-1}a.
\end{equation}

Suppose $\tilde{\theta}=c^T\mathbf{y}$ is a linear unbiased estimator of $\theta$. Then

\begin{equation}
    \mathbf{y}\sim\mathcal{N}(\mathbf{X}\beta,\sigma^2I_N)
        \quad\Rightarrow\quad \tilde{\theta}\sim\mathcal{N}(c^T\mathbf{X}\beta, \sigma^2c^Tc).
\end{equation}

Since $\tilde{\theta}$ is unbiased for $\theta$,

\begin{equation}
    \text{E}(\tilde{\theta})=\theta
        \quad\Rightarrow\quad c^T\mathbf{X}\beta = a^T\beta
        \quad\Rightarrow\quad (c^T\mathbf{X} - a^T)\beta = 0.
\end{equation}

But this must hold for *every* $\beta\in\mathbb{R}^{p+1}$, so $c^T\mathbf{X} = a^T$. 

We have reduced the problem to showing that if $c^T\mathbf{X} = a^T$ then $c^Tc\geq a^T(\mathbf{X}^T\mathbf{X})^{-1}a$. We will establish this by showing that $a^T(\mathbf{X}^T\mathbf{X})^{-1}a$ is a global minimum for $\lVert c\rVert^2$ subject to the constraint $c^T\mathbf{X} = a^T$.

First, observe that a minimum exists since the solution set is non-empty (consider $c = (\mathbf{X}^{-1}\mathbf{X})^{-1}\mathbf{X}^T$), bounded below by 0, and closed. We will find it using Lagrange multipliers.

Let $\lambda\in\mathbb{R}^{p+1}$ be variables and define

\begin{equation}
    \mathcal{L}(c,\lambda) = \lVert c\rVert^2 = \lambda^T(\mathbf{X}^Tc - a).
\end{equation}

This satisfies

\begin{equation}
    \frac{\partial\mathcal{L}}{\partial c} = 2c - \mathbf{X}\lambda,
        \qquad \frac{\partial\mathcal{L}}{\partial c} = \mathbf{X}^Tc - a.
\end{equation}

At an extremum of our optimisation problem both of these are zero, so

\begin{align}
    c = \frac{1}{2}\mathbf{X}\lambda
        & \quad\Rightarrow\quad \frac{1}{2}\mathbf{X}^T\mathbf{X}\lambda = a \\
        & \quad\Rightarrow\quad \lambda = 2(\mathbf{X}^T\mathbf{X})^{-1}a \\
        & \quad\Rightarrow\quad c = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}a \\
        & \quad\Rightarrow\quad \tilde{\theta} = c^T\mathbf{y} = \hat{\theta}.
\end{align}

Since the Lagrange multiplier has a unique solution, this must be the global minimum of the constrained optimisation problem. In fact we have proved something slightly stronger: that $\hat{\theta}$ is the unique unbiased estimator with minimal variance.

**(b)** Our proof is analogous to that in part (a). The covariance matrix of $\hat{\beta}$ is $\text{Var}(\hat{\beta}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$. If $\tilde{\beta} = \mathbf{A}\mathbf{y}$ is another unbiased estimator for $\beta$ then $\tilde{\beta} \sim \mathcal{N}(\mathbf{A}\mathbf{X}\beta, \sigma^2\mathbf{A}\mathbf{A}^T)$. Since $\tilde{\beta}$ is unbiased, $\mathbf{A}\mathbf{X}\beta = \beta$ for all $\beta\in\mathbb{R}^{p+1}$ and so $\mathbf{A}\mathbf{X} = \mathbf{I}_{P+1}$. Thus we have reduced the problem to showing that if $\mathbf{A}$ is a $(p+1)\times N$ matrix with $\mathbf{A}\mathbf{X} = \mathbf{I}_{p+1}$ then $\mathbf{A}\mathbf{A}^T - (\mathbf{X}^T\mathbf{X})^{-1}$ is positive semi-definite, or equivalently

\begin{equation}
    v^T\mathbf{A}\mathbf{A}v \geq v^T(\mathbf{X}^T\mathbf{X})^{-1}v \quad \text{for all } v\in\mathbb{R}^{p+1}\setminus \{0\}.
\end{equation}

Again we treat this as a constrained optimisation problem and employ Lagrange multipliers. Fix $v\in\mathbb{R}^{p+1}$. The set $\{ \lVert\mathbf{A}^Tv\rVert^2 \mid \mathbf{A}\mathbf{X}=\mathbf{I}_{p+1}\}$ is non-empty (take $\mathbf{A} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$), bounded below by zero, and closed so has a minimum.

Let $\Lambda$ be a $(p+1)\times(p+1)$ matrix of variables and define

\begin{equation}
    \mathcal{L}(\mathbf{A},\Lambda) = \lVert \mathbf{A}^T v\rVert^2 - \Lambda \cdot(\mathbf{A}\mathbf{X} - \mathbf{I}_{p+1},
\end{equation}

where'$\cdot$' denotes the dot product. This has

\begin{equation}
    \frac{\partial\mathcal{L}}{\partial A} = 2vv^T\mathbf{A} - \Lambda\mathbf{X}^T,
        \qquad \frac{\partial\mathcal{L}}{\partial \Lambda} = \mathbf{A}\mathbf{X} - \mathbf{I}_{p+1}.
\end{equation}

At an extremum of the constrained optimisation problem both of these are zero, so

\begin{align}
    2vv^T\mathbf{A}\mathbf{X} = \Lambda\mathbf{X}^T\mathbf{X}
        & \quad\Rightarrow\quad 2vv^T  = \Lambda\mathbf{X}^T\mathbf{X} \\
        & \quad\Rightarrow\quad \Lambda = vv^T(\mathbf{X}^T\mathbf{X}^{-1}) \\
        & \quad\Rightarrow\quad vv^T\mathbf{A} = vv^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T.
\end{align}

Multiplying each term on the right by its transpose yields

\begin{align}
    vv^T\mathbf{A}\mathbf{A}^Tvv^T & = vv^T(\mathbf{X}^T\mathbf{X})^{-1}vv^T \\
    \left[v^T\mathbf{A}\mathbf{A}^Tv\right]vv^T & = \left[v^T(\mathbf{X}^T\mathbf{X})^{-1}v\right]vv^T \\
    v^T\mathbf{A}\mathbf{A}^Tv & = v^T(\mathbf{X}^T\mathbf{X})^{-1}v,
\end{align}

where the second line holds as both terms in square brackets are scalars and so commute with everything. Since the constrained optimisation problem has a minimum this must be it. This establishes the claim.

## Exercise 3.4

Show how the vector of least squares coefficients can be obtained from a single pass of the Gram–Schmidt procedure (Algorithm 3.1). Represent your solution in terms of the $QR$ decomposition of $\mathbf{X}$.

### Solution

To obtain $\hat{\beta}$ from the Gram-Schmidt process add an extra step $2^\prime$. Take $j\in\{1,\ldots, p\}$ and suppose that for $k<j$ we have written

\begin{equation}
    \mathbf{z}_k = \mathbf{x}_k + \sum_{l=0}^{k-1}\hat{\delta}_{lk}\mathbf{x}_l.
\end{equation}

Then we define coefficients $\hat{\delta}_{kj}$ for $k=0,\ldots, j-1$ by

\begin{align}
    \mathbf{z}_j
        & = \mathbf{x}_j - \sum_{k=0}^{j-1} \hat{\gamma}_{kj}\mathbf{z}_k \\
        & = \mathbf{x}_j - \sum_{k=0}^{j-1} \hat{\gamma}_{kj}\left(\mathbf{x}_k + \sum_{l=0}^k\hat{\delta}_{lk}\mathbf{x}_l\right) \\
        & = \mathbf{x}_j + \sum_{k=0}^{j-1}\hat{\delta}_{kj}\mathbf{x}_k.
\end{align}

Let $\Delta$ be the matrix with $\Delta_{ij}=\hat{\delta}_{ij}$ for $i<j$, ones on the diagonal, and zeros elsewhere. By construction,

\begin{equation}
    \mathbf{Z} = \mathbf{X}\Delta \quad\Rightarrow\quad \Delta = \Gamma^{-1}.
\end{equation}

By (3.32), $\hat{\beta} = \Gamma^{-1}\mathbf{Z}^T\mathbf{y} = \Delta\mathbf{Z}^T\mathbf{y}$ so we can calculate $\hat{\beta}$ explicitly:

\begin{equation}
    \hat{\beta}_j = \mathbf{z}_j+\sum_{k=j+1}^p\hat{\delta}_{jk}\mathbf{z}_k.
\end{equation}

## Exercise 3.5

Consider the ridge regression problem (3.41). Show that this problem is equivalent to the problem

\begin{equation}
\hat{\beta}^c = \underset{\beta^c}{\text{argmin}} \Bigg\{ \sum_{i=1}^N \big[ y_i - \beta_0^c  - \sum_{j=1}^p (x_{ij}-\bar{x}_j)\beta_j^c\big]^2 + \lambda \sum_{j=1}^p (\beta_j^c)^2\Bigg\}.
\end{equation}

Give the correspondence between $\beta^c$ and the original $\beta$ in (3.41). Characterize the solution to this modified criterion. Show that a similar result holds for the lasso.

### Solution

Our solution is the same for both ridge regresssion and lasso. Since

\begin{equation}
    \beta_0^c + \sum_{j=1}^p(x_{ij} - \bar{x}_j)\beta_j^c 
        = \left( \beta_0^c - \sum_{j=1}^p \bar{x}_j\beta_j^c\right) + \sum_{j=1}^p x_{ij}\beta_j^c,
\end{equation}

the two minimisation problems are equivalent with 

\begin{equation}
    \beta_0^c = \beta_0 + \sum_{j=1}^p \bar{x}_j\beta_j = \frac{1}{N}\sum_{i=1}^N \hat{y}_i
\end{equation}

and $\beta_j^c = \beta_j$ for $j\neq 0$.

At the minimum, the derivative with respect to $\beta_0^c$ of the expression in braces is zero, so

\begin{align}
    \sum_{i=1}^N \left[ y_i - \beta_0^c  - \sum_{j=1}^p (x_{ij}-\bar{x}_j)\beta_j^c \right] = 0 \quad
        \Rightarrow \quad \left( \sum_{i=1}^N y_i \right) - N\beta_0^c - 0 = 0
\end{align}

and thus the solution to the modified criterion is $\beta_0^c = \bar{y}$, $\beta_j^c = \beta_j$ for $j>0$.

### Exercise 3.6

Show that the ridge regression estimate is the mean (and mode) of the posterior distribution, under a Gaussian prior $\beta\sim \mathcal{N}(0, \tau^2 \mathbf{I})$, and Gaussian sampling model $y\sim\mathcal{N}(\mathbf{X}\beta, \sigma^2\mathbf{I})$. Find the relationship between the regularization parameter $\lambda$ in the ridge formula, and the variances $\tau^2$ and $\sigma^2$.

### Solution

The posterior distribution for $\beta$ has density function

\begin{align}
    f_{\beta\mid Y_1, \ldots, Y_N}(\beta)
        & \propto f_{Y_1, \ldots, Y_N\mid\beta}(y_1,\ldots, y_N)f_{\beta}(\beta) \\
        & = \left(\prod_{i=1}^N f_{Y\mid\beta}(y_i)\right) f_{\beta}(\beta) \\
        & \propto \text{exp}\left( -\frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - x_i^T\beta)^2 - \frac{1}{\tau^2}\lVert \beta\rVert^2\right)\\
        & = \text{exp}\left( -\frac{1}{2\sigma^2}\left( \lVert\mathbf{y} - \mathbf{X}\beta\rVert^2 + \frac{\sigma^2}{\tau^2}\lVert\beta\rVert^2\right)\right).
\end{align}

So the mode of this distribution is

\begin{equation}
    \underset{\beta}{\text{argmax}} \left(f_{\beta\mid Y_1, \ldots, Y_N}(\beta)\right)
        = \underset{\beta}{\text{argmin}}\left(\lVert\mathbf{y} - \mathbf{X}\beta\rVert^2 + \frac{\sigma^2}{\tau^2}\lVert\beta\rVert^2\right).
\end{equation}

This is the ridge regression estimate with $\lambda = \sigma^2/\tau^2$. Moreover, since the posterior is Gaussian (p.64) the mean equals the mode.

## Exercise 3.7

Assume $y_i\sim\mathcal{N}(\beta_0+x_i^T \beta, \sigma^2)$, $i=1,2,...,N$, and the parameters $\beta_j$, $j = 1,...,p$ are each distributed as $\mathcal{N}(0, \tau^2)$, independently of one another. Assuming $\sigma^2$ and $\tau^2$ are known, and $\beta_0$ is not governed by a prior (or has a flat improper prior), show that the (minus) log-posterior density of $\beta$ is proportional to $\sum_{i=1}^N(y_i − \beta_0 − \sum_j x_{ij}\beta_j)^2 + \lambda\sum_{j=1}^p \beta_j^2$ where $\lambda = \sigma^2/\tau^2$.

### Solution

This follows from the solution to exercise 3.6.

## Exercise 3.8

Consider the $QR$ decomposition of the uncentered $N \times (p + 1)$ matrix $\mathbf{X}$ (whose first column is all ones), and the SVD of the $N \times p$ centered matrix $\tilde{\mathbf{X}}$. Show that $Q_2$ and $U$ span the same subspace, where $Q_2$ is the sub-matrix of $Q$ with the first column removed. Under what circumstances will they be the same, up to sign flips?