## (1) Background

Consider the sum of a normal $Z_1 \sim N(\mu_1,\tau_1^2)$ and a truncated normal $Z_2 \sim TN(\mu_2,\tau_2^2,a,b)$, denoted $W=Z_1+Z_2$. How can this normal-truncated sum (NTS) distribution be characterized? One approach for doing inference on $W$ would be to simply sample from $Z_1$ and $Z_2$, as can be done with most statistical software. However, having an analytical or numerical method for calculating the density (PDF) and cumulative distribution (CDF) of $W$ has several benefits over the sampling approach. This includes including vectorization, reproducibility across software programs, and faster compute times (especially for the tails of the distribution). This post will provide `python` code to calculate the PDF and CDF of an arbitrary NTS. The value of distribution will be highlighted with three use cases: i) quality control, ii) two-stage hypothesis testing, and iii) data carving for selective inference. 

This post will make extensive use of the theoretical results from [Arnold et. al (1993)](https://link.springer.com/article/10.1007/BF02294652) (hereafter Arnold) and [Kim (2006)](https://www.kss.or.kr/jounalDown.php?IDX=831) (hereafter Kim). Sections (2) and (3) are not original work, but instead show how to translate these research papers into usable code. In section (4), the two-stage hypothesis testing and data carving are original contributions in making the connection between these statistical problems and the NTS distribution. 

## (2) Integrating a bivariate normal CDF

The CDF of a bivarate normal (BVN) will turn out to be necessary to calculate the CDF of a NTS. While there is no closed-form solution to the CDF of a BVN (unless the point of evaluation occurs at the origin), the integration problem can be quickly solved by numerical methods. Additionally, if the user is willing to accept a non-exact solution, there are approximation methods which yield an analytical solution that can be easily vectorized. This section will briefly review both approaches. 

Following the notation of the literature, the orthant probability of a standard BVN[[1]], $L(h,k,\rho)=P(X_1\geq h, X_2\geq k)$, is equivalent to solving the following integral introduced by [Sheppard](https://royalsocietypublishing.org/doi/10.1098/rsta.1899.0003) 130 years ago:

$$
\begin{align}
L(h,k,\rho) &= \frac{1}{2\pi\sqrt{1-\rho^2}} \int_h^\infty \int_k^\infty \exp \Bigg[ - \frac{x^2-2\rho x y + y^2}{2(1-\rho^2)} \Bigg] dx dy  \\
&= \frac{1}{2\pi} \int_{\arccos(\rho)}^{\pi} \exp \Bigg[ - \frac{h^2+k^2-2hk\cos(\theta)}{2\sin^2(\theta)} \Bigg] d\theta \tag{1}
\end{align}
$$
\label{eq:sheppard}

Note that the orthant probability can easily be related to the CDF: $F(h,k,\rho)=P(X_1\leq h, X_2\leq k) = L(h,k,\rho) + \Phi(h) + \Phi(k) - 1$. The CDF and PDF of a standard normal are denoted by $\Phi(\cdot)$ and $\phi(\cdot)$, respectively. What is interesting about $\eqref{eq:sheppard}$ is that the problem of integrating out both $X_1$ and $X_2$ can been reduced to a single variable of integration. 

In `python`, `scipy.integrate.quad` can be used to quickly solve $\eqref{eq:sheppard}$. As will be shown, the method turns out to produce results  identical to that of the `scipy.stats.multivariate_normal.cdf` function. Even though there is no closed-form solution, [Cox 1991](https://www.jstor.org/stable/1403446) showed that $L(\cdot)$ could be approximated using a simple Taylor expansion.

$$
\begin{align}
L(h,k,\rho) &= P(X_1 \geq h, X_2 \geq k) = P(X_1 \geq h) P(X_2 \geq k | X_1 \geq h) \\
&= \Phi(-h) E\Bigg[ \Phi\Bigg(\frac{\rho X_1 - k}{\sqrt{1-\rho^2}} \Bigg) \Bigg| X_1 \geq h \Bigg] \\
&\approx \Phi(-h) \Phi\Bigg( \frac{\rho\phi(h)/\Phi(-h)-k}{\sqrt{1-\rho^2}} \Bigg) \tag{2}
\end{align} 
$$
\label{eq:cox}

Cox showed that this approximation method works well for reasonable ranges of $\rho$ (absolute coefficient less than 90%).  For users willing to trade off accuracy for speed, $\eqref{eq:cox}$ allows for rapid vectorization cross different values of $\rho$, $h$, or $k$. The code block below will define a `BVN` class that can be used to calculate the orthant probability and perform sampling. 

In [7]:
import numpy as np
import pandas as pd
from scipy.stats import norm, truncnorm, chi2, t
from scipy.stats import multivariate_normal as MVN
from scipy.linalg import cholesky
from scipy.integrate import quad
from scipy.optimize import minimize_scalar

import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 


class BVN():
    def __init__(self, mu, sigma, rho):
        """
        mu: array of means
        sigma: array of variances
        rho: correlation coefficient
        """
        if isinstance(mu,list):
            mu, sigma = np.array(mu), np.array(sigma)
        assert mu.shape[0]==sigma.shape[0]==2
        assert np.abs(rho) <= 1
        self.mu = mu.reshape([1,2])
        self.sigma = sigma.flatten()
        od = rho*np.sqrt(sigma.prod())
        self.rho = rho
        self.Sigma = np.array([[sigma[0],od],[od, sigma[1]]])
        self.A = cholesky(self.Sigma) # A.T.dot(A) = Sigma

    def rvs(self, size, seed=None):
        """
        size: number of samples to simulate
        seed: to pass onto np.random.seed
        """
        np.random.seed(seed)
        X = np.random.randn(size,2)
        Z = self.A.T.dot(X.T).T + self.mu
        return Z

    def sheppard(self, theta, h, k):
        return (1/(2*np.pi))*np.exp(-0.5*(h**2+k**2-2*h*k*np.cos(theta))/(np.sin(theta)**2))

    def orthant(self, h, k, method='scipy'):
        assert method in ['scipy','cox','sheppard']
        if isinstance(h,int) or isinstance(h, float):
            h, k = np.array([h]), np.array([k])
        else:
            assert isinstance(h,np.ndarray) and isinstance(k,np.ndarray)
        assert len(h) == len(k)
        # Calculate the number of standard deviations away it is        
        Y = (np.c_[h, k] - self.mu)/np.sqrt(self.sigma)
        Y1, Y2 = Y[:,0], Y[:,1]

        # (i) scipy: L(h, k)=1-(F1(h)+F2(k))+F12(h, k)
        if method == 'scipy':
            sp_bvn = MVN([0, 0],[[1,self.rho],[self.rho,1]])
            pval = 1+sp_bvn.cdf(Y)-(norm.cdf(Y1)+norm.cdf(Y2))
            return pval 
        
        if method == 'cox':
            mu_a = norm.pdf(Y1)/norm.cdf(-Y1)
            root = np.sqrt(1-self.rho**2)
            xi = (self.rho * mu_a - Y2) / root
            pval = norm.cdf(-Y1) * norm.cdf(xi)
            return pval

        if method == 'sheppard':
            pval = np.array([quad(self.sheppard, np.arccos(self.rho), np.pi, args=(y1,y2))[0] for y1, y2 in zip(Y1,Y2)])
            return pval

mu = np.array([1,2])
sigma = np.array([0.5,2])
dist_BVN = BVN(mu,sigma,rho=0.4)
h, k = 2, 3
pval_scipy = dist_BVN.orthant(h, k, method='scipy')[0]
pval_sheppard = dist_BVN.orthant(h, k, method='sheppard')[0]
pval_cox = dist_BVN.orthant(h, k, method='cox')[0]
nsim = 1000000
pval_rvs = dist_BVN.rvs(nsim)
pval_rvs = pval_rvs[(pval_rvs[:,0]>h) & (pval_rvs[:,1]>k)].shape[0]/nsim
methods = ['scipy','sheppard','cox','rvs']
pvals = [pval_scipy, pval_sheppard, pval_cox, pval_rvs]
pd.DataFrame({'method':methods,'orthant':pvals})

Unnamed: 0,method,orthant
0,scipy,0.040615
1,sheppard,0.040615
2,cox,0.04067
3,rvs,0.040866


Notice that the output results are identical using either the `multivariate_normal` class or directly calling integration solvers.  Figure 1A below shows that the Cox-method is very close to the numerical solution, although there are no definitive error bounds (in Cox's paper he found that the worst-case empirical error was around 10%). Although the figure is not shown, there is no improvement in the run time using the `sheppard` method over `scipy`'s built in MVN distribution for a single estimate, and `scipy` is faster over a matrix of values. There are substantial gains in the run time for vectorization across a matrix of $h,k$ values using Cox's methods (Figure 1B).

<center><h4>Figure 1: BVN comparison </h4></center>
<table>
    <tr>
        <td> <center>A: Orthant probabilities   </center> </td>
        <td> <center>B: Vectorized run-times </center> </td>
    <tr>
        <td> <img src="figures/gg_pval.png" style="width: 80%"/> </td>
        <td> <img src="figures/gg_rvec.png" style="width: 90%"/> </td>
    </tr>
</table>

## (3) Deriving $f_W$ and $f_W$

The first step to characterize the NTS distribution is to understand the distribution of a truncated bivariate normal distrubion (TBVN). 

$$
\begin{align*}
(X_1, X_2) &\sim \text{TBVN}(\theta, \Sigma, \rho, a, b) \\
E([X_1,X_2]) &= [\theta_1, \theta_2] \\
V([X_1,X_2]) &= \begin{pmatrix} \sigma_1^2 & \rho \sigma_1\sigma_2 \\ \rho \sigma_1\sigma_2 & \sigma_2^2  \end{pmatrix} \\
X_1 &\in \mathbb{R}, \hspace{3mm} X_2 \in [a,b]
\end{align*}
$$

Notice that the TBVN takes the same formulation as a bivariate normal (a mean vector, a covariance matrix, and a correlation coefficient $\rho$) except that the random variable $X_2$ term is bound between $a$ and $b$. Arnold showed that the marginal density of the non-truncated random variable $X_1$ could be written as follows:

$$
\begin{align}
f_W(x) &= f_{X_1}(x) = \frac{\phi(m_1(x)) \Bigg[ \Phi\Big(\frac{\beta-\rho \cdot m_1(x)}{\sqrt{1-\rho^2}}\Big) - \Phi\Big(\frac{\alpha-\rho \cdot m_1(x)}{\sqrt{1-\rho^2}}\Big) \Bigg] }{\sigma_1\cdot Z} \tag{3} \\ 
m_1(x) &= (x - \theta_1)/\sigma_1  \\
\alpha &= \frac{a-\theta_2}{\sigma_2}, \hspace{3mm} \beta = \frac{b-\theta_2}{\sigma_2}  \\
Z &= \Phi(\beta) - \Phi(\alpha)
\end{align}
$$
\label{eq:arnold}

Kim showed that a NTS could be written as a TBVN by a simple change of variables:

$$
\begin{align*}
X_1 &= Z_1 + Z_2^u \\ 
X_2 &= Z_2^u, \hspace{3mm} X_2 \in [a,b] \\
Z_2^u &\sim N(\mu_2,\tau_2^2) \\
\theta &= [\mu_1 + \mu_2, \mu_2] \\
\Sigma &= \begin{pmatrix} \tau_1^2 + \tau_2^2 & \rho \sigma_1\sigma_2 \\ \rho \sigma_1\sigma_2 & \tau_2^2  \end{pmatrix} \\
\rho &= \sigma_2 / \sigma_1 = \tau_2/\sqrt{\tau_1^2 + \tau_2^2} \\
Z_1 + Z_2 &= W \sim NTS(\theta(\mu),\Sigma(\tau^2), a, b)
\end{align*}
$$

Where $Z_2^u$ is the non-truncated version of $Z_2$. Clearly this specification of the TBVN is equivalent to the NTS and the marginal distribution of $X_1$ is equivalent to the PDF of $W$. Kim showed that the integral of $\eqref{eq:arnold}$ was equivalent to solving the orthant probabilities of a standard BVN : 

$$
\begin{align}
F_W &= F_{X_1}(x) = 1 - \frac{L(m_1(x),\alpha,\rho) - L(m_1(x),\beta,\rho)}{Z} \tag{4}
\end{align}
$$
\label{eq:kim}

Hence, the PDF for a NTS can be solved analytically, and CDF can be solved using either numerical methods for the exact univariate integral in $\eqref{eq:sheppard}$ or Cox's approximation method in $\eqref{eq:cox}$. The code block below provides the `python` code necessary to calculate $\eqref{eq:arnold}$ and $\eqref{eq:kim}$. The `NTS` class provides a quantile function $F^{-1}_W(p)=\sup_w F_W(w) \leq p$ that relies on [Golden's method](https://en.wikipedia.org/wiki/Golden-section_search) for univariate optimization by querying the `cdf` attribute until convergence (I am sure there are more efficient ways to carry out this optimization, especially for vectorization). 

In [2]:
class NTS():
    def __init__(self, mu, tau, a, b):
        """
        mu: array of means
        tau: array of standard errors
        rho: correlation coefficient
        """
        assert mu.shape[0]==tau.shape[0]==2
        # Assign parameters
        self.mu, self.tau = mu.flatten(), tau.flatten()
        self.a, self.b = a, b
        # Truncated normal (Z2)
        self.alpha = (self.a - self.mu[1]) / self.tau[1]
        self.beta = (self.b - self.mu[1]) / self.tau[1]
        self.Z = norm.cdf(self.beta) - norm.cdf(self.alpha)
        self.Q = norm.pdf(self.alpha) - norm.pdf(self.beta)
        # Average will be unweighted combination of the two distributions
        self.mu_W = self.mu[0] + self.mu[1] + self.tau[1]*self.Q/self.Z
        # Distributions
        self.dist_X1 = norm(loc=self.mu[0], scale=self.tau[0])
        self.dist_X2 = truncnorm(a=self.alpha, b=self.beta, loc=self.mu[0], scale=self.tau[1])
        # W
        self.theta1 = self.mu.sum()
        self.theta2 = self.mu[1]
        self.sigma1 = np.sqrt(np.sum(self.tau**2))
        self.sigma2 = self.tau[1]
        self.rho = self.sigma2/self.sigma1

    def pdf(self, x):
        term1 = self.sigma1 * self.Z
        m1 = (x - self.theta1) / self.sigma1
        term2 = (self.beta-self.rho*m1)/np.sqrt(1-self.rho**2)
        term3 = (self.alpha-self.rho*m1)/np.sqrt(1-self.rho**2)
        f = norm.pdf(m1)*(norm.cdf(term2) - norm.cdf(term3)) / term1
        return f

    def cdf(self, x, method='scipy'):
        if isinstance(x, list):
            x = np.array(x)
        if isinstance(x, float) or isinstance(x, int):
            x = np.array([x])
        nx = len(x)
        m1 = (x - self.theta1) / self.sigma1
        bvn = BVN(mu=[0,0],sigma=[1,1],rho=self.rho)
        orthant1 = bvn.orthant(m1,np.repeat(self.alpha,nx),method=method)
        orthant2 = bvn.orthant(m1,np.repeat(self.beta,nx),method=method)
        return 1 - (orthant1 - orthant2)/self.Z

    def ppf(self, p):
        if isinstance(p, list):
            p = np.array(p)
        if isinstance(p, float) or isinstance(p, int):
            p = np.array([p])
        # Set up reasonable lower bounds
        lb = self.mu_W - self.sigma1*4
        ub = self.mu_W + self.sigma1*4
        w = np.repeat(np.NaN, len(p))
        for i, px in enumerate(p):
            tmp = float(minimize_scalar(fun=lambda w: (self.cdf(w)-px)**2,method='bounded',bounds=(lb,ub)).x)
            w[i] = tmp
        return w

    def rvs(self, n, seed=1234):
        r1 = self.dist_X1.rvs(size=n,random_state=seed)
        r2 = self.dist_X2.rvs(size=n,random_state=seed)
        return r1 + r2

Next, let's generate NTS data with the following parameters: $\mu_1=1$, $\tau_1=1$ (for $Z_1$), $\mu_2=1$, $\tau_2=2$, $a=-1$, and $b=4$ (for $Z_2$).

In [5]:
# Demonstrated with example
mu1, tau1 = 1, 1
mu2, tau2, a, b = 1, 2, -1, 4
mu, tau = np.array([mu1, mu2]), np.array([tau1,tau2])
dist_NTS = NTS(mu=mu,tau=tau, a=a, b=b)
n_iter = 100000
W_sim = dist_NTS.rvs(n=n_iter,seed=1)
mu_sim, mu_theory = W_sim.mean(),dist_NTS.mu_W
xx = np.linspace(-5*mu.sum(),5*mu.sum(),n_iter)
mu_quad = np.sum(xx*dist_NTS.pdf(xx)*(xx[1]-xx[0]))
methods = ['Empirical','Theory', 'Quadrature']
mus = [mu_sim, mu_theory, mu_quad]
# Compare
print(pd.DataFrame({'method':methods,'mu':mus}))

       method        mu
0   Empirical  2.292354
1      Theory  2.290375
2  Quadrature  2.290375


The output above compares the empirical mean of the simulated data with the sum of the two location parameters of $Z_1$ and $Z_2$, as well as what we would estimate using a quadrature procedure with equation $\eqref{eq:arnold}$, as $E(W)=\int w F_W(w) dw \approx \sum w F_W(w) dw$ for a small $dw$. Next, we can compare the actual and theoretical percentiles and quantiles against the empirical ones seen from `W_sim`. As expected, they are visually indistinguishable from each other.[[^2]]

<center><h4>Figure 2: NTS P-P & Q-Q plot </h4></center>
<img src="figures/gg_ppqq.png" style="width: 60%"/>

## (4.A) Application: Quality control

In some manufacturing processes, the one of the components may go through an quality control procedure that removes items above or below a certain threshold. For example, this question was posed in a [1964 issue](https://www.jstor.org/stable/1266101?seq=1) of *Technometrics*:

> An item which we make has, among others, two parts which are assembled additively with regard to length. The lengths of both parts are normally distributed but, before assembly, one of the parts is subjected to an inspection which removes all individuals below a specified length. As an example, suppose that X comes from a normal distribution with a mean 100 and a standard deviation of 6, and Y comes from a normal distribution with a mean of 50 and a standard deviation of 3, but with the restriction that Y > 44. How can I find the chance that X + Y is equal to or less than a given value?

Subsequent answers focused on value of $P(X + Y < 138) \approx 0.03276$. We can confirm this by using the `NTS` class.

In [8]:
mu1, tau1 = 100, 6
mu2, tau2, a, b = 50, 3, 44, np.inf
mu, tau = np.array([mu1, mu2]), np.array([tau1,tau2])
dist_A = NTS(mu=mu,tau=tau, a=a, b=b)
print('P(X+Y<138)=%0.5f' % dist_A.cdf(138))

P(X+Y<138)=0.03276


## (4.B) Two-stage testing

In a [previous post](http://www.erikdrysdale.com/regression_trial/) I discussed a two-stage testing strategy designed to validate a machine learning regression model. The framework can be generalized as follows: 1) estimate an upper-bound on the mean of a Gaussian, 2) use this upper bound as the null hypothesis on a new sample of data.[[^3]] This is useful in order to reject the null (in the second stage) to say the mean is *at most* some value. Several simplifying assumptions are made to make the analysis tractable: the data are [IID](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables) and from the same normal distribution: $S, T \overset{iid}{\sim} N(\delta,\sigma^2)$, and that $n$ and $m$ are large enough so that the difference between the normal and student-t distributions are sufficiently small. The statistical pipeline is as follows.

1. Estimate the mean and variance on the first Gaussian sample: $(S_{1},\dots,S_{n}) \sim N(\hat{\delta}_S,\hat{\sigma}^2/n)$
2. Estimate the $1-\gamma$ quantile of this distribution: $\hat{\delta}_0=\hat{\delta}_S+n^{-1/2}\cdot\hat{\sigma}\cdot\Phi^{-1}_{1-\gamma}$ to set the null hypothesis:
$$
\begin{align*}
H_0&: \delta \geq \hat{\delta}_0 \\
H_A&: \delta < \hat{\delta}_0 \\
\end{align*}
$$
3. Estimate mean and variance on second Gaussian sample: $(T_{1},\dots,T_{m}) \sim N(\hat{\delta}_T,\hat{\sigma}^2)$
4. Calculate a one-sided test statistic: $\hat{s} = \sqrt{m} \cdot (\hat{\delta}_T - \hat{\delta}_0) / \hat{\sigma}$
$$
\begin{align*}
\text{Reject }H_0&: \hat{s} < t_\alpha
\end{align*}
$$

Remember that when the null is true, then the estimate of the upper bound is too low because $\delta$ is larger than what we estimate ($\hat{\delta}_0$). Concomitantly, the null being false implies that we have successfully bounded the actual mean. If $\tau >0.5$, then $E(s)<0$ because $\hat{\delta}_0$ will tend to be above the true average of $\delta$. The statistical information provided by this procedure is two-fold. First, $\delta_0$ has the property that it will be larger than the true mean $(1-\tau)$% of the time.[[^4]] Second, when the null is false (i.e. the bound holds) we will be able to compute the power in advance.  Now, where does the NTS fit into this? In order to bound the type-I error rate to $\alpha$, we need to know the distribution of $s$ when the null is true: $\delta > \hat{\delta}_0$:

$$
\begin{align*}
\hat{s} &= \frac{\hat{\delta}_T - \delta}{\hat{\sigma}_m} - \frac{\hat{\delta}_0 - \delta}{\hat{\sigma}_m} \hspace{2mm} \Big| \hspace{2mm} \hat{\delta}_0 > \delta  \\
&= N(0,1) - \frac{\hat{\delta}_S-\delta + \hat{\sigma}_n\cdot\Phi^{-1}_{1-\gamma}}{\hat{\sigma}_m} \hspace{2mm} \Big| \hspace{2mm} \hat{\delta}_S + \hat{\sigma}_n\cdot\Phi^{-1}_{1-\gamma} - \delta > 0 \\
&= N(0,1) + TN\big( - \sqrt{m/n} \cdot \Phi^{-1}_{1-\gamma}, m/n, 0, \infty  \big)
\end{align*}
$$

Hence the critical value $t_\alpha = F^{-1}_W(\alpha)$ can be found by inverting $\eqref{eq:kim}$ (i.e. the quantile function) with the following parameters from our original notation:

$$
\begin{align*}
\mu &= \big[ 0, - \sqrt{m/n} \cdot \Phi^{-1}_{1-\gamma}\big] \\
\tau^2 &= \big[1, m/n \big] \\
\hat{s} | H_0 &\sim NTS(\theta(\mu), \Sigma(\tau^2), 0, \infty ) \\ 
\hat{s} | H_A &\sim NTS(\theta(\mu), \Sigma(\tau^2), -\infty, 0 )
\end{align*}
$$

Notice that the distribution of the NTS test statistic only depends on $m$, $n$, and $\gamma$. This means that the test statistic $s$ is a [pivot](https://en.wikipedia.org/wiki/Pivotal_quantity) over all possible values of $\delta$ or $\sigma$ (which are [nuisance parameters](https://en.wikipedia.org/wiki/Nuisance_parameter)). This means that the researcher can calculate the critical value $t_\alpha$ as well as estimate the power $1-\beta = P(\hat{s}<t_\alpha)$ in advance of the data. To the extent that their are degrees of freedom in selecting $m$, $n$, and $\gamma$, several trade-offs occur.

1. Smaller values of $\gamma$ increase power increase but lower statistical information with a higher average upper bound
2. Higher values of $n$ (keeping $m$ and $\gamma$ fixed) reduce power but increase statistical information with lower average upper bound
3. Higher values of $m$ (keeping $n and $\gamma$ fixed) increase statistical power

If $m+n=k$ is fixed, then a trade-off can be made between the size of the upper bound from the first stage, and the power in the second stage. The simulations check that the $\hat{s}$ is characterized by a NTS distribution, and examine how the power and statistical information of the test varies for the following parameters: $\delta=2$, $\sigma^2=4$, $n+m=200$, $\gamma=0.01$, and $\alpha=0.05$.

In [4]:
class two_stage():
    def __init__(self, n, m, gamma, alpha, pool=True, student=True):
        # Assign
        assert (n > 1) and (m >= 1) and (gamma > 0) & (gamma < 1)
        self.n, self.m, self.gamma = n, m, gamma
        self.pool = alpha, pool

        # Calculate degres of freedom
        self.dof_S, self.dof_T = n - 1, m - 1
        if self.pool:
            self.dof_T = n + m - 1
        if student:
            self.phi_inv = t(df=self.dof_S).ppf(1-gamma)
        else:
            self.phi_inv = norm.ppf(1-gamma)
        mn_ratio = (self.dof_T+1)/(self.dof_S+1)
        mu_2stage = np.array([0, -np.sqrt(mn_ratio)*self.phi_inv])
        tau_2stage = np.sqrt([1, mn_ratio])
        self.H0 = NTS(mu=mu_2stage,tau=tau_2stage, a=0, b=np.infty)
        self.HA = NTS(mu=mu_2stage,tau=tau_2stage, a=-np.infty, b=0)
        self.t_alpha = self.H0.ppf(alpha)[0]
        self.power = self.HA.cdf(self.t_alpha)

    def rvs(self, nsim, delta, sigma2, seed=None):
        if seed is None:
            seed = nsim
        np.random.seed(seed)
        delta1 = delta + np.sqrt(sigma2/(self.dof_S+1))*np.random.randn(nsim)
        delta2 = delta + np.sqrt(sigma2/(self.dof_T+1))*np.random.randn(nsim)
        sigS = np.sqrt(sigma2*chi2(df=self.dof_S).rvs(nsim)/self.dof_S)
        sigT = np.sqrt(sigma2*chi2(df=self.dof_T).rvs(nsim)/self.dof_T)
        delta0 = delta1 + (sigS/np.sqrt(self.n))*self.phi_inv
        shat = (delta2 - delta0)/(sigT/np.sqrt(self.dof_T+1))
        df = pd.DataFrame({'shat':shat, 'd0hat':delta0})
        return df

delta, sigma2 = 2, 4
n, m = 100, 100
gamma, alpha = 0.01, 0.05
nsim = 10000000
p_seq = np.arange(0.01,1,0.01)
two_stage(n=n, m=m, gamma=gamma, alpha=alpha, pool=True).power[0]

# --- (A) CALCULATE N=M=100 PP-QQ PLOT --- #
dist_2s = two_stage(n=n, m=m, gamma=gamma, alpha=alpha, pool=True)
df_2s = dist_2s.rvs(nsim=nsim, delta=delta, sigma2=sigma2)
df_2s = df_2s.assign(Null=lambda x: x.d0hat < delta)
df_2s = df_2s.assign(reject=lambda x: x.shat < dist_2s.t_alpha)

qq_emp = df_2s.groupby('Null').apply(lambda x: pd.DataFrame({'pp':p_seq,'qq':np.quantile(x.shat,p_seq)}))
qq_emp = qq_emp.reset_index().drop(columns='level_1')
qq_theory_H0 = dist_2s.H0.ppf(p_seq)
qq_theory_HA = dist_2s.HA.ppf(p_seq)
tmp1 = pd.DataFrame({'pp':p_seq,'theory':qq_theory_H0,'Null':True})
tmp2 = pd.DataFrame({'pp':p_seq,'theory':qq_theory_HA,'Null':False})
qq_pp = qq_emp.merge(pd.concat([tmp1, tmp2]))
qq_pp = qq_pp.melt(['pp','Null'],['qq','theory'],'tt')

# --- (B) POWER AS GAMMA VARIES --- #
gamma_seq = np.round(np.arange(0.01,0.21,0.01),2)
power_theory = np.array([two_stage(n=n, m=m, gamma=g, alpha=alpha, pool=False).power[0] for g in gamma_seq])
ub_theory = delta + np.sqrt(sigma2/n)*t(df=n-1).ppf(1-gamma_seq)
power_emp, ub_emp = np.zeros(power_theory.shape), np.zeros(ub_theory.shape)
for i, g in enumerate(gamma_seq):
    tmp_dist = two_stage(n=n, m=m, gamma=g, alpha=alpha, pool=False)
    tmp_sim = tmp_dist.rvs(nsim=nsim, delta=delta, sigma2=sigma2)
    tmp_sim = tmp_sim.assign(Null=lambda x: x.d0hat < delta, 
                reject=lambda x: x.shat < tmp_dist.t_alpha)
    power_emp[i] = tmp_sim.query('Null==False').reject.mean()
    ub_emp[i] = tmp_sim.d0hat.mean()

tmp1 = pd.DataFrame({'tt':'theory','gamma':gamma_seq,'power':power_theory,'ub':ub_theory})
tmp2 = pd.DataFrame({'tt':'emp','gamma':gamma_seq,'power':power_emp,'ub':ub_emp})
dat_gamma = pd.concat([tmp1, tmp2]).melt(['tt','gamma'],None,'msr')

# --- (C) POWER AS N = K - M VARIES --- #

k = n + m
n_seq = np.arange(5,k,5)
dat_nm = pd.concat([pd.DataFrame({'n':nn,'m':k-nn,
    'power':two_stage(n=nn, m=k-nn, gamma=gamma, alpha=alpha, pool=True).power[0]},index=[nn]) for nn in n_seq])
dat_nm = dat_nm.reset_index(None,True).assign(ub=delta + np.sqrt(sigma2/n_seq)*t(df=n_seq-1).ppf(1-gamma))
dat_nm = dat_nm.melt(['n','m'],None,'msr')

<center><h4>Figure 3: Two-stage test statistic </h4></center>
<table>
    <tr>
        <td> <center>A: PP-QQ plot   </center> </td>
        <td> <center>B: $\gamma$ variation </center> </td>
        <td> <center>C: $n,m$ variation </center> </td>
    <tr>
        <td> <img src="figures/gg_qp_2s.png" style="width: 80%"/> </td>
        <td> <img src="figures/gg_gamma.png" style="width: 80%"/> </td>
        <td> <img src="figures/gg_nm.png" style="width: 80%"/> </td>
    </tr>
</table>

The panel of plots in Figure 3 shows the trade-offs inherent in the choice of $m$, $n$, and $\gamma$. Because the actual and expected quantiles line up perfectly (Figure 3A), we can be confident that the NTS distribution correctly describes the two-stage test statistic.

## (4.C) Application: Data carving

The classical instruments of statistical inference like p-values and confidence internals have probabilistic properties that rely on the assumption that the choice of model and null hypotheses are specified in advance of the data. In other words, which variables will be tested and the form of the null hypothesis is determined independent of any data. In the age of exploratory statistics and large datasets these assumptions are increasingly at odds with empirical practice. For example, researchers who use a [lasso](https://en.wikipedia.org/wiki/Lasso_(statistics)) model to find a sparse set of coefficients will often "investigate" the importance of the features by examining the frequency of which variables get selected by the procedure through bootstrapping or cross-validation.[[^5]] Unfortunately the traditional methods of doing inference on the coefficients of a regression model do not work for the lasso because which coefficients get selected is a function of the data. In other words, both the model and the null hypothesis is determined *ex post* rather than *a priori*. 

In the framework of Fithian, Sun, and Taylor (see [*Optimal Inference After Model Selection*](https://arxiv.org/abs/1410.2597)), this amount to a two-stage process: "1) the **selection stage** determines what questions to ask, and 2) the **inference stage** answers those questions." Once again echoing Fithian et al., this is moving from the non-adaptive to the adaptive selection paradigm (a.k.a. data snooping). There paper answer the following question: "what it means for inference to be valid in the presence of adaptive selection and to propose methods that achieve this selective validity."

<br> 
* * *

### Footnotes

[^1]: A standard BVN means that the means are zero, and covariance matrix has a value of one on the diagonals and $\rho$ on the off-diagonals.

[^2]: In reality, there are slight differences, they just cannot be seen on the figure.

[^3]: A lower bound can be studied just as easily as an upper bond.

[^4]: This is only true in the frequentist sense. We can say nothing about any one realization of $\hat{\delta}_0$ itself, only about the random variable $\delta_0$ in general.

[^5]: Here are just three examples of papers that use this frequency property found from the first page of a google search: [here](https://www.scirp.org/html/4-1240172_30157.htm), [here](https://dl.acm.org/doi/10.1145/1553374.1553431), and [here](https://www.frontiersin.org/articles/10.3389/fnagi.2016.00318/full).