# Bhattacharya 2009 JASA

## Problem

Bhattacharya (2009) studies inference on the value of a linear program.
The *population problem* is given by

$$
\max_p \gamma' p \text{ subject to } Ap = \gamma, p \geq 0.
$$
Here, we have $m$ decision variables collected in the vector $p$.

The analogous *sample problem* is given by 
$$
\min_p \hat{\gamma}' p \text{ subject to } Ap = \gamma, p \geq 0.
$$
The only difference is that $\gamma$ is treated as estimated by an estimator $\hat{\gamma}$.
Importantly, all other constraints are nonrandom. That is, the polyhedron $P$ over which we maximize is non-random. 

In linear programs, if there is a solution, it is attained at an extreme point of the constraint set $P$.
Informally, extreme points of $P$ are points in $P$, which are not attainable as a convex combination of any two other distinct points. 
For example, if $P=[0,1]^2$, the extreme points are the set $\{(0,0), (1,0), (0,1), (1,1)\}$.

These extreme points correspond to the basic solutions of a linear program. 
Further, if the program has a solution it is attained at one of the *feasible* basic solutions.

Denote the set of *feasible* basic solutions by $S = \{z_1, \ldots, z_{|S|}\}$.
We have $|S| \leq {m \choose M}$, where $m$ are the number of decision variables and $M$ is the number of equality constraints.

<small>
The bound follows from the requirement that at every basic solution m-M linearly independent constraint need to be active (and the symmetry of the binomial coefficient).
</small>

Hence, if we know $S$ we can simply think of maximizing over the finite set $S$.

## Distribution Theory

The main result is given in propositions 3 and 4.

We assume 
$$
\sqrt{n}(\hat{\gamma} - \gamma) \to_d w = O_p(1)
$$
and denote the set of optimal solutions as 
$$
\Theta_0 = \{z| z \in S, \gamma'z = v\}
$$
which is finite and has $J$ elements, so $\Theta_0 = \{z_1, \ldots, z_J\}$.


**Proposition 3** Assume $\sqrt{n}(\hat{\gamma}-\gamma)\to_d w$ and that elements of $\hat{\gamma}$ are bounded with probability 1. Then
$$
\sqrt{n}(\hat{v} - v) = \max_{z\in\Theta_0}\{w'z\} + o_p(1) = \max_{1\leq j\leq J}\{w'z_j\} + o_p(1).
$$

Note this distribution is not pivotal: it depends on an unknown parameter, namely the set of optimal solutions $\Theta_0$.
Proposition 4 then states that we can instead use an estimator of this set given by
$$
\hat{\Theta}_n = \{z^* \in S: \hat{\gamma}'z^* \geq \max_{z\in S} \hat{\gamma}'z - c_n\}
$$
So these are all basic solutions which attain a similar sample value up to some tolerance.
$c_n > 0$ is a tuning parameter and needs to be chosen by the researcher. 
The asymptotic theory requires $\sqrt{n}c_n \to \infty$ and $c_n \to 0$.

If there is only a single optimal solution, so $|\Theta_0| = 1$, then the asymptotic distribution is given by $\sqrt{n}(\hat{\gamma} - \gamma)z_o$.
Hence, this is a linear combination of normal random variables with weights $z_0$. Generally, $z_0$ is unknown but can be conistently estimated.

<small> Check the latter requirement is actually needed for proposition 4 (and maybe 3). </small>

## Similarity to pretesting

The approach above works by allowing for a slightly larger set of optimal solutions.
The non-standard behavior arises from multiple optimal solutions.
We can think of $\hat{\Theta}_n$ in terms of a pre-test whether there is a unique solution:
Given some tolerance $c_n>0$
- if the difference between the two values is too small, we cannot be sure that $\hat{z}$ is the unique optimal solution;
- if the difference is large enough, we can be fairly certain, that $\hat{z}$ is the optimal solution.

### Example in m = 2 dimensions

### Example in m > 2 dimensions

## Procedure

- Step 1: Estimate set of optimal solutions $\hat{\Theta}_n \equiv \{z^* \in S: \hat{\gamma}'z^* \geq \max_{s\in S} \hat{\gamma}'z - c_n\}$
- Step 2: For each of $B$ draws $w$ from $\sqrt{n}(\hat{\gamma} - \gamma)$:
  - Calculate $v_i := \max_{z\in \hat{\Theta}_n}\{w'z\}$.
- Step 3: Determine the quantiles of $\{v_i\}_{i=1}^R$


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
"""Illustrate Bhattacharya 2009 in a simple example."""

from functools import partial

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from scipy.stats import gaussian_kde, norm

from thesis.bhattacharya.cdd_funcs import (
    find_extreme_points,
    find_extreme_points_box,
    mat_box_constraint,
)
from thesis.config import BLD, RNG

In [None]:
# Illustration of cdd for finding extreme points of a box constraint

# Get the matrix for the box constraint
mat_for_cdd = mat_box_constraint(2)

# V-representation: [t V]
# The library will always output V-representations in this form, i.e. so that all
# components of the first column are zero or one, and so that
# L does not contain rows whose first component is one.
# 1 in the first column means it is an extreme point.
find_extreme_points(mat_for_cdd)

In [None]:
# In lower dimensional problem, finding extreme points is fast, but increases
# exponentially with dimension.
# For example, [0,1]^16 has 2^16=65536 extreme points.

times = {}

for dim in [2, 4, 8, 16]:
    times[dim] = %timeit -o find_extreme_points_box(dim)

times

## Simple Simulation

We will now consider the B2009 method in the simplest possible linear program.

The population problem is given by 
$$
\min_{x\in[0,1]^2} c_1 x_1 + c_2 x_2
$$

We will fix $c_2 > 0$ for the analysis.
Hence, the solution has a simple form: $x_2=0$ is optimal and hence the only relevant extreme points are $\{(1,0), (0,0)\}$.

- When $c_1 > 0$, $(0,0)$ is the unique optimal solution.
- When $c_1 = 0$, both $(0, 0)$ and $(1,0)$ are optimal solutions.
- When $c_1 < 0$, $(1,0)$ is the unique optimal solution.

The optimal value viewed as a function of $c_1$ has the following form:
$$
v(c_1) = 0 + I\{c_1 < 0\}\times c_1 = \min\{0, c_1\}
$$
We can see, that the point of multiple solutions $c_1 = 0$ corresponds to the kink in the value function.

Hence, the four extreme points are the corners of the box.
We will treat $c_1$ unknown and $c_2$ as known.
However, we have $n$ draws from a random variable $y$ with $E[y] = c_1$ and finite variance $\sigma^2$. Hence, we can estimate $c_1$ by $\hat{c}_1 = \frac{1}{n}\sum_{i=1}^n y_i$.

The sample problem is thus given by
$$
\min_{x\in[0,1]^2} \hat{c}_1 x_1 + c_2 x_2.
$$

B2009 applies since the polyhedron $[0,1]^d$ is not stochastic and $\sqrt{n}(\hat{c}_1 - c_1) \to_d N(0, \sigma^2)$.

By the B2009 results, the asymptotic distribution of $\sqrt{n}(\hat{v} - v)$ is given by

- $0$, whenever $c_1 > 0$
- $\min\{N(0, \sigma^2), 0\}$, when $c_1=0$
- $N(0, \sigma^2)$, when $c_1 < 0$.

In [None]:
# First, we illustrate the distribution of \hat{v} for different sample sizes and c_1.


def draw_data(
    num_obs: int,
    c_1: float,
    sigma: float,
    rng: np.random.Generator,
) -> np.ndarray:
    """Draw num_obs observations from N(c_1, sigma^2)."""
    return rng.normal(c_1, sigma, num_obs)


def v_hat(data: np.ndarray) -> float:
    """Calculate the Bhattacharya 2009 estimator for a given sample."""
    mean = np.mean(data)
    return mean * (mean < 0)


def sim_distribution_v_hat(
    num_reps: int,
    num_obs: int,
    c_1: float,
    sigma: float,
    rng: np.random.Generator,
) -> np.ndarray:
    """Simulate finite sample distribution of v_hat."""
    out = np.empty(num_reps)

    for i in range(num_reps):
        data = draw_data(num_obs, c_1, sigma, rng)
        out[i] = v_hat(data)

    return out

In [None]:
def plot_scaled_distr(
    num_obs: int,
    num_reps: int,
    num_grid: int,
    c_1: float,
    sigma: float,
    rng: np.random.Generator,
) -> go.Figure:
    """Plot the scaled distribution of v_hat."""
    grid = np.linspace(-3, 3, num_grid)

    distr = sim_distribution_v_hat(
        num_reps=num_reps,
        num_obs=num_obs,
        c_1=c_1,
        sigma=sigma,
        rng=rng,
    )

    v = np.min(c_1, 0)

    scaled_distr = np.sqrt(num_obs) * (distr - v)

    try:
        kde = gaussian_kde(scaled_distr).evaluate(grid)
    except np.linalg.LinAlgError:
        kde = np.zeros(num_grid)

    fig = go.Figure()

    fig.add_trace(
        go.Scatter(
            x=grid,
            y=kde,
            mode="lines",
            name="Finite Sample (KDE)",
            line={"dash": "solid", "color": "blue"},
        ),
    )

    # Add N(0,sigma^2) distribution

    sigma_asympt = sigma if c_1 <= 0 else 0

    if c_1 < 0:
        asym_distr = norm.pdf(grid, loc=0, scale=sigma_asympt)
    if c_1 == 0:
        # Minimum of norm.pdf(grid, loc=0, scale=sigma_asympt) and 0
        asym_distr = np.where(grid <= 0, norm.pdf(grid, loc=0, scale=sigma_asympt), 0)
        # Add zero to grid and set asym_distr to 0.5 at that point
        grid = np.concatenate(([0], grid))
        asym_distr = np.concatenate(([0.5], asym_distr))

        idx = np.argsort(grid)
        grid = grid[idx]
        asym_distr = asym_distr[idx]
    if c_1 > 0:
        asym_distr = np.zeros(num_grid)

    fig.add_trace(
        go.Scatter(
            x=grid,
            y=asym_distr,
            mode="lines",
            name="Asymptotic Distribution",
            line={"dash": "solid", "color": "red"},
        ),
    )

    # Add mean(scaled_distr) and alpha/2 and 1-alpha/2 quantiles
    alpha = 0.05

    mean = np.mean(scaled_distr)
    q1_fs = np.quantile(scaled_distr, alpha / 2)
    q2_fs = np.quantile(scaled_distr, 1 - alpha / 2)

    fig.add_trace(
        go.Scatter(
            x=[mean, mean],
            y=[0, 0.5],
            mode="lines",
            name=f"FS: Mean = {mean:.2f}",
            line={"dash": "dash", "color": "orange"},
        ),
    )

    fig.add_trace(
        go.Scatter(
            x=[q1_fs, q1_fs],
            y=[0, 0.5],
            mode="lines",
            name=f"FS: {alpha/2} quantile = {q1_fs:.2f}",
            line={"dash": "dash", "color": "green"},
        ),
    )

    fig.add_trace(
        go.Scatter(
            x=[q2_fs, q2_fs],
            y=[0, 0.5],
            mode="lines",
            name=f"FS: {1-alpha/2} quantile = {q2_fs:.2f}",
            line={"dash": "dash", "color": "green"},
        ),
    )

    # Add quantiles of the asymptotic distribution
    q1_asym = norm.ppf(alpha / 2, loc=0, scale=sigma_asympt) if c_1 <= 0 else 0

    q2_asym = norm.ppf(1 - alpha / 2, loc=0, scale=sigma_asympt) if c_1 < 0 else 0

    fig.add_trace(
        go.Scatter(
            x=[q1_asym, q1_asym],
            y=[0, 0.5],
            mode="lines",
            name=f"Asym.: {alpha/2} quantile = {q1_asym:.2f}",
            line={"dash": "dot", "color": "black"},
        ),
    )

    fig.add_trace(
        go.Scatter(
            x=[q2_asym, q2_asym],
            y=[0, 0.5],
            mode="lines",
            name=f"Asym.: {1-alpha/2} quantile = {q2_asym:.2f}",
            line={"dash": "dot", "color": "black"},
        ),
    )

    fig.update_layout(
        title=f"N = {num_obs}, c_1 = {c_1}, Sigma = {sigma}, Simulations = {num_reps}",
        xaxis_title="sqrt(n) * (v_hat - v)",
        yaxis_title="Density",
    )

    return fig


plot = partial(plot_scaled_distr, num_reps=25_000, num_grid=1000, sigma=1, rng=RNG)

In [None]:
c_1_for_plot = [-0.2, -0.05, -0.01, 0, 0.05]
num_obs_for_plot = [100, 250, 1000, 10_000]

res_plots = {
    c_1: {num_obs: plot(num_obs=num_obs, c_1=c_1) for num_obs in num_obs_for_plot}
    for c_1 in c_1_for_plot
}

In [None]:
path_to_plots = BLD / "bhattacharya" / "plots"

for c_1, plot_dict in res_plots.items():
    for num_obs, plot in plot_dict.items():
        plot.write_image(path_to_plots / f"dist_v_hat_{c_1}_num_obs_{num_obs}.png")

In [None]:
def bhatta_confidence_interval(
    data: np.ndarray,
    c_n: float,
    n_reps: int,
    alpha: float,
) -> np.ndarray:
    """Calculate the Bhattacharya 2009 confidence interval for a given sample."""
    num_obs = len(data)

    # Step 1: Construct set of optimal solutions up to some tolerance
    gamma_1_hat = np.mean(data)
    gamma_2_hat = 0.5
    gamma = np.array([gamma_1_hat, gamma_2_hat])

    basic_feasible_solutions = find_extreme_points_box(2)[:, 1:]

    values = basic_feasible_solutions @ gamma.T

    v_hat = np.min(values)

    idx = np.where(v_hat + c_n >= values)[0]

    num_solutions = len(idx)

    estimated_optimal_solutions = basic_feasible_solutions[idx, :]

    # Step 2: For each of B draws w from N(0, sigma^2_hat) calculate the minimum
    # value over the estimated optimal solutions
    sigma_hat = np.std(data)

    asym_distr = np.empty(n_reps)

    for i in range(n_reps):
        w1 = RNG.normal(0, sigma_hat)
        w2 = 0
        w = np.array([w1, w2])

        if len(idx) == 1:
            asym_distr[i] = np.dot(estimated_optimal_solutions, w)[0]
        else:
            _values = estimated_optimal_solutions @ w

            asym_distr[i] = np.min(_values)

    z_hi = np.quantile(asym_distr, 1 - alpha / 2)
    z_lo = np.quantile(asym_distr, alpha / 2)

    ci_lo = v_hat - z_hi / np.sqrt(num_obs)
    ci_hi = v_hat - z_lo / np.sqrt(num_obs)

    return np.array([ci_lo, ci_hi, z_lo, z_hi, v_hat, num_solutions])

In [None]:
# Run a small simulation to see how to confidence intervals look on average
# Choosing c_n large amounts to correctly choosing both solutions when c_1 = 0
# Choosing c_n = 0 amounts to choosing (0, 0) whenever c_1_hat > 0 and (1, 0)
# whenever c_1_hat < 0


def sim_confidence_interval(
    num_obs: int,
    c_1: float,
    c_n: float,
    sigma: float,
    num_sims: int,
    alpha: float,
    rng: np.random.Generator,
) -> pd.DataFrame:
    """Simulate finite sample distribution of v_hat."""
    res = np.zeros((num_sims, 5))

    for i in range(num_sims):
        data = draw_data(num_obs, c_1, sigma, rng)
        res[i, :] = bhatta_confidence_interval(
            data=data,
            c_n=c_n,
            n_reps=1000,
            alpha=alpha,
        )

    cols = ["ci_lo", "ci_hi", "z_lo", "z_hi", "v_hat"]

    out = pd.DataFrame(res, columns=cols)

    out["true"] = np.min([c_1, 0])

    out["covers_hi"] = out["ci_hi"] >= out["true"]
    out["covers_lo"] = out["ci_lo"] <= out["true"]

    out["covers"] = out["covers_hi"] & out["covers_lo"]

    return out

In [None]:
c1_grid = np.sort(np.concatenate((np.linspace(-0.1, 0.1, 10), np.zeros(1))))

sigma = 0.5

alpha = 0.05

num_obs = 1000

num_sims = 1000

In [None]:
c_n_for_simulation = {
    # "paper_0.01": 0.01 * np.log(num_obs) / np.sqrt(num_obs),
    # "paper_0.1": 0.1 * np.log(num_obs) / np.sqrt(num_obs),
    "normal_alpha_over_1": sigma * norm.ppf(1 - alpha) / (np.sqrt(num_obs)),
    "normal_alpha_over_2": sigma * norm.ppf(1 - alpha / 2) / (np.sqrt(num_obs)),
    "normal_alpha_over_4": sigma * norm.ppf(1 - alpha / 4) / (np.sqrt(num_obs)),
}

c_n_for_simulation

In [None]:
res = {
    c_n_key: {
        c_1: sim_confidence_interval(
            num_obs=num_obs,
            c_1=c_1,
            c_n=c_n_val,
            sigma=sigma,
            num_sims=num_sims,
            alpha=alpha,
            rng=RNG,
        )
        for c_1 in c1_grid
    }
    for c_n_key, c_n_val in c_n_for_simulation.items()
}

In [None]:
# Concatenate results in the dictionaries
df_res = pd.concat(
    [
        pd.concat([res[c_n_key][c_1].assign(c_n=c_n_key, c_1=c_1) for c_1 in c1_grid])
        for c_n_key in c_n_for_simulation
    ],
)

df_res.head()

In [None]:
data_for_plot = df_res.groupby(["c_1", "c_n"]).mean()

data_for_plot = data_for_plot.reset_index()

c_n_to_color = {
    "inf": "blue",
    "zero": "red",
    "paper_0.01": "green",
    "paper_0.1": "purple",
    "paper_1": "orange",
    "normal_alpha_over_1": "blue",
    "normal_alpha_over_2": "red",
    "normal_alpha_over_4": "green",
}

fig = go.Figure()

for c_n in c_n_for_simulation:
    data = data_for_plot.query(f"c_n == '{c_n}'")

    fig.add_trace(
        go.Scatter(
            x=data["c_1"],
            y=data["covers"],
            mode="lines",
            name=f"c_n={c_n} ({c_n_for_simulation[c_n]:.3f})",
            line={"color": c_n_to_color[c_n]},
        ),
    )

fig.update_layout(
    title=(
        f"Coverage of Confidence Intervals (N = {num_obs}, Simulations = {num_sims},"
        f" Nominal Coverage = {1 - alpha})"
    ),
    xaxis_title="True Value",
    yaxis_title="Coverage",
)

# Get x range of plo

# Add a line at 0.95
fig.add_shape(
    type="line",
    x0=np.min(c1_grid),
    y0=0.95,
    x1=np.max(c1_grid),
    y1=0.95,
    line={"color": "black", "width": 1},
)

fig.write_image(path_to_plots / f"coverage_{num_obs}.png")

fig.show()

**Case 1: $c_n=\infty$.**

Clearly, in this case we have undercoverage whenever $c_1 \leq 0$.
When $c_n=\infty$, our estimator of the set of optimal solutions includes all feasible basic solutions.
That is, $\hat{\Theta}_n = S$.

Consider some $c_1 < 0$. As $n\to \infty$, we have $\hat{c_1} \to_p c_1$.
The correct asymptotic distribution would be $\sqrt{n}(\hat{v} - v) = \sqrt{n}(\min(\hat{c}_1, 0) - v) \sim N(0, \sigma^2)$.
Hence, the correct quantiles for a two-sided CI are $\Phi^{-1}(\alpha/2)$ and $\Phi^{-1}(1-\alpha/2)$.

Now note that the plot showing the average CI bounds indicates, that undercoverage comes from the lower bound being too large.
That is, $z_{1-\alpha/2} < \Phi^{-1}(1-\alpha/2)$.

The reason is simple: When we simulate, we simulate $\min_{z\in\hat{\Theta}_n}\{w'z\}$.
But if $\hat{\Theta}_n = S$ it always contains $(0,0)$, which will be optimal, whenever $w > 0$. Remember $\sqrt{n}(v-\hat{v}) \to_d w$.
Since $w =_d N(0,\sigma^2)$, we have $Pr(w > 0) = 0.5$ and hence our simulated distribution has a point-mass of 0.5 at 0.
Hence, asymptotically, for any $\alpha \leq 0.5$ we have $z_{1-\alpha/2} = 0 < \Phi^{-1}(1-\alpha/2)$, so the confidence interval is too short. 

This is not a problem in terms of coverage when $c_1 \geq 0$. In this case, the optimal value is $0$, hence a confidence interval with lower bound $0$ covers the true parameter. 

<small> That is at least asymptotically true. I think in finite sample we might run into situations where close to the right of $0$ the confidence interval might not contain $0$ in finite samples. </small>

**Case 2: $c_n=0$.**

In this case we have correct coverage for $c_1 <<0$ and conservative coverage for $c_1 >> 0$.
The latter is expected since the asymptotic distribution is degenerate with all mass at $0$.

When $c_n=0$, we only consider the sample optimal solution, meaninig $\hat{\Theta}_n = \hat{z}$.

<small> While this should be unique with probability 1 in sample, we can define it to be unique by picking a random element. </small>

Why is there undercoverage at 0? **No**, there is undercoverage for $c_1 = 0 - \epsilon$!
And the reason is simple: When $\hat{c}_1 > 0$ is observed, the optimal sample solution is $(0,0)$.
Hence the simulated quantiles will be 0 and the CI is given by the point $\hat{v} = 0$ which leads to coverage of zero.
Since $P(\hat{c}_1) > 0$ in finite sample, we have undercoverage, mostly resulting from this case.

Note this is not an issue at $c_1 = 0$! In this case $\hat{v} = 0$ covers the true parameter $v=0$.

# Comparison to pre-testing

## Pre-test estimator
Note that the approach in B2009 is equivalent to a pre-testing procedure.

A pre-test would determine whether we are at $c_1=0$ and then choose the approximating distribution accordingly.
In particular, we have for the asymptotic distribution $G$

$$
G(c_1) = \begin{cases}
    0 & \text{if } c_1 > 0 \\ % & is your "\tab"-like command (it's a tab alignment character)
    \min\{0, Z\} & \text{if } c_1 = 0 \\
    Z & \text{if } c_1 < 0,
\end{cases}
$$

where $Z\sim N(0,\sigma^2)$.

Hence, an estimator of the asymptotic distribution incorporating a pre-test might look like

$$
\hat{G}(c_1) = \begin{cases}
    0 & \text{if } \frac{\sqrt{n}(\hat{c}_1 - 0)}{\hat{\sigma}} > \kappa_n \\ % & is your "\tab"-like command (it's a tab alignment character)
    \min\{0, Z\} & \text{if } |\frac{\sqrt{n}(\hat{c}_1 - 0)}{\hat{\sigma}}| \leq \kappa_n \\
    Z & \text{if } \frac{\sqrt{n}(\hat{c}_1 - 0)}{\hat{\sigma}} < -\kappa_n.
\end{cases}
$$
Here, $\kappa_n$ has to satisfy $\kappa_n \to \infty$ but more slowly than $\sqrt{n}$, so $\frac{\kappa_n}{\sqrt{n}}\to0$.

We can for example choose $\kappa_n = \Phi(1 - \frac{\alpha}{2r_n})$ where $r_n\to \infty, \frac{r_n}{\sqrt{n}}\to 0$.
Then the type I error (wrongly rejecting $c_1=0$) is approximately $\kappa_n = \Phi(1 - \frac{\alpha}{2r_n})$ in a large sample.

## B2009 estimator

Now the estimator of B2009 is essentially a pre-test.
Note that an event equivalent to $c_1=0$ is $|\Theta_0| > 1$. That is, we have two optimal basic feasible basis solutions if and only if $c_1=0$ (given that $c_2 > 0$).
More explicitly, we want to test $\Theta_0 = \{(1, 0), (0,0)\}$.

By using the preliminary estimator $\hat{\Theta}_n = \{z \in S: \hat{\gamma}'z \leq \hat{z} + c_n\}$ we are conducting this pretest.
We can work through the cases for type I errors, to motivate a tuning parameter choice similarly to the pre-test above.

In particular, we can conduct a similar test by choosing
$$
c_n = \frac{\sigma \Phi^{-1}(1 - \frac{\alpha}{2 r_n})}{\sqrt{n}},
$$

with $r_n \to \infty$ and $\frac{r_n}{\sqrt{n}} \to 0$. 
Note this satisfies $c_n \to 0$ and $\sqrt{n}c_n \to \infty$ as required by the theory in B2009.

The pre-test has similar consequences:
- If $\hat{\Theta}_n = \{(0,0)\}$, we use $0$ as the asymptotic approximation.
- If $\hat{\Theta}_n = \{(1,0)\}$, we use $Z\sim N(0, \sigma^2)$ as the asymptotic approximation.
- If $\hat{\Theta}_n = \{(1,0), (0,0)\}$, we use $\min\{Z, 0\}$ as the asymptotic approximation.

Issues again arise when we are smaller, but close to $c_1=0$.

<small> All assuming we are actually correctly centering the interval. </small>

- When $\hat{\Theta}_n = \{(0,0)\}$ we have zero coverage, since the CI collapses to a point. Closer to zero, this happens with probability $\approx \alpha/(2r_n)$.
- When $\hat{\Theta}_n = \{(1,0),(0,0)\}$ we will also have below nominal level coverage. In particular, in this case the sample critical value for the lower CI bound is $\hat{z}_{1-\alpha/2} = 0$, since there is a point-mass at zero in the approximation. For $\hat{z}_{alpha/2}$ we get a Normal critical value.
- When $\hat{\Theta}_n = \{(1,0), (0,0)\}$ we will have correct coverage since 

Is the main problem **finite sample bias**?
That is: If we are close to $c_1$ where the asymptotic distribution does not have mean zero, we will be *biased in finite samples*.
This is essentially due to the point mass at zero that comes form being close to $(0,0)$ being optimal.

But wouldn't this be taken into account by the construction of the confidence intervals?

## Undercoverage close to 0

The key question is: Why is there undercoverage close to zero?

Remember: For $v_1 < 0$ close to zero, the true distribution $\sqrt{n}(\hat{v} - v_1)$ will *not* have $z_{1-\alpha/2} = 0$.
$\sqrt{n}(\hat{v})$ will have a point-mass at zero in finite sample, but the centered distribution will have this point mass at $-\sqrt{n}v_1 > 0$. Hence, using $\hat{v} - 0$ for constructing the lower (one-sided) CI will lead to undercoverage. 

Evidence:
- A one-sided *upper* confidence interval has nominally (close to) correct coverage.
  - The critical value we use is $z_{\alpha/2}$ and the interval is $(-\infty, \hat{v} - z_{\alpha/2}]$
  - Note that $z_{\alpha/2} = \Phi^{-1}(\alpha/2)$ since we are in the left tail of the distribution which is normal.
- A one-sided *lower* confidence interval has coverage (close to) 50%.
  - The lower CI is given by $[\hat{v} - z_{1-\alpha/2}, \infty)$. 
  - About half-the time, we include $(0,0)$ in $\hat{\Theta}_n$. In this case, $z_{1-\alpha/2} = 0$ but the correct critical value (see above) is $-\sqrt{n}v_1 > 0$. 
  - 

In [None]:
# Let's illustrate this discrepancy in the quantiles.
# First, we plot the true finite sample distribution.

plot_scaled_distr(
    num_obs=10000,
    num_reps=25_000,
    num_grid=1_000,
    c_1=-0.01,
    sigma=1,
    rng=RNG,
)