# 

# Introduction

## The Problem: Small, Correlated Datasets in Cost Estimation

Developing hundreds of Cost Estimating Relationships (CERs) for small datasets ranging from 5-30 data points, a recurring pattern emerges: strong fit statistics (R², F-Statistics) but nonsensical coefficients—wrong signs, implausible magnitudes, and poor p-values for critical predictors. As Department of Defense analysts know, this story may feel too close to home.

Multicollinear datasets are a frequent presence in cost analysis, causing models to misbehave. The consequences of multicollinearity in small samples are well-documented ([Flynn and James 2016](#ref-flynn2016multicollinearity)):

-   **“Bouncing β’s”**—unstable coefficient estimates that swing wildly with small changes in the data
-   **Wrong coefficient signs**—estimates flip positive/negative contrary to domain knowledge
-   **Unreliable hypothesis testing**—high F-statistic but statistically insignificant individual t-statistics
-   **Hidden extrapolation**—predictions fall outside the convex hull of training data despite appearing within variable ranges
-   **Inflated variance**—coefficient variance increases by factor $1/(1-R^2)$ where $R^2$ is correlation between predictors

### Motivating Example: Cost Improvement Curve with Rate Effect

Learning curves are fundamental to cost estimation in manufacturing and aerospace industries. The classic power-law model describes how unit costs decrease with cumulative production:

$$Y = T_1 \cdot X^b$$

where $Y$ is the unit cost, $T_1$ is the theoretical first unit cost, $X$ is the cumulative unit number, and $b$ is the learning curve slope (typically negative, indicating cost reduction with experience).

When multiple factors affect costs (e.g., production rate effects), the model extends to:

$$Y = T_1 \cdot X_1^b \cdot X_2^c \cdot \varepsilon$$ {#eq-learning-rate}

where $X_1$ represents lot midpoint (Learning variable) and $X_2$ represents lot quantity (Rate variable). In this specification, lot midpoint is inherently correlated with lot size as production ramps up from Prototypes to Low-Rate Initial Production (LRIP) to Full-Rate Production (FRP). Correlations of $\rho = (-0.3, 0.88)$ have been found in Selected Acquisition Reports comparing lot size to cumulative quantities. Domain knowledge establishes that learning and rate slopes should be $\leq 100\%$ (i.e., costs should not increase with cumulative production).

## Diagnosing Multicollinearity

Standard diagnostic measures for detecting multicollinearity include:

| Diagnostic | Threshold | Interpretation |
|----------------------|---------------------|-----------------------------|
| Simple correlation $r_{X_i,X_j}$ | \> 0.8 | Potential multicollinearity |
| VIF (Variance Inflation Factor) | \> 10 | Likely harmful (CEBoK threshold) |
| Condition number $\kappa$ | \> 30 | Collinearity harmful |
| R² among predictors | \> 0.90 | Harmful if $R^2$ for $X_i \vert \text{other } X$’s \> 0.90 |
| F vs t mismatch | High F, low t | Classic multicollinearity symptom |

The Variance Inflation Factor is defined as $\text{VIF} = 1/(1-R^2_{X_i|\text{other}X})$, and the condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ from eigenvalue decomposition of $X'X$.

## Traditional Remedies and Their Limitations

Traditional approaches each come with trade-offs ([Flynn and James 2016](#ref-flynn2016multicollinearity)):

| Remedy | Description | Limitation |
|------------------|----------------------------|--------------------------|
| Collect more data | Increases sample size, reduces variance | Often infeasible in defense cost analysis |
| Drop variables | Remove collinear predictors via confluence analysis | May lose domain-required variables |
| Centering/Scaling | Reduces structural multicollinearity | Only helps polynomial/interaction terms |
| Ridge Regression | L2 penalty shrinks coefficients | No sparsity; no domain constraints |
| Lasso Regression | L1 penalty enables variable selection | Arbitrary selection among correlated vars; no constraints |
| Elastic Net | Combined L1+L2 penalties | Still no explicit domain constraints |
| PCA / PLS | Transforms to uncorrelated components | Loses coefficient interpretability |

## Theoretical Foundation

### Why Some Regularization is Always Optimal

> **Theobald-Farebrother Theorem (1974, 1976)**
>
> For any OLS problem, there exists a ridge parameter $\lambda^* > 0$ such that the ridge estimator has strictly lower Mean Squared Error (MSE) than OLS. This result holds for the population MSE (true prediction risk), not merely training error ([Theobald 1974](#ref-theobald1974); [Farebrother 1976](#ref-farebrother1976)).

**Why This Matters**: OLS minimizes training error but may overfit—especially with correlated predictors. Ridge introduces bias but reduces variance. The theorem guarantees that for some $\lambda > 0$, the variance reduction exceeds the bias increase, yielding lower total error.

**The Practical Challenge**: The optimal $\lambda^*$ depends on unknown population parameters ($\beta$, $\sigma^2$). Cross-validation provides an empirical estimate. When CV selects $\lambda \approx 0$, OLS was already near-optimal. The framework adapts automatically.

### Constrained Methods Superior to Unconstrained

The Penalized and Constrained (PAC) optimization method developed by James, Paulson, and Rusmevichientong ([2020](#ref-james2020pac)) found:

> “The results suggest that PAC and relaxed PAC are surprisingly robust to random violations in the constraints. While both methods deteriorated slightly as \[constraint error\] increased, they were still both superior to their unconstrained counterparts for all values of \[error\] and all settings.”

## Research Contribution

This paper provides a practical guide and framework that combines penalized regularization (Lasso, Ridge, Elastic Net) with constrained optimization in the context of cost estimation. Key contributions include:

1.  **Investigation of small-sample applicability**: Most research on penalized methods uses large datasets—does it apply to cost estimation’s typical 5-30 point samples?

2.  **Python package implementation**: The `penalized_constrained` package combines Elastic Net penalties with lower and upper bound constraints on coefficients

3.  **Proper diagnostic adjustments**: Via Generalized Degrees of Freedom (GDF) for constraint-driven regression ([Hu 2010](#ref-hu2010gdf))

4.  **Cross-validation framework**: For likelihood-free hyperparameter selection

5.  **sklearn-compatible implementation**: For practical application in existing workflows

6.  **Comprehensive benchmarks**: Simulation data comparing proposed method against OLS, Ridge, Lasso, and constrained-only approaches across varying sample sizes, correlations, and error variances

7.  **Practical guidance**: On when and how to implement these algorithms

**Future Work**: Validation on actual program data using publicly available Selected Acquisition Reports (SARs), which are not subject to CUI restrictions.

## Paper Organization

The remainder of this paper is organized as follows: ([**sec-methodology?**](#ref-sec-methodology)) presents the mathematical formulation and algorithm details, ([**sec-simulation-design?**](#ref-sec-simulation-design)) describes our Monte Carlo study design, ([**sec-results?**](#ref-sec-results)) presents the empirical findings, ([**sec-doe-analysis?**](#ref-sec-doe-analysis)) provides rigorous statistical analysis using DOE methodology, and ([**sec-discussion?**](#ref-sec-discussion)) offers practical recommendations and discusses limitations.

``` markdown
# Introduction {#sec-introduction}

## The Problem: Small, Correlated Datasets in Cost Estimation

Developing hundreds of Cost Estimating Relationships (CERs) for small datasets ranging from 5-30 data points, a recurring pattern emerges: strong fit statistics (R², F-Statistics) but nonsensical coefficients---wrong signs, implausible magnitudes, and poor p-values for critical predictors. As Department of Defense analysts know, this story may feel too close to home.

Multicollinear datasets are a frequent presence in cost analysis, causing models to misbehave. The consequences of multicollinearity in small samples are well-documented [@flynn2016multicollinearity]:

- **"Bouncing β's"**---unstable coefficient estimates that swing wildly with small changes in the data
- **Wrong coefficient signs**---estimates flip positive/negative contrary to domain knowledge
- **Unreliable hypothesis testing**---high F-statistic but statistically insignificant individual t-statistics
- **Hidden extrapolation**---predictions fall outside the convex hull of training data despite appearing within variable ranges
- **Inflated variance**---coefficient variance increases by factor $1/(1-R^2)$ where $R^2$ is correlation between predictors

### Motivating Example: Cost Improvement Curve with Rate Effect

Learning curves are fundamental to cost estimation in manufacturing and aerospace industries. The classic power-law model describes how unit costs decrease with cumulative production:

$$Y = T_1 \cdot X^b$$

where $Y$ is the unit cost, $T_1$ is the theoretical first unit cost, $X$ is the cumulative unit number, and $b$ is the learning curve slope (typically negative, indicating cost reduction with experience).

When multiple factors affect costs (e.g., production rate effects), the model extends to:

$$Y = T_1 \cdot X_1^b \cdot X_2^c \cdot \varepsilon$$ {#eq-learning-rate}

where $X_1$ represents lot midpoint (Learning variable) and $X_2$ represents lot quantity (Rate variable). In this specification, lot midpoint is inherently correlated with lot size as production ramps up from Prototypes to Low-Rate Initial Production (LRIP) to Full-Rate Production (FRP). Correlations of $\rho = (-0.3, 0.88)$ have been found in Selected Acquisition Reports comparing lot size to cumulative quantities. Domain knowledge establishes that learning and rate slopes should be $\leq 100\%$ (i.e., costs should not increase with cumulative production).

## Diagnosing Multicollinearity

Standard diagnostic measures for detecting multicollinearity include:

| Diagnostic | Threshold | Interpretation |
|------------|-----------|----------------|
| Simple correlation $r_{X_i,X_j}$ | > 0.8 | Potential multicollinearity |
| VIF (Variance Inflation Factor) | > 10 | Likely harmful (CEBoK threshold) |
| Condition number $\kappa$ | > 30 | Collinearity harmful |
| R² among predictors | > 0.90 | Harmful if $R^2$ for $X_i \vert \text{other } X$'s > 0.90 |
| F vs t mismatch | High F, low t | Classic multicollinearity symptom |

: Multicollinearity diagnostic thresholds {#tbl-diagnostics}

The Variance Inflation Factor is defined as $\text{VIF} = 1/(1-R^2_{X_i|\text{other}X})$, and the condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ from eigenvalue decomposition of $X'X$.

## Traditional Remedies and Their Limitations

Traditional approaches each come with trade-offs [@flynn2016multicollinearity]:

| Remedy | Description | Limitation |
|--------|-------------|------------|
| Collect more data | Increases sample size, reduces variance | Often infeasible in defense cost analysis |
| Drop variables | Remove collinear predictors via confluence analysis | May lose domain-required variables |
| Centering/Scaling | Reduces structural multicollinearity | Only helps polynomial/interaction terms |
| Ridge Regression | L2 penalty shrinks coefficients | No sparsity; no domain constraints |
| Lasso Regression | L1 penalty enables variable selection | Arbitrary selection among correlated vars; no constraints |
| Elastic Net | Combined L1+L2 penalties | Still no explicit domain constraints |
| PCA / PLS | Transforms to uncorrelated components | Loses coefficient interpretability |

: Traditional multicollinearity remedies {#tbl-remedies}

## Theoretical Foundation

### Why Some Regularization is Always Optimal

::: {.callout-note title="Theobald-Farebrother Theorem (1974, 1976)"}
For any OLS problem, there exists a ridge parameter $\lambda^* > 0$ such that the ridge estimator has strictly lower Mean Squared Error (MSE) than OLS. This result holds for the population MSE (true prediction risk), not merely training error [@theobald1974; @farebrother1976].
:::

**Why This Matters**: OLS minimizes training error but may overfit---especially with correlated predictors. Ridge introduces bias but reduces variance. The theorem guarantees that for some $\lambda > 0$, the variance reduction exceeds the bias increase, yielding lower total error.

**The Practical Challenge**: The optimal $\lambda^*$ depends on unknown population parameters ($\beta$, $\sigma^2$). Cross-validation provides an empirical estimate. When CV selects $\lambda \approx 0$, OLS was already near-optimal. The framework adapts automatically.

### Constrained Methods Superior to Unconstrained

The Penalized and Constrained (PAC) optimization method developed by @james2020pac found:

> "The results suggest that PAC and relaxed PAC are surprisingly robust to random violations in the constraints. While both methods deteriorated slightly as [constraint error] increased, they were still both superior to their unconstrained counterparts for all values of [error] and all settings."

## Research Contribution

This paper provides a practical guide and framework that combines penalized regularization (Lasso, Ridge, Elastic Net) with constrained optimization in the context of cost estimation. Key contributions include:

1. **Investigation of small-sample applicability**: Most research on penalized methods uses large datasets---does it apply to cost estimation's typical 5-30 point samples?

2. **Python package implementation**: The `penalized_constrained` package combines Elastic Net penalties with lower and upper bound constraints on coefficients

3. **Proper diagnostic adjustments**: Via Generalized Degrees of Freedom (GDF) for constraint-driven regression [@hu2010gdf]

4. **Cross-validation framework**: For likelihood-free hyperparameter selection

5. **sklearn-compatible implementation**: For practical application in existing workflows

6. **Comprehensive benchmarks**: Simulation data comparing proposed method against OLS, Ridge, Lasso, and constrained-only approaches across varying sample sizes, correlations, and error variances

7. **Practical guidance**: On when and how to implement these algorithms

**Future Work**: Validation on actual program data using publicly available Selected Acquisition Reports (SARs), which are not subject to CUI restrictions.

## Paper Organization

The remainder of this paper is organized as follows: @sec-methodology presents the mathematical formulation and algorithm details, @sec-simulation-design describes our Monte Carlo study design, @sec-results presents the empirical findings, @sec-doe-analysis provides rigorous statistical analysis using DOE methodology, and @sec-discussion offers practical recommendations and discusses limitations.
```

Farebrother, R. W. 1976. “Further Results on the Mean Square Error of Ridge Regression.” *Journal of the Royal Statistical Society: Series B* 38 (3): 248–50. <https://doi.org/10.1111/j.2517-6161.1976.tb01588.x>.

Flynn, Bernard, and Andrew James. 2016. “Multicollinearity in CER Development.” In *ICEAA Professional Development & Training Workshop*.

Hu, Shu-Ping. 2010. “Generalized Degrees of Freedom for Constrained CERs.” PRT-191. Tecolote Research.

James, Gareth M., Courtney Paulson, and Paat Rusmevichientong. 2020. “Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising.” *Journal of the American Statistical Association* 115 (529): 107–22. <https://doi.org/10.1080/01621459.2019.1609970>.

Theobald, C. M. 1974. “Generalizations of Mean Square Error Applied to Ridge Regression.” *Journal of the Royal Statistical Society: Series B* 36 (1): 103–6. <https://doi.org/10.1111/j.2517-6161.1974.tb00990.x>.