Two Regressions and a Bootstrap

Regression Calibration for ML-Generated Covariates and the Nonlinear Boundary

When ML predictions are used as regressors in downstream models, the prediction error is non-classical and a growing literature proposes purpose-built corrections: GMM (Fong and Tyler 2021), prediction-powered inference (Angelopoulos et al. 2023), instrumental variables (Yang et al. 2022), joint MLE (Battaglia et al. 2025). This note observes that for linear downstream models, regression calibration — replacing the ML prediction with E[X | X̂, Z] estimated on a calibration sample — already eliminates the non-classical error structure under an exogeneity condition these methods also rely on. Two OLS regressions and a two-sample bootstrap give you consistent estimates with valid confidence intervals. For nonlinear downstream models (logistic, Poisson, any GLM), Jensen's inequality breaks the argument and heavier methods are genuinely needed. The linear/nonlinear boundary is the main result.

Replication

# Generate tables, figures, and compile (full: ~30 min, 500 sims)
./build.sh

# Quick mode (~5 min, 100 sims) for verification
./build.sh --quick

Requires Python 3.8+ with numpy, scikit-learn, scipy, matplotlib, and a LaTeX distribution with pdflatex and bibtex.

What the simulations cover

Experiment	Downstream model	Prediction quality	Learner	Sims
1	Linear	Heteroskedastic (known + estimated)	Ridge	500
1b	Linear	Heteroskedastic (known)	Random Forest	200
2	Linear	Homoskedastic	Ridge	500
3	Linear	Heteroskedastic, varying n_cal	Ridge	500
4	Logistic	Heteroskedastic (known)	Ridge	500

Every method reports 95% CIs: two-sample bootstrap for regression calibration and moment EIV, posterior credible intervals for the Gibbs sampler, Wald for oracle/naive, sum-of-variances for PPI.

Key findings

Linear case. All correction methods are approximately unbiased. Regression calibration with known reliability groups has the best RMSE after the latent-variable oracle. The two-sample bootstrap (resample C and U separately) gives near-nominal coverage. Under homoskedasticity, the ordering reverses: moment EIV beats regression calibration.

Nonlinear case. Regression calibration has a Jensen's-inequality bias of ≈ −0.05 in logistic regression (about a quarter of the naive bias). The Metropolis-within-Gibbs latent-variable model removes it. PPI is also unbiased but has high variance.

Estimated heteroskedasticity. Regression calibration remains approximately unbiased when the variance is estimated from calibration residuals. The Gibbs sampler is more sensitive: precision weighting amplifies variance-model errors.

Files

paper.tex          LaTeX source
references.bib     Bibliography (11 entries, validated)
make_tables.py     Replication script → tables/*.tex + figures/*.pdf
build.sh           One-command build
tables/            Generated LaTeX tables (\input from paper.tex)
figures/           Generated PDF figures (\includegraphics from paper.tex)
paper.pdf          Compiled paper

The recipe

For a linear downstream model Y = β_D D + β_x X + β_z Z + ε, where X is predicted by an ML model and D is the coefficient of interest:

Calibration sample C: Regress X on (X̂, Z) — possibly per reliability group
Target sample U: Predict X̃ from the calibration model, regress Y on (D, X̃, Z)
Inference: Resample C and U independently with replacement, re-run both stages, take percentile CIs

That's it. Consistent for β_D under E[ε | X̂, Z] = 0. No GMM, no IV, no MCMC.

When this breaks: logistic regression, Poisson, any GLM with a nonlinear link. Then you need joint estimation or PPI.

References

Angelopoulos et al. (2023). Prediction-powered inference. Science, 382(6671), 669–674.
Battaglia et al. (2025). Inference for regression with variables generated by AI or ML. arXiv:2402.15585v5.
Boonstra, Little, and Mitani (2021). Bias due to Berkson error. Biostatistics, 23(4), 1063–1078.
Carroll et al. (2006). Measurement Error in Nonlinear Models. Chapman and Hall/CRC.
Fong and Tyler (2021). ML predictions as regression covariates. Political Analysis, 29(4), 467–484.
Wang, McCormick, and Leek (2020). Correcting inference based on predicted outcomes. PNAS, 117(48), 30266–30275.
Yang et al. (2022). Causal inference with data-mined variables. INFORMS JDS, 1(2), 138–155.
Zrnic and Candès (2024). Cross-prediction-powered inference. PNAS, 121(15), e2322083121.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two Regressions and a Bootstrap

Regression Calibration for ML-Generated Covariates and the Nonlinear Boundary

Replication

What the simulations cover

Key findings

Files

The recipe

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
build.sh		build.sh
make_tables.py		make_tables.py
paper.pdf		paper.pdf
paper.tex		paper.tex
references.bib		references.bib

Folders and files

Latest commit

History

Repository files navigation

Two Regressions and a Bootstrap

Regression Calibration for ML-Generated Covariates and the Nonlinear Boundary

Replication

What the simulations cover

Key findings

Files

The recipe

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages