When ML predictions are used as regressors in downstream models, the prediction error is non-classical and a growing literature proposes purpose-built corrections: GMM (Fong and Tyler 2021), prediction-powered inference (Angelopoulos et al. 2023), instrumental variables (Yang et al. 2022), joint MLE (Battaglia et al. 2025). This note observes that for linear downstream models, regression calibration — replacing the ML prediction with E[X | X̂, Z] estimated on a calibration sample — already eliminates the non-classical error structure under an exogeneity condition these methods also rely on. Two OLS regressions and a two-sample bootstrap give you consistent estimates with valid confidence intervals. For nonlinear downstream models (logistic, Poisson, any GLM), Jensen's inequality breaks the argument and heavier methods are genuinely needed. The linear/nonlinear boundary is the main result.
# Generate tables, figures, and compile (full: ~30 min, 500 sims)
./build.sh
# Quick mode (~5 min, 100 sims) for verification
./build.sh --quickRequires Python 3.8+ with numpy, scikit-learn, scipy, matplotlib, and a LaTeX distribution with pdflatex and bibtex.
| Experiment | Downstream model | Prediction quality | Learner | Sims |
|---|---|---|---|---|
| 1 | Linear | Heteroskedastic (known + estimated) | Ridge | 500 |
| 1b | Linear | Heteroskedastic (known) | Random Forest | 200 |
| 2 | Linear | Homoskedastic | Ridge | 500 |
| 3 | Linear | Heteroskedastic, varying n_cal | Ridge | 500 |
| 4 | Logistic | Heteroskedastic (known) | Ridge | 500 |
Every method reports 95% CIs: two-sample bootstrap for regression calibration and moment EIV, posterior credible intervals for the Gibbs sampler, Wald for oracle/naive, sum-of-variances for PPI.
Linear case. All correction methods are approximately unbiased. Regression calibration with known reliability groups has the best RMSE after the latent-variable oracle. The two-sample bootstrap (resample C and U separately) gives near-nominal coverage. Under homoskedasticity, the ordering reverses: moment EIV beats regression calibration.
Nonlinear case. Regression calibration has a Jensen's-inequality bias of ≈ −0.05 in logistic regression (about a quarter of the naive bias). The Metropolis-within-Gibbs latent-variable model removes it. PPI is also unbiased but has high variance.
Estimated heteroskedasticity. Regression calibration remains approximately unbiased when the variance is estimated from calibration residuals. The Gibbs sampler is more sensitive: precision weighting amplifies variance-model errors.
paper.tex LaTeX source
references.bib Bibliography (11 entries, validated)
make_tables.py Replication script → tables/*.tex + figures/*.pdf
build.sh One-command build
tables/ Generated LaTeX tables (\input from paper.tex)
figures/ Generated PDF figures (\includegraphics from paper.tex)
paper.pdf Compiled paper
For a linear downstream model Y = β_D D + β_x X + β_z Z + ε, where X is predicted by an ML model and D is the coefficient of interest:
- Calibration sample C: Regress X on (X̂, Z) — possibly per reliability group
- Target sample U: Predict X̃ from the calibration model, regress Y on (D, X̃, Z)
- Inference: Resample C and U independently with replacement, re-run both stages, take percentile CIs
That's it. Consistent for β_D under E[ε | X̂, Z] = 0. No GMM, no IV, no MCMC.
When this breaks: logistic regression, Poisson, any GLM with a nonlinear link. Then you need joint estimation or PPI.
- Angelopoulos et al. (2023). Prediction-powered inference. Science, 382(6671), 669–674.
- Battaglia et al. (2025). Inference for regression with variables generated by AI or ML. arXiv:2402.15585v5.
- Boonstra, Little, and Mitani (2021). Bias due to Berkson error. Biostatistics, 23(4), 1063–1078.
- Carroll et al. (2006). Measurement Error in Nonlinear Models. Chapman and Hall/CRC.
- Fong and Tyler (2021). ML predictions as regression covariates. Political Analysis, 29(4), 467–484.
- Wang, McCormick, and Leek (2020). Correcting inference based on predicted outcomes. PNAS, 117(48), 30266–30275.
- Yang et al. (2022). Causal inference with data-mined variables. INFORMS JDS, 1(2), 138–155.
- Zrnic and Candès (2024). Cross-prediction-powered inference. PNAS, 121(15), e2322083121.