Simulation Study: Reliability of Predictors in Regression, SEM, and ML

This repository contains a simulation study designed to illustrate, in the context of regression and structural equation modelling (SEM), how predictor reliability (i.e., the degree to which observed predictors reflect their latent counterparts without measurement error) affects model estimation and predictive performance.

The study reproduces, in simplified form, ideas discussed in Beyond the Hype: A Simulation Study Evaluating the Predictive Performance of Machine Learning Models in Psychology (Jankowsky et al., 2024). It places these ideas in the tradition of classical regression and SEM, where measurement error has been studied extensively.

01_simulate.R
Generates synthetic datasets for analysis. Specifically:
- Constructs latent predictors (X*) according to user-defined correlation structure.
- Produces observed predictors (X) by adding measurement error such that their reliability equals pre-specified values (rho_X).
- Creates outcome variables (Y) as linear combinations of the latent predictors with noise adjusted to achieve a target latent R².
- Saves all datasets and writes a manifest file (manifest.csv) describing dataset ID, reliability, replicate number, seed, and empirical checks.
02_train.R
Trains predictive models on the simulated datasets. Specifically:
- Reads datasets from the manifest.
- Splits each dataset into training and test subsets (default 70/30 split) using reproducible seeds.
- Fits ordinary least squares (OLS) regression (Y ~ X1 + ... + Xp).
- Optionally, fits a gradient boosted tree model (XGBoost) if the package is installed and enabled.
- Saves predictions (true Y and predicted Ŷ for the test set) to CSV files.
- Produces an index file (pred_index.csv) linking each dataset to its prediction outputs.
03_validate.R
Evaluates predictive performance and aggregates results. Specifically:
- Loads prediction files and compares predicted values to true outcomes.
- Computes two performance metrics: RMSE and R².
- Saves detailed results per dataset (perf_by_dataset.csv).
- Aggregates results across replicates by reliability and model (perf_agg.csv).
- Creates a simple plot (R2_vs_rhoX.png) showing how predictive R² depends on predictor reliability.
plots.R
Produces detailed visualizations of simulation results. Specifically:
- Loads aggregated results (perf_agg.csv) and replicate-level results (perf_by_dataset.csv).
- Generates facet plots of R² and RMSE across reliability levels.
- Creates error bar plots (mean ± SE across replicates).
- Produces replicate-level plots (jitter + boxplot) to display the distribution of predictive performance across replicates.
- Saves all plots to data/out/ (e.g., perf_grid.png, perf_grid_with_se.png, perf_replicates_R2_jitter.png, perf_replicates_RMSE_jitter.png).
cleanup.R
Maintains a tidy project structure. Specifically:
- Removes bulky intermediate files after simulations are complete.
- In data/sim/: keeps only manifest.csv, deletes per-replicate datasets.
- In data/pred/: keeps only pred_index.csv, deletes per-replicate prediction files.
- In data/out/: keeps summary metrics (perf_*.csv) and plots (*.png), removes leftover scratch files.
- Helps keep the repository lightweight and avoids clutter when running many replications.
scripts/utils_pilot.R
Contains helper functions used across scripts:
- make_corr() – builds correlation matrices for latent predictors.
- rmvnorm_simple() – generates multivariate normal data.
- compute_sigma_for_R2() – determines error variance to achieve the target latent R².
- train_test_idx() – creates reproducible train/test splits.
data/
Project output directory, with subfolders:
- data/sim/ – simulated datasets + manifest.
- data/pred/ – prediction outputs + index.
- data/out/ – validation results and plots.

How to Run

Run 01_simulate.R to generate synthetic datasets.
- Main settings:
  - N: number of observations (sample size).
  - p: number of predictors.
  - r_betweenX: correlation among latent predictors (0 = independent; 0.40 = moderately correlated).
  - beta: true regression coefficients for latent predictors.
  - R2_target: desired proportion of variance explained in the latent outcome model.
  - rho_grid: reliability levels to simulate (e.g., 0.60, 0.80, 1.00).
  - replicates: number of datasets to generate per reliability level.
- Outputs: datasets in data/sim/ and manifest.csv.
Run 02_train.R to train models and save predictions.
- Models: OLS always, XGBoost optional.
- Outputs: per-dataset prediction files in data/pred/ and pred_index.csv.
Run 03_validate.R to compute metrics and visualize results.
- Outputs: per-dataset performance (perf_by_dataset.csv), aggregated results (perf_agg.csv), and R² vs reliability plot (R2_vs_rhoX.png).

Notation and Theoretical Background

In the code, rho_X denotes predictor reliability.
This follows psychometric convention where reliability is expressed as ρ (Greek rho).

Reliability is formally defined as:

$$\rho = \frac{\text{Var(True score)}}{\text{Var(Observed score)}}$$
- ρ = 1.0 → predictors are measured without error (perfect reliability).\
- ρ = 0.8 → 80% of observed variance reflects the true latent predictor, 20% is error variance.\
- ρ = 0.6 → 60% is true signal, 40% is error.

OLS; why is it relevant?

Ordinary Least Squares (OLS) regression is the most common method of linear regression.
It estimates coefficients by minimizing the sum of squared residuals:

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \varepsilon$$
Coefficients are estimated by minimizing the sum of squared residuals:

$$\text{minimize } \sum (Y - \hat{Y})^2$$
Interpretation of coefficients (from labs):

β₀ (intercept): expected outcome when predictors = 0.

β₁ (slope): expected difference in the outcome for a one-unit increase in a predictor, holding others constant
In regression theory: measurement error in predictors attenuates regression coefficients and lowers $R^2$.
In SEM: predictors are modeled as latent variables with explicit reliabilities; attenuation is represented in the measurement model.
In this project: OLS provides the baseline model because its behavior under measurement error is well understood.

Evaluation Metrics

Predictive performance is evaluated using two standard metrics: Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²).

Root Mean Squared Error (RMSE):

$$RMSE = \sqrt{\tfrac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$$
- Represents the average prediction error, expressed in the same units as the outcome variable.
- Larger errors are penalized more strongly due to squaring.
- Lower RMSE indicates better predictive accuracy.
Coefficient of Determination (R²):

$$R^2 = 1 - \tfrac{MSE}{Var(Y)}$$
- Represents the proportion of variance in the outcome explained by predictions.
- Values closer to 1 indicate stronger predictive performance.
- Under measurement error, R² decreases systematically, regardless of sample size or model complexity.

Together, RMSE and R² capture complementary aspects of model performance: - RMSE quantifies absolute prediction error. - R² quantifies relative explanatory power.

Both metrics highlight the consequences of measurement error: even flexible machine learning models cannot achieve high predictive performance when predictor reliability is low.

Results & Interpretation

The simulation study shows how predictor reliability ($\rho_{XX'}$) affects regression performance.

Aggregated performance

As reliability increases, R² goes up and RMSE goes down.
With $\rho = 0.6$, predictive R² is about 0.30 (weaker model, more error).
With $\rho = 1.0$, predictive R² approaches the latent target of 0.50.

Replicate-level distributions

Predictive R² across replicates:

Predictive RMSE across replicates:

Each dot is one simulated dataset (replicate).
Boxplots summarize the spread of results at each reliability.
Lower reliability produces more attenuated R² and higher RMSE.
With perfect reliability (ρ = 1.0), results are closest to the true latent model.

Raincloud Plots: Distributions of Predictive Performance

To visualize how predictive performance varies across simulation replicates, we use raincloud plots.
These combine a density distribution (the “cloud”), a boxplot (median and interquartile range), and individual jittered points (each replicate).
This makes it easy to see both average trends and the spread of outcomes under different levels of predictor reliability.

Predictive R² rainclouds:
- At ρ = 0.6, distributions are shifted downward, showing that low reliability sharply limits variance explained.\
- At ρ = 1.0, clouds shift upward, indicating stronger performance closer to the true latent model.\
Predictive RMSE rainclouds:
- RMSE decreases systematically as reliability improves, mirroring the R² pattern.\
- Distributions are wider at low reliability, indicating instability across replicates.\

Interpretation:
The plots highlight that measurement error imposes a ceiling on predictive performance.

Summary

This project illustrates: The maximum predictive performance is bounded by the reliability of the predictors.

In regression: this is seen as attenuation bias - coefficients shrink and predictive power decreases when predictors are noisy.
In ML: the same limits apply - flexible models cannot exceed the information content of the observed data.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
simulation_study.Rproj		simulation_study.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simulation Study: Reliability of Predictors in Regression, SEM, and ML

Contents

How to Run

Notation and Theoretical Background

OLS; why is it relevant?

Evaluation Metrics

Results & Interpretation

Aggregated performance

Replicate-level distributions

Raincloud Plots: Distributions of Predictive Performance

Summary

About

Uh oh!

Releases

Packages

Languages

aleksawr/simulation_study

Folders and files

Latest commit

History

Repository files navigation

Simulation Study: Reliability of Predictors in Regression, SEM, and ML

Contents

How to Run

Notation and Theoretical Background

OLS; why is it relevant?

Evaluation Metrics

Results & Interpretation

Aggregated performance

Replicate-level distributions

Raincloud Plots: Distributions of Predictive Performance

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages