Skip to content

gcol33/BORG

BORG

R-CMD-check Lifecycle: experimental License: MIT

Bounded Outcome Risk Guard for Model Evaluation

BORG detects data leakage and invalid cross-validation setups before you compute performance metrics. It checks for information reuse between training and test data, and blocks evaluation when problems are found.

Quick Start

library(BORG)

# Validate a train/test split
data <- iris
train_idx <- 1:100
test_idx <- 101:150

borg(data, train_idx = train_idx, test_idx = test_idx)

# Detect overlapping indices
borg(data, train_idx = 1:100, test_idx = 51:150)
#> Error: index_overlap - Train and test indices overlap (50 shared indices)

Statement of Need

A model shows 95% accuracy on test data, then drops to 60% in production. The usual cause: data leakage. Information from the test set contaminated training, and the reported metrics were wrong.

A Princeton meta-analysis found leakage errors in 648 published papers across 30 fields. In civil war prediction research, correcting leakage revealed that "complex ML models do not perform substantively better than decades-old Logistic Regression." The reported gains were artifacts.

BORG catches these errors before metrics are computed.

Features

Core Validation

  • borg(): Main entry point for all validation

    • Validates train/test splits against data
    • Detects preprocessing leakage (scaling, PCA fitted on full data)
    • Checks for target leakage (features derived from outcome)
    • Validates grouped data (same patient in train and test)
    • Validates temporal data (test predates training)
    • Validates spatial data (test points too close to training)
  • borg_inspect(): Detailed inspection of specific objects

    • Works with caret::preProcess, recipes::recipe, prcomp
    • Checks rsample resampling objects
    • Validates fitted models (lm, glm, ranger, etc.)
  • borg_diagnose(): Analyze data for dependency structure

    • Detects spatial autocorrelation (Moran's I)
    • Detects temporal autocorrelation (ACF/Ljung-Box)
    • Detects clustered structure (ICC)
    • Recommends appropriate CV strategy

Risk Categories

Category Impact Response
Hard Violation Results invalid Blocks evaluation
Soft Inflation Results biased Warns, allows with caution

Hard Violations:

  • index_overlap - Same row in train and test
  • duplicate_rows - Identical observations across sets
  • preprocessing_leak - Scaler/PCA fitted on full data
  • target_leakage - Feature with |r| > 0.99 with target
  • group_leakage - Same group in train and test
  • temporal_leak - Test data predates training

Soft Inflation:

  • proxy_leakage - Feature with |r| 0.95-0.99 with target
  • spatial_proximity - Test points close to training
  • spatial_overlap - Test inside training convex hull

Installation

# Install from GitHub
# install.packages("pak")
pak::pak("gcol33/BORG")

Usage Examples

Basic Validation

library(BORG)
data <- iris
train_idx <- 1:100
test_idx <- 101:150

# Returns BorgRisk object with validation results
result <- borg(data, train_idx = train_idx, test_idx = test_idx)
result

Detecting Preprocessing Leakage

# BAD: scale() fitted on all data before splitting
data_scaled <- scale(iris[, 1:4])
borg_inspect(data_scaled, train_idx = 1:100, test_idx = 101:150)
#> Hard violation: preprocessing_leak

Target Leakage Detection

# Feature highly correlated with outcome
leaky_data <- data.frame(
 x = rnorm(100),
 outcome = rnorm(100)
)
leaky_data$leaked <- leaky_data$outcome + rnorm(100, sd = 0.01)

borg_inspect(leaky_data, train_idx = 1:70, test_idx = 71:100, target = "outcome")
#> Hard violation: target_leakage_direct

Grouped Data Validation

# Clinical data with patient IDs
clinical <- data.frame(
  patient_id = rep(1:10, each = 10),
  measurement = rnorm(100)
)

# Random split ignoring patients
set.seed(123)
idx <- sample(100)
train_idx <- idx[1:70]
test_idx <- idx[71:100]

borg_inspect(clinical, train_idx, test_idx, groups = "patient_id")
#> Hard violation: group_leakage

Spatial Data Validation

spatial_data <- data.frame(
  lon = runif(200, -10, 10),
  lat = runif(200, -10, 10),
  response = rnorm(200)
)

# Let BORG diagnose and generate appropriate CV folds
result <- borg(spatial_data, coords = c("lon", "lat"), target = "response", v = 5)
result$diagnosis@recommended_cv
#> "spatial_block"

Framework Integration

BORG works with common ML frameworks:

# caret
library(caret)
pp <- preProcess(mtcars[, -1], method = c("center", "scale"))
borg_inspect(pp, train_idx = 1:25, test_idx = 26:32, data = mtcars)

# tidymodels
library(recipes)
rec <- recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors()) |>
  prep()
borg_inspect(rec, train_idx = 1:25, test_idx = 26:32, data = mtcars)

Interface Summary

Function Purpose
borg() Main entry point - diagnose data or validate splits
borg_inspect() Detailed inspection of objects
borg_diagnose() Analyze data dependencies
borg_validate() Validate complete workflow
borg_rewrite() Attempt automatic repair
plot() Visualize results
summary() Generate methods text
borg_certificate() Create validation certificate
borg_export() Export certificate to YAML/JSON

Documentation

License

MIT (see the LICENSE.md file)

About

Bounded Outcome Risk Guard for model evaluation under dependence

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors