https://doi.org/10.1515/sagmb-2018-0059
The goal of missPLS is to provide a methods-first R package for the
incomplete-data PLS workflows described in Nengsih, Bertrand,
Maumy-Bertrand, and Meyer (2019), Determining the Number of Components in
PLS Regression on Incomplete Data Set, and in Titin Agustin Nengsih's
thesis chapters devoted to NIPALS-PLS with missing predictors.
The package builds on plsRglm for PLS fitting, cross-validation, and
information criteria, and wraps the imputation strategies used in the
published comparisons through mice, VIM, and bcv.
missPLS provides:
- simulation helpers for Li et al.-style PLS data,
- MCAR and MAR missingness generators,
- imputation wrappers for MICE, KNN, and SVD workflows,
- component-selection helpers for
Q2,AIC,AIC-DoF,BIC, andBIC-DoF, - packaged real datasets used in the article and thesis,
- diagnostics and study runners for simulation and real-data analyses.
Heavy reproduction scripts live under tools/ and are intentionally kept
outside the package examples and tests.
This website and these examples were created by T. A. Nengsih, F. Bertrand, and M. Maumy-Bertrand.
When released on CRAN, you will be able to install the released version of
missPLS from CRAN with:
install.packages("missPLS")You can install the development version of missPLS from
GitHub with:
devtools::install_github("fbertran/missPLS")The package ships with the four real-data studies used in the published work.
| Dataset | Rows | Predictors | Source |
|---|---|---|---|
bromhexine |
23 | 64 | Pharmaceutical syrup study |
tetracycline |
107 | 101 | Serum assay study |
octane |
68 | 493 | NIR gasoline study |
ozone_complete |
203 | 12 | Complete-case mlbench::Ozone study |
library(missPLS)
set.seed(1)
sim <- simulate_pls_data(n = 30, p = 12, true_ncomp = 2, seed = 1)
miss <- add_missingness(
sim$x,
sim$y,
mechanism = "MCAR",
missing_prop = 10,
seed = 2
)
imp <- impute_pls_data(miss$x_incomplete, method = "knn", seed = 3)
sel_incomplete <- select_ncomp(
x = miss$x_incomplete,
y = sim$y,
method = "nipals_standard",
criterion = "Q2-10fold",
max_ncomp = 4,
seed = 4,
folds = 5
)
sel_imputed <- select_ncomp(
x = imp,
y = sim$y,
method = "complete",
criterion = "AIC",
max_ncomp = 4,
seed = 5
)
sel_incomplete
#> selection_method criterion selected_ncomp criterion_value max_ncomp seed n_imputations
#> 1 nipals_standard Q2-10fold 3 0.9179554 4 4 1
#> status notes
#> 1 ok
sel_imputed
#> selection_method criterion selected_ncomp criterion_value max_ncomp seed n_imputations
#> 1 complete_knn AIC 4 55.81119 4 5 1
#> status notes
#> 1 ok Mode across 1 imputed datasets.bromhexine
#> <misspls_dataset>
#> name: bromhexine
#> dimensions: 23 x 64
#> source: Goicoechea and Olivieri (1999a)
diag_bromhexine <- diagnose_real_data("bromhexine")
head(diag_bromhexine$response_correlations)
#> predictor correlation
#> 1 x7 0.7317195
#> 2 x8 0.8062487
#> 3 x9 0.8601593
#> 4 x10 0.8922337
#> 5 x11 0.9172034
#> 6 x12 0.9334266The study runners orchestrate smoke runs inside the package and support the
heavier table and figure regeneration scripts under tools/.
results <- run_simulation_study(
dimensions = list(c(20, 10)),
true_ncomp = 2,
missing_props = 10,
mechanisms = "MCAR",
reps = 1,
seed = 10,
max_ncomp = 3,
criteria = "AIC",
incomplete_methods = "nipals_standard",
imputation_methods = "knn",
folds = 5
)
results[, c("method", "criterion", "selected_ncomp", "matched_target", "status")]
#> method criterion selected_ncomp matched_target status
#> 1 Complete AIC 3 FALSE ok
#> 2 NIPALS-PLSR (standard) AIC 3 FALSE ok
#> 3 KNNimpute AIC 3 FALSE okFor a longer walkthrough, start with the package vignette:
vignette("missPLS")