missPLS: Methods and Reproducible Workflows for Partial Least Squares with Missing Data

Titin Agustin Nengsih, Frederic Bertrand and Myriam Maumy-Bertrand

https://doi.org/10.1515/sagmb-2018-0059

The goal of missPLS is to provide a methods-first R package for the incomplete-data PLS workflows described in Nengsih, Bertrand, Maumy-Bertrand, and Meyer (2019), Determining the Number of Components in PLS Regression on Incomplete Data Set, and in Titin Agustin Nengsih's thesis chapters devoted to NIPALS-PLS with missing predictors.

The package builds on plsRglm for PLS fitting, cross-validation, and information criteria, and wraps the imputation strategies used in the published comparisons through mice, VIM, and bcv.

missPLS provides:

simulation helpers for Li et al.-style PLS data,
MCAR and MAR missingness generators,
imputation wrappers for MICE, KNN, and SVD workflows,
component-selection helpers for Q2, AIC, AIC-DoF, BIC, and BIC-DoF,
packaged real datasets used in the article and thesis,
diagnostics and study runners for simulation and real-data analyses.

Heavy reproduction scripts live under tools/ and are intentionally kept outside the package examples and tests.

This website and these examples were created by T. A. Nengsih, F. Bertrand, and M. Maumy-Bertrand.

Installation

When released on CRAN, you will be able to install the released version of missPLS from CRAN with:

install.packages("missPLS")

You can install the development version of missPLS from GitHub with:

devtools::install_github("fbertran/missPLS")

Included datasets

The package ships with the four real-data studies used in the published work.

Dataset	Rows	Predictors	Source
`bromhexine`	23	64	Pharmaceutical syrup study
`tetracycline`	107	101	Serum assay study
`octane`	68	493	NIR gasoline study
`ozone_complete`	203	12	Complete-case `mlbench::Ozone` study

Quick start

library(missPLS)

set.seed(1)
sim <- simulate_pls_data(n = 30, p = 12, true_ncomp = 2, seed = 1)
miss <- add_missingness(
  sim$x,
  sim$y,
  mechanism = "MCAR",
  missing_prop = 10,
  seed = 2
)
imp <- impute_pls_data(miss$x_incomplete, method = "knn", seed = 3)

sel_incomplete <- select_ncomp(
  x = miss$x_incomplete,
  y = sim$y,
  method = "nipals_standard",
  criterion = "Q2-10fold",
  max_ncomp = 4,
  seed = 4,
  folds = 5
)

sel_imputed <- select_ncomp(
  x = imp,
  y = sim$y,
  method = "complete",
  criterion = "AIC",
  max_ncomp = 4,
  seed = 5
)

sel_incomplete
#>   selection_method criterion selected_ncomp criterion_value max_ncomp seed n_imputations
#> 1  nipals_standard Q2-10fold              3       0.9179554         4    4             1
#>   status notes
#> 1     ok
sel_imputed
#>   selection_method criterion selected_ncomp criterion_value max_ncomp seed n_imputations
#> 1     complete_knn       AIC              4        55.81119         4    5             1
#>   status                           notes
#> 1     ok Mode across 1 imputed datasets.

Real-data diagnostics

bromhexine
#> <misspls_dataset>
#>   name: bromhexine 
#>   dimensions: 23 x 64 
#>   source: Goicoechea and Olivieri (1999a)

diag_bromhexine <- diagnose_real_data("bromhexine")
head(diag_bromhexine$response_correlations)
#>   predictor correlation
#> 1        x7   0.7317195
#> 2        x8   0.8062487
#> 3        x9   0.8601593
#> 4       x10   0.8922337
#> 5       x11   0.9172034
#> 6       x12   0.9334266

Study runners

The study runners orchestrate smoke runs inside the package and support the heavier table and figure regeneration scripts under tools/.

results <- run_simulation_study(
  dimensions = list(c(20, 10)),
  true_ncomp = 2,
  missing_props = 10,
  mechanisms = "MCAR",
  reps = 1,
  seed = 10,
  max_ncomp = 3,
  criteria = "AIC",
  incomplete_methods = "nipals_standard",
  imputation_methods = "knn",
  folds = 5
)

results[, c("method", "criterion", "selected_ncomp", "matched_target", "status")]
#>                   method criterion selected_ncomp matched_target status
#> 1               Complete       AIC              3          FALSE     ok
#> 2 NIPALS-PLSR (standard)       AIC              3          FALSE     ok
#> 3              KNNimpute       AIC              3          FALSE     ok

For a longer walkthrough, start with the package vignette:

vignette("missPLS")

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
R		R
data-raw		data-raw
data		data
docs		docs
inst		inst
man		man
tests		tests
tools		tools
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

missPLS: Methods and Reproducible Workflows for Partial Least Squares with Missing Data

Titin Agustin Nengsih, Frederic Bertrand and Myriam Maumy-Bertrand

Installation

Included datasets

Quick start

Real-data diagnostics

Study runners

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

missPLS: Methods and Reproducible Workflows for Partial Least Squares with Missing Data

Titin Agustin Nengsih, Frederic Bertrand and Myriam Maumy-Bertrand

Installation

Included datasets

Quick start

Real-data diagnostics

Study runners

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages