Skip to content

fbertran/missPLS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

missPLS: Methods and Reproducible Workflows for Partial Least Squares with Missing Data

Titin Agustin Nengsih, Frederic Bertrand and Myriam Maumy-Bertrand

https://doi.org/10.1515/sagmb-2018-0059

R-CMD-check R-hub Lifecycle: experimental Project Status: Active - The project has reached a stable, usable state and is being actively developed. GitHub Repo stars

The goal of missPLS is to provide a methods-first R package for the incomplete-data PLS workflows described in Nengsih, Bertrand, Maumy-Bertrand, and Meyer (2019), Determining the Number of Components in PLS Regression on Incomplete Data Set, and in Titin Agustin Nengsih's thesis chapters devoted to NIPALS-PLS with missing predictors.

The package builds on plsRglm for PLS fitting, cross-validation, and information criteria, and wraps the imputation strategies used in the published comparisons through mice, VIM, and bcv.

missPLS provides:

  • simulation helpers for Li et al.-style PLS data,
  • MCAR and MAR missingness generators,
  • imputation wrappers for MICE, KNN, and SVD workflows,
  • component-selection helpers for Q2, AIC, AIC-DoF, BIC, and BIC-DoF,
  • packaged real datasets used in the article and thesis,
  • diagnostics and study runners for simulation and real-data analyses.

Heavy reproduction scripts live under tools/ and are intentionally kept outside the package examples and tests.

This website and these examples were created by T. A. Nengsih, F. Bertrand, and M. Maumy-Bertrand.

Installation

When released on CRAN, you will be able to install the released version of missPLS from CRAN with:

install.packages("missPLS")

You can install the development version of missPLS from GitHub with:

devtools::install_github("fbertran/missPLS")

Included datasets

The package ships with the four real-data studies used in the published work.

Dataset Rows Predictors Source
bromhexine 23 64 Pharmaceutical syrup study
tetracycline 107 101 Serum assay study
octane 68 493 NIR gasoline study
ozone_complete 203 12 Complete-case mlbench::Ozone study

Quick start

library(missPLS)

set.seed(1)
sim <- simulate_pls_data(n = 30, p = 12, true_ncomp = 2, seed = 1)
miss <- add_missingness(
  sim$x,
  sim$y,
  mechanism = "MCAR",
  missing_prop = 10,
  seed = 2
)
imp <- impute_pls_data(miss$x_incomplete, method = "knn", seed = 3)

sel_incomplete <- select_ncomp(
  x = miss$x_incomplete,
  y = sim$y,
  method = "nipals_standard",
  criterion = "Q2-10fold",
  max_ncomp = 4,
  seed = 4,
  folds = 5
)

sel_imputed <- select_ncomp(
  x = imp,
  y = sim$y,
  method = "complete",
  criterion = "AIC",
  max_ncomp = 4,
  seed = 5
)

sel_incomplete
#>   selection_method criterion selected_ncomp criterion_value max_ncomp seed n_imputations
#> 1  nipals_standard Q2-10fold              3       0.9179554         4    4             1
#>   status notes
#> 1     ok
sel_imputed
#>   selection_method criterion selected_ncomp criterion_value max_ncomp seed n_imputations
#> 1     complete_knn       AIC              4        55.81119         4    5             1
#>   status                           notes
#> 1     ok Mode across 1 imputed datasets.

Real-data diagnostics

bromhexine
#> <misspls_dataset>
#>   name: bromhexine 
#>   dimensions: 23 x 64 
#>   source: Goicoechea and Olivieri (1999a)

diag_bromhexine <- diagnose_real_data("bromhexine")
head(diag_bromhexine$response_correlations)
#>   predictor correlation
#> 1        x7   0.7317195
#> 2        x8   0.8062487
#> 3        x9   0.8601593
#> 4       x10   0.8922337
#> 5       x11   0.9172034
#> 6       x12   0.9334266

Study runners

The study runners orchestrate smoke runs inside the package and support the heavier table and figure regeneration scripts under tools/.

results <- run_simulation_study(
  dimensions = list(c(20, 10)),
  true_ncomp = 2,
  missing_props = 10,
  mechanisms = "MCAR",
  reps = 1,
  seed = 10,
  max_ncomp = 3,
  criteria = "AIC",
  incomplete_methods = "nipals_standard",
  imputation_methods = "knn",
  folds = 5
)

results[, c("method", "criterion", "selected_ncomp", "matched_target", "status")]
#>                   method criterion selected_ncomp matched_target status
#> 1               Complete       AIC              3          FALSE     ok
#> 2 NIPALS-PLSR (standard)       AIC              3          FALSE     ok
#> 3              KNNimpute       AIC              3          FALSE     ok

For a longer walkthrough, start with the package vignette:

vignette("missPLS")

About

Missing data estimation for PLS regression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages