version 0.3.0

cran · Jan 30, 2021 · 4eb2398 · 4eb2398
1 parent d58c1b7
commit 4eb2398
Show file tree

Hide file tree

Showing 13 changed files with 226 additions and 202 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: rMIDAS
 Title: Multiple Imputation with Denoising Autoencoders
-Version: 0.2.0
+Version: 0.3.0
 Authors@R: c(
     person(given = "Thomas",
            family = "Robinson",
@@ -28,11 +28,11 @@ License: Apache License (>= 2.0)
 URL: https://github.com/MIDASverse/rMIDAS
 BugReports: https://github.com/MIDASverse/rMIDAS/issues
 NeedsCompilation: no
-Packaged: 2020-11-02 10:19:04 UTC; tomrobinson
+Packaged: 2021-01-30 00:25:27 UTC; tomrobinson
 Author: Thomas Robinson [aut, cre, cph]
     (<https://orcid.org/0000-0001-7097-1599>),
   Ranjit Lall [aut, cph] (<https://orcid.org/0000-0003-1455-3506>),
   Alex Stenlake [ctb, cph]
 Maintainer: Thomas Robinson <ts.robinson1994@gmail.com>
 Repository: CRAN
-Date/Publication: 2020-11-02 14:30:02 UTC
+Date/Publication: 2021-01-30 05:50:02 UTC
diff --git a/MD5 b/MD5
@@ -1,46 +1,46 @@
-2b171ba05938f12b9ff162ef220997f3 *DESCRIPTION
+450e99720eaab2d1ebe8b385a8afe763 *DESCRIPTION
 4b0e8158ffc6dfd349c09ae2cf78bf79 *NAMESPACE
-c1edbd7e866006350912efce4e6369e0 *NEWS.md
+1ae7dbcfd58ed0252e8747c09a518e87 *NEWS.md
 310b8c5ebd9c9142693bb13d801fb34c *R/load_utils.R
-09ca13e6eb832e0d3522b31ff3027cb0 *R/midas_functions.R
-4757396c6fd7cfc779ae0cc8bf5d5570 *R/pre_processing.R
+54bcecf6a71938796a02be05fb02cd16 *R/midas_functions.R
+35da741b0f6ece834d88d4e2c48ae048 *R/pre_processing.R
 0ef1815140875b09e36db14da6177157 *R/rubin_analysis.R
 5f0f7a477bc22f93c04aa9388a5b1059 *R/setup.R
 dfae92d29cde2eb804d727b91e0b02c2 *R/zzz.R
-729d309d5783d4579450c236877e3f38 *README.md
+675c3fb7b84afdfefcdee30fc47eefb7 *README.md
 a530231b4224e7197f03e30437bf016a *build/vignette.rds
 d49090aff7e7d8f2614da7cfcc8dda64 *inst/CITATION
 5387350a6623a416df9777722d1b85e4 *inst/doc/custom_python_versions.R
 f484f6445b5229eddb00fbd8881b021b *inst/doc/custom_python_versions.Rmd
-d363889d3288b14262b37ed0bc6f30f7 *inst/doc/custom_python_versions.html
+c1938cdce11c36f9c9ead57cee3e6223 *inst/doc/custom_python_versions.html
 072b1f5c9f87b73195cf00bb70bbd106 *inst/doc/imputation_demo.R
 78db257a69cec810b48a1dc8bac655d5 *inst/doc/imputation_demo.Rmd
-e4998ff417b08cead5b57004ddfb0546 *inst/doc/imputation_demo.html
+76e9243e2e6d460bb67fd07555456126 *inst/doc/imputation_demo.html
 64fba0a91b3a21bbbf50fa4f8a600741 *inst/examples/basic_workflow.R
 a67958e3009099ce054dfeb6836ecf22 *inst/examples/overimputation.R
 6a02980078146cc9bdd061054e3b4bfc *inst/python/__pycache__/midas_base.cpython-36.pyc
 393fee7a0486f0c55914ee243ecffa35 *inst/python/__pycache__/midas_base.cpython-37.pyc
 eb00af692b487faec78c83b934afa54f *inst/python/__pycache__/midas_base.cpython-38.pyc
 eb3f66e25ec913f65e7241c00dcfd437 *inst/python/__pycache__/midas_base.cpython-39.pyc
-6c6b9344914d38c01c194da81936c1ae *inst/python/midas_base.py
+73b4dc9395165c4e27482f00d11e8592 *inst/python/midas_base.py
 488fb2cb3ec9334b78245fb4e16f1937 *man/add_bin_labels.Rd
 c108ce6825a14d383201734677a83d54 *man/add_missingness.Rd
 a5a30b2cc7b62e1bd61a17d593e17927 *man/coalesce_one_hot.Rd
 c13c5d8b933d5dd21459c139ebf43538 *man/col_minmax.Rd
 2bb86f17a81bf892ae7e2b191f6f95fe *man/combine.Rd
-52765cb1cd31227c85bcb07578fb1cf4 *man/complete.Rd
-769664053c3840ee3b52e50bb06626b5 *man/convert.Rd
+723a044dfc96ac49267f435af3893427 *man/complete.Rd
+48413a5a794caa38137a9bc36ca02c4c *man/convert.Rd
 ee448b97e60bee367e6e5f98ba823b16 *man/figures/logo.png
 551e6d455d913ad82f2885625e08708b *man/figures/logos.drawio
 0029cb10bab1c771f84aa1892f749efd *man/import_midas.Rd
 a796458731eabf8e9e6bc35b2d828cf0 *man/mid_py_setup.Rd
 94665912e58b1aa50605a3fd9e98b08c *man/midas_setup.Rd
 094823bf4af2a3d7fed849c42a087ee7 *man/na_to_nan.Rd
-7ff82823157f7002afe711ec1330af73 *man/overimpute.Rd
+d4c2d7ae85b4b52fb50e33373fac6211 *man/overimpute.Rd
 67d01f8dd1cf19b84a79632f047922f4 *man/python_init.Rd
 bf582d11c2bbd5bdad111aaad3be8b36 *man/set_python_env.Rd
 2aaaff54cab764411588113f92dbc22a *man/skip_if_no_numpy.Rd
-43fcb71f56acc52f6b0e398235193857 *man/train.Rd
+62467aedb62fbcd1df4133a204f2a7e9 *man/train.Rd
 499c92fc7cdd4e54e555f49b27dd88c9 *man/undo_minmax.Rd
 316d76f801fd61c6d2121a1687c0c4cd *tests/testthat.R
 d7ce0a0c02e69876725d5e39cc609feb *tests/testthat/testAnalysis.R

diff --git a/NEWS.md b/NEWS.md
@@ -1,21 +1,23 @@
-## rMIDAS 0.2
+# rMIDAS 0.3
 
-* rMIDAS now fully supports both Tensorflow 1.X and 2.X
+* Minor updates to underlying Python code to mirror MIDASpy v1.2.1
 
-* Added two vignettes for demonstrating imputation workflow and configuring Python installs/environments
+* Added NULL defaults to cat_cols and bin_cols parameters within `rMIDAS::convert()`
 
-* Streamlined handling of Python configuration and interface with **reticulate**
+* Overimputation legend now plotted in bottom-right corner of figure
 
-* Added a `fast` parameter to the `complete()` function, giving users more flexibility on how to handle predicted probabilities for categorical and binary variables.
+* Minor changes to README
 
-* Added function `add_missingness()` to spike-in missingness for examples
+# rMIDAS 0.2
 
+* rMIDAS now fully supports both Tensorflow 1.X and 2.X
+* Added two vignettes for demonstrating imputation workflow and configuring Python installs/environments
+* Streamlined handling of Python configuration and interface with **reticulate**
+* Added a `fast` parameter to the `complete()` function, giving users more flexibility on how to handle predicted probabilities for categorical and binary variables.
+* Added function `add_missingness()` to spike-in missingness for examples
 * Minor changes to README
-
 * Minor changes to DESCRIPTION including title and description fields
-
 * Replaced all instances of `cat()` with `message()` for better logging
-
 * Bug fixes related to GitHub issues
 
 # rMIDAS 0.1

diff --git a/R/midas_functions.R b/R/midas_functions.R
@@ -25,6 +25,7 @@ import_midas <- function(...) {
 #' @param learn_rate A number, the learning rate \eqn{\gamma} (default = 0.0001), which controls the size of the weight adjustment in each training epoch. In general, higher values reduce training time at the expense of less accurate results.
 #' @param input_drop A number between 0 and 1. The probability of corruption for input columns in training mini-batches (default = 0.8). Higher values increase training time but reduce the risk of overfitting. In our experience, values between 0.7 and 0.95 deliver the best performance.
 #' @param seed An integer, the value to which \proglang{Python}'s pseudo-random number generator is initialized. This enables users to ensure that data shuffling, weight and bias initialization, and missingness indicator vectors are reproducible.
+#' @param train_batch An integer, the number of observations in training mini-batches (default = 16).
 #' @param latent_space_size An integer, the number of normal dimensions used to parameterize the latent space.
 #' @param cont_adj A number, weights the importance of continuous variables in the loss function
 #' @param binary_adj A number, weights the importance of binary variables in the loss function
@@ -46,6 +47,7 @@ train <- function(data,
                    learn_rate = 0.0004,
                    input_drop = 0.8,
                    seed=123L,
+                   train_batch = 16L,
                    latent_space_size = 4,
                    cont_adj= 1.0,
                    binary_adj= 1.0,
@@ -56,7 +58,6 @@ train <- function(data,
                    vae_sample_var = 1.0) {
 
   ## Parameters not integrated:
-  # train_batch = 16,
   # output_layers= 'reversed',
   # loss_scale= 1,
   # init_scale= 1,
@@ -81,6 +82,7 @@ train <- function(data,
                            learn_rate = learn_rate,
                            input_drop = input_drop,
                            seed = as.integer(seed),
+                           train_batch = as.integer(train_batch),
                            vae_layer = vae_layer,
                            latent_space_size = as.integer(latent_space_size),
                            cont_adj = cont_adj,
@@ -122,7 +124,7 @@ train <- function(data,
 #' @param mid_obj Object of class `midas`, the result of running `rMIDAS::train()`
 #' @param m An integer, the number of completed datasets required
 #' @param file Path to save completed datasets. If `NULL`, completed datasets are only loaded into memory.
-#' @param file_root A character string, used as the root for all filenames when saving completed datasets if a `filepath` is supplied. If no file_root is provided, saved datasets will be saved as "file/midas_impute_yymmdd_hhmmss_m.csv"
+#' @param file_root A character string, used as the root for all filenames when saving completed datasets if a `filepath` is supplied. If no file_root is provided, completed datasets will be saved as "file/midas_impute_yymmdd_hhmmss_m.csv"
 #' @param unscale Boolean, indicating whether to unscale any columns that were previously minmax scaled between 0 and 1
 #' @param bin_label Boolean, indicating whether to add back labels for binary columns
 #' @param cat_coalesce Boolean, indicating whether to decode the one-hot encoded categorical variables
@@ -216,7 +218,7 @@ complete <- function(mid_obj,
 
     }
 
-    return(df)
+    return(as.data.frame(df))
 
   })
 
@@ -259,6 +261,7 @@ complete <- function(mid_obj,
 #' @param plot_vars Boolean, specifies whether to plot the distribution of original versus overimputed values. This takes the form of a density plot for continuous variables and a barplot for categorical variables (showing proportions of each class).
 #' @param skip_plot Boolean, specifies whether to suppress the main graphical output. This may be desirable when users are conducting a series of overimputation exercises and are primarily interested in the console output. **Note**, when `skip_plot = FALSE`, users must manually close the resulting pyplot window before the code will terminate.
 #' @param spike_seed,seed An integer, to initialize the pseudo-random number generators. Separate seeds can be provided for the spiked-in missingness and imputation, otherwise `spike_seed` is set to `seed` (default = 123L).
+#' @param save_path String, indicating path to directory to save overimputation figures. Users should include a trailing "/" at the end of the path i.e. save_path = "path/to/figures/".
 #' @inheritParams train
 #' @seealso \code{\link{train}} for the main imputation function.
 #' @export
@@ -276,12 +279,14 @@ overimpute <- function(# Input data
                        plot_vars = FALSE,
                        skip_plot = FALSE,
                        spike_seed = NULL,
+                       save_path = "",
 
                        # MIDAS model parameters
                        layer_structure = c(256,256,256),
                        learn_rate = 0.0004,
                        input_drop = 0.8,
                        seed=123L,
+                       train_batch=16L,
                        latent_space_size = 4,
                        cont_adj= 1.0,
                        binary_adj= 1.0,
@@ -316,14 +321,15 @@ overimpute <- function(# Input data
   }
 
   if (plot_vars) {
-    message("**Note**: Plotting variables is enabled.\n Overimputation will not proceed until these graphs are closed.")
+    message("**Note**: Plotting for individual variables is enabled.\nIf your dataset has many variables, this will generate a lot of files!\nTo run without plotting variable graphs, set plot_vars = FALSE\n")
   }
 
 
   mod_inst <- import_midas(layer_structure = as.integer(layer_structure),
                            learn_rate = learn_rate,
                            input_drop = input_drop,
                            seed = as.integer(seed),
+                           train_batch = as.integer(train_batch),
                            vae_layer = vae_layer,
                            latent_space_size = as.integer(latent_space_size),
                            cont_adj = cont_adj,
@@ -358,7 +364,9 @@ overimpute <- function(# Input data
                                       plot_vars = plot_vars,
                                       skip_plot = skip_plot,
                                       plot_main = FALSE,
-                                      spike_seed = as.integer(spike_seed))
+                                      spike_seed = as.integer(spike_seed),
+                                      save_figs = TRUE,
+                                      save_path = save_path)
 
   return(mod_overimp)
 

diff --git a/R/pre_processing.R b/R/pre_processing.R
@@ -31,7 +31,7 @@
 #' cat <- c("a","f")
 #'
 #' convert(data, bin_cols = bin, cat_cols = cat)
-convert <- function(data, bin_cols, cat_cols, minmax_scale = FALSE) {
+convert <- function(data, bin_cols = NULL, cat_cols = NULL, minmax_scale = FALSE) {
 
   # Check data input
 
@@ -72,7 +72,26 @@ convert <- function(data, bin_cols, cat_cols, minmax_scale = FALSE) {
     data_cat_oh <- mltools::one_hot(data_cat, cols = names(data_cat))
 
     cat_lists <- lapply(cat_cols,
-                        function(x) c(names(data_cat_oh)[startsWith(names(data_cat_oh),paste0(x,"_"))]))
+                        function(x) {
+
+                          tmp_names <- names(data_cat_oh)
+                          # Locate whether other variables share same root e.g. c("var1", "var1_other")
+                          if (sum(grepl(x, cat_cols)) > 1) {
+
+                            var_matches <- cat_cols[grep(x, cat_cols)]
+
+                            # Get vector of variables to remove from matching
+                            del_vars <- var_matches[!(var_matches == x)]
+
+                            # Loop through and delete
+                            for (del_var in del_vars) {
+                              tmp_names <- tmp_names[!grepl(del_var, tmp_names)]
+                            }
+                          }
+
+                          # Now find one-hot encoded names
+                          c(tmp_names[startsWith(tmp_names,paste0(x,"_"))])
+                        })
 
   }
 
@@ -91,7 +110,7 @@ convert <- function(data, bin_cols, cat_cols, minmax_scale = FALSE) {
     if (!(sum(!is.na(b_vals)) == 2)) {
       stop("Column '",bin_col,"' does not have two non-missing values")
 
-    } else if (sum(b_vals[!is.na(b_vals)] %in% c(1,2)) != 2) {
+    } else if (sum(b_vals[!is.na(b_vals)] %in% c(1,0)) != 2) {
 
       data_bin[,bin_col] <- ifelse(data_bin[,bin_col, with = FALSE] == b_vals[!is.na(b_vals)][1],
                                    1L,0L)

diff --git a/README.md b/README.md
@@ -5,34 +5,33 @@
 
 <!-- badges: start -->
 
-[![CRAN
-status](https://www.r-pkg.org/badges/version/rMIDAS)](https://cran.r-project.org/package=rMIDAS)
-[![R build
-status](https://github.com/MIDASverse/rMIDAS/workflows/R/badge.svg)](https://github.com/tidyverse/dplyr/actions?workflow=R)
 [![R build
 status](https://github.com/tsrobinson/rMIDAS/workflows/R-CMD-check/badge.svg)](https://github.com/MIDASverse/rMIDAS/actions)
+[![CRAN
+status](https://www.r-pkg.org/badges/version/rMIDAS)](https://cran.r-project.org/package=rMIDAS)
+[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)
+[![Last-changedate](https://img.shields.io/badge/last%20change-2021--01--29-yellowgreen.svg)](https://github.com/MIDASverse/rMIDAS/commits/master)
 <!-- badges: end -->
 
 ## Overview
 
-**rMIDAS** is an R package for multiply imputing missing data using an
-accurate and efficient algorithm based on deep learning methods. The
-package provides a simplified workflow for imputing and then analyzing
-data:
+**rMIDAS** is an R package for accurate and efficient multiple
+imputation using deep learning methods. The package provides a
+simplified workflow for imputing and then analyzing data:
 
   - `convert()` carries out all necessary preprocessing steps
-  - `train()` constructs and trains a MIDAS imputation model.
+  - `train()` constructs and trains a MIDAS imputation model
   - `complete()` generates multiple completed datasets from the trained
     model
   - `combine()` runs regression analysis across the complete data,
-    following Rubin’s Rules.
+    following Rubin’s combination rules
 
 **rMIDAS** is based on the Python class
 [MIDASpy](https://github.com/MIDASverse/MIDASpy).
 
 ### Efficient handling of large data
 
-**rMIDAS** also incorporates several features to streamline and improve
+rMIDAS also incorporates several features to streamline and improve the
 the efficiency of multiple imputation analysis:
 
   - Optimisation for large datasets using `data.table` and `mltools`
@@ -52,7 +51,7 @@ to Handle Missing Values in Large and Complex Data.” APSA Preprints.
 
 ## Installation
 
-**rMIDAS** is now available on
+rMIDAS is now available on
 [CRAN](https://cran.r-project.org/package=rMIDAS). To install the
 package in R, you can use the following code:
 
@@ -68,7 +67,7 @@ code:
 devtools::install_github("MIDASverse/rMIDAS")
 ```
 
-Note that **rMIDAS** uses the
+Note that rMIDAS uses the
 [reticulate](https://github.com/rstudio/reticulate) package to interface
 with Python. Users must have Python 3.5 - 3.8 installed in order to run
 MIDAS (Python 3.9 is not yet supported). rMIDAS will automatically try
@@ -82,28 +81,30 @@ library(rMIDAS)
 # Point to a Python binary
 set_python_env(python = "path/to/python/binary")
 
-# Point to a virtualenv binary
+# Or point to a virtualenv binary
 set_python_env(python = "virtual_env", type = "virtualenv")
 
-# Point to a condaenv, where conda can be supplied to choose a specific executable
+# Or point to a condaenv, where conda can be supplied to choose a specific executable
 set_python_env(python = "conda_env", type = "condaenv", conda = "auto")
 
 # Now run rMIDAS::train() and rMIDAS::complete()...
 ```
 
-## Vignettes
+## Vignettes (including example)
 
 **rMIDAS** is packaged with two vignettes:
 
-1.  `vignette("impute-demo", "rMIDAS")` demonstrates the basic workflow
-    of using the **rMIDAS** package
-2.  `vignette("custom-python", "rMIDAS")` provides detailed guidance on
-    configuring Python binaries and environments, including
-    troubleshooting tips
+1.  [`vignette("imputation_demo",
+    "rMIDAS")`](https://github.com/MIDASverse/rMIDAS/blob/master/vignettes/imputation_demo.Rmd)
+    demonstrates the basic workflow and capacities of **rMIDAS**
+2.  [`vignette("custom_python_versions",
+    "rMIDAS")`](https://github.com/MIDASverse/rMIDAS/blob/master/vignettes/custom_python_versions.Rmd)
+    provides detailed guidance on configuring Python binaries and
+    environments, including some troubleshooting tips
 
 ## Getting help
 
-**rMIDAS** is still in development, and we may not have caught all bugs.
-If you come across any difficulties, or have any suggestions for
+rMIDAS is still in development, and we may not have caught all bugs. If
+you come across any difficulties, or have any suggestions for
 improvements, please raise an issue
 [here](https://github.com/MIDASverse/MIDASpy/issues).