Skip to content

Commit

Permalink
version 0.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
tsrobinson authored and cran-robot committed Jan 30, 2021
1 parent d58c1b7 commit 4eb2398
Show file tree
Hide file tree
Showing 13 changed files with 226 additions and 202 deletions.
6 changes: 3 additions & 3 deletions DESCRIPTION
@@ -1,6 +1,6 @@
Package: rMIDAS
Title: Multiple Imputation with Denoising Autoencoders
Version: 0.2.0
Version: 0.3.0
Authors@R: c(
person(given = "Thomas",
family = "Robinson",
Expand Down Expand Up @@ -28,11 +28,11 @@ License: Apache License (>= 2.0)
URL: https://github.com/MIDASverse/rMIDAS
BugReports: https://github.com/MIDASverse/rMIDAS/issues
NeedsCompilation: no
Packaged: 2020-11-02 10:19:04 UTC; tomrobinson
Packaged: 2021-01-30 00:25:27 UTC; tomrobinson
Author: Thomas Robinson [aut, cre, cph]
(<https://orcid.org/0000-0001-7097-1599>),
Ranjit Lall [aut, cph] (<https://orcid.org/0000-0003-1455-3506>),
Alex Stenlake [ctb, cph]
Maintainer: Thomas Robinson <ts.robinson1994@gmail.com>
Repository: CRAN
Date/Publication: 2020-11-02 14:30:02 UTC
Date/Publication: 2021-01-30 05:50:02 UTC
24 changes: 12 additions & 12 deletions MD5
@@ -1,46 +1,46 @@
2b171ba05938f12b9ff162ef220997f3 *DESCRIPTION
450e99720eaab2d1ebe8b385a8afe763 *DESCRIPTION
4b0e8158ffc6dfd349c09ae2cf78bf79 *NAMESPACE
c1edbd7e866006350912efce4e6369e0 *NEWS.md
1ae7dbcfd58ed0252e8747c09a518e87 *NEWS.md
310b8c5ebd9c9142693bb13d801fb34c *R/load_utils.R
09ca13e6eb832e0d3522b31ff3027cb0 *R/midas_functions.R
4757396c6fd7cfc779ae0cc8bf5d5570 *R/pre_processing.R
54bcecf6a71938796a02be05fb02cd16 *R/midas_functions.R
35da741b0f6ece834d88d4e2c48ae048 *R/pre_processing.R
0ef1815140875b09e36db14da6177157 *R/rubin_analysis.R
5f0f7a477bc22f93c04aa9388a5b1059 *R/setup.R
dfae92d29cde2eb804d727b91e0b02c2 *R/zzz.R
729d309d5783d4579450c236877e3f38 *README.md
675c3fb7b84afdfefcdee30fc47eefb7 *README.md
a530231b4224e7197f03e30437bf016a *build/vignette.rds
d49090aff7e7d8f2614da7cfcc8dda64 *inst/CITATION
5387350a6623a416df9777722d1b85e4 *inst/doc/custom_python_versions.R
f484f6445b5229eddb00fbd8881b021b *inst/doc/custom_python_versions.Rmd
d363889d3288b14262b37ed0bc6f30f7 *inst/doc/custom_python_versions.html
c1938cdce11c36f9c9ead57cee3e6223 *inst/doc/custom_python_versions.html
072b1f5c9f87b73195cf00bb70bbd106 *inst/doc/imputation_demo.R
78db257a69cec810b48a1dc8bac655d5 *inst/doc/imputation_demo.Rmd
e4998ff417b08cead5b57004ddfb0546 *inst/doc/imputation_demo.html
76e9243e2e6d460bb67fd07555456126 *inst/doc/imputation_demo.html
64fba0a91b3a21bbbf50fa4f8a600741 *inst/examples/basic_workflow.R
a67958e3009099ce054dfeb6836ecf22 *inst/examples/overimputation.R
6a02980078146cc9bdd061054e3b4bfc *inst/python/__pycache__/midas_base.cpython-36.pyc
393fee7a0486f0c55914ee243ecffa35 *inst/python/__pycache__/midas_base.cpython-37.pyc
eb00af692b487faec78c83b934afa54f *inst/python/__pycache__/midas_base.cpython-38.pyc
eb3f66e25ec913f65e7241c00dcfd437 *inst/python/__pycache__/midas_base.cpython-39.pyc
6c6b9344914d38c01c194da81936c1ae *inst/python/midas_base.py
73b4dc9395165c4e27482f00d11e8592 *inst/python/midas_base.py
488fb2cb3ec9334b78245fb4e16f1937 *man/add_bin_labels.Rd
c108ce6825a14d383201734677a83d54 *man/add_missingness.Rd
a5a30b2cc7b62e1bd61a17d593e17927 *man/coalesce_one_hot.Rd
c13c5d8b933d5dd21459c139ebf43538 *man/col_minmax.Rd
2bb86f17a81bf892ae7e2b191f6f95fe *man/combine.Rd
52765cb1cd31227c85bcb07578fb1cf4 *man/complete.Rd
769664053c3840ee3b52e50bb06626b5 *man/convert.Rd
723a044dfc96ac49267f435af3893427 *man/complete.Rd
48413a5a794caa38137a9bc36ca02c4c *man/convert.Rd
ee448b97e60bee367e6e5f98ba823b16 *man/figures/logo.png
551e6d455d913ad82f2885625e08708b *man/figures/logos.drawio
0029cb10bab1c771f84aa1892f749efd *man/import_midas.Rd
a796458731eabf8e9e6bc35b2d828cf0 *man/mid_py_setup.Rd
94665912e58b1aa50605a3fd9e98b08c *man/midas_setup.Rd
094823bf4af2a3d7fed849c42a087ee7 *man/na_to_nan.Rd
7ff82823157f7002afe711ec1330af73 *man/overimpute.Rd
d4c2d7ae85b4b52fb50e33373fac6211 *man/overimpute.Rd
67d01f8dd1cf19b84a79632f047922f4 *man/python_init.Rd
bf582d11c2bbd5bdad111aaad3be8b36 *man/set_python_env.Rd
2aaaff54cab764411588113f92dbc22a *man/skip_if_no_numpy.Rd
43fcb71f56acc52f6b0e398235193857 *man/train.Rd
62467aedb62fbcd1df4133a204f2a7e9 *man/train.Rd
499c92fc7cdd4e54e555f49b27dd88c9 *man/undo_minmax.Rd
316d76f801fd61c6d2121a1687c0c4cd *tests/testthat.R
d7ce0a0c02e69876725d5e39cc609feb *tests/testthat/testAnalysis.R
Expand Down
20 changes: 11 additions & 9 deletions NEWS.md
@@ -1,21 +1,23 @@
## rMIDAS 0.2
# rMIDAS 0.3

* rMIDAS now fully supports both Tensorflow 1.X and 2.X
* Minor updates to underlying Python code to mirror MIDASpy v1.2.1

* Added two vignettes for demonstrating imputation workflow and configuring Python installs/environments
* Added NULL defaults to cat_cols and bin_cols parameters within `rMIDAS::convert()`

* Streamlined handling of Python configuration and interface with **reticulate**
* Overimputation legend now plotted in bottom-right corner of figure

* Added a `fast` parameter to the `complete()` function, giving users more flexibility on how to handle predicted probabilities for categorical and binary variables.
* Minor changes to README

* Added function `add_missingness()` to spike-in missingness for examples
# rMIDAS 0.2

* rMIDAS now fully supports both Tensorflow 1.X and 2.X
* Added two vignettes for demonstrating imputation workflow and configuring Python installs/environments
* Streamlined handling of Python configuration and interface with **reticulate**
* Added a `fast` parameter to the `complete()` function, giving users more flexibility on how to handle predicted probabilities for categorical and binary variables.
* Added function `add_missingness()` to spike-in missingness for examples
* Minor changes to README

* Minor changes to DESCRIPTION including title and description fields

* Replaced all instances of `cat()` with `message()` for better logging

* Bug fixes related to GitHub issues

# rMIDAS 0.1
Expand Down
18 changes: 13 additions & 5 deletions R/midas_functions.R
Expand Up @@ -25,6 +25,7 @@ import_midas <- function(...) {
#' @param learn_rate A number, the learning rate \eqn{\gamma} (default = 0.0001), which controls the size of the weight adjustment in each training epoch. In general, higher values reduce training time at the expense of less accurate results.
#' @param input_drop A number between 0 and 1. The probability of corruption for input columns in training mini-batches (default = 0.8). Higher values increase training time but reduce the risk of overfitting. In our experience, values between 0.7 and 0.95 deliver the best performance.
#' @param seed An integer, the value to which \proglang{Python}'s pseudo-random number generator is initialized. This enables users to ensure that data shuffling, weight and bias initialization, and missingness indicator vectors are reproducible.
#' @param train_batch An integer, the number of observations in training mini-batches (default = 16).
#' @param latent_space_size An integer, the number of normal dimensions used to parameterize the latent space.
#' @param cont_adj A number, weights the importance of continuous variables in the loss function
#' @param binary_adj A number, weights the importance of binary variables in the loss function
Expand All @@ -46,6 +47,7 @@ train <- function(data,
learn_rate = 0.0004,
input_drop = 0.8,
seed=123L,
train_batch = 16L,
latent_space_size = 4,
cont_adj= 1.0,
binary_adj= 1.0,
Expand All @@ -56,7 +58,6 @@ train <- function(data,
vae_sample_var = 1.0) {

## Parameters not integrated:
# train_batch = 16,
# output_layers= 'reversed',
# loss_scale= 1,
# init_scale= 1,
Expand All @@ -81,6 +82,7 @@ train <- function(data,
learn_rate = learn_rate,
input_drop = input_drop,
seed = as.integer(seed),
train_batch = as.integer(train_batch),
vae_layer = vae_layer,
latent_space_size = as.integer(latent_space_size),
cont_adj = cont_adj,
Expand Down Expand Up @@ -122,7 +124,7 @@ train <- function(data,
#' @param mid_obj Object of class `midas`, the result of running `rMIDAS::train()`
#' @param m An integer, the number of completed datasets required
#' @param file Path to save completed datasets. If `NULL`, completed datasets are only loaded into memory.
#' @param file_root A character string, used as the root for all filenames when saving completed datasets if a `filepath` is supplied. If no file_root is provided, saved datasets will be saved as "file/midas_impute_yymmdd_hhmmss_m.csv"
#' @param file_root A character string, used as the root for all filenames when saving completed datasets if a `filepath` is supplied. If no file_root is provided, completed datasets will be saved as "file/midas_impute_yymmdd_hhmmss_m.csv"
#' @param unscale Boolean, indicating whether to unscale any columns that were previously minmax scaled between 0 and 1
#' @param bin_label Boolean, indicating whether to add back labels for binary columns
#' @param cat_coalesce Boolean, indicating whether to decode the one-hot encoded categorical variables
Expand Down Expand Up @@ -216,7 +218,7 @@ complete <- function(mid_obj,

}

return(df)
return(as.data.frame(df))

})

Expand Down Expand Up @@ -259,6 +261,7 @@ complete <- function(mid_obj,
#' @param plot_vars Boolean, specifies whether to plot the distribution of original versus overimputed values. This takes the form of a density plot for continuous variables and a barplot for categorical variables (showing proportions of each class).
#' @param skip_plot Boolean, specifies whether to suppress the main graphical output. This may be desirable when users are conducting a series of overimputation exercises and are primarily interested in the console output. **Note**, when `skip_plot = FALSE`, users must manually close the resulting pyplot window before the code will terminate.
#' @param spike_seed,seed An integer, to initialize the pseudo-random number generators. Separate seeds can be provided for the spiked-in missingness and imputation, otherwise `spike_seed` is set to `seed` (default = 123L).
#' @param save_path String, indicating path to directory to save overimputation figures. Users should include a trailing "/" at the end of the path i.e. save_path = "path/to/figures/".
#' @inheritParams train
#' @seealso \code{\link{train}} for the main imputation function.
#' @export
Expand All @@ -276,12 +279,14 @@ overimpute <- function(# Input data
plot_vars = FALSE,
skip_plot = FALSE,
spike_seed = NULL,
save_path = "",

# MIDAS model parameters
layer_structure = c(256,256,256),
learn_rate = 0.0004,
input_drop = 0.8,
seed=123L,
train_batch=16L,
latent_space_size = 4,
cont_adj= 1.0,
binary_adj= 1.0,
Expand Down Expand Up @@ -316,14 +321,15 @@ overimpute <- function(# Input data
}

if (plot_vars) {
message("**Note**: Plotting variables is enabled.\n Overimputation will not proceed until these graphs are closed.")
message("**Note**: Plotting for individual variables is enabled.\nIf your dataset has many variables, this will generate a lot of files!\nTo run without plotting variable graphs, set plot_vars = FALSE\n")
}


mod_inst <- import_midas(layer_structure = as.integer(layer_structure),
learn_rate = learn_rate,
input_drop = input_drop,
seed = as.integer(seed),
train_batch = as.integer(train_batch),
vae_layer = vae_layer,
latent_space_size = as.integer(latent_space_size),
cont_adj = cont_adj,
Expand Down Expand Up @@ -358,7 +364,9 @@ overimpute <- function(# Input data
plot_vars = plot_vars,
skip_plot = skip_plot,
plot_main = FALSE,
spike_seed = as.integer(spike_seed))
spike_seed = as.integer(spike_seed),
save_figs = TRUE,
save_path = save_path)

return(mod_overimp)

Expand Down
25 changes: 22 additions & 3 deletions R/pre_processing.R
Expand Up @@ -31,7 +31,7 @@
#' cat <- c("a","f")
#'
#' convert(data, bin_cols = bin, cat_cols = cat)
convert <- function(data, bin_cols, cat_cols, minmax_scale = FALSE) {
convert <- function(data, bin_cols = NULL, cat_cols = NULL, minmax_scale = FALSE) {

# Check data input

Expand Down Expand Up @@ -72,7 +72,26 @@ convert <- function(data, bin_cols, cat_cols, minmax_scale = FALSE) {
data_cat_oh <- mltools::one_hot(data_cat, cols = names(data_cat))

cat_lists <- lapply(cat_cols,
function(x) c(names(data_cat_oh)[startsWith(names(data_cat_oh),paste0(x,"_"))]))
function(x) {

tmp_names <- names(data_cat_oh)
# Locate whether other variables share same root e.g. c("var1", "var1_other")
if (sum(grepl(x, cat_cols)) > 1) {

var_matches <- cat_cols[grep(x, cat_cols)]

# Get vector of variables to remove from matching
del_vars <- var_matches[!(var_matches == x)]

# Loop through and delete
for (del_var in del_vars) {
tmp_names <- tmp_names[!grepl(del_var, tmp_names)]
}
}

# Now find one-hot encoded names
c(tmp_names[startsWith(tmp_names,paste0(x,"_"))])
})

}

Expand All @@ -91,7 +110,7 @@ convert <- function(data, bin_cols, cat_cols, minmax_scale = FALSE) {
if (!(sum(!is.na(b_vals)) == 2)) {
stop("Column '",bin_col,"' does not have two non-missing values")

} else if (sum(b_vals[!is.na(b_vals)] %in% c(1,2)) != 2) {
} else if (sum(b_vals[!is.na(b_vals)] %in% c(1,0)) != 2) {

data_bin[,bin_col] <- ifelse(data_bin[,bin_col, with = FALSE] == b_vals[!is.na(b_vals)][1],
1L,0L)
Expand Down
47 changes: 24 additions & 23 deletions README.md
Expand Up @@ -5,34 +5,33 @@

<!-- badges: start -->

[![CRAN
status](https://www.r-pkg.org/badges/version/rMIDAS)](https://cran.r-project.org/package=rMIDAS)
[![R build
status](https://github.com/MIDASverse/rMIDAS/workflows/R/badge.svg)](https://github.com/tidyverse/dplyr/actions?workflow=R)
[![R build
status](https://github.com/tsrobinson/rMIDAS/workflows/R-CMD-check/badge.svg)](https://github.com/MIDASverse/rMIDAS/actions)
[![CRAN
status](https://www.r-pkg.org/badges/version/rMIDAS)](https://cran.r-project.org/package=rMIDAS)
[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)
[![Last-changedate](https://img.shields.io/badge/last%20change-2021--01--29-yellowgreen.svg)](https://github.com/MIDASverse/rMIDAS/commits/master)
<!-- badges: end -->

## Overview

**rMIDAS** is an R package for multiply imputing missing data using an
accurate and efficient algorithm based on deep learning methods. The
package provides a simplified workflow for imputing and then analyzing
data:
**rMIDAS** is an R package for accurate and efficient multiple
imputation using deep learning methods. The package provides a
simplified workflow for imputing and then analyzing data:

- `convert()` carries out all necessary preprocessing steps
- `train()` constructs and trains a MIDAS imputation model.
- `train()` constructs and trains a MIDAS imputation model
- `complete()` generates multiple completed datasets from the trained
model
- `combine()` runs regression analysis across the complete data,
following Rubin’s Rules.
following Rubin’s combination rules

**rMIDAS** is based on the Python class
[MIDASpy](https://github.com/MIDASverse/MIDASpy).

### Efficient handling of large data

**rMIDAS** also incorporates several features to streamline and improve
rMIDAS also incorporates several features to streamline and improve the
the efficiency of multiple imputation analysis:

- Optimisation for large datasets using `data.table` and `mltools`
Expand All @@ -52,7 +51,7 @@ to Handle Missing Values in Large and Complex Data.” APSA Preprints.

## Installation

**rMIDAS** is now available on
rMIDAS is now available on
[CRAN](https://cran.r-project.org/package=rMIDAS). To install the
package in R, you can use the following code:

Expand All @@ -68,7 +67,7 @@ code:
devtools::install_github("MIDASverse/rMIDAS")
```

Note that **rMIDAS** uses the
Note that rMIDAS uses the
[reticulate](https://github.com/rstudio/reticulate) package to interface
with Python. Users must have Python 3.5 - 3.8 installed in order to run
MIDAS (Python 3.9 is not yet supported). rMIDAS will automatically try
Expand All @@ -82,28 +81,30 @@ library(rMIDAS)
# Point to a Python binary
set_python_env(python = "path/to/python/binary")

# Point to a virtualenv binary
# Or point to a virtualenv binary
set_python_env(python = "virtual_env", type = "virtualenv")

# Point to a condaenv, where conda can be supplied to choose a specific executable
# Or point to a condaenv, where conda can be supplied to choose a specific executable
set_python_env(python = "conda_env", type = "condaenv", conda = "auto")

# Now run rMIDAS::train() and rMIDAS::complete()...
```

## Vignettes
## Vignettes (including example)

**rMIDAS** is packaged with two vignettes:

1. `vignette("impute-demo", "rMIDAS")` demonstrates the basic workflow
of using the **rMIDAS** package
2. `vignette("custom-python", "rMIDAS")` provides detailed guidance on
configuring Python binaries and environments, including
troubleshooting tips
1. [`vignette("imputation_demo",
"rMIDAS")`](https://github.com/MIDASverse/rMIDAS/blob/master/vignettes/imputation_demo.Rmd)
demonstrates the basic workflow and capacities of **rMIDAS**
2. [`vignette("custom_python_versions",
"rMIDAS")`](https://github.com/MIDASverse/rMIDAS/blob/master/vignettes/custom_python_versions.Rmd)
provides detailed guidance on configuring Python binaries and
environments, including some troubleshooting tips

## Getting help

**rMIDAS** is still in development, and we may not have caught all bugs.
If you come across any difficulties, or have any suggestions for
rMIDAS is still in development, and we may not have caught all bugs. If
you come across any difficulties, or have any suggestions for
improvements, please raise an issue
[here](https://github.com/MIDASverse/MIDASpy/issues).

0 comments on commit 4eb2398

Please sign in to comment.