RealSurvSim is an R package that provides a variety of methods for simulating survival (time-to-event) datasets. It is particularly useful for survival analysis applications in research and simulation studies. The package includes both non-parametric (kernel density estimation), parametric, and bootstrap-based simulation approaches for generating realistic time-to-event data.
- Parametric Simulation: Fit a distribution (e.g., exponential, Weibull, log-logistic, mixture distributions) to existing data and generate new samples from the fitted distribution.
- Kernel Density Simulation: Non-parametric simulation via kernel density estimation, using an accept-reject approach.
- Bootstrap Methods:
- Conditional Bootstrap (
cond): Splits event and censoring times, then resamples to preserve the observed event/censoring ratio. - Case Resampling (
case): Simple random resampling of entire observations with replacement.
- Conditional Bootstrap (
- Flexible Group/Strata Handling: Simulate data separately by group while preserving group sizes or allowing user-specified sample sizes.
If you have downloaded or cloned this repository:
# Install devtools if you don't already have it
install.packages("devtools")
# Then, from the root of the package directory:
devtools::install_github()
This package uses several R libraries for density estimation, distribution fitting, and survival analysis. They will be automatically installed (if not already present) when installing RealSurvSim. Key dependencies include:
- kdensity (for kernel density estimation)
- fitdistrplus (for fitting various distributions to data)
- flexsurv (for Gompertz and other survival distributions)
- univariateML (for maximum-likelihood estimation of some distributions, e.g., inverse gamma)
- actuar (for distributions like log-logistic and inverse gamma)
- survival (core survival analysis functionality)
Below is an overview of the core functions and some example usages. For detailed information on parameters and return values, refer to the function documentation.
-
data_simul_KDE(orig_vals, n = NULL, kernel = "gaussian")
Simulates data via kernel density estimation from a numeric vector of original values.- Parameters:
orig_vals: Numeric vector of original data values.n: Number of observations to simulate (defaults to the length oforig_vals).kernel: The kernel to use for KDE (currently supports"gaussian").
- Returns: A numeric vector of simulated values.
- Parameters:
-
data_simul_Estim(orig_vals, n = NULL, distrib = "exp")
Fits a specified parametric distribution toorig_valsand draws new samples from the fitted distribution.- Supported distributions include:
"inverse_gamma","gompertz","llogis","gumbel","myMix","exp".
- Supported distributions include:
-
data_simul_bootstr(dat, n = NULL, type = "cond")
Bootstrap-based simulation of event and censoring times.- Parameters:
dat: Dataframe containing at leastV1(time) andV2(censor indicator, 0/1).n: Number of observations to sample. Defaults to the same size asdat.type:"cond"for conditional bootstrap or"case"for case-resampling.
- Returns: A resampled or reconstructed dataframe containing simulated times and censor indicators.
- Parameters:
-
RealSurvSim(dat, col_time, col_status, col_group, reps = 10000, random_seed = 123, n = NULL, simul_type, distribs = c("exp", "exp", "exp", "exp"))
The main wrapper function for simulating multiple survival datasets using one of four approaches:-
"cond": Conditional bootstrap -
"case": Case resampling -
"distr": Parametric distribution-based simulation -
"KDE": Kernel density estimation-based simulation -
Parameters:
dat: Original (or reconstructed) dataset with time, status, and group columns.col_time: Column name/index for time.col_status: Column name/index for censoring indicator (1=event, 0=censored).col_group: Column name/index for treatment/group identifier.reps: Number of datasets to simulate (default 10,000).random_seed: Random seed (default 123) for reproducibility.n: Vector specifying sample sizes per group (optional).simul_type: Single string specifying the simulation method ("cond","case","distr","KDE").distribs: Which distributions to use ifsimul_type = "distr".
-
Returns:
A list containing multiple simulated datasets (one for each repetition). Each dataset is a data.frame with columnsV1(time),V2(status), andV3(group).
-
Below are brief examples demonstrating how to simulate data. In practice, replace the placeholders (example_data, "time", etc.) with your actual dataset and column names.
library(RealSurvSim)
# Example dataset construction (for demonstration):
set.seed(123)
example_data <- data.frame(
time = rexp(100, rate = 0.1), # Times
status = sample(0:1, 100, replace = TRUE), # 0=censored, 1=event
group = sample(0:1, 100, replace = TRUE) # Two groups, 0 or 1
)
# 1. Kernel Density Estimation Simulation
sim_kde <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5, # Simulate 5 datasets
simul_type = "KDE" # Use KDE-based simulation
)
str(sim_kde$datasets) # Check the structure of generated datasets
# 2. Parametric Distribution Simulation
sim_distr <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5,
simul_type = "distr",
distribs = c("exp", "exp", "exp", "exp")
)
str(sim_distr$datasets)
# 3. Conditional Bootstrap
sim_cond <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5,
simul_type = "cond"
)
str(sim_cond$datasets)
# 4. Case Resampling
sim_case <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5,
simul_type = "case"
)
str(sim_case$datasets)
data(liang)
data(wu)
# 5. liang_kde<- RealSurvSim(liang, liang$V1, liang$V2, liang$V3, reps=3, simul_type = "KDE")
# For arbitary n
# 6. arbliang_distr<- RealSurvSim(liang, liang$V1, liang$V2, liang$V3,reps=10,n = c(40,50), simul_type = "distr", distrib=c("exp", "llogis","llogis", "exp"))
# 7. arbwu_case<- RealSurvSim(wu, wu$V1, wu$V2, wu$V3, reps=100,n = c(40,50), simul_type = "case")Underlying Paper for the Package
Analysis and Methods for Survival Data (arXiv:2308.07842)
Data Reconstruction Algorithm
Guyot et al. (2012), describing the algorithm for reconstructing survival data from published Kaplan-Meier curves.
WebPlotDigitizer
WebPlotDigitizer for extracting data points from Kaplan-Meier curves.