This repository contains a collection of R scripts to reproduce all examples, simulations, and figures of Touw, Alfons, Groenen & Wilms (2025).
Below, we give a detailed explanation on which results from the paper are reproduced by which R script(s). Furthermore, we report the running time of each script when we conducted the analyses. We used the following two machines:
- A desktop PC with an Intel i9 10-core CPU running Ubuntu 24.04 LTS.
- A laptop with an Apple M3 8-core CPU (4 performance cores and 4 efficiency cores) running macOS Sequoia 15.5.
Both machines were using R version 4.5.1 and our package clusterGGM version 0.1.0. Rather than installing the latest version of our package from CRAN, we recommend installing this specific version with the following commands:
install.packages("remotes")
remotes::install_github("aalfons/clusterGGM", ref = "v0.1.0")
If you already have package remotes installed, you can skip the first line.
Figure 1 in Section 2 was drawn with the online diagram editor draw.io such that a replication script is not applicable.
Folder illustration contains two scripts that produce the other illustrative figures from the paper:
simulation_designs.Rproduces Figure 2 in Section 3.covariance_precision.Rproduces Figure 5 in Section 4.
Folder simulations contains all scripts and output from our extensive simulations. Results are always saved in folder simulations/results and figures are always saved in simulations/figures.
Most scripts were run in parallel, with each script running on a single CPU core. However, some scripts use parallel computing themselves via package parallel.
The following scripts produce the results for the four baseline simulation designs:
simulations_WB2022_random.R: running time was 7h17min using a single core on the desktop PC.simulations_WB2022_chain.R: running time was 7h18min using a single core on the desktop PC.simulations_WB2022_unbalanced.R: running time was 7h19min using a single core on the desktop PC.simulations_WB2022_unstructured.R: running time was 4hY24min using a single core on the desktop PC.
The script figure_WB2022_baseline.R then reads in the results and produces Figure 3 in Section 3.
Together with simulations_WB2022_chain.R from above, the following scripts produce the results for the chain design with varying number of variables or clusters:
simulations_WB2022_chain_p=30_K=3.R: running time was 1d7h40min using a single core on the desktop PC.simulations_WB2022_chain_p=30_K=5.R: running time was 1d9h48min using a single core on the desktop PC.simulations_WB2022_chain_p=30_K=6.R: running time was 1d9h40min using a single core on the desktop PC.simulations_WB2022_chain_p=30_K=10.R: running time was 1d9h9min using a single core on the desktop PC.simulations_WB2022_chain_p=60.R: running time was 1d17h26min using five cores on the desktop PC.simulations_WB2022_chain_p=120.R: running time was 7d0h47min using ten cores on the desktop PC.
The following scripts then read in the relevant results and produce the following figures:
figure_WB2022_variables.Rproduces Figure 1 in online Appendix C.figure_WB2022_clusters.Rproduces Figure 2 in online Appendix C.
The following scripts produce the results for the modification of the baseline simulation designs in which the block structure is not exact but only approximate:
simulations_approximate_random.R: running time was 7h30min using a single core on the desktop PC.simulations_approximate_chain.R: running time was 7h35min using a single core on the desktop PC.simulations_approximate_unbalanced.R: running time was 7h37min using a single core on the desktop PC.simulations_approximate_unstructured.R: running time was 4h25min using a single core on the desktop PC.
The script figure_approximate.R then reads in the results and produces Figure 3 in online Appendix C.
The following scripts produce the results for the two designs in which the relevant structure is on the diagonal of the precision matrix, as well as the two designs with a noisy blockdiagonal structure:
simulations_diagonal_balanced.R: running time was 9h26min using a single core on the desktop PC.simulations_diagonal_unbalanced.R: running time was 10h02min using a single core on the desktop PC.simulations_blockdiagonal_balanced.R: running time was 10h7min using a single core on the desktop PC.simulations_blockdiagonal_unbalanced.R: running time was 10h21min using a single core on the desktop PC.
The script figure_diagonal_blockdiagonal.R then reads in the results and produces Figure 4 in Section 3.
The script simulations_computation_time.R measures the computation time of the compared methods on simulated data sets. Running time of this script was 3h25min using a single core on the laptop.
The script figure_computation_time.R then reads in the results and produces Figure 4 in online Appendix C.
The following scripts produce the results for the two designs in which the structure of interest is on the covariance matrix:
simulations_Sigma_exact.R: running time was 17h36min using a single core on the desktop PC.simulations_Sigma_approximate.R: running time was 19h44min using a single core on the desktop PC.
The script figure_Sigma.R then reads in the results and produces Figure 6 in Section 4.
Folder applications contains all scripts and output from our empirical applications. Some scripts use parallel computing via package parallel.
The relevant files can be found in folder applications/finance:
- The script
data_preprocessing.Rreads in the raw data in.csvformat and stores the processed data in an.RDatafile. Both the raw data and the preprocessed data are stored in the subfolderdata. - The script
applications_finance.Rproduces the variable clustering results. It reads in the preprocessed data and stores the results in the subfolderoutput. It also produces Figures 5 and 6 in online Appendix D.1, which are stored in the subfolderfigures. Running time was 14h15min using ten cores on the desktop PC. - The script
applications_finance_oos.Rproduces the results on out-of-sample errors via double cross-validation. It reads in the preprocessed data and stores the results in the subfolderoutput. Running time was 6d17h using ten cores on the desktop PC. - The script
plot_finance.Rthen reads in the results and produces Figure 7 in Section 5.1, which is stored in the subfolderfigures.
The relevant files can be found in folder applications/oecd:
- The script
data_preprocessing.Rreads in the raw data in.odsformat and stores the processed data in an.RDatafile. Both the raw data and the preprocessed data are stored in the subfolderdata. - The script
applications_oecd.Rreads in the preprocessed data and stores the results in the subfolderoutput. Running time was 3 seconds using a single core on the laptop. - The script
plot_oecd.Rthen reads in the results and produces Figure 8 in Section 5.2, which is stored in the subfolderfigures.
The relevant files can be found in folder applications/HSQ:
- The script
HSQ.Rreads in the raw data in.csvformat, preprocesses and analyzes the data. Results from Table 1 in Section 5.3 are printed on theRconsole. The script also stores the results in.RData formatand produces Figures 7 and 8 in online Appendix D.3. Running time was 14 minutes using four cores on the laptop.