Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution halted while running loocv with bca on parallel mode #40

Open
trashmai opened this issue Nov 14, 2023 · 8 comments
Open

Execution halted while running loocv with bca on parallel mode #40

trashmai opened this issue Nov 14, 2023 · 8 comments

Comments

@trashmai
Copy link

Hi,

First off, thanks a lot for this package.

I ran into some issues while running loocv in parallel mode with a bca result of 17,073 rows. After about 9 hours, I got these error messages:

Error in xcoo1[ind1, nax] : subscript out of bounds
Calls: loocv -> loocv.between
In addition: Warning messages:
1: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, :
scheduled cores 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 did not deliver results, all values of the jobs will be affected
2: In x * w :
longer object length is not a multiple of shorter object length
3: In x * w :
longer object length is not a multiple of shorter object length
4: In x * w :
longer object length is not a multiple of shorter object length
Execution halted

Could you help me figure out what's going wrong?

Thanks!

@thioulouse
Copy link
Collaborator

Hi

What are the dimensions of your data set (number rows, columns, and classes) ?

Using cross-validation in bca is really useful only when the number of columns is much higher than the number of rows, because of the risk of spurious groups in this case. You say you have 17073 rows, and an even higher number of columns could run into memory availability problems.

Also note that you may use cross-validation on a limited number of bca axes (instead of keeping all axes) to spare memory.

Jean

@trashmai
Copy link
Author

Hi Jean,

We have 68 classes and 59 columns, and we set nf=3 for both PCA and BCA. So, the number of axes for LOOCV should be 3 as well, right? (I tested nax=0 and nax=3 for LOOCV on a randomly sampled 500 rows and got very similar results).

We have around 100GB of free memory. Although I didn't monitor the memory usage, I ran LOOCV in parallel mode with both 8000 and 5000 randomly sampled rows. It worked with 5000 rows but failed on 8000. I'm now running it with parallel set to FALSE, and I hope to get a proper result. However, the progress bar indicates that the ETA is more than 6 days.

Thanks for your reply.

@trashmai
Copy link
Author

FYI: Parallel processing also failed on a machine with over 600GB of memory, resulting in the same error.

@thioulouse
Copy link
Collaborator

Thanks, I am trying to look into this

@thioulouse
Copy link
Collaborator

Can you check with the current devel version of ade4 on GitHub ?

@trashmai
Copy link
Author

I re-installed ade4 from the github as the instruction in README, re-ran the full analysis last night, and got exactly the same errors this morning.

@thioulouse
Copy link
Collaborator

I am sorry but I cannot reproduce the error that you mentioned ("Error in xcoo1[ind1, nax] : subscript out of bounds"). Note that this error happens only after the leave one out cross-validation loop. It happens during the computation of the group overlap index between bca and cross-validation coordinates, so it is not done in parallel computing mode.

I checked with 10,000 rows, 100 columns and 100 groups with no problem on a M1 Mac computer with only 8 GB of memory and all computations went fine. Moreover, computation time are much shorter than the ones you reported: only about 1 hour for 10,000 rows, 100 columns and 100 groups in single core and 20 minutes in multicore (parallel with 8 cores).

What kind of computer system are you using ? Can you please give us your sessionInfo() outputs ?

Thanks,
Jean

@trashmai
Copy link
Author

We ran parallel on 3 computers,

  1. (last time we used to run non-parallel)

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ade4_1.7-22

loaded via a namespace (and not attached):
[1] MASS_7.3-51.5 compiler_3.6.3 Rcpp_1.0.9

  1. (paralleling and get errors)

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.14.8 ade4_1.7-22

loaded via a namespace (and not attached):
[1] Rcpp_1.0.11 codetools_0.2-18 prettyunits_1.2.0 foreach_1.5.2
[5] crayon_1.5.2 MASS_7.3-55 R6_2.5.1 lifecycle_1.0.4
[9] rlang_1.1.2 progress_1.2.2 cli_3.6.1 doParallel_1.0.17
[13] vctrs_0.6.4 iterators_1.0.14 hms_1.1.3 parallel_4.1.2
[17] compiler_4.1.2 pkgconfig_2.0.3

  1. (non-parallel computing)

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

Matrix products: default

Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding

locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

time zone: Asia/Taipei
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.14.8 ade4_1.7-22

loaded via a namespace (and not attached):
[1] R6_2.5.1 codetools_0.2-19 doParallel_1.0.17 iterators_1.0.14 parallel_4.3.0
[6] pkgconfig_2.0.3 lifecycle_1.0.3 cli_3.6.1 foreach_1.5.2 vctrs_0.6.2
[11] compiler_4.3.0 prettyunits_1.1.1 tools_4.3.0 hms_1.1.3 Rcpp_1.0.10
[16] rlang_1.1.1 crayon_1.5.2 progress_1.2.2 MASS_7.3-58.4

I've noticed that the error didn't occur during the parallel processing stage, but could it be somehow related to the mclapply warnings (such as the way groups and sample sizes were split and distributed for parallel processing failed to meet certain conditions, my random guess)? The non-parallel processing finished yesterday, and we got great results, which leads me to believe that the error was not directly caused by the computation of the group overlap index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants