Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance when reading a lot of spectra #43

Closed
romanzenka opened this issue Nov 19, 2021 · 17 comments
Closed

Low performance when reading a lot of spectra #43

romanzenka opened this issue Nov 19, 2021 · 17 comments
Assignees
Labels
enhancement New feature or request

Comments

@romanzenka
Copy link

rawrr::readSpectrum is very slow, making it unuseable to read files with 10,000s of spectra

By slow I mean it takes ~1 second on my 1 year old Macbook Pro to read a spectrum.
(I do call the function once, with list of spectrum ids.)

It would take 3 hours just to read a single file. That renders the package unuseable by some two orders of magnitude.

I will be investigating to figure out what is the culprit. It might be necessary to add switches that remove some "advanced" functionality from spectrum reads to get the performance back (?).

@cpanse
Copy link
Collaborator

cpanse commented Nov 20, 2021

@romanzenka we know about that. the current version tries to fetch everything. we are going to fix it.
A possible workaround is applying some filtering using rawrr::readIndex('someRawFileName') and fetching only the scans of interest. Why do you want to read all spectra at once?
C

@cpanse cpanse changed the title Low performance Low performance when reading a lot of spectra Nov 20, 2021
@romanzenka
Copy link
Author

romanzenka commented Nov 22, 2021

We are essentially making a specialized "search engine" that processes all spectra.

We understand that going to C / .NET would be best for such job, but R is very convenient otherwise and has a lot of functionality we like. Being able to do these odd jobs in R would be great.

I'd be willing to try to provide a pull request, but I am afraid I'd collide with your design plans as you are already aware of this issue.

@tobiasko
Copy link
Collaborator

Hi @romanzenka,

some comments: it is true that fetching a small number of spectra is relatively slow. This is due to a big processing overhead when calling our managed code (the rawrr.exe) using a system call, plus writing tmp files to disc and needing to read and parse tmp data. I recommend looking at this presentation, especially slide 5. @cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls). But the first results look very promising and would boost reading speeds especially for very small and selective data requests on many files!

Hope this helps,
Tobi

@tobiasko
Copy link
Collaborator

Regarding your plans of implementing a search engine directly in R: I have big doubts that this makes sense! R is an interpreted language and not suited for heavy data lifting. This is why most R functions that crucially depend on performance are implemented in C.

see http://adv-r.had.co.nz/Performance.html

If you still think you are missing a crucial functionality that could be provided by rarwrr please feel free to suggest something and we can think about making it happen, BUT it should make sense from a code design perspective.

@tobiasko
Copy link
Collaborator

...and because you phrased this statement is a very actual way:

"It would take 3 hours just to read a single file."

No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the rawrr::readSpectrum() function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.

@romanzenka
Copy link
Author

No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the rawrr::readSpectrum() function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.

I understand that very well, which is why I only call the function once. The speed is still so slow that it is not useable. I suspect that is because that the function gathers metadata one spectrum at a time, which likely involves many seeks within the .raw file to gather all that info + complex parsing and similar.

@romanzenka
Copy link
Author

@cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls).

I agree that having the engine in memory, "heated up and rearing to go" would be of great benefit if you can pull it off.

The low speed I am experiencing is most likely not a result of writing/parsing text files - that operation takes a tiny fraction of the time considering a size of one spectrum. A second is basically an eon in computer time... my hard drive can pump ~100MB in a single second into memory. The inefficiency is likely elsewhere, but I shall not speculate before I have numbers.

@tobiasko
Copy link
Collaborator

A developer from the ProteoWizard/MSconvert project once told me: "When using vendor libraries you need to know how to pet the cat!" So, if you think you know better than @cpanse, please go ahead and suggest changes to our managed code. The C# source is available here. We are always open for pull requests as long as they comply with the Bioc guidelines and fit into the package scope. An example can be found here

@romanzenka
Copy link
Author

romanzenka commented Nov 29, 2021

I think what I have to do is to create a version of the scan reading function that reads only what I need and nothing more. That should cut down on the time spent gathering the additional metadata that my code downstream simply ignores. If that is not going to be good enough, it might be necessary for the vendor to provide some "accelerator" functions, using their deep knowledge of the file format.

Also, I realized that the way data is passed into R at the moment is by generation and subsequent parsing of R source code. So the second trick would be to pass the data maybe as raw bytes, and then disentangle them on the R end using a simpler method than full-blown "eval" which has to be ready for anything an R programmer can throw at it - thus more complex - thus slower.

@cpanse
Copy link
Collaborator

cpanse commented Nov 29, 2021

@romanzenka Can you provide more details of your request?

  • What data do you want? E.g., centroided peaks or segments (profile)?

  • How do you want the data to be read by R? e.g., base64 encoded one peak list a line using the scan method.

  • Can you provide me access to a raw file you are going to use? (you can also send me an email cp@fgcz.ethz.ch with the download link)

I think #44 is the ultimate way to go. Meanwhile, I can try to provide a code snippet to solve your issue.

@romanzenka
Copy link
Author

@cpanse

  • at the moment it is incredibly bare-bones. I basically need the precursor m/z and charge, then two arrays (or one interleaved, or whatever) of m/z + intensity pairs, centroided.

  • Since I spoke to you, did some minor benchmarking.

a <- 1:10000 / 7 # Some numbers
v <- paste0("list(a=c(", paste(a, collapse=", "), ")")
microbenchmark::microbenchmark(eval(v))

... and I am getting about 1.5 microseconds for this.
That could mean that maybe the R parse is fast enough and this is not the culprit, so we could spare ourselves the pain of doing a binary transfer or base64.

  • I will send you an e-mail, just need to check I am not sharing anything "secret" first.

@cpanse
Copy link
Collaborator

cpanse commented Dec 1, 2021

@romanzenka I hope that helps.

commit 1637d6f on git@git.bioconductor.org:packages/rawrr (check out and R CMD build or wait for two days)

# fetch via ExperimentHub
library(ExperimentHub)
eh <- ExperimentHub::ExperimentHub()
EH4547 <- normalizePath(eh[["EH4547"]])

(rawfile <- paste0(EH4547, ".raw"))
if (!file.exists(rawfile)){
  file.copy(EH4547, rawfile)
}
R> bm <- lapply(2^(0:14), function(n, ...){
+         m0 <-  microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='default')}, ...)
+         m1 <-  microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='barebone')}, ...)
+         
+         data.frame(time = c(m0$time, m1$time), mode=c('default', 'barebone'), n=n)
+  }, times=1, unit="nanosecond") |> Reduce(f='rbind')
R> bm
          time     mode     n
1    983118992  default     1
2    906433494 barebone     1
3    902113611  default     2
4    871311213 barebone     2
5    890822867  default     4
6    879356766 barebone     4
7    895267636  default     8
8    909109441 barebone     8
9    930387498  default    16
10   881011362 barebone    16
11   929100467  default    32
12   857490072 barebone    32
13   914358999  default    64
14   872367250 barebone    64
15   962366760  default   128
16   876129902 barebone   128
17   996060642  default   256
18   908822154 barebone   256
19  1170730769  default   512
20   925475452 barebone   512
21  1963340186  default  1024
22  1120511427 barebone  1024
23  3557690212  default  2048
24  1409178241 barebone  2048
25  6165030108  default  4096
26  1976297334 barebone  4096
27 10846751392  default  8192
28  3010938648 barebone  8192
29 29449842481  default 16384
30  6763253400 barebone 16384
R> lattice::xyplot(time ~ n, groups=bm$mode, data=bm, type='b', scale=list(log=TRUE), ylab='time [in nanosecond]', xlab='number of spectra')

Screenshot 2021-12-01 at 16 50 14

R> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] tartare_1.7.2       ExperimentHub_2.1.4 AnnotationHub_3.1.5
[4] BiocFileCache_2.0.0 dbplyr_2.1.1        BiocGenerics_0.39.2

loaded via a namespace (and not attached):
 [1] KEGGREST_1.33.0               tidyselect_1.1.1             
 [3] BiocVersion_3.14.0            purrr_0.3.4                  
 [5] lattice_0.20-44               vctrs_0.3.8                  
 [7] generics_0.1.0                htmltools_0.5.2              
 [9] stats4_4.1.1                  yaml_2.2.1                   
[11] utf8_1.2.2                    interactiveDisplayBase_1.31.2
[13] blob_1.2.2                    rlang_0.4.11                 
[15] pillar_1.6.3                  later_1.3.0                  
[17] withr_2.4.2                   glue_1.4.2                   
[19] DBI_1.1.1                     rappdirs_0.3.3               
[21] bit64_4.0.5                   GenomeInfoDbData_1.2.7       
[23] lifecycle_1.0.1               zlibbioc_1.39.0              
[25] Biostrings_2.61.2             memoise_2.0.0                
[27] Biobase_2.53.0                IRanges_2.27.2               
[29] fastmap_1.1.0                 httpuv_1.6.3                 
[31] GenomeInfoDb_1.29.8           curl_4.3.2                   
[33] fansi_0.5.0                   AnnotationDbi_1.55.1         
[35] Rcpp_1.0.7                    xtable_1.8-4                 
[37] promises_1.2.0.1              filelock_1.0.2               
[39] BiocManager_1.30.16           cachem_1.0.6                 
[41] S4Vectors_0.31.4              XVector_0.33.0               
[43] mime_0.11                     bit_4.0.4                    
[45] microbenchmark_1.4.9          png_0.1-7                    
[47] digest_0.6.27                 dplyr_1.0.7                  
[49] shiny_1.7.0                   grid_4.1.1                   
[51] tools_4.1.1                   bitops_1.0-7                 
[53] magrittr_2.0.1                RCurl_1.98-1.4               
[55] tibble_3.1.4                  RSQLite_2.2.8                
[57] rawrr_1.3.2                   crayon_1.4.1                 
[59] pkgconfig_2.0.3               ellipsis_0.3.2               
[61] rstudioapi_0.13               assertthat_0.2.1             
[63] httr_1.4.2                    R6_2.5.1                     
[65] compiler_4.1.1               

Cheers

cpanse added a commit to cpanse/rawrr that referenced this issue Dec 1, 2021
address fgcz#43

includes:
* new C# method `WriteCentroidSpectrumAsRcode`
* add test case
* roxygen2::roxygenize
* new rawrr.exe assembly
* version bump to 1.3.2
@romanzenka
Copy link
Author

Thank you! I have achieved very comparable results (modulo the start, some caches were not warm enough):

image

Testing on our files now.

@romanzenka
Copy link
Author

I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.

I'm updating the test to a) read only MS2 spectra b) cycle through different files so we do not get overly optimistic results thanks to caching of previously loaded data.

Hopefully I will have plots shortly - what I am curious about seeing is "spectra per second", so I'll modify the plot a bit.

@romanzenka
Copy link
Author

Below is a chart (it tops at 8192 spectra because the code crashed, investigating now) showing the times.

The difference is that each microbenchmark is ran on a completely different .raw file to reduce the effect of caching. I used a 24 fraction set of .raw files to make sure I have a fresh one for each query.

image

Here is the same thing with spectra per second plotted on Y axis. The update you provided did have a dramatic effect on read times. Thank you!

image

@romanzenka
Copy link
Author

Well, I tracked down the bug. If I load 16,384 spectra from a particular file, my R crashes when it tries to source the resulting 1.1GB of R source code. The extraction itself takes about 1 minute, at some impressive 270 spectra per second... but then R cannot handle the parse on my 32GB RAM laptop. I get:

negative length vectors are not allowed

I think we ran over max vector lengths in R. That might be a future improvement, for now I will simply run the input in chunks big enough to get me speed, but small enough not to kill R.

@cpanse
Copy link
Collaborator

cpanse commented Dec 2, 2021

I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.

thanks; I fixed that. commit 36f43e1 C

@cpanse cpanse self-assigned this Dec 2, 2021
@cpanse cpanse added the enhancement New feature or request label Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants