-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low performance when reading a lot of spectra #43
Comments
@romanzenka we know about that. the current version tries to fetch everything. we are going to fix it. |
We are essentially making a specialized "search engine" that processes all spectra. We understand that going to C / .NET would be best for such job, but R is very convenient otherwise and has a lot of functionality we like. Being able to do these odd jobs in R would be great. I'd be willing to try to provide a pull request, but I am afraid I'd collide with your design plans as you are already aware of this issue. |
Hi @romanzenka, some comments: it is true that fetching a small number of spectra is relatively slow. This is due to a big processing overhead when calling our managed code (the rawrr.exe) using a system call, plus writing tmp files to disc and needing to read and parse tmp data. I recommend looking at this presentation, especially slide 5. @cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls). But the first results look very promising and would boost reading speeds especially for very small and selective data requests on many files! Hope this helps, |
Regarding your plans of implementing a search engine directly in R: I have big doubts that this makes sense! R is an interpreted language and not suited for heavy data lifting. This is why most R functions that crucially depend on performance are implemented in C. see http://adv-r.had.co.nz/Performance.html If you still think you are missing a crucial functionality that could be provided by rarwrr please feel free to suggest something and we can think about making it happen, BUT it should make sense from a code design perspective. |
...and because you phrased this statement is a very actual way: "It would take 3 hours just to read a single file." No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the |
I understand that very well, which is why I only call the function once. The speed is still so slow that it is not useable. I suspect that is because that the function gathers metadata one spectrum at a time, which likely involves many seeks within the .raw file to gather all that info + complex parsing and similar. |
I agree that having the engine in memory, "heated up and rearing to go" would be of great benefit if you can pull it off. The low speed I am experiencing is most likely not a result of writing/parsing text files - that operation takes a tiny fraction of the time considering a size of one spectrum. A second is basically an eon in computer time... my hard drive can pump ~100MB in a single second into memory. The inefficiency is likely elsewhere, but I shall not speculate before I have numbers. |
A developer from the ProteoWizard/MSconvert project once told me: "When using vendor libraries you need to know how to pet the cat!" So, if you think you know better than @cpanse, please go ahead and suggest changes to our managed code. The C# source is available here. We are always open for pull requests as long as they comply with the Bioc guidelines and fit into the package scope. An example can be found here |
I think what I have to do is to create a version of the scan reading function that reads only what I need and nothing more. That should cut down on the time spent gathering the additional metadata that my code downstream simply ignores. If that is not going to be good enough, it might be necessary for the vendor to provide some "accelerator" functions, using their deep knowledge of the file format. Also, I realized that the way data is passed into R at the moment is by generation and subsequent parsing of R source code. So the second trick would be to pass the data maybe as raw bytes, and then disentangle them on the R end using a simpler method than full-blown "eval" which has to be ready for anything an R programmer can throw at it - thus more complex - thus slower. |
@romanzenka Can you provide more details of your request?
I think #44 is the ultimate way to go. Meanwhile, I can try to provide a code snippet to solve your issue. |
a <- 1:10000 / 7 # Some numbers
v <- paste0("list(a=c(", paste(a, collapse=", "), ")")
microbenchmark::microbenchmark(eval(v)) ... and I am getting about 1.5 microseconds for this.
|
@romanzenka I hope that helps. commit 1637d6f on git@git.bioconductor.org:packages/rawrr (check out and R CMD build or wait for two days)
Cheers |
address fgcz#43 includes: * new C# method `WriteCentroidSpectrumAsRcode` * add test case * roxygen2::roxygenize * new rawrr.exe assembly * version bump to 1.3.2
I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me. I'm updating the test to a) read only MS2 spectra b) cycle through different files so we do not get overly optimistic results thanks to caching of previously loaded data. Hopefully I will have plots shortly - what I am curious about seeing is "spectra per second", so I'll modify the plot a bit. |
Below is a chart (it tops at 8192 spectra because the code crashed, investigating now) showing the times. The difference is that each microbenchmark is ran on a completely different .raw file to reduce the effect of caching. I used a 24 fraction set of .raw files to make sure I have a fresh one for each query. Here is the same thing with spectra per second plotted on Y axis. The update you provided did have a dramatic effect on read times. Thank you! |
Well, I tracked down the bug. If I load 16,384 spectra from a particular file, my R crashes when it tries to
I think we ran over max vector lengths in R. That might be a future improvement, for now I will simply run the input in chunks big enough to get me speed, but small enough not to kill R. |
thanks; I fixed that. commit 36f43e1 C |
rawrr::readSpectrum is very slow, making it unuseable to read files with 10,000s of spectra
By slow I mean it takes ~1 second on my 1 year old Macbook Pro to read a spectrum.
(I do call the function once, with list of spectrum ids.)
It would take 3 hours just to read a single file. That renders the package unuseable by some two orders of magnitude.
I will be investigating to figure out what is the culprit. It might be necessary to add switches that remove some "advanced" functionality from spectrum reads to get the performance back (?).
The text was updated successfully, but these errors were encountered: