speed up readIndex/readSpectrum by base::textConnection #70

asepsiswu · 2023-11-08T02:04:13Z

A potential constraint lies in the data input/output speed of the disk.

one solution is to redirect the output of a command executed with system2 to a base::textConnection.

I hope the following code might provide some clues.

rawfile <- "demo.raw"
con <- textConnection( system2(Sys.which("mono"), args = c(shQuote(exe),  shQuote(rawfile),  ... ), stdout = T) )
string <- scan(con,"")
e <- eval(string)

The text was updated successfully, but these errors were encountered:

tobiasko · 2023-11-08T07:55:59Z

That sounds very interesting! Having a text buffer that connects the C# with the R level would save a write and read operation. @cpanse I think you should really consider this solution!

cpanse · 2023-11-08T19:50:55Z

@asepsiswu @tobiasko

R> rbenchmark::benchmark({.readIndexI70(f) |> nrow() -> dump}, replications = 10)
                                      test replications elapsed relative user.self sys.self user.child sys.child
1 {\n    dump <- nrow(.readIndexI70(f))\n}           10  34.621        1     0.769    0.516     35.272     3.772
R> rbenchmark::benchmark({rawrr::readIndex(f) |> nrow() -> dump}, replications = 10)
                                         test replications elapsed relative user.self sys.self user.child sys.child
1 {\n    dump <- nrow(rawrr::readIndex(f))\n}           10  35.044        1     0.293    0.038     35.272     4.101
R> file.size(f) / 1024^3
[1] 2.030649
R> .readIndexI70
function(rawfile, tmpdir = tempdir()){
  rawrr:::.isAssemblyWorking()
  rawrr:::.checkRawFile(rawfile)
    mono <- if(Sys.info()['sysname'] %in% c("Darwin", "Linux")) TRUE else FALSE
    exe <- rawrr:::.rawrrAssembly()

    tfstdout <- tempfile(tmpdir=tmpdir)

    cmd <- exe

    if (mono){
        con <- system2(Sys.which("mono"),
                       args = c(shQuote(exe), shQuote(rawfile),
                                "index", shQuote(tfstdout)),
                       stdout = TRUE) |> textConnection()
    }else{
        con <- system2(exe,
                       args = c( shQuote(rawfile), "index", shQuote(tfstdout)),
                       stdout = TRUE) |> textConnection()
    }

    DF <- read.table(con,
      header = TRUE,
      comment.char = "#",
      sep = ';',
      na.strings = "-1",
      colClasses = c('integer', 'character', 'numeric', 'numeric', 'character',
                     'integer', 'integer', 'integer', 'numeric'))

    DF$dependencyType <- as.logical(DF$dependencyType)

    DF
}
<bytecode: 0x122ae75b8>
R>

Yes, it is more elegant, and we used pipe in rawDiag. So far, it does not seem to improve the read performance on my Apple M1 (reading 10x a 2G raw file saved less than 0.5seconds).

This reverts commit db525f9.

asepsiswu · 2023-11-09T00:17:30Z

The readIndex function generates much smaller intermediate files compared to readSpectrum. The latter, readSpectrum, exhibits poorer performance, especially when used with non-SSD (Solid State Drive) disks.

Based on my experience, data.table::fread demonstrates significantly faster performance compared to read.table, especially when dealing with large files.

cpanse · 2023-11-09T07:18:46Z

We have already gone one step further, #44 , and tried to go directly via libmono-2.so (similar to what python projects, e.g., alphapept do).
It is remarkable that, in my opinion, the file IO has much less impact than we expected. (I assume
every computer uses SSD technology when dealing with super expensive Orbitrap data.) For small amounts of data, the impact of initialization the file handle seems much higher.

Concerning data.table::fread, my philosophy is to use as much as possible R base to make it easy to install and maintain.

tobiasko · 2023-11-09T07:53:10Z

@cpanse: I don't share your assumption. We should also think about the people that have less resources and still depend on conventional hard disks!

asepsiswu · 2023-11-09T07:59:11Z

Below is my experience. The RAW file is only 35M, and it takes almost 110 seconds.

I tried the EH4547 demo but failed when runing R$createObject() with ASSEMBLY PROBLEM.
My primary idea is to redirect rawrr.exe output to Random-access memory(RAM) and bypass the need to write and read. Writing to the file has much more impact on non-SSD.

tobiasko · 2023-11-09T07:59:19Z

@asepsiswu: Pleas don't fell offended. @cpanse is afraid of package dependencies. 👻 I once had to remove code that was using a dplyr function and look for ways to do it in base R.

But, why don't we check if data.table is available on the system. If yes, use it, if not use the base R read.table? At 100 Hz (Astral) we should expect already 360'000 spectra/h recording time. If the fread is really faster it might matter at some point.

asepsiswu · 2023-11-09T08:03:57Z

data.table is an R package without other dependence, which really outperforms well. It is worth a try.
I agree with @cpanse, to use as much as possible R base to make it easy to install and maintain.

cpanse · 2023-11-09T08:06:37Z

Below is my experience. The RAW file is only 35M, and it takes almost 110 seconds.

I tried the EH4547 demo but failed when runing R$createObject() with ASSEMBLY PROBLEM. My primary idea is to redirect rawrr.exe output to Random-access memory(RAM) and bypass the need to write and read. Writing to the file has much more impact on non-SSD.

have you considered:

set tempdir = /dev/shm/
set mode = "barebone" when using readSpectrum

cpanse · 2023-11-09T08:08:48Z

@asepsiswu: Pleas don't fell offended. @cpanse is afraid of package dependencies. 👻 I once had to remove code that was using a dplyr function and look for ways to do it in base R.

But, why don't we check if data.table is available on the system. If yes, use it, if not use the base R read.table? At 100 Hz (Astral) we should expect already 360'000 spectra/h recording time. If the fread is really faster it might matter at some point.

because why do you want to read all Astral spectra? solve #44 and that issue and many more are gone.

asepsiswu · 2023-11-09T08:18:35Z

It seems not good.

cpanse · 2023-11-09T08:41:15Z

It seems not good.

@asepsiswu of note, rawrr::readSpectrum is not using read.table as rawrr::readIndex it uses source parsing R code.

cpanse · 2023-11-09T08:47:38Z

It seems not good.

@asepsiswu of note, rawrr::readSpectrum is not using read.table as rawrr::readIndex it uses source parsing R code.

@asepsiswu and if you set the parameter mode = "barebone" too?

asepsiswu · 2023-11-09T09:07:44Z

It seems not good.

@asepsiswu of note, rawrr::readSpectrum is not using read.table as rawrr::readIndex it uses source parsing R code.

@asepsiswu and if you set the parameter mode = "barebone" too?

I need nosies info. The IO seems to have less impact. If possible, read data from C# interface to R is the best solution.

cpanse self-assigned this Nov 8, 2023

cpanse added the enhancement New feature or request label Nov 8, 2023

cpanse added a commit that referenced this issue Nov 8, 2023

address #70 and bump to 1.10.2

db525f9

cpanse added a commit that referenced this issue Nov 8, 2023

Revert "address #70 and bump to 1.10.2"

235e317

This reverts commit db525f9.

asepsiswu closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up readIndex/readSpectrum by base::textConnection #70

speed up readIndex/readSpectrum by base::textConnection #70

asepsiswu commented Nov 8, 2023

tobiasko commented Nov 8, 2023

cpanse commented Nov 8, 2023

asepsiswu commented Nov 9, 2023

cpanse commented Nov 9, 2023

tobiasko commented Nov 9, 2023

asepsiswu commented Nov 9, 2023

tobiasko commented Nov 9, 2023 •

edited

Loading

asepsiswu commented Nov 9, 2023

cpanse commented Nov 9, 2023

cpanse commented Nov 9, 2023 •

edited

Loading

asepsiswu commented Nov 9, 2023

cpanse commented Nov 9, 2023

cpanse commented Nov 9, 2023

asepsiswu commented Nov 9, 2023

speed up readIndex/readSpectrum by base::textConnection #70

speed up readIndex/readSpectrum by base::textConnection #70

Comments

asepsiswu commented Nov 8, 2023

tobiasko commented Nov 8, 2023

cpanse commented Nov 8, 2023

asepsiswu commented Nov 9, 2023

cpanse commented Nov 9, 2023

tobiasko commented Nov 9, 2023

asepsiswu commented Nov 9, 2023

tobiasko commented Nov 9, 2023 • edited Loading

asepsiswu commented Nov 9, 2023

cpanse commented Nov 9, 2023

cpanse commented Nov 9, 2023 • edited Loading

asepsiswu commented Nov 9, 2023

cpanse commented Nov 9, 2023

cpanse commented Nov 9, 2023

asepsiswu commented Nov 9, 2023

tobiasko commented Nov 9, 2023 •

edited

Loading

cpanse commented Nov 9, 2023 •

edited

Loading