Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet format: add polarity, and maybe change to long format #116

Open
sorenwacker opened this issue Apr 25, 2021 · 8 comments
Open

Parquet format: add polarity, and maybe change to long format #116

sorenwacker opened this issue Apr 25, 2021 · 8 comments
Labels
enhancement New feature or request

Comments

@sorenwacker
Copy link

I miss the ionization mode in the parquet format.
Did you consider not storing the intensities and masses as arrays but exploding them?
I find the data much easier to analyse if the data is in long format.
And because parquet is compressing the data, it should not blow up the file size to much.

@sorenwacker sorenwacker changed the title Polarity in parquet file, Parquet format: add polarity, and maybe change to long format Apr 25, 2021
@caetera caetera added the enhancement New feature or request label Apr 26, 2021
@nielshulstaert
Copy link
Contributor

Hi @soerendip I should have read this issue before replying to your other one, thanks for your input. @ypriverol, are you using the parquet format output? I think you're more qualified to answer this one. We could add another parquet output option or add yet another flag.

@caetera
Copy link
Collaborator

caetera commented Apr 29, 2021

Hi @nielshulstaert is it worth considering moving format options into sub-commands? I.e. similarly to xic and query we introduce mzml, mgf, and parquet options to switch the format and substitute -f flag? That might help with too many flags. It will be, without any doubt, a significant change in the interface, but if properly "announced" should be possible. What do you think?

@sorenwacker
Copy link
Author

I not sure what is the best for the community. There is also the mzMLb format that was recently announced. The strength of the parquet format is its columnar format, which I think is undermined when you compress the data into arrays stored in the cells. I will run some tests next week and get back to you.

@sorenwacker
Copy link
Author

sorenwacker commented May 6, 2021

read-speed-ms-file-formats
I made a test with 12 small metabolomics files converted to different formats. For mzXML and mzMLb I used the parsers from pyteomics and for mzMLb from pymzml. For parquet and feather I used Pandas and pyarrow. The orange is just reading the data into memory and the blue one is reading and converting into long format. parquet and feather have almost the same read speed and the feather the data is already in the long format. I was puzzled that the mzMLb format takes so much time to read. Maybe that is just an inefficient parser.

I wonder what are the advantages of the m/z values and intensities stored as an array from your point of view. Are there ways of accessing the data that make it better? I wonder in case you want to slice by m/z rather than by retention time, like extracting a peak over a period of time, it would be much overhead to use the denser format. Or am I wrong? Did you do any benchmarking?

@ypriverol
Copy link
Collaborator

Hi @soerendip :

Sorry to arrive late for this discussion. My idea here:

1- We in PRIDE are using Avro and parquet for data handling Peptide evidence storage and Spectra.
2- While we are not using the current implementation, I think it would be great to continue developing the parquet version here to enable in the future the development of new algorithms and storage systems to fast retrieve spectra. As @soerendip points it out the mzMLb is a binary mzML file format but not a major application is yet using it. I see some major advantages on parquet associated with its column format design.
3- @soerendip @nielshulstaert @caetera would be great to discuss some use cases and advantages of the design of the parquet format for the ultimate design.

@sorenwacker
Copy link
Author

Sure.

@sorenwacker
Copy link
Author

BTW you can find the files that I used.

The files were downloaded from https://www.ebi.ac.uk/metabolights/MTBLS1569/descriptors. And the converted files can be downloaded from https://soerendip.com/dl/MTBLS1569/

I used the 12 files starting with T for the test.

@sorenwacker
Copy link
Author

sorenwacker commented May 19, 2021

Apparently, the mzMLb can be faster if generated with a better compression type. It seems for some reason the compression is zip in this files.. there is another dependency that was not installed (hdf5plugin, I believe). Which supposedly can make the file faster to read, however, it did not work so far and I am not sure what is wrong. I also looked at the file sizes. That is where mzMLb is better than the former mz... formats.

image

Apparently, storing the data in long format blows up the parquet format (compare parquet-Mint with parquet-TRR) TRR=ThermoRawfileReader. This is just the reading time, without formating the data to long format.

And all done with Python. It would be interesting how much faster other parsers are. E.g. from OpenMS or XCMS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants