-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet format: add polarity, and maybe change to long format #116
Comments
Hi @soerendip I should have read this issue before replying to your other one, thanks for your input. @ypriverol, are you using the parquet format output? I think you're more qualified to answer this one. We could add another parquet output option or add yet another flag. |
Hi @nielshulstaert is it worth considering moving format options into sub-commands? I.e. similarly to xic and query we introduce mzml, mgf, and parquet options to switch the format and substitute |
I not sure what is the best for the community. There is also the mzMLb format that was recently announced. The strength of the parquet format is its columnar format, which I think is undermined when you compress the data into arrays stored in the cells. I will run some tests next week and get back to you. |
Hi @soerendip : Sorry to arrive late for this discussion. My idea here: 1- We in PRIDE are using Avro and parquet for data handling Peptide evidence storage and Spectra. |
Sure. |
BTW you can find the files that I used. The files were downloaded from https://www.ebi.ac.uk/metabolights/MTBLS1569/descriptors. And the converted files can be downloaded from https://soerendip.com/dl/MTBLS1569/ I used the 12 files starting with T for the test. |
Apparently, the mzMLb can be faster if generated with a better compression type. It seems for some reason the compression is zip in this files.. there is another dependency that was not installed (hdf5plugin, I believe). Which supposedly can make the file faster to read, however, it did not work so far and I am not sure what is wrong. I also looked at the file sizes. That is where mzMLb is better than the former mz... formats. Apparently, storing the data in long format blows up the parquet format (compare parquet-Mint with parquet-TRR) TRR=ThermoRawfileReader. This is just the reading time, without formating the data to long format. And all done with Python. It would be interesting how much faster other parsers are. E.g. from OpenMS or XCMS. |
I miss the ionization mode in the parquet format.
Did you consider not storing the intensities and masses as arrays but exploding them?
I find the data much easier to analyse if the data is in long format.
And because parquet is compressing the data, it should not blow up the file size to much.
The text was updated successfully, but these errors were encountered: