File format #12

cortner · 2020-07-11T16:49:29Z

I am trying to settle on a file format for v1.x and would appreciate feedback on some thoughts:

there are essentially two perspectives:

Use a slim format that contains the bare minimum of information to reconstruct the bases and potentials; or
Dump the entire Julia type into a file including everything that could be recomputed at runtime.

There are some benefits to both I think. E.g., 1 will lead to MUCH smaller files, but on the other hand it can only be read and understood with the code that wrote it. 2 on the other hand will have lots of "meta-data" type information that is not needed to reconstruct the types but it will make it easy to write a parser at some point that can read it even if the original code is lost or cannot be made to run for whatever reason...

As I'm writing this I wonder whether there is a third way:

have a minimal "slim" file format as default, but provide the option of saving also this meta-data mentioned above which will be a "human-readable" specification of potential or basis.

@gabor1 your perspective would be particularly appreciated here.

gabor1 · 2020-07-13T16:41:31Z

So the route that @albapa and I have taken is to write a "fat" file, with everything in it, even the original training data, i.e. enough to actually rerun the training with a future version of the code, but structured in such a way that the file can simply be transformed (by removing lines) to a "thin" format, that is just enough to evaluate the potential, possibly using some restricted versions of the code (in our case, with versions of the code > version that wrote the file, but you could even be thinner than that)

We write the fat version by default, because users often don't mind large files, and helps debugging. if there is a utility provided to transform fat files to thin files then they don't need to carry around large files if that is a problem. developers who might be creating a huge number of potential files in a short space of time during development will know how to switch on the thin writer.

cortner · 2020-07-13T16:56:08Z

Ok so that sounds like some form of mixed thin/fat format would be ideal.

gabor1 · 2020-07-13T17:01:51Z

Or just a structured fat file, so that it is easy to remove the fat…

…

-- Gábor Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 13 Jul 2020, at 17:56, Christoph Ortner ***@***.***> wrote: Ok so that sounds like some form of mixed thin/fat format would be ideal. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

albapa · 2020-07-13T19:10:35Z

We used to use a binary format which was very fast but a pain in all other respects. Then we went for XML, with CDATA lines for the meta-data, i.e. training configurations and command line options. I think this was a very good choice, for the reasons above. We actually have the options for companion files to store lots of reals, which are slow and cumbersome to read by XML - these are read in C. In the training code, there is an option to omit the training data, which is useful for explorations and quick tests, and for distribution we use the full version.

I guess today we would use a json file.

cortner · 2020-07-13T19:41:09Z

So far I've stored huge amounts of reals in a separate HDF5 file. So similar to your approach.

What's your view on JSON (XML) compressed as zip as needed?

albapa · 2020-07-13T19:59:36Z

I was going to say go with bson but then I saw your message on slack... Maybe read/write is still faster though.

I think zipping the json would be perfect - although I don't know how parsing performance (of default libraries) compares to XML.

cortner · 2020-07-14T15:48:15Z

I think that's where I'm going then. Julia has very nice zip format integration via ZipFile.jl

cortner · 2022-01-14T03:30:45Z

I'm going to close this - Zipped JSON files turn out to be easy to manage in Julia and exactly the level of flexibility we need.

cortner closed this as completed Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File format #12

File format #12

cortner commented Jul 11, 2020

gabor1 commented Jul 13, 2020 •

edited

Loading

cortner commented Jul 13, 2020

gabor1 commented Jul 13, 2020 via email

albapa commented Jul 13, 2020

cortner commented Jul 13, 2020

albapa commented Jul 13, 2020

cortner commented Jul 14, 2020

cortner commented Jan 14, 2022

File format #12

File format #12

Comments

cortner commented Jul 11, 2020

gabor1 commented Jul 13, 2020 • edited Loading

cortner commented Jul 13, 2020

gabor1 commented Jul 13, 2020 via email

albapa commented Jul 13, 2020

cortner commented Jul 13, 2020

albapa commented Jul 13, 2020

cortner commented Jul 14, 2020

cortner commented Jan 14, 2022

gabor1 commented Jul 13, 2020 •

edited

Loading