Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File format #12

Closed
cortner opened this issue Jul 11, 2020 · 8 comments
Closed

File format #12

cortner opened this issue Jul 11, 2020 · 8 comments

Comments

@cortner
Copy link
Member

cortner commented Jul 11, 2020

I am trying to settle on a file format for v1.x and would appreciate feedback on some thoughts:

there are essentially two perspectives:

  1. Use a slim format that contains the bare minimum of information to reconstruct the bases and potentials; or
  2. Dump the entire Julia type into a file including everything that could be recomputed at runtime.

There are some benefits to both I think. E.g., 1 will lead to MUCH smaller files, but on the other hand it can only be read and understood with the code that wrote it. 2 on the other hand will have lots of "meta-data" type information that is not needed to reconstruct the types but it will make it easy to write a parser at some point that can read it even if the original code is lost or cannot be made to run for whatever reason...

As I'm writing this I wonder whether there is a third way:

  1. have a minimal "slim" file format as default, but provide the option of saving also this meta-data mentioned above which will be a "human-readable" specification of potential or basis.

@gabor1 your perspective would be particularly appreciated here.

@gabor1
Copy link
Collaborator

gabor1 commented Jul 13, 2020

So the route that @albapa and I have taken is to write a "fat" file, with everything in it, even the original training data, i.e. enough to actually rerun the training with a future version of the code, but structured in such a way that the file can simply be transformed (by removing lines) to a "thin" format, that is just enough to evaluate the potential, possibly using some restricted versions of the code (in our case, with versions of the code > version that wrote the file, but you could even be thinner than that)

We write the fat version by default, because users often don't mind large files, and helps debugging. if there is a utility provided to transform fat files to thin files then they don't need to carry around large files if that is a problem. developers who might be creating a huge number of potential files in a short space of time during development will know how to switch on the thin writer.

@cortner
Copy link
Member Author

cortner commented Jul 13, 2020

Ok so that sounds like some form of mixed thin/fat format would be ideal.

@gabor1
Copy link
Collaborator

gabor1 commented Jul 13, 2020 via email

@albapa
Copy link

albapa commented Jul 13, 2020

We used to use a binary format which was very fast but a pain in all other respects. Then we went for XML, with CDATA lines for the meta-data, i.e. training configurations and command line options. I think this was a very good choice, for the reasons above. We actually have the options for companion files to store lots of reals, which are slow and cumbersome to read by XML - these are read in C. In the training code, there is an option to omit the training data, which is useful for explorations and quick tests, and for distribution we use the full version.

I guess today we would use a json file.

@cortner
Copy link
Member Author

cortner commented Jul 13, 2020

So far I've stored huge amounts of reals in a separate HDF5 file. So similar to your approach.

What's your view on JSON (XML) compressed as zip as needed?

@albapa
Copy link

albapa commented Jul 13, 2020

I was going to say go with bson but then I saw your message on slack... Maybe read/write is still faster though.

I think zipping the json would be perfect - although I don't know how parsing performance (of default libraries) compares to XML.

@cortner
Copy link
Member Author

cortner commented Jul 14, 2020

I think that's where I'm going then. Julia has very nice zip format integration via ZipFile.jl

@cortner
Copy link
Member Author

cortner commented Jan 14, 2022

I'm going to close this - Zipped JSON files turn out to be easy to manage in Julia and exactly the level of flexibility we need.

@cortner cortner closed this as completed Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants