Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Archive format AEP #21

Merged
merged 16 commits into from
Jan 10, 2021
Merged

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Sep 18, 2020

  • Used AEP template from AEP 0
  • Status is submitted
  • Added type & status labels to PR
  • Added AEP to README.md
  • Provided github handles for authors

@chrisjsewell chrisjsewell marked this pull request as ready for review September 23, 2020 14:43
@chrisjsewell
Copy link
Member Author

I'm sure there is room for improvement, but I think this is now in a state for review

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start. I've made a few of suggestions to improve format and content.

Edit: Accidentally submitted comment as "comment", this should be "changes requested".

005_exportformat/readme.md Outdated Show resolved Hide resolved
|------------|------------------------------------------------------------------|
| Title | Implement a new archive export format |
| Authors | [Chris Sewell](mailto:christopher.sewell@epfl.ch) (chrisjsewell) |
| Champions | [Chris Sewell](mailto:christopher.sewell@epfl.ch) (chrisjsewell) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please identify a champion for this AEP, it cannot be yourself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giovannipizzi do you want to champion? Is there a specific definition for a champion? Is it just someone else who thinks it is a good idea?


## Proposed Enhancement

The goal of this project is to first develop a set of agreed requirements for a new archive format, followed by a concrete implementation of the format, and accompanying export and import functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revise the description to focus on the proposed enhancement, we can add detailed notes on its implementation in a later section. This should not be a meta-description of the AEP process itself.

005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
The alternative approach would be to use the newly implemented "packfile" object-store, with coordinating SQLite database.
The pros and cons of this approach have been previously assessed in <https://github.com/aiidateam/AEP/pull/11>.

### Archive compression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the ability to compress the archive should be discussed as part of the requirements and I would also rephrase it such that it is clear that the "compression" is the requirement, not the fact that one can compress the whole archive. That's always a given.


### Archive compression

For portability, it is desirable that the full archive be contained within a single zipped file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think we should rephrase this, because I can always place some directories or files in a zip-file.


## Pros and Cons

For implementing a new format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this AEP is trying to combine two different arguments: 1. Whether a new format should be developed and 2. What format that would be. It is easy to combine these two questions by simply making "keep the current format" one of the alternatives.

In that sense, this AEP should identify a proposed specific solution as a result of the earlier discussion and the "Pros and Cons" of that specific solution should be discussed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I wasn't quite sure what the initial scope and "completeness" of these AEPs had to be to merge in the submitted state?
The question of "whether to keep the current format", as you say, is somewhat tied to already having decided on a specific solution, which in-turn requires that all potential solutions (including Zarr 😉) are fully assessed, and most likely requires some prototyping of one or more of the solutions with tests/benchmarking to compare against the current format etc

Does all this work need to be completed before a submission is merged? Does it stay as an open PR until this is all done, or is it enough to initially provide an adequate justification for commencing the work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we just kept AEPs in the PR stage until they appeared ready. I would be open to merge draft with subsequent updates, but that should probably be discussed in plenum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the outcome of the plenum discussion was to bring this AEP into an agreeable state concerning the requirements as soon as possible and then merge it with the status "submitted" and subsequent updates,, but also @ltalirz is going to have another look at this.

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>
@chrisjsewell chrisjsewell changed the title Aep importexport AEP Archive format Sep 29, 2020
* `data.json` contains all requisite information from the SQL Database.
* `nodes` contains the "object store" files per node, organised by UUID: `xxyy-zz...`.

Particularly for large export archives, writing to (export) and reading from (import) `data.json` represents a significant bottle-neck in performance for these processes, both in respect to memory usage and process speed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add here the two principle reasons:

  1. The file repository is inefficiently stored. Each file is written as its own file and uncompressed, which requires a lot of inodes and just indexing all file content is slow.
  2. For the validity of data.json to be determined, the entire content has to be read into memory, which becomes a limiting factor for export size very quickly

005_exportformat/readme.md Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
Comment on lines 101 to 102
An additional feature to consider would be delta increments, such that an existing archive file could be amended during an export.
This may allow for a push/pull interface for "syncing" an archive file to a particular AiiDA profile.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but are you referring here to what @giovannipizzi described that he would like to see as a feature? Because I am not sure this is exactly what he meant. What I think he meant was that we should ideally have push/pull functionality on AiiDA databases. And one way of implementing this was through means of export archives. Let's say you have a database, you create an export archive of it and import it in a second db. Then you create additional nodes in db1 and you want to "push" them to db2. The mechanism would be to compute the diff, export those into an archive and then just import that archive in db2. So I don't think we would require directly updating the archives themselves, as this section in the AEP seems to suggest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I guess @giovannipizzi can clarify this

When writing an export file, these issues should not necessarily be present.

[JSON streaming](https://en.wikipedia.org/wiki/JSON_streaming) technologies, such as JSONL, also allow for JSON to be streamed, without the need to read the full file into memory.
It is unclear though if this would actually provide any performance gains, since in many cases the full JSON will still need to be loaded into memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of "streamable JSON" if it still requires reading everything in memory 😕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I undersold it here lol. I guess you use it in "searching" for records; e.g. by iterating over each JSON "piece" (loading only that into memory) and checking if it contains the key(s) you are looking for.
I will give a deeper dive on this though

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>
@chrisjsewell
Copy link
Member Author

Points to address from meeting:

  • Change Data “accessibility” -> “longevity”

    • Distinguish between longevity and ease of access
  • Use cases required for manual changes, but this is normally due to issues in aiida-core caused by “faulty” migrations

    • There should also be some mention/consideration of simplicity vs complexity
    • This also usually relates to low vs high performance, and ascii vs binary
  • Move compression to design requirements

    • There should be a mention, in the user requirements, for a single file archive
    • Also relates to space on disc and introspection
  • More consideration and explanation of JSON streaming

    • Pro: does not require entire JSON to be kept in memory
    • Likely still needs index mapping
  • Derive a minimal required feature set (probably one table per import/export)

    • Compare to current implementation and discuss what features are new/removed
    • One example is storing logs (new)
  • Inspecting an archive should not require an AiiDA profile loaded (currently the case)

  • API interface should include a contains_uuid method

  • Benchmarking:
    Given 1 million nodes, with UUI key + 1 attribute, how long to write and access or check for existence.

005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved
005_exportformat/readme.md Outdated Show resolved Hide resolved

For small archives, this is most probably the best solution.
However, when considering large archives, single and large JSON-files are an extremely poor database format;
they must be read in full to access any data and don't support concurrency or ACID (atomicity, consistency, isolation, durability) transactions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the note on ACID - the JSON would be dumped during export by a single process, and never edited, so I don't think this comment applies here.


To support this, one could consider extending the current format to move towards a "NoSQL" database type implementation, splitting the JSON into multiple files (see for example [MongoDB](https://en.wikipedia.org/wiki/MongoDB)).

For example, node-level JSONs could be stored in the disk object store, together with a minimal index of UUID -> Hashkey mappings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small technical note - I think I was suggesting this, but in the end I checked and just writing/reading a multiline JSON is much faster, and if you want to keep an index (that would just increase performance but everything would work even without it) this can just be formed by the byte offset where the line of the JSON of a given node (with given UUID) starts.

At a node level, dumping data into a JSON and writing to disk, would also likely be faster than recreating database tables that must handle indexes, ACID, avoiding concurrency problems, etc.
When writing an export file, these issues should not necessarily be present.

[JSON streaming](https://en.wikipedia.org/wiki/JSON_streaming) technologies, such as JSONL, also allow for JSON to be streamed, without the need to read the full file into memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, more than streaming, I think a multiline JSON would be a very good technical solution.

* It is a very stable and robust format with a clear long-term support plan until at least until 2050 (see <https://www.sqlite.org/lts.html>).

The main drawback of using SQL is that it is a binary format and so inherently not directly human-readable.
The format specification must be known before reading, and also SQLite version should be preserved within the archive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that

  1. we can limit to SQLite 3 (and readers will know if the format is different)
  2. There are GUIs to inspect and edit graphically SQLite files, so this problem is mitigated.

Co-authored-by: Giovanni Pizzi <gio.piz@gmail.com>
README.md Outdated Show resolved Hide resolved
@chrisjsewell
Copy link
Member Author

I believe I've now addressed all comments, so merging this in "draft" status (see #23)

@chrisjsewell chrisjsewell changed the title AEP Archive format Add Archive format AEP Jan 10, 2021
@chrisjsewell chrisjsewell merged commit 4855195 into aiidateam:master Jan 10, 2021
@chrisjsewell chrisjsewell deleted the aep-importexport branch January 10, 2021 15:20
chrisjsewell added a commit to chrisjsewell/AEP that referenced this pull request Dec 16, 2021
commit b4b4053
Author: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
Date:   Wed Dec 15 20:20:05 2021 +0100

    AEP 006 - Efficient object store for the AiiDA repository (aiidateam#11)

commit 0a5675d
Author: Sebastiaan Huber <mail@sphuber.net>
Date:   Fri Sep 10 18:16:30 2021 +0200

    Update README.md (aiidateam#26)

commit 5b45258
Author: Sebastiaan Huber <mail@sphuber.net>
Date:   Fri Sep 10 18:14:31 2021 +0200

    AEP 004: Infrastructure to import completed calculation jobs (aiidateam#12)

commit 4855195
Author: Chris Sewell <chrisj_sewell@hotmail.com>
Date:   Sun Jan 10 15:20:52 2021 +0000

    Add Archive format AEP (aiidateam#21)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants