Add Archive format AEP #21

chrisjsewell · 2020-09-18T16:26:27Z

chrisjsewell · 2020-09-23T14:44:16Z

I'm sure there is room for improvement, but I think this is now in a state for review

csadorf

I think this is a good start. I've made a few of suggestions to improve format and content.

Edit: Accidentally submitted comment as "comment", this should be "changes requested".

005_exportformat/readme.md

csadorf · 2020-09-28T12:08:10Z

005_exportformat/readme.md

+|------------|------------------------------------------------------------------|
+| Title      | Implement a new archive export format                            |
+| Authors    | [Chris Sewell](mailto:christopher.sewell@epfl.ch) (chrisjsewell) |
+| Champions  | [Chris Sewell](mailto:christopher.sewell@epfl.ch) (chrisjsewell) |


Please identify a champion for this AEP, it cannot be yourself.

@giovannipizzi do you want to champion? Is there a specific definition for a champion? Is it just someone else who thinks it is a good idea?

csadorf · 2020-09-28T12:12:05Z

005_exportformat/readme.md

+
+## Proposed Enhancement
+
+The goal of this project is to first develop a set of agreed requirements for a new archive format, followed by a concrete implementation of the format, and accompanying export and import functions.


Please revise the description to focus on the proposed enhancement, we can add detailed notes on its implementation in a later section. This should not be a meta-description of the AEP process itself.

005_exportformat/readme.md

csadorf · 2020-09-28T12:41:42Z

005_exportformat/readme.md

+The alternative approach would be to use the newly implemented "packfile" object-store, with coordinating SQLite database.
+The pros and cons of this approach have been previously assessed in <https://github.com/aiidateam/AEP/pull/11>.
+
+### Archive compression


I think that the ability to compress the archive should be discussed as part of the requirements and I would also rephrase it such that it is clear that the "compression" is the requirement, not the fact that one can compress the whole archive. That's always a given.

csadorf · 2020-09-28T12:42:06Z

005_exportformat/readme.md

+
+### Archive compression
+
+For portability, it is desirable that the full archive be contained within a single zipped file.


Again, I think we should rephrase this, because I can always place some directories or files in a zip-file.

csadorf · 2020-09-28T12:43:49Z

005_exportformat/readme.md

+
+## Pros and Cons
+
+For implementing a new format.


It seems to me that this AEP is trying to combine two different arguments: 1. Whether a new format should be developed and 2. What format that would be. It is easy to combine these two questions by simply making "keep the current format" one of the alternatives.

In that sense, this AEP should identify a proposed specific solution as a result of the earlier discussion and the "Pros and Cons" of that specific solution should be discussed here.

Yeh, I wasn't quite sure what the initial scope and "completeness" of these AEPs had to be to merge in the submitted state?
The question of "whether to keep the current format", as you say, is somewhat tied to already having decided on a specific solution, which in-turn requires that all potential solutions (including Zarr 😉) are fully assessed, and most likely requires some prototyping of one or more of the solutions with tests/benchmarking to compare against the current format etc

Does all this work need to be completed before a submission is merged? Does it stay as an open PR until this is all done, or is it enough to initially provide an adequate justification for commencing the work?

So far we just kept AEPs in the PR stage until they appeared ready. I would be open to merge draft with subsequent updates, but that should probably be discussed in plenum.

I think the outcome of the plenum discussion was to bring this AEP into an agreeable state concerning the requirements as soon as possible and then merge it with the status "submitted" and subsequent updates,, but also @ltalirz is going to have another look at this.

005_exportformat/readme.md

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>

sphuber · 2020-10-01T11:02:15Z

005_exportformat/readme.md

+* `data.json` contains all requisite information from the SQL Database.
+* `nodes` contains the "object store" files per node, organised by UUID: `xxyy-zz...`.
+
+Particularly for large export archives, writing to (export) and reading from (import) `data.json` represents a significant bottle-neck in performance for these processes, both in respect to memory usage and process speed.


Maybe we could add here the two principle reasons:

The file repository is inefficiently stored. Each file is written as its own file and uncompressed, which requires a lot of inodes and just indexing all file content is slow.

For the validity of data.json to be determined, the entire content has to be read into memory, which becomes a limiting factor for export size very quickly

005_exportformat/readme.md

sphuber · 2020-10-01T11:19:30Z

005_exportformat/readme.md

+An additional feature to consider would be delta increments, such that an existing archive file could be amended during an export.
+This may allow for a push/pull interface for "syncing" an archive file to a particular AiiDA profile.


I am not sure, but are you referring here to what @giovannipizzi described that he would like to see as a feature? Because I am not sure this is exactly what he meant. What I think he meant was that we should ideally have push/pull functionality on AiiDA databases. And one way of implementing this was through means of export archives. Let's say you have a database, you create an export archive of it and import it in a second db. Then you create additional nodes in db1 and you want to "push" them to db2. The mechanism would be to compute the diff, export those into an archive and then just import that archive in db2. So I don't think we would require directly updating the archives themselves, as this section in the AEP seems to suggest.

yep I guess @giovannipizzi can clarify this

sphuber · 2020-10-01T11:23:50Z

005_exportformat/readme.md

+When writing an export file, these issues should not necessarily be present.
+
+[JSON streaming](https://en.wikipedia.org/wiki/JSON_streaming) technologies, such as JSONL, also allow for JSON to be streamed, without the need to read the full file into memory.
+It is unclear though if this would actually provide any performance gains, since in many cases the full JSON will still need to be loaded into memory.


What is the advantage of "streamable JSON" if it still requires reading everything in memory 😕

Perhaps I undersold it here lol. I guess you use it in "searching" for records; e.g. by iterating over each JSON "piece" (loading only that into memory) and checking if it contains the key(s) you are looking for.
I will give a deeper dive on this though

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>

005_exportformat/readme.md

chrisjsewell · 2020-10-02T03:46:24Z

Points to address from meeting:

Change Data “accessibility” -> “longevity”
- Distinguish between longevity and ease of access
Use cases required for manual changes, but this is normally due to issues in aiida-core caused by “faulty” migrations
- There should also be some mention/consideration of simplicity vs complexity
- This also usually relates to low vs high performance, and ascii vs binary
Move compression to design requirements
- There should be a mention, in the user requirements, for a single file archive
- Also relates to space on disc and introspection
More consideration and explanation of JSON streaming
- Pro: does not require entire JSON to be kept in memory
- Likely still needs index mapping
Derive a minimal required feature set (probably one table per import/export)
- Compare to current implementation and discuss what features are new/removed
- One example is storing logs (new)
Inspecting an archive should not require an AiiDA profile loaded (currently the case)
API interface should include a contains_uuid method
Benchmarking:
Given 1 million nodes, with UUI key + 1 attribute, how long to write and access or check for existence.

005_exportformat/readme.md

giovannipizzi · 2020-10-15T17:35:53Z

005_exportformat/readme.md

+
+For small archives, this is most probably the best solution.
+However, when considering large archives, single and large JSON-files are an extremely poor database format;
+they must be read in full to access any data and don't support concurrency or ACID (atomicity, consistency, isolation, durability) transactions.


I would remove the note on ACID - the JSON would be dumped during export by a single process, and never edited, so I don't think this comment applies here.

giovannipizzi · 2020-10-15T17:38:13Z

005_exportformat/readme.md

+
+To support this, one could consider extending the current format to move towards a "NoSQL" database type implementation, splitting the JSON into multiple files (see for example [MongoDB](https://en.wikipedia.org/wiki/MongoDB)).
+
+For example, node-level JSONs could be stored in the disk object store, together with a minimal index of UUID -> Hashkey mappings.


Small technical note - I think I was suggesting this, but in the end I checked and just writing/reading a multiline JSON is much faster, and if you want to keep an index (that would just increase performance but everything would work even without it) this can just be formed by the byte offset where the line of the JSON of a given node (with given UUID) starts.

giovannipizzi · 2020-10-15T17:39:11Z

005_exportformat/readme.md

+At a node level, dumping data into a JSON and writing to disk, would also likely be faster than recreating database tables that must handle indexes, ACID, avoiding concurrency problems, etc.
+When writing an export file, these issues should not necessarily be present.
+
+[JSON streaming](https://en.wikipedia.org/wiki/JSON_streaming) technologies, such as JSONL, also allow for JSON to be streamed, without the need to read the full file into memory.


As mentioned above, more than streaming, I think a multiline JSON would be a very good technical solution.

giovannipizzi · 2020-10-15T17:40:25Z

005_exportformat/readme.md

+* It is a very stable and robust format with a clear long-term support plan until at least until 2050 (see <https://www.sqlite.org/lts.html>).
+
+The main drawback of using SQL is that it is a binary format and so inherently not directly human-readable.
+The format specification must be known before reading, and also SQLite version should be preserved within the archive.


I would add that

we can limit to SQLite 3 (and readers will know if the format is different)

There are GUIs to inspect and edit graphically SQLite files, so this problem is mitigated.

Co-authored-by: Giovanni Pizzi <gio.piz@gmail.com>

005_exportformat/readme.md

README.md

005_exportformat/readme.md

chrisjsewell · 2021-01-10T15:18:48Z

I believe I've now addressed all comments, so merging this in "draft" status (see #23)

commit b4b4053 Author: Giovanni Pizzi <giovanni.pizzi@epfl.ch> Date: Wed Dec 15 20:20:05 2021 +0100 AEP 006 - Efficient object store for the AiiDA repository (aiidateam#11) commit 0a5675d Author: Sebastiaan Huber <mail@sphuber.net> Date: Fri Sep 10 18:16:30 2021 +0200 Update README.md (aiidateam#26) commit 5b45258 Author: Sebastiaan Huber <mail@sphuber.net> Date: Fri Sep 10 18:14:31 2021 +0200 AEP 004: Infrastructure to import completed calculation jobs (aiidateam#12) commit 4855195 Author: Chris Sewell <chrisj_sewell@hotmail.com> Date: Sun Jan 10 15:20:52 2021 +0000 Add Archive format AEP (aiidateam#21)

chrisjsewell added 5 commits September 18, 2020 17:16

Initial copy of discussion notes

5ea8b24

add top spec table

ff07e2b

Update readme.md

ebad4ce

Update readme.md

49207b3

Update readme.md

f370692

ltalirz mentioned this pull request Sep 23, 2020

New efficient import format (and import/export functionality) for AiiDA 2.x aiidateam/aiida-core#4384

Closed

chrisjsewell added 2 commits September 23, 2020 15:41

update

ebc0a10

Update README.md

db14c0a

chrisjsewell marked this pull request as ready for review September 23, 2020 14:43

chrisjsewell requested review from csadorf and giovannipizzi September 23, 2020 14:43

csadorf reviewed Sep 28, 2020

View reviewed changes

chrisjsewell commented Sep 29, 2020

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Sep 29, 2020

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

Apply suggestions from code review

f3f288c

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>

chrisjsewell changed the title ~~Aep importexport~~ AEP Archive format Sep 29, 2020

chrisjsewell mentioned this pull request Sep 29, 2020

Changes in features for new export format aiidateam/aiida-core#4382

Closed

3 tasks

sphuber reviewed Oct 1, 2020

View reviewed changes

Update 005_exportformat/readme.md

29e7679

Co-authored-by: Sebastiaan Huber <mail@sphuber.net>

chrisjsewell commented Oct 2, 2020

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Oct 2, 2020

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Oct 2, 2020

View reviewed changes

005_exportformat/readme.md Show resolved Hide resolved

chrisjsewell commented Oct 2, 2020

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Oct 2, 2020

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Oct 2, 2020

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell added 2 commits October 2, 2020 04:41

Apply suggestions from code review

be3ee36

Update 005_exportformat/readme.md

6c930d0

ltalirz mentioned this pull request Oct 12, 2020

revision of AEP "status" codes #22

Closed

Update readme.md

a874c0b

giovannipizzi reviewed Oct 15, 2020

View reviewed changes

Apply suggestions from @giovannipizzi

e22010d

Co-authored-by: Giovanni Pizzi <gio.piz@gmail.com>

This was referenced Oct 29, 2020

Export file size. Big exports. aiidateam/aiida-core#2399

Closed

Import / Export without empty directories aiidateam/aiida-core#3764

Closed

chrisjsewell mentioned this pull request Nov 2, 2020

Verdi export merge command and functionality aiidateam/aiida-core#4538

Open

sphuber mentioned this pull request Nov 30, 2020

Implement a new more efficient format for export archives aiidateam/aiida-core#4601

Closed

chrisjsewell commented Jan 10, 2021

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Jan 10, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

79c5c41

chrisjsewell commented Jan 10, 2021

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Jan 10, 2021

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

Apply suggestions from code review

1f3b39a

chrisjsewell commented Jan 10, 2021

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

chrisjsewell commented Jan 10, 2021

View reviewed changes

005_exportformat/readme.md Outdated Show resolved Hide resolved

Apply suggestions from code review

dc46f13

chrisjsewell changed the title ~~AEP Archive format~~ Add Archive format AEP Jan 10, 2021

chrisjsewell merged commit 4855195 into aiidateam:master Jan 10, 2021

chrisjsewell deleted the aep-importexport branch January 10, 2021 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Archive format AEP #21

Add Archive format AEP #21

chrisjsewell commented Sep 18, 2020 •

edited

Loading

chrisjsewell commented Sep 23, 2020

csadorf left a comment •

edited

Loading

csadorf Sep 28, 2020

chrisjsewell Sep 29, 2020

csadorf Sep 28, 2020

csadorf Sep 28, 2020

csadorf Sep 28, 2020

csadorf Sep 28, 2020

chrisjsewell Sep 29, 2020

csadorf Oct 1, 2020

csadorf Oct 1, 2020

sphuber Oct 1, 2020

sphuber Oct 1, 2020

chrisjsewell Oct 1, 2020

sphuber Oct 1, 2020

chrisjsewell Oct 1, 2020

chrisjsewell commented Oct 2, 2020

giovannipizzi Oct 15, 2020

giovannipizzi Oct 15, 2020

giovannipizzi Oct 15, 2020

giovannipizzi Oct 15, 2020

chrisjsewell commented Jan 10, 2021


		## Proposed Enhancement

		The goal of this project is to first develop a set of agreed requirements for a new archive format, followed by a concrete implementation of the format, and accompanying export and import functions.


		### Archive compression

		For portability, it is desirable that the full archive be contained within a single zipped file.

		An additional feature to consider would be delta increments, such that an existing archive file could be amended during an export.
		This may allow for a push/pull interface for "syncing" an archive file to a particular AiiDA profile.


		To support this, one could consider extending the current format to move towards a "NoSQL" database type implementation, splitting the JSON into multiple files (see for example [MongoDB](https://en.wikipedia.org/wiki/MongoDB)).

		For example, node-level JSONs could be stored in the disk object store, together with a minimal index of UUID -> Hashkey mappings.

Add Archive format AEP #21

Add Archive format AEP #21

Conversation

chrisjsewell commented Sep 18, 2020 • edited Loading

chrisjsewell commented Sep 23, 2020

csadorf left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisjsewell commented Oct 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisjsewell commented Jan 10, 2021

chrisjsewell commented Sep 18, 2020 •

edited

Loading

csadorf left a comment •

edited

Loading