Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ REFACTOR: New archive format #5145

Merged
merged 14 commits into from
Dec 1, 2021

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Sep 22, 2021

PR to implement the new archive format, as discussed in https://github.com/aiidateam/AEP/tree/master/005_exportformat

Note, all the new code is added to aiida/tools/archive, as opposed to the current aiida/tools/importexport,
since I think this is logically the better structure.

Abstraction

The archive format is now fully abstracted into the ArchiveFormatAbstract in aiida/tools/archive/abstract.py, with the sqlite implementation in aiida/tools/archive/implementations/sqlite.

tests/tools/importexport/test_abstract.py provides a good overview of the abstraction capabilities, but essentially you can open the archive in any of 'r', 'x', 'w', 'a' modes:

with archive_format.open(archive_path, mode='w') as writer:
	...

The abstraction is designed to closely resemble that of the "main" Backend, with methods such as bulk_insert, put_object, list_objects, open_object and delete_object, and a QueryBuilder implementation.
Eventually, it could be envisaged that the archive can then become a "proper" backend, and export/import functionality will be merged into simply a transfer_data(in_backend, out_backend, ...) type method.

The 'a' (append) mode allows for modifying the archive and so, it is intended, will eventually allow for "pushes" to the archive identical to archive imports (but in reverse).

Structure for sqlite-zip format

  • The file is a zip file
    • Why not tar? Because tar compresses the whole file, whereas zip compress individual components.
      • whole file compression allows for optimization when compressing duplicated content, however, the AiiDA repository already deduplicates, so the benefit would be limited
      • whole file compression means that the whole file has to be decompressed before accessing a single component, a large disadvantage. By contrast, zip allows for random access of components, after obtaining their byte position in the central directory
  • The central directory is written with the metadata and database records at the top
    • Zip files are read first from the bottom, which contains the byte position of the start of the central directory, then scanning down the central directory to extract records for each file
    • When extracting the metadata/database only, one can simply scan for that record, then break and directly decompress the byte array for that file
    • In this way, we do not have to scan through all the records of the repository files (made possible by https://github.com/aiidateam/archive-path)
  • The database is in sqlite format
    • The schema is dynamically generated from the SQLAlchemy ORM classes for the "main" database (converting JSONB -> JSON, and UUID -> String)
    • Has a QueryBuilder implementation! (with some restrictions)
  • The repository files are stored directly in the zip (with compression) and named by the hashkey
    • Using a disk-objectstore container in the zip file adds no obvious benefit and overcomplicates the format (also making it potentially less future-proof)

zip-file-structure

(see also https://en.wikipedia.org/wiki/ZIP_(file_format)#/media/File:ZIP-64_Internal_Layout.svg)

Archive creation (export)

The archive creation now progresses in a much more "logical" manner:

  • First gather all entity primary keys (per type) that needs to be exported.
    • This need to proceed in the "reverse" order of table relationships
  • If a test run (--test-run CLI option), break here and print the entity counts
  • Now stream the full entities (per type) to the archive writer, in the order of relationships
    • The default batch size is 1000, i.e. read into memory 1000 full DB row dictionaries at a time (id, uuid, attributes, extras, ...) from the "main" DB and bulk insert them into the archive DB. Can chage this on the CLI with --batch-size
    • This is django/sqlalchemy backend agnostic
  • Finally stream the repository files, for the exported nodes
    • The bytes are streamed directly from the disk-objectstore container into the zip file
    • The compression level default is set at 6 (out of 9), which is the default for Python's zipfile. This can be controlled in the CLI with the --compression option
    • There is a source of inefficiency, in that the bytes (if packed in the container) are first uncompressed, then recompressed in the zip file
      • This is anyhow the case for exporting between containers, i.e. using a container in the archive would not fix this
      • It would only be possible to remove this, if the compressions were the same in container and zip, and would probably be overly complex to achieve
  • On exit of the writer context, the sqite DB is closed and written to the zip file, then the metadata.json file.
    • The compression level of the DB is the same as for the repo files (see below for size comparison)
    • The metadata.json is not compressed, for quicker access

You can now also import all entities, with verdi archive create --all

For profiling, I have created https://github.com/chrisjsewell/process-plot, and tested against the 2d-database on Materials Cloud (TODO add URL)

$ pplot exec -c screen -sh 15 --title "2D-DB (new)" -p memory_rss,cpu_percent "verdi archive create -v info -b 1000 -f -G 19 -- 2d-export-new.aiida"
PPLOT INFO: Output files will be written to: /path/to/pplot_out, with basename: 20210921174712
PPLOT INFO: Staring command: ['verdi', 'archive', 'create', '-v', 'info', '-b', '1000', '-f', '-G', '19', '--', '2d-export-new.aiida']
PPLOT INFO: Running process as PID: 30913
Report: 
Archive Parameters
--------------------  -------------------
Path                  2d-export-new.aiida
Version               1.0
Compression           6

Inclusion rules
----------------------------  --------
Computers/Nodes/Groups/Users  Selected
Computer Authinfos            False
Node Comments                 True
Node Logs                     True

Traversal rules
---------------------------------  -----
Follow links input calc forwards   False
Follow links input calc backwards  True
Follow links create forwards       True
Follow links create backwards      True
Follow links return forwards       True
Follow links return backwards      False
Follow links input work forwards   False
Follow links input work backwards  True
Follow links call calc forwards    True
Follow links call calc backwards   True
Follow links call work forwards    True
Follow links call work backwards   True

Collecting entities: Users               100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8
Report: Validating Nodes
Report: Creating archive with:
-----------  ------
Users             7
Computers        14
Groups            1
Nodes        109547
Links        159905
Group-Nodes  109547
-----------  ------
Archiving database: Group-Nodes          100.0%|███████████████████████████████████████████████████████████████████████████████████████████████████| 379021/379021
Archiving files:                         100.0%|███████████████████████████████████████████████████████████████████████████████████████████████████| 199565/199565
Report: Finalizing archive creation...
Report: Archive created successfully
Success: wrote the export archive file to 2d-export-new.aiida
PPLOT INFO: Total run time: 0 hour(s), 03 minute(s), 50.677877 second(s)
PPLOT INFO: Plotting results to: pplot_out/20210922130459.png
PPLOT SUCCESS!

You can see below that, compared to the existing archive, it is twice as fast (mainly because of disk-objectecstore) and ~5 times less memory usage (which will also scale less with more nodes)!

version 1.6.5:

v1-6-5

(see also #4534)

current develop (seb's temporary format)

develop

new format

new

Key factors for memory usage:

  • Set object with node PKs (then less so for other entities)
  • List object with tuples of (group_id, node_id)
  • List object with LinkQuadruple items
  • Set object with repository hashkeys
    • note to improve memory, we delete the group/link list objects above, before generating this hashkey set
  • Dict object of filename -> ZipInfo (which is written last to the zip file central directory on context exit)

Comparison of compression levels:

  • v1.6.5 default archive (level 6): 927 Mb
  • current develop default archive (level 6): 785 Mb
    • Uncompressed data.json: 510 Mb
  • No compression: 4.5 Gb
    • Uncompressed sqlite DB: 530 Mb
  • level 1: 852 Mb
  • level 6: 829 Mb
  • level 9: 808 Mb
  • level 6 repo files, no sqlite DB compression: 1.24 Gb

times to run QB context (see below)

  • level 1: 3.17 s ± 45.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • level 6: 3.13 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • level 6 repo files, no sqlite DB compression: 1.52 s ± 52.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    • i.e. without compression, you can extract the DB faster to the tempdir, before querying

Archive Inspect (CLI)

Default (fast):

$ verdi archive inspect archive.zip  
---------------  --------------------------
Version archive  1.0
Version aiida    2.0.0a1
Compression      6
Created          2021-10-06T00:48:38.842965
---------------  --------------------------

With database statistics (-d/--database):

$ verdi archive inspect -d archive.zip
---------------  --------------------------
Version archive  1.0
Version aiida    2.0.0a1
Compression      6
Created          2021-10-06T00:48:38.842965
---------------  --------------------------

Database statistics
-------------------
Users:
  count: 7
  emails:
  - a@b.com
  - aiida@theossrv2.epfl.ch
  - aiida@theossrv5.epfl.ch
  - davide.campi@epfl.ch
  - giovanni.pizzi@epfl.ch
  - ivano.castelli@epfl.ch
  - nicolas.mounet@epfl.ch
Computers:
  count: 14
  labels:
  - bellatrix
  - brisi
  - 'daint (Imported #1)'
  - daint-gpu
  - daint-mc
  - daint_aprun
  - daint_mc
  - 'daint_mc (Imported #0)'
  - daint_old
  - dora
  - dora_aprun
  - localhost
  - theospc14-direct_
  - theospc27slurm
Nodes:
  count: 126859
  node_types:
  - data.Data.
  - data.array.kpoints.KpointsData.
  - data.core.array.ArrayData.
  - data.core.array.bands.BandsData.
  - data.core.array.trajectory.TrajectoryData.
  - data.core.cif.CifData.
  - data.core.code.Code.
  - data.core.dict.Dict.
  - data.core.folder.FolderData.
  - data.core.remote.RemoteData.
  - data.core.singlefile.SinglefileData.
  - data.core.structure.StructureData.
  - data.core.upf.UpfData.
  - data.forceconstants.ForceconstantsData.
  - process.calculation.calcfunction.CalcFunctionNode.
  - process.calculation.calcjob.CalcJobNode.
  process_types:
  - aiida.calculations:codtools.ciffilter
  - aiida.calculations:quantumespresso.matdyn
  - aiida.calculations:quantumespresso.ph
  - aiida.calculations:quantumespresso.pw
  - aiida.calculations:quantumespresso.q2r
Groups:
  count: 19
  type_strings:
  - core
  - core.import
Comments:
  count: 0
Logs:
  count: 0
Links:
  count: 177215
Repo Files:
  count: 207783

Archive querying (API)

Use a QueryBuilder implementation, without importing the archive!

In [1]: from aiida.tools.archive.abstract import get_format
In [2]: archive_format = get_format()
In [3]: with archive_format.open("2d-export-new.aiida", "r") as reader:
   ...:     qb = reader.querybuilder()
   ...:     print(qb.append(ProcessNode, tag="tag").append(Code, with_outgoing="tag").distinct().count())
   ...: 
10817

The contextmanager extracts the sqlite database to a temporary directory, then starts an SQLAlchemy session, and loads this in to a BackendQueryBuilder implementation.

Note, at least currently:

  1. You cannot return actual AiiDA ORM instances (just specific column values)
    • This would require writing a whole backend entity implementation for the archive
    • Currently fails with NotImplementedError, i.e. should be clear to user why it failed
  2. It will fail if you try to use "JSONB specific" query filters: contains, has_key, of_length, longer, shorter, of_type
    • TODO should just make sure these "fail elegantly"

Archive migration

For migrations from "legacy" archives, it progresses as such:

  1. Extract the metadata.json and data.json
  2. Pass these through the requisite legacy migrations, to convert them in-place
  3. Open the new archive zip file (in a temporary directory)
  4. Stream the repository files; computing their hashkey, and only adding unique files to the new archive
    • As we do this, we build a mapping of node UUIDs to repo files/folders
  5. Create and open the sqlite database and write the data.json to it (computing and merging in repository_metadata)
  6. Write the sqlite DB and metadata.json to the new archive
  7. Move the new archive from the temporary directory to the final location
$ pplot exec -c screen "verdi archive migrate -v info two_dimensional_database.zip migrated.aiida"
PPLOT INFO: Output files will be written to: /Users/chrisjsewell/Documents/GitHub/aiida_core_develop/pplot_out, with basename: 20210924013649
PPLOT INFO: Staring command: ['verdi', 'archive', 'migrate', '-v', 'info', 'two_dimensional_database.zip', 'migrated.aiida']
PPLOT INFO: Running process as PID: 74931
Report: Legacy migrations required
Report: Extracting data.json ...
Report: Legacy migration pathway: 0.9 -> 0.10 -> 0.11 -> 0.12
Performing migrations: 0.11 -> 0.12      100.0%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3
Report: aiida-core v1 -> v2 migration required
Report: Initialising new archive...
Converting repo                          100.0%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 614731/614731
Report: Unique files written: 199565
Report: Converting DB to SQLite
Adding Users                             100.0%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6
Adding Computers                         100.0%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14
Adding Groups                            100.0%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16
Adding Nodes                             100.0%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109547/109547
Adding Links                             100.0%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159905/159905
Adding Group-Nodes                       100.0%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4109/4109
Report: Finalising archive
Success: migrated the archive to version 1.0
PPLOT INFO: Total run time: 0 hour(s), 03 minute(s), 43.717983 second(s)
PPLOT INFO: Plotting results to: pplot_out/20210924013649.png
PPLOT SUCCESS!

20210924013649

Naturally the memory usage is the size of the data.json
BUT importantly, if the old archive is in ZIP format (not tar), the repository files are never extracted to disk (they are all streamed between the old archive and new archive) so the disk usage will be much lower, and also the migration is a lot quicker.
(For TAR, it is a lot slower to randomly access files, so we need to first extract the entire archive)

IMPORTANT migrations of legacy versions '0.1', '0.2', and '0.3' are no longer supported.
This is because '0.3' -> '0.4' requires manipulation of the repository files, and so would require extracting the entire old archive to disk.
I think this is an acceptable compromise, and obviously you could still use aiida-core v1 to first migrate up from these very old versions.

For future migrations, we will use alembic to migrate the sqlite DB, with migrations that almost identically mirror those of the main database (and only the database will be extracted to disk).

Archive import

Importantly, the import is now "backend acnostic", i.e. there is no longer separate code for django and sqlalchemy (there is now bulk_insert / bulk_update methods on the backend).
The repository files are never extracted to a temp folder, they are directly streamed to the backend container.

The code has also been improved, so that it has a more of a "logical flow", and you can also use the --test-run flag, to bail out before repo files are imported and the transaction is committed.

The time is down from ~19 minutes to ~5 minutes, and the memory usage down from >1.5 Gb to 300 Mb!

v1.65

$ pplot exec -c screen --title "v1.6.5 (Django)" "verdi -p a-import-django archive import -v DEBUG 2d-export-1-65.zip"
PPLOT INFO: Output files will be written to: /users/pplot_out, with basename: 20210929060007
PPLOT INFO: Staring command: ['verdi', '-p', 'a-import-django', 'archive', 'import', '-v', 'DEBUG', '2d-export-1-65.zip']
PPLOT INFO: Running process as PID: 38417
Info: starting import: 2d-export-1-65.zip
Calling import function import_data_dj for the django backend.
Checking archive version compatibility

IMPORT
--------  ------------------
Archive   2d-export-1-65.zip

Parameters
--------------------------  ------
Comment rules               newest
New Node Extras rules       import
Existing Node Extras rules  kcl
CHECKING IF NODES FROM LINKS ARE IN DB OR ARCHIVE...
CREATING PK-2-UUID/EMAIL MAPPING...
Importing 109569 entities
ASSESSING IMPORT DATA...
Finding existing entities - User         100.0%|█████████████████████████████████████████████████████████████████████| 1/1
Reading archived entities - User         100.0%|█████████████████████████████████████████████████████████████████████| 7/7
Reading archived entities - Computer     100.0%|███████████████████████████████████████████████████████████████████| 14/14
Reading archived entities - Node         100.0%|███████████████████████████████████████████████████████████| 109547/109547
Reading archived entities - Group        100.0%|█████████████████████████████████████████████████████████████████████| 1/1
STORING ENTITIES...
Users -  existing entries                100.0%|█████████████████████████████████████████████████████████████████████| 1/1
Users -  storing new                     100.0%|█████████████████████████████████████████████████████████████████████| 6/6
Computers -  storing new                 100.0%|███████████████████████████████████████████████████████████████████| 14/14
CREATING NEW NODE REPOSITORIES...
Iterating node repositories              100.0%|███████████████████████████████████████████████████████████| 109547/109547
Nodes -  storing new                     100.0%|███████████████████████████████████████████████████████████| 109547/109547
Groups -  storing new                    100.0%|█████████████████████████████████████████████████████████████████████| 1/1
STORING NODE LINKS...
Links - label=output_structure           100.0%|███████████████████████████████████████████████████████████| 159905/159905
   (159905 new links...)
STORING GROUP ELEMENTS...
Groups - label=20210921-193409           100.0%|█████████████████████████████████████████████████████████████████████| 1/1
Done (cleaning up)                       100.0%|███████████████████████████████████████████████████████████| 109547/109547

Summary
-----------------------  ---------------
Auto-import Group label  20210929-061757
User(s)                  6 new
Computer(s)              14 new
Node(s)                  109547 new
Group(s)                 1 new
Link(s)                  159905 new

Success: imported archive 2d-export-1-65.zip
PPLOT INFO: Total run time: 0 hour(s), 18 minute(s), 57.606729 second(s)
PPLOT INFO: Plotting results to: pplot_out/20210929060007.png
PPLOT SUCCESS!

image

New

$ pplot exec -c screen --title "New (Django)" "verdi -p a-import-django archive import -v info archive.zip"
PPLOT INFO: Output files will be written to: /users/pplot_out, with basename: 20210929053941
PPLOT INFO: Staring command: ['verdi', '-p', 'a-import-django', 'archive', 'import', '-v', 'info', 'archive2.zip']
PPLOT INFO: Running process as PID: 37289
Report: starting import: archive.zip
Report: Parameters
------------------------------  ----------------
Archive                         archive2.zip
New Node Extras                 keep
Merge Node Extras (in backend)  (k)eep
Merge Node Extras (in archive)  do (n)ot create
Merge Node Extras (in both)     (l)eave existing

Report: Skipping 1 existing User(s)
Report: Adding 6 new user(s)
Adding new user(s)                       100.0%|██████████████████████████████████████████████████████████████████████████████| 6/6
Report: Adding 14 new computer(s)
Adding new computer(s)                   100.0%|████████████████████████████████████████████████████████████████████████████| 14/14
Report: Adding 109547 new node(s)
Adding new node(s)                       100.0%|████████████████████████████████████████████████████████████████████| 109547/109547
Report: Gathering existing 'create' Link(s)
Processing 'create' Link(s)              100.0%|██████████████████████████████████████████████████████████████████████| 63278/63278
Report: Added 63278 new 'create' Link(s)
Report: Gathering existing 'input_calc' Link(s)
Processing 'input_calc' Link(s)          100.0%|██████████████████████████████████████████████████████████████████████| 96627/96627
Report: Added 96627 new 'input_calc' Link(s)
Report: Adding 1 new group(s)
Adding new group(s)                      100.0%|██████████████████████████████████████████████████████████████████████████████| 1/1
Report: Adding 109547 Node(s) to new Group(s)
Adding new group_node(s)                 100.0%|████████████████████████████████████████████████████████████████████| 109547/109547
Report: Created new import Group: PK=2, label=20210929-054123
Adding all Node(s) to the import Group   100.0%|████████████████████████████████████████████████████████████████████| 109547/109547
Collecting archive file keys             100.0%|████████████████████████████████████████████████████████████████████| 109547/109547
Report: Checking keys against repository ...
Report: Adding 199565 new repository files
Adding archive files to repository       100.0%|████████████████████████████████████████████████████████████████████| 199565/199565
Report: Committing transaction to database...
Success: imported archive archive2.zip
PPLOT INFO: Total run time: 0 hour(s), 04 minute(s), 53.726231 second(s)
PPLOT INFO: Plotting results to: pplot_out/20210929053941.png
PPLOT SUCCESS!

image

@chrisjsewell chrisjsewell changed the title ✨ NEW: Archive format ♻️ REFACTOR: New archive format Sep 22, 2021
@giovannipizzi
Copy link
Member

Thanks a lot, this looks great! (Memory, time, disk usage, and also the possibility to query inside!)

I didn't check the implementation, but the logic you describe above seems sound to me.
Just a few feedback comments:

  • regarding the migration from 0.1, 0.2, 0.3 (and possibly later in the future): if we drop it, I would strongly encourage to move the code into a separate minimal python package that we release (and won't require changes most probably), that just does the migration, with minimal dependencies. This is because there are many old formats still around (I think? In which versions were 0.3 and 0.4?), also in the Materials Cloud Archive (so they will continue to be there for quite some time) and asking in a couple of years to install old versions of AiiDA might be tricky (might require old python versions and dependencies etc.). Hopefully moving out the code should not require much time, and people could then just install that, migrate, and then use the most recent version of AiiDA to import.

  • Regarding the comment on alembic migrations: I guess that there will be one important difference if we migrate JSON fields - then the migration will not look exactly the same. But this is just as a note - I agree that this would make things much more homogeneous, so great.

  • will therefore the export format mirror the DB schema of the version of AiiDA at use? I guess this binds AiiDA to a specific SQL schema and removes a level of abstraction of ORM frontend from DB backend. I think this is OK as keeping too much abstraction that we don't plan to use makes things more complex, but just to double check (I imagine with your PR on parity between Django and SQLA, we now know the schema is the same right?)

  • regarding compression: I think the penalty of decompressing and recompressing is OK (and in any case, different AiiDA instances might, potentially, use different level of compression or types of compression, so this might be needed anyway). If one wanted, one might think to extend the Archive format, and stream directly the compressed objects into the Zip file (asking Zip not to compress those again). One would then also need to mark the fact the file is compressed (and potentially, with which algorithm). This could be used directly in the file name with some special syntax, e.g. "_<compression_algo>" vs just "" for uncompressed. This makes exporting faster. If the compression_algo is the same as the importing instance, then you can also stream while importing, otherwise you move the slowness of decompressing (and recompressing) at the importing side. Also in this case, anyway, this will be faster (as you remove the zip-level compress/decompress). Note, however, that this is just a note for discussion, but most probably it's not worth the complexity it would bring (and also, this would move some "knowledge" of an implementation detail of the instance - what is compressed and what is not) into the archive file that should be as agnostic as feasible (as a compromise with performance).

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 24, 2021

Thanks @giovannipizzi

will therefore the export format mirror the DB schema of the version of AiiDA at use? I guess this binds AiiDA to a specific SQL schema and removes a level of abstraction of ORM frontend from DB backend. I think this is OK as keeping too much abstraction that we don't plan to use makes things more complex, but just to double check (I imagine with your PR on parity between Django and SQLA, we now know the schema is the same right?)

Yes indeed, the export schema will be "identical" to the AiiDA schema (with a few differences in exactly how fields/columns are stored, i.e. JSONB/JSON and UUID/CHAR(36).

I don't think this "degrades" the abstraction anymore that it already is for the archive/QueryBuilder (see e.g. the discussion in #5088 (comment)), i.e. we assume:

  1. A certain set of field (a.k.a column) names on the DB backend entites
  2. That these fields will serialize to / deserialize from a certain Python type, e.g. it does not matter whether you store as JSON/JSONB or UUID/CHAR(36) because they both go to/from Python dicts and strings respectively

I imagine with your PR on parity between Django and SQLA, we now know the schema is the same right?

Note, the parity PR does not change anything regarding the above two assumptions; it just changes some indexes on the database, converts a STRING to a TEXT type, and adds more non-nullable constraints
(I guess here technically the SQLA backend could currently return None instead of e.g. a string for certain fields, but this should not be the case practically)

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 24, 2021

I would strongly encourage to move the code into a separate minimal python package

asking in a couple of years to install old versions of AiiDA might be tricky

Hmm, couldn't one just use one of the Docker images, if direct installation is not possible?

It wouldn't be the end of the world to create this extra python package, just extra work and I'm lazy lol

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 24, 2021

Some quick question arising from writing the importer:

  • Currently, in the import, we remove checkpoints from the node attributes before importing. It feels like we should also do this going forward, when we actually create the archive?
  • Currently, AuthInfo is not added to the archive on creation, nor is it "re-created" when importing Computers. It is unclear to me yet, how this manifests if you actually want to use these imported computers (will the computer just show as unconfigured?), and should there be an option to export/import AuthInfo (I imagine there could be a security/portability concern, so should not be on by default)?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 24, 2021

Copying here also all the issues currently labelled as topic/archive, for potential closure:

@giovannipizzi
Copy link
Member

I don't think this "degrades" the abstraction anymore that it already is for the archive/QueryBuilder

Maybe this is not what you meant, but just in case I wasn't clear, I'm not worried of the type change (e.g. UUID -> string). More that we might want to store in the DB things in a format that might not mirror 100% what's in the DB. But I think @sphuber raised a similar concern for the QueryBuilder in a different issue, I think - as long as we have the same abstraction as the QB, I think we're fine.

Hmm, couldn't one just use one of the Docker images, if direct installation is not possible?

Unfortunately we only store images on DockerHub, that now deletes images not used for 6 months. So unfortunately this means one has to rebuild images from Dockerfiles in the future, most probably, and this has the same issues of dependencies (e.g. not pinned in a dependency of a dependency) that I was mentioning :-(

Currently, AuthInfo is not added to the archive on creation, nor is it "re-created" when importing Computers. It is unclear to me yet, how this manifests if you actually want to use these imported computers (will the computer just show as unconfigured?), and should there be an option to export/import AuthInfo (I imagine there could be a security/portability concern, so should not be on by default)?

Yes - the idea is that a 'configured' computer means a computer with an authinfo entry (for the current user). So they just show up as unconfigured. This is the intended behaviour I had in mind when this was originally implemented; you want to have a reference to the computer, but in most cases you are giving it to someone else (or reimporting on a different machine, possibly with a different type of configuration, e.g. with a proxy, or once you install AiiDA on the same computer so it's a localhost/local transport, on others is via SSH). Or maybe it's the same computer but another user is importing it, so they want to use a different user name to login.

And, indeed, there might be security issues - even if this is the case currently, as we shouldn't store passwords or other credentials, but in the future (or some transport plugins) might decide (incorrectly) to do so.

So - I think by default we should continue not exporting/importing it.
I see the use case of "mirroring" an instance of AIiDA on another computer by the same user, though, so optionally allowing exporting/importing might be potentially useful in some use cases (even if, at the moment, I'd say not crucial - we'll need to anyway think more when we start thinking properly to a push/pull mechanism for AiiDA profiles similar to git repos).

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 28, 2021

I think - as long as we have the same abstraction as the QB, I think we're fine.

I would point you towards #5154, for this more abstract (pun intended) discussion

Unfortunately we only store images on DockerHub, that now deletes images not used for 6 months.

Well if that’s the case, then I would suggest we should start thinking about storing the images that relate to specific versions, in somewhere controlled by us, e.g. the CSCS openstack. Perhaps, for future proofing, they should even be stored in the OCI image format (although I think Docker is anyway compliant with this)

@chrisjsewell
Copy link
Member Author

In e8f337d, I've added now a good chunk of the import code.
One of the key things, is that it is now "backend agnostic", i.e. there is no separate code for django and sqlalchemy.
I'll and all the explanation / profiling to the initial comment soon though

@chrisjsewell
Copy link
Member Author

we'll need to anyway think more when we start thinking properly to a push/pull mechanism for AiiDA profiles similar to git repos

Oh I'm already thinking towards this 😉
Basically, now the archive acts like a backend and the import is backend agnostic, I envisage you can eventually "merge" the export/import code as conceptually a (bi-directional) backend export, i.e. going from the archive to the profile (and visa-versa) you are just going from one backend to another. Something like def export(backend_source, backend_target, **kwargs)

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Sep 29, 2021

Ok, there are numerous TODO comments in the code to address, and I've got to re-write the tests;
but all the components now functionally work (verdi archive create/migrate/inspect/import).
So I would welcome any initial reviews (give it a try!); erring on the side of just commenting/questioning on "architectural" aspects, rather than any nitpicks of code, docstrings, etc.
Cheers!

@codecov
Copy link

codecov bot commented Oct 7, 2021

Codecov Report

Merging #5145 (b1a7c46) into develop (d5084d6) will increase coverage by 0.04%.
The diff coverage is 92.26%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5145      +/-   ##
===========================================
+ Coverage    81.27%   81.30%   +0.04%     
===========================================
  Files          534      529       -5     
  Lines        37423    37031     -392     
===========================================
- Hits         30410    30104     -306     
+ Misses        7013     6927      -86     
Flag Coverage Δ
django 76.80% <90.45%> (+0.65%) ⬆️
sqlalchemy 75.75% <89.39%> (+0.52%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/backends/sqlalchemy/models/authinfo.py 88.47% <ø> (ø)
aiida/backends/sqlalchemy/models/comment.py 96.16% <ø> (ø)
aiida/backends/sqlalchemy/models/computer.py 88.47% <ø> (-3.84%) ⬇️
aiida/backends/sqlalchemy/models/node.py 81.82% <ø> (ø)
aiida/backends/sqlalchemy/models/settings.py 82.23% <ø> (ø)
aiida/backends/sqlalchemy/models/user.py 94.74% <ø> (ø)
aiida/cmdline/utils/shell.py 31.15% <ø> (ø)
aiida/orm/__init__.py 100.00% <ø> (ø)
aiida/orm/implementation/__init__.py 100.00% <ø> (ø)
...m/implementation/sqlalchemy/querybuilder/joiner.py 91.31% <ø> (ø)
... and 73 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5084d6...b1a7c46. Read the comment docs.

chrisjsewell added a commit to chrisjsewell/aiida_core that referenced this pull request Oct 8, 2021
This is essentially an addition to aiidateam#5156 and required aiidateam#5145

Without the "optimised" use of `Container.get_objects_stream_and_meta` for the `DiskObjectStoreRepositoryBackend`, the profiled archive creation in aiidateam#5145 goes from 4 minutes to 9 minutes!
@chrisjsewell
Copy link
Member Author

Wahoo, the archive now works (just about) as a full backend: https://aiida-archive-demo.readthedocs.io

@chrisjsewell
Copy link
Member Author

Also, if you look in the code in aiida/tools/archive/imports.py now, you'll see that it basically all written as simply a backend -> backend transfer (only using abstract methods from the Backend class)

@chrisjsewell chrisjsewell force-pushed the new-archive branch 5 times, most recently from f8f4be2 to 8b3715d Compare December 1, 2021 03:50
Implement the new archive format,
as discussed in `aiidateam/AEP/005_exportformat`.

To address shortcomings in cpu/memory performance for export/import,
the archive format has been re-designed.
In particular,

1. The `data.json` has been replaced with an sqlite database,
   using the saeme schema as the sqlabackend,
   meaning it is no longer required to be fully read into memory.
2. The archive utilises the repository redesign,
   with binary files stored by hashkeys (removing de-duplication)
3. The archive is only saved as zip (not tar),
   meaning internal files can be decompressed+streamed independantly,
   without the need to uncompress the entire archive file.
4. The archive is implemented as a full (read-only) backend,
   meaning it can be queried without the need to import to a profile.

Additionally, the entire export/import code has been re-written
to utilise these changes.

These changes have reduced the export times by ~250%, export peak RAM by ~400%,
import times by ~400%, and import peak RAM by ~500%.
The changes also allow for future push/pull mechanisms.
Add option to not create import group of imported nodes
`ProcessNode` checkpoint attributes (for running processes)
contain serialized code and so can cause security issues, therefore they are now stripped by default.
`create_archive` anyhow only allows export of sealed `ProcessNode`,
i.e. ones for finalised processes which no longer require the checkpoints.
The `sphinx-sqlalchemy` extension is used to dynamically generate documentation for the database schema.
`BackendEntityAttributesMixin` leaked implementation specific variables and functions.
This has now been moved to the `DjangoNode`/`SqlaNode` classes.
`BackendEntityExtrasMixin` leaked implementation specific variables and functions.
This has now been moved to the `DjangoNode`/`SqlaNode` classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment