Implement the new file repository #4345

sphuber · 2020-09-01T18:22:19Z

Fixes #3445
Fixes #1663
Fixes #64
Fixes #1675
Fixes #3437
Fixes #2580
Fixes #3764
Fixes #4020

codecov · 2020-09-03T10:24:25Z

Codecov Report

Merging #4345 (44a3c11) into develop (4d90f37) will increase coverage by 0.06%.
The diff coverage is 89.09%.

❗ Current head 44a3c11 differs from pull request most recent head 0e250e6. Consider uploading reports for the commit 0e250e6 to get more accurate results

@@             Coverage Diff             @@
##           develop    #4345      +/-   ##
===========================================
+ Coverage    79.62%   79.67%   +0.06%     
===========================================
  Files          519      523       +4     
  Lines        37111    37166      +55     
===========================================
+ Hits         29546    29609      +63     
+ Misses        7565     7557       -8

Flag	Coverage Δ
django	`74.27% <82.96%> (-0.09%)`	⬇️
sqlalchemy	`73.19% <79.60%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/cmdline/commands/cmd_database.py	`62.41% <0.00%> (ø)`
aiida/cmdline/commands/cmd_devel.py	`43.48% <ø> (+0.63%)`	⬆️
aiida/cmdline/commands/cmd_setup.py	`49.13% <0.00%> (-3.70%)`	⬇️
aiida/manage/database/integrity/duplicate_uuid.py	`31.71% <ø> (-61.77%)`	⬇️
aiida/orm/nodes/process/calculation/calcjob.py	`76.15% <ø> (+1.59%)`	⬆️
aiida/orm/nodes/process/process.py	`86.09% <ø> (-0.51%)`	⬇️
aiida/restapi/translator/nodes/node.py	`85.44% <0.00%> (ø)`
aiida/tools/importexport/common/utils.py	`26.83% <ø> (-4.98%)`	⬇️
aiida/backends/manager.py	`66.67% <33.34%> (-3.44%)`	⬇️
aiida/manage/configuration/__init__.py	`78.58% <40.00%> (-0.42%)`	⬇️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d90f37...0e250e6. Read the comment docs.

chrisjsewell · 2020-10-28T01:00:20Z

I've totally broken your PR now @sphuber 😆
I suggest when you are ready to pick this, you revert all changes to aiida/tools/importexport then rebase (I've made a copy of your branch, to preserve these changes for reference), then I can make a PR to your branch with the archive updates

chrisjsewell · 2020-11-06T01:15:06Z

Discussion point (cc @sphuber, @giovannipizzi, @ltalirz)

In the current PR implementation, for the archive format, a disk-objectstore container is implemented internal to the archive. I think this is an assumption that requires further discussion:

The disk-objectstore, as I understand, is essentially a compression format with concurrent read/write access, and ability to remove/overwrite objects in-place. These write features are not part of standard compression formats (zip, tar, ...) and are critical for the functioning of the AiiDA repository.
However, it is not necessarily required for the archive, where (at least currently) we only require the archive to be written from scratch (no overwrite) and from a single process (no write concurrency).

The drawbacks of having a disk-objectstore within the archive is that basically you are storing one compression format inside another. The key cons are then:

It increases the complexity of the archive format, making it less easy to manipulate and potentially having implications for long-term support.
You cannot introspect/read these objects without first fully extracting the archive to a folder, then using the disk-objectstore API (which includes initialising an sqlite connection)

An alternative approach would be to simply store the archive as:

metadata.json
data.json
objects/
    <hash-key-1>
    <hash-key-2>
    ...

It should then be possible to directly stream the object bytes to/from the (compressed) archive <-> AiiDa disk-objectstore, and I see no reason why this could not be reasonably performant, compared with directly streaming from one container to another (but without the initial decompression overhead).

I'd note, one con/issue I've found with python's zipfile.ZipFile, is that it generates a ZipInfo list for all objects (files+folders) on initiation, which can lead to high memory useage for a zip containing large amounts of objects (O(n)) (see #4534 (comment)).
Although, this will be partially improved by there no longer being duplication of objects and no nested folders to store (as in the current format).
I also have some ideas to "fix" this (at least for targeted read operations).
Actually I read the memory wrong, so the reading of a zip file with ~600,000 objects peaks at 0.4 Gb not 4 Gb 🤦, so not quite as much of a big deal.

giovannipizzi · 2020-11-11T14:17:48Z

Main comment to @chrisjsewell : your analysis is correct I think.
The important benchmark to do is if adding millions of objects to a ZipFile object is performant or not.
If you do some tests and verify that putting, say 10 million empty files in a zip is efficient from a memory an time point of view, both when reading and writing; and that it's also performant to move data from a disk-objecstore to zip in this format, and the opposite; then I agree your suggestion is better.
Before any complex implementation, a simple script benchmarking this would be great, and depending on the memory usage and CPU time we can decide what to do

sphuber · 2020-11-11T14:37:57Z

Note that the implementation I chose here was just to get things working, given that we were planning to overhaul the entire archive format as a whole. I therefore chose to change as little as possible and make the implementation as simple as possible. This would at least guarantee that the entire code base works and is operational. Since we wouldn't release this anyway before the new archive format is there, we can even omit this intermediate format and its migrations once the new format and implementation is there.

The `Repository` class is the new interface to interact with the file repository. It contains a file hierarchy in memory to now what is contained within it. A new instance can be constructed from an existing hierarchy using the `from_serialized` class method. The actual storing of the objects in a repository is delegated to a backend repository instance, represented by `AbstractRepositoryBackend`. This backend class is only tasked with consuming byte streams and storing them as objects in some repository under a unique key that allows the object to be retrieved at a later time. The backend repository is not expected to keep a mapping of what it contains. It should merely provide the interface to store byte streams and return their content, given their unique corresponding key. This also means that the internal virtual hierarchy of a ``Repository`` instance does not necessarily represent all the files that are stored by repository backend. The repository exposes a mere subset of all the file objects stored in the backend. This is why object deletion is also implemented as a soft delete, by default, where the files are just removed from the internal virtual hierarchy, but not in the actual backend. This is because those objects can be referenced by other instances. The interface is implemented for the `disk-objectstore` which is an efficient key-value store on the local file system. This commit also provides a sandbox implementation which represents a temporary folder on the local file system.

Since now all nodes will have a full representation of the virtual file hierarchy of their repository content, which is stored in the database in the `repository_metadata` column, it is crucial that the format of that metadata is optimized to use as little data as possible, to prevent the database from bloating unnecessarily. Currently, we only define two file types: * Directories: which can contain an arbitrary number of sub objects * Files: which cannot contain objects, but instead have a unique hashkey This means that we can serialize the metadata, without encoding the explicit type. If it contains a key, it is necessarily a file. If it is not a file, it has to be a directory. The serialization format therefore can make do with simple nested dictionaries, where each entry either contains the key `k`, indicating it is a file, or `o`, which can be a dictionary itself, containing the objects contained within that directory. Note that we can even safely omit the `o` key for an empty dictionary. Since there is no `k` we know it has to be a directory anyway. This logic is implemented in `FileObject.from_serialized` that can fully reconstruct all instance properties by inferring them from the sparse serialized data.

This code was put in place because a naive backup of the file repository was impossible already for reasonably size projects. The problem was the underlying design where each node had an associated folder on disk, even if it contained no files, and all files were stored separately, creating a huge numbers of files. This meant that just rsync'ing the folder could take days even to just calculate the difference with the backed up folder. The ad-hoc solution was this script that would only transfer the folders of the nodes added since the last backup date, the list of which was determined by a simple query. However, now that the new file repository is implemented, which has been explicitly designed to be easily backed up by tools like `rsync`, this custom solution is no longer needed.

The new `Repository` implementation is integrated by wrapping it in a new class `NodeRepository` that is mixed into the `Node` class. This approach serves multiple purposes. 1. The class serves as a backwards compatibility layer between the old repository interface, which would allow clients to set the mode for file handles for reading, as well as passing in text-like streams, and the new interface, which exclusively operates with bytes. The interface will decode the bytes when the caller is expecting normal strings in order to not break the interface. 2. It allows to implement the concept of immutability which is a concept of the `Node` and is completely unknown to the `Repository` class. 3. It nicely separates the file repository functionality into a separate file preventing the already huge `Node` implementation from getting even larger. The downside of the latter is that the `NodeRepository` class contains some potentially confusing logic. It accesses `Node` properties, but from the code it is not clear where they come from. Since this class is only meant to be used as a mixin for the `Node` class, I think that is a sacrifice that is worth the benefits.

This ensures that the repository is still mutable for sealable nodes as long as they are not yet sealed. After sealing, the repository of even these nodes becomes fully immutable. Note that the implementation completely overrides the methods of the `NodeRepositoryMixin`, because it needs to override the check in those that will raise if the node is stored. An alternative would have been to call the `super` from the `Sealable`, which due to the correct MRO, would in fact the underlying method on the `NodeRepositoryMixin`, but it would need to accept an argument, like for example `force`, to skip the mutability check. This is not desirable, however, since those repository methods are public methods of the `Node` interface and so any user will be able to disable the mutability check for any node. This solution does not suffer from that vulnerability but the downside is that the implementation of the method from the `NodeRepositoryMixin` needs to be copied and it needs to be kept in sync.

This section describes in detail the new file repository design, including the design guidelines that were kept in mind and what requirements it was designed to respect. It also gives a detailed overview over how the implementation is integrated in the existing code base.

The migration of an existing legacy file repository consists of: 1. Make inventory of all files in the current file repository 2. Write these files to the new backend (disk object store) 3. Store the metadata of a node's repository contents, containing the virtual hierarchy to the corresponding `repository_metadata` column in the database. The repository metadata will contain a hashkey for each file that was written to the disk object store, which was generated by the latter and that can be used to retrieve the content of the file. The migration is performed in a single database migration, meaning that everything is executed in a single transaction. Either the entire migration succeeds or in case of an error, it is a no-op. This why the migration will not delete the legacy file repository in the same migration. The advantage of this approach is that now there is no risk of data loss and/or corruption. If the migration fails, all original data will be in tact. The downside, however, is that the content of the file repository is more or less duplicated. This means that the migration will require a potentially significant amount of disk space that is at worst equal to the size of the existing file repository. This should be the upper limit since the disk object store will both deduplicate as well as compress the content that is written.

Use the `Container.export` method in export/import functionality

The new repository implementation, using the `disk-objectstore` underneath, now provides a UUID for each repository. Currently, each database can only be configured to work with a single repository. By writing the UUID of the repository into the database, it will become possible to enable a consistency check of the repository and database that are configured for a particular profile. This will prevent accidental misconfigurations where the wrong repository is coupled to a particular database. The repository UUID is generated automatically by the `disk-objectstore` library when the container is created and it provides an interface to retrieve it. This value is written to the database in the `verdi setup` command when a new profile is created. If the database already has the repository UUID setting defined, it will be cross-referenced with the one of the repository to make sure it is compatible. This case is to facilitate the creation of a new profile for an existing repository and database. However, if the UUIDs do not match, the setup command fails. For databases that were created before this commit, the database migration that performed the migration of the legacy file repository includes a statement that will insert the UUID of the new object store container once is has been created.

Since the migration to the new repository implementation, each file repository has its own UUID. This UUID is now written to the settings table of the database that is associated to it. This allows to check that the repository and database that are configured for a profile are compatible. The check is added to the `Manager._load_backend` method, as this is the central point where the database is loaded for the currently loaded profile. We need to place the check here since in order to retrieve the repository UUID from the database, the corresponding backend needs to be loaded first. If the UUID of the repository and the one stored in the database are found to be different, a warning is emitted instructing the user to make sure the repository and database are correctly configured. Since this is new functionality and its stability is not known, a warning is chosen instead of an exception in order to prevent AiiDA becoming unusable in the case of an unintentional bug in the compatibility checking. In the future, when the check has been proven to be stable and reliable, the warning can be changed into an exception.

The `from disk_objectstore import Container` import has a significant loading time. The `aiida.manage.configuration.profile` module imported it at the top level, and since this module is loaded when `verdi` is called, the load time of `verdi` was significantly increased. This would have a detrimental effect on the tab completion speed. The work around is to only import the module within the methods that use it. Ideally, the loading of this library would not be so costly.

chrisjsewell · 2021-04-28T08:25:35Z

Cheers @sphuber
Can you confirm that anything outstanding (bugs/improvements) have been captured in an issue(s)?
Or actually, it may anyway be good to open a "meta" discussion outlining the good, the bad and the ugly about the new repository system.
A few things that come to mind:

If I remember correctly, the user is now responsible for periodically running container packing
There is no longer an "easy" way to look/search through stored files (i.e. you now have to go through the API/CLI if you want to look at simulation raw outputs)
the documentation for repository backup has been replaced with a todo
the archive format (+ migration) is not finalised

sphuber · 2021-04-28T08:33:53Z

As far as I am aware there are no outstanding known bugs with the implementation. There are quite a few features that we still want to add before the v2.0 release though, which are part of the relevant milestone:

#4321
#4601
#695

There might be other things that may not already have an issue, but I would suggest to just add them to the v2.0 milestone. Unless this would be a non-pressing feature that can be released later of course.

I noticed someone created the milestone "Preparations for the new repository" but the issues don't seem to be related to the new repository whatsoever. If the bugs need to be fixed with 2.0 I would simply move them to that milestone and close this milestone.

chrisjsewell · 2021-04-28T08:48:25Z

but the issues don't seem to be related to the new repository whatsoever.

yeh I think @ramirezfranciscof changed the name from the 1.6.x milestone, but they can all go in 2.0

sphuber · 2021-04-28T08:52:45Z

Just so we are clear. This branch should be merged with a normal merge commit. Not rebased-merged or squashed

sphuber · 2021-04-28T08:53:13Z

yeh I think @ramirezfranciscof changed the name from the 1.6.x milestone, but they can all go in 2.0

I moved them all there and closed that milestone.

chrisjsewell · 2021-04-28T08:57:08Z

Just so we are clear. This branch should be merged with a normal merge commit. Not rebased-merged or squashed

See now you're just daring me to squash

approved

sphuber mentioned this pull request Sep 1, 2020

Prepare the code for the new repository implementation #4344

Merged

sphuber force-pushed the fix/3445/disk-object-store-repository branch 13 times, most recently from 96b49db to 7941131 Compare September 26, 2020 23:09

sphuber force-pushed the fix/3445/disk-object-store-repository branch 5 times, most recently from 274264a to 4d6f6b1 Compare October 3, 2020 21:12

sphuber force-pushed the fix/3445/disk-object-store-repository branch 2 times, most recently from 198ea83 to 4ea60c7 Compare October 11, 2020 22:17

chrisjsewell mentioned this pull request Oct 28, 2020

Add reset method toProgressReporterAbstract #4522

Merged

sphuber force-pushed the fix/3445/disk-object-store-repository branch 2 times, most recently from 9ff5293 to 21cb32f Compare November 5, 2020 12:14

chrisjsewell linked an issue Nov 11, 2020 that may be closed by this pull request

Import / Export without empty directories #3764

Closed

sphuber added 12 commits April 28, 2021 07:53

Repository: allow File class to be changed

2d4ae66

First implementation of export/import using new disk object store

f30f000

Use the `Container.export` method in export/import functionality

sphuber force-pushed the fix/3445/disk-object-store-repository branch from c3eb2d2 to 0e250e6 Compare April 28, 2021 05:53

sphuber requested review from chrisjsewell and giovannipizzi April 28, 2021 08:02

sphuber mentioned this pull request Apr 28, 2021

Docs: Topic repository #4020

Closed

chrisjsewell approved these changes Apr 28, 2021

View reviewed changes

sphuber merged commit 16b326b into aiidateam:develop Apr 28, 2021

sphuber deleted the fix/3445/disk-object-store-repository branch April 28, 2021 09:29

sphuber mentioned this pull request Apr 28, 2021

Avoid inconsistencies due to multiple profiles using the same database #182

Closed

chrisjsewell mentioned this pull request Apr 29, 2021

Profile: do not leak repository implementation in interface #4891

Merged

CasperWA mentioned this pull request May 5, 2021

Object store migration: Node folders raw_input vs. path #4910

Closed

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the new file repository #4345

Implement the new file repository #4345

sphuber commented Sep 1, 2020 •

edited

Loading

codecov bot commented Sep 3, 2020 •

edited

Loading

chrisjsewell commented Oct 28, 2020

chrisjsewell commented Nov 6, 2020 •

edited

Loading

giovannipizzi commented Nov 11, 2020

sphuber commented Nov 11, 2020

chrisjsewell commented Apr 28, 2021 •

edited

Loading

sphuber commented Apr 28, 2021

chrisjsewell commented Apr 28, 2021

sphuber commented Apr 28, 2021

sphuber commented Apr 28, 2021

chrisjsewell commented Apr 28, 2021 •

edited

Loading

Implement the new file repository #4345

Implement the new file repository #4345

Conversation

sphuber commented Sep 1, 2020 • edited Loading

codecov bot commented Sep 3, 2020 • edited Loading

Codecov Report

chrisjsewell commented Oct 28, 2020

chrisjsewell commented Nov 6, 2020 • edited Loading

giovannipizzi commented Nov 11, 2020

sphuber commented Nov 11, 2020

chrisjsewell commented Apr 28, 2021 • edited Loading

sphuber commented Apr 28, 2021

chrisjsewell commented Apr 28, 2021

sphuber commented Apr 28, 2021

sphuber commented Apr 28, 2021

chrisjsewell commented Apr 28, 2021 • edited Loading

sphuber commented Sep 1, 2020 •

edited

Loading

codecov bot commented Sep 3, 2020 •

edited

Loading

chrisjsewell commented Nov 6, 2020 •

edited

Loading

chrisjsewell commented Apr 28, 2021 •

edited

Loading

chrisjsewell commented Apr 28, 2021 •

edited

Loading