👌 IMPROVE: Add `AbstractRepositoryBackend.iter_object_streams` #5168

chrisjsewell · 2021-10-08T02:07:28Z

This is essentially an addition to #5156 and required for #5145

Without the "optimised" use of Container.get_objects_stream_and_meta for the DiskObjectStoreRepositoryBackend, the profiled archive creation in #5145 (comment) goes from 4 minutes to 9 minutes!

This is essentially an addition to aiidateam#5156 and required aiidateam#5145 Without the "optimised" use of `Container.get_objects_stream_and_meta` for the `DiskObjectStoreRepositoryBackend`, the profiled archive creation in aiidateam#5145 goes from 4 minutes to 9 minutes!

codecov · 2021-10-08T02:24:31Z

Codecov Report

Merging #5168 (672bbfb) into develop (bd91bdb) will increase coverage by 0.01%.
The diff coverage is 93.75%.

@@             Coverage Diff             @@
##           develop    #5168      +/-   ##
===========================================
+ Coverage    80.96%   80.97%   +0.01%     
===========================================
  Files          535      535              
  Lines        37188    37201      +13     
===========================================
+ Hits         30107    30119      +12     
- Misses        7081     7082       +1

Flag	Coverage Δ
django	`75.75% <93.75%> (+0.03%)`	⬆️
sqlalchemy	`74.87% <93.75%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/backends/general/migrations/utils.py	`86.02% <50.00%> (-0.37%)`	⬇️
aiida/repository/backend/abstract.py	`98.47% <100.00%> (+0.05%)`	⬆️
aiida/repository/backend/disk_object_store.py	`96.78% <100.00%> (+0.29%)`	⬆️
aiida/repository/backend/sandbox.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd91bdb...672bbfb. Read the comment docs.

ramirezfranciscof

Hey @chrisjsewell , great work advancing the cause of batch operations! Haha. I only wanted to go over a few comments but this should be ready to go in no problem.

ramirezfranciscof · 2021-10-08T09:32:35Z

aiida/repository/backend/abstract.py

+        for key in keys:
+            with self.open(key) as handle:  # pylint: disable=not-context-manager
+                yield key, handle


I noticed the two implementations of this method have slightly different behaviors. The abstract.iter_object_streams will get each handle in its iteration and close each file when going to the next element, while disk_object_store.iter_object_streams will get all handles at the start and close all files only at the end. This means that, for example, one could use the second to do a "pass of content" from one file from the next but no the first, right? (weird case, but just to illustrate).

Since the external iteration is something the caller could do on their own, I would personally prefer to implement the version with the internal iteration just to offer the different path for the caller to decide what is best (one never knows for what use case that one didn't anticipate it could matter). However it is true that this case is very subtle so I am also ok if you prefer to leave it like this.

In any case, what I do think it could be more important is to document a bit more this behavior on the docstring: that the streams will be closed after the iteration finishes so you can't use this to put them on a list, for example, and (if we go that way) that this function may offer the possibility to do the upfront opening where better performance would be expected but it is not guaranteed for all cases (and that each handle is only guaranteed to be available on its own iteration step).

Heya, yeh I guess you can just document more the behaviour, but yes each stream should only be read in its own iteration.

This means that, for example, one could use the second to do a "pass of content" from one file from the next but no the first, right?

They are read only handles, so this would not be the case.

updated the docs

ramirezfranciscof · 2021-10-08T09:46:43Z

tests/repository/backend/test_sandbox.py

+def test_iter_object_streams(repository):
+    """Test the ``Repository.iter_object_streams`` method."""
+    key = repository.put_object_from_filelike(io.BytesIO(b'content'))
+
+    for _key, stream in repository.iter_object_streams([key]):
+        assert _key == key
+        assert stream.read() == b'content'


I find it curious that you decided to implement a default behavior for the abstract repository backend to overwrite instead of leaving it as an abstract method. I guess it makes sense if having the open method implemented already allows for a fallback implementation, but I'll be interested to know if there were other aspects to this that you considered for this (even considerations in the other direction that you weighted not as important here) or was just because of that?

Anyways, the reason I comment it here is that given that design choice it would make more sense to put this test in the tests/repository/backend/test_abstract.py, no? There is already a mock repository there so it shouldn't require much adaptation. Also we could have 2 or 3 files to iterate over so as to be sure we are covering more of the functionality of the iterator.

or was just because of that?

Yes just the reason that it is possible to already have a default implementation, so no need to duplicate this code across implementations

There is already a mock repository there so it shouldn't require much adaptation.

The abstract, and thus the mock object, does not implement an open method, and so this does not work. In fact the ast act class is faulty in this respect, in that it should make it abstract, plus implementations should likely return a contextmanager rather than an (already within the context) iterator

See https://github.com/chrisjsewell/aiida_core/blob/8132651145311e2bf5b3f02beb3457af50a10248/aiida/tools/archive/abstract.py#L257 and it's overloads, for a good implementation example of a stream opener

The abstract, and thus the mock object, does not implement an open method, and so this does not work.

I mean, I didn't say it would require no adaptation 😅, but it should be just implementing that method and adjust one or two more things, right? The change would make the test more findable (if I want to identify where I am testing that implemented logic I would look in the tests for the abstract), more informative (the failing test ensures the problem is in the abstract class and I there is no contamination from the sandbox) and more robust (if for some reason we make changes or even remove the sandbox, and thus its tests, we are no longer testing this default implementation).

In fact the ast act class is faulty in this respect, in that it should make it abstract, plus implementations should likely return a contextmanager rather than an (already within the context) iterator

Yeah, this is a good point! We should probably change that.

Ok I've moved to removed the code from the abstract class, made the method an abstractmethod and moved the code to the SandboxRepositoryBackend.

Honestly, there shouldn't really be any testing against the AbstractRepositoryBackend, because it is just that an abstract class; it should merely define the public interface of the class. Yeh you can sneak in some "shared" code, to reduce duplication, but from a "purity" kind of view, using Protocols is the way forward: https://www.python.org/dev/peps/pep-0544/

…team#5168) This allows for optimised implementations of stream iterations. In particular, `DiskObjectStoreRepositoryBackend` can implement a method which is over 2x as fast as iterating with `open`.

chrisjsewell requested a review from ramirezfranciscof October 8, 2021 02:07

chrisjsewell mentioned this pull request Oct 8, 2021

Add batch operations to new repository #5053

Closed

ramirezfranciscof reviewed Oct 8, 2021

View reviewed changes

chrisjsewell added 4 commits October 16, 2021 12:14

Merge branch 'develop' into add-iter-object-streams

cae3beb

Merge branch 'develop' into add-iter-object-streams

b714f61

address review comments

389cfdf

Update abstract.py

672bbfb

chrisjsewell merged commit 0de3f21 into aiidateam:develop Oct 19, 2021

chrisjsewell deleted the add-iter-object-streams branch October 19, 2021 11:51

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👌 IMPROVE: Add `AbstractRepositoryBackend.iter_object_streams` #5168

👌 IMPROVE: Add `AbstractRepositoryBackend.iter_object_streams` #5168

chrisjsewell commented Oct 8, 2021 •

edited

Loading

codecov bot commented Oct 8, 2021 •

edited

Loading

ramirezfranciscof left a comment

ramirezfranciscof Oct 8, 2021

chrisjsewell Oct 8, 2021 •

edited

Loading

chrisjsewell Oct 19, 2021

ramirezfranciscof Oct 8, 2021

chrisjsewell Oct 8, 2021

chrisjsewell Oct 8, 2021

chrisjsewell Oct 8, 2021

ramirezfranciscof Oct 11, 2021

chrisjsewell Oct 19, 2021

👌 IMPROVE: Add AbstractRepositoryBackend.iter_object_streams #5168

👌 IMPROVE: Add AbstractRepositoryBackend.iter_object_streams #5168

Conversation

chrisjsewell commented Oct 8, 2021 • edited Loading

codecov bot commented Oct 8, 2021 • edited Loading

Codecov Report

ramirezfranciscof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisjsewell Oct 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

👌 IMPROVE: Add `AbstractRepositoryBackend.iter_object_streams` #5168

👌 IMPROVE: Add `AbstractRepositoryBackend.iter_object_streams` #5168

chrisjsewell commented Oct 8, 2021 •

edited

Loading

codecov bot commented Oct 8, 2021 •

edited

Loading

chrisjsewell Oct 8, 2021 •

edited

Loading