Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👌 IMPROVE: Add AbstractRepositoryBackend.iter_object_streams #5168

Merged
merged 5 commits into from
Oct 19, 2021

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Oct 8, 2021

This is essentially an addition to #5156 and required for #5145

Without the "optimised" use of Container.get_objects_stream_and_meta for the DiskObjectStoreRepositoryBackend, the profiled archive creation in #5145 (comment) goes from 4 minutes to 9 minutes!

This is essentially an addition to aiidateam#5156 and required aiidateam#5145

Without the "optimised" use of `Container.get_objects_stream_and_meta` for the `DiskObjectStoreRepositoryBackend`, the profiled archive creation in aiidateam#5145 goes from 4 minutes to 9 minutes!
@codecov
Copy link

codecov bot commented Oct 8, 2021

Codecov Report

Merging #5168 (672bbfb) into develop (bd91bdb) will increase coverage by 0.01%.
The diff coverage is 93.75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5168      +/-   ##
===========================================
+ Coverage    80.96%   80.97%   +0.01%     
===========================================
  Files          535      535              
  Lines        37188    37201      +13     
===========================================
+ Hits         30107    30119      +12     
- Misses        7081     7082       +1     
Flag Coverage Δ
django 75.75% <93.75%> (+0.03%) ⬆️
sqlalchemy 74.87% <93.75%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/backends/general/migrations/utils.py 86.02% <50.00%> (-0.37%) ⬇️
aiida/repository/backend/abstract.py 98.47% <100.00%> (+0.05%) ⬆️
aiida/repository/backend/disk_object_store.py 96.78% <100.00%> (+0.29%) ⬆️
aiida/repository/backend/sandbox.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd91bdb...672bbfb. Read the comment docs.

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @chrisjsewell , great work advancing the cause of batch operations! Haha. I only wanted to go over a few comments but this should be ready to go in no problem.

Comment on lines 155 to 157
for key in keys:
with self.open(key) as handle: # pylint: disable=not-context-manager
yield key, handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the two implementations of this method have slightly different behaviors. The abstract.iter_object_streams will get each handle in its iteration and close each file when going to the next element, while disk_object_store.iter_object_streams will get all handles at the start and close all files only at the end. This means that, for example, one could use the second to do a "pass of content" from one file from the next but no the first, right? (weird case, but just to illustrate).

Since the external iteration is something the caller could do on their own, I would personally prefer to implement the version with the internal iteration just to offer the different path for the caller to decide what is best (one never knows for what use case that one didn't anticipate it could matter). However it is true that this case is very subtle so I am also ok if you prefer to leave it like this.

In any case, what I do think it could be more important is to document a bit more this behavior on the docstring: that the streams will be closed after the iteration finishes so you can't use this to put them on a list, for example, and (if we go that way) that this function may offer the possibility to do the upfront opening where better performance would be expected but it is not guaranteed for all cases (and that each handle is only guaranteed to be available on its own iteration step).

Copy link
Member Author

@chrisjsewell chrisjsewell Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, yeh I guess you can just document more the behaviour, but yes each stream should only be read in its own iteration.

This means that, for example, one could use the second to do a "pass of content" from one file from the next but no the first, right?

They are read only handles, so this would not be the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the docs

Comment on lines +101 to +107
def test_iter_object_streams(repository):
"""Test the ``Repository.iter_object_streams`` method."""
key = repository.put_object_from_filelike(io.BytesIO(b'content'))

for _key, stream in repository.iter_object_streams([key]):
assert _key == key
assert stream.read() == b'content'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it curious that you decided to implement a default behavior for the abstract repository backend to overwrite instead of leaving it as an abstract method. I guess it makes sense if having the open method implemented already allows for a fallback implementation, but I'll be interested to know if there were other aspects to this that you considered for this (even considerations in the other direction that you weighted not as important here) or was just because of that?

Anyways, the reason I comment it here is that given that design choice it would make more sense to put this test in the tests/repository/backend/test_abstract.py, no? There is already a mock repository there so it shouldn't require much adaptation. Also we could have 2 or 3 files to iterate over so as to be sure we are covering more of the functionality of the iterator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or was just because of that?

Yes just the reason that it is possible to already have a default implementation, so no need to duplicate this code across implementations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a mock repository there so it shouldn't require much adaptation.

The abstract, and thus the mock object, does not implement an open method, and so this does not work. In fact the ast act class is faulty in this respect, in that it should make it abstract, plus implementations should likely return a contextmanager rather than an (already within the context) iterator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abstract, and thus the mock object, does not implement an open method, and so this does not work.

I mean, I didn't say it would require no adaptation 😅, but it should be just implementing that method and adjust one or two more things, right? The change would make the test more findable (if I want to identify where I am testing that implemented logic I would look in the tests for the abstract), more informative (the failing test ensures the problem is in the abstract class and I there is no contamination from the sandbox) and more robust (if for some reason we make changes or even remove the sandbox, and thus its tests, we are no longer testing this default implementation).

In fact the ast act class is faulty in this respect, in that it should make it abstract, plus implementations should likely return a contextmanager rather than an (already within the context) iterator

Yeah, this is a good point! We should probably change that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've moved to removed the code from the abstract class, made the method an abstractmethod and moved the code to the SandboxRepositoryBackend.

Honestly, there shouldn't really be any testing against the AbstractRepositoryBackend, because it is just that an abstract class; it should merely define the public interface of the class. Yeh you can sneak in some "shared" code, to reduce duplication, but from a "purity" kind of view, using Protocols is the way forward: https://www.python.org/dev/peps/pep-0544/

@chrisjsewell chrisjsewell merged commit 0de3f21 into aiidateam:develop Oct 19, 2021
@chrisjsewell chrisjsewell deleted the add-iter-object-streams branch October 19, 2021 11:51
edan-bainglass pushed a commit to edan-bainglass/aiida-core that referenced this pull request Dec 23, 2023
…team#5168)

This allows for optimised implementations of stream iterations.
In particular, `DiskObjectStoreRepositoryBackend` can implement
a method which is over 2x as fast as iterating with `open`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants