-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lazy_load parameter to open() #573
Conversation
This allows loading ASDF files, then closing the file and still retaining full access to the tree objects. Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
Hi there @vmarkovtsev 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog. I help make sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃. Everything looks good from my point of view! 👍 If there are any issues with this message, please report them here. |
Hi @vmarkovtsev thanks for the PR! We explicitly disallowed access to the tree for a closed ASDF file handle here: #407. But it seems like this is intended to apply to the case where you have opened an ASDF file using another open file handle, is that correct? The ASDF file handle itself remains open, but you want to be able to close the original file handle. This PR does seem to fill a gap: you want to retain memory mapped access to a file that existed outside the scope of the ASDF file handle. I assume that using the existing It's probably really none of my business, but what's wrong with keeping the file handle open? My concern here is that ASDF may be enabling resource leakage since the underlying file is not actually entirely closed: the ASDF object now holds potentially many memory mapped handles to the original file. I assume garbage collection can handle this properly, but I just want to make sure we're not introducing a problem. I'm just thinking out loud and trying to make sure I understand the intent of the PR and any side effects it might present. |
So this PR allows to load all the tree directly into memory so that it no longer requires the file backup at all. The reason why we need this is that our arrays are relatively small and always fit into RAM on one side, and it is hard for us to maintain an open file handle throughout the life of the upstream object on the other. We do leverage The team continues to hit the same problem over and over again: we load the asdf tree, pass the arrays from the tree into faraway parts of the codebase, and then everything breaks at random spots because the file is already closed and it is impossible to |
We store machine learning models in ASDF. We love ASDF because it is way more efficient than pickles, way more flexible than hdf5 and is not nailed to Python in theory. |
Is there a reason the
Maybe we should add this to the user-testimonial section of our documentation 😉 |
I can write a long success story about ASDF fwiw 😄 The |
@vmarkovtsev Please do if you have the time. We are in the process of trying to spread its adoption within astronomy and any support would be quite useful in that regard. Thanks very much for the words of encouragement. |
That makes sense. In light of that, this seems like a reasonable change. Just let me review a bit more and then we should be able to move ahead with it. |
asdf/asdf.py
Outdated
When `True` and the read file is seekable, underlying data arrays | ||
will be materialized when they are used for the first time. | ||
This implies that the file should stay open during the lifetime | ||
of the tree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to nitpick, but I find this a little confusing. Instead I might say the following:
"When True
and the underlying file handle is seekable, data arrays will only be loaded lazily: i.e. when they are accessed for the first time. In this case the underlying file must stay open during the lifetime of the tree. Setting to False
causes all data arrays to be loaded up front, which means that they can be accessed even after the underlying file is closed."
I'm happy to make this change myself, BTW, but just want to make sure it makes sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a changelog entry to the 2.2.0
section for this as well?
@drdavella All the review comments have been addressed, waiting for the CI to pass. |
Done |
asdf/asdf.py
Outdated
@@ -90,7 +90,18 @@ def __init__(self, tree=None, uri=None, extensions=None, version=None, | |||
|
|||
copy_arrays : bool, optional | |||
When `False`, when reading files, attempt to memmap underlying data | |||
arrays when possible. | |||
arrays when possible. Please note that the file must be opened | |||
in "rw" mode; "r" will result a SIGSEGV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but it seems like we should have better error handling for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. I had to debug segmentation faults otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is really useful to know, but I think this comment opens a different can of worms. Not all use cases involve passing a file handle to asdf.open
, so I think this comment needs to be a little more specific to clarify. Maybe we should leave it for another PR and put it elsewhere in the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the current behavior:
asdf.open("/path", mode="r") # SIGSEGV
asdf.open("/path", mode="rw") # OK
with open("/path", mode="r") as file:
asdf.open(file) # SIGSEGV
with open("/path", mode="rw") as file:
asdf.open(file) # OK
So the comment is always effective as long as the option itself is effective (that is, no streaming and non-seekable files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am surprised that this has not been reported yet. In my case I always use compressed blocks so never had that SIGSEGV before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah, I should have added that the "rw" requirement is only for files with uncompressed arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think I'm confused by this and will need to investigate further. Your example above suggests that opening any file without explicitly given a mode of 'rw'
leads to segfault, which doesn't make sense since this is most common use case, both in the test suite, and in our internal code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with uncompressed arrays, that's the most common use case since it's the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but can we leave this comment out of this PR? I'm tracking it in #574 but I don't want to hold this PR up while I figure out what the issue is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
CHANGES.rst
Outdated
numpy arrays are materialized. Thus it becomes safe to close the file | ||
and continue using ``AsdfFile.tree``. However, `copy_arrays` parameter | ||
is still effective and the active memory maps may still require the file | ||
to stay open in case `copy_arrays` is ``False``. [#573] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for another nit, but we should have double backticks around lazy_load
and copy_arrays
here so they render properly on github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Thanks @vmarkovtsev! |
This allows loading ASDF files, then closing the file
and still retaining full access to the tree objects.
Signed-off-by: Vadim Markovtsev vadim@sourced.tech