Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ndarray conversion to a Converter #1537

Merged
merged 154 commits into from
Sep 8, 2023

Conversation

braingram
Copy link
Contributor

@braingram braingram commented May 4, 2023

This PR moves ndarray handling to a converter.

This required extensive changes to the internal block API which had several knock-on effects.

Public changes

AsdfConfig.convert_unknown_ndarray_subclasses

Moving ndarray to a converter means that subclasses are no longer automatically converted.

This change would break stdatamodels (where FITS_rec is automatically converted to ndarray) and breaks one test in asdf-astropy (because of NDArrayMixIn).

It seems likely there are other instances 'in the wild' so an AsdfConfig option, convert_uknown_ndarray_subclasses was added and some internal changes were added to allow the new ndarray converter to continue handling subclasses. If an instance of a subclass of ndarray is encountered during serialization which is not explicitly handled by a converter and the convert_unknown_ndarray_subclasses option is enabled (as is by default), this instance will be coerced into an ndarray and a warning will be issued (this warning means that the test in asdf-astropy will continue to fail as warnings are converted to errors).

A note was added to the docs describing that the default setting for this will be converted to False and the option (and special handling) removed in a future version of asdf.

asdf.Stream moved to asdf.tags.core.Stream and asdf.stream deprecated

This is more consistent with other special tag related objects (asdf.tags.core.NDArrayType, asdf.tags.core.IntegerType etc). As an added benefit, the eventual removal of asdf.stream should allow the docs to build for development installs on systems that are case-insensitive (by removing the ambiguous asdf.Stream and asdf.stream).

Changes to Converter and SerializationContext methods for block access

This changes some of the unreleased public API added in #1508

The updates to the block management to allow moving ndarray to a converter meant that some of the complexity added in #1508 could be removed. With the changes in this PR, the SerializationContext is now aware if a serialization or deserialization is being performed and what object is being handled (and can automatically generate and assign keys for most use cases). This means that asdf.util.BlockKey is no longer needed (extension code still needs to generate keys if an object reads from multiple blocks, by generating keys with SerializationContext.generate_block_key asdf can automatically associate these with the correct object). Key-related changes to SerializationContext are:

  • update find_block_index(key, data_callback=None) to find_available_block_index(data_callback, key=None)
  • update get_block_data_callback(index) to get_block_data_callback(index, key=None)
  • removed assign_block_key (all assignment is now automatic)

These changes also mean that Converter.reserve_blocks is no longer needed and was removed.

array settings copied between AsdfFile instances

To fix #1524 and remove an undocumented behavior, creating an AsdfFile with an existing AsdfFile (AsdfFile(af)) no longer attempts to copy block options from the original AsdfFile. This was previously done inconsistently and could result in files that could not be opened (see #1524).

external blocks are no longer memmapped

In working towards a fix for #1525 memmapping for external blocks was disabled. With this PR external blocks are always lazy loaded and never memmapped. If these are sensible settings I propose we document the behavior and close the issue. If instead further changes are needed to allow control over lazy loading, memmapping or read/write mode than a separate PR will be needed.

AsdfFile.update block organization change

Instead of attempting to limit the number of blocks moved/rewritten (which added significant complexity), AsdfFile.update now writes all new blocks to the end of the file then moves the blocks to the end of the new tree and updates associations between data callbacks and blocks to allow objects that lazy load blocks to read the correct block. This new strategy allows for updating with compressed blocks (where the size is unknown until the data is written) and allows for any change in tree size (even if it completely overwrites all previous blocks). This has one major downside where for a very large file it can temporarily almost double in size. However, the final file size after an update with this PR should be equivalent or even smaller (windows support for truncating was added in this PR to allow the final file to be a sensible size without excessive padding). This seems like a reasonable trade-off (exchanging some storage inefficiency for simplicity in code) to fix the numerous errors related to update found during work on this PR.

Private API changes

Removal of block.py and replacement with _block submodules

The Block and BlockManger classes were removed and replaced with a number of new classes and functions. Documentation was added to the new _block submodules and should serve as the primary description of this new code (and should be updated based on questions and comments to the PR).

The changes to the internal block API do mean that sphinx-asdf is no longer compatible with this PR. A PR was opened for sphinx-asdf to make it compatible with the changes in this PR and previous versions of ASDF: asdf-format/sphinx-asdf#64

To allow docs to build for this PR the requirements were temporarily changed to install sphinx-asdf from the source branch for the sphinx-asdf PR linked above. Once the private block API changes in this PR are decided the sphinx-asdf PR can be finalized, merged, released and the requirements updated here (either prior to or after merging but prior to release of asdf 3.0).

Test strucuture

Numerous issues were discovered during this rewrite:

Fixes or work towards fixes for many issues are included in this PR

To make it more obvious which tests correspond to issues, which are unit tests and which are now 'legacy' tests a new test organization was started in this PR.

'legacy' tests were mostly untouched (left in the same location and only updated to reflect changes to any internal API used and changed in this PR).

Tests that correspond to a specific issue are organized in _tests/_issues. Each issue gets it's own file test_1520.py with a function test_1520 with a docstring linking to the github issue and code that closely reflects any minimal test case in the issue.

New unit tests (like those added for _block) are organized in a directory structure that matches the code layout (asdf._block.key tests exists in asdf._tests._block.key). The goal is to have 100% test coverage for a submodule using only the tests for that module. Measurement of this goal is not currently done in the CI (and has to be checked manually) and this PR does not achieve that goal for all changes (especially asdf._block.manager which attempts to mimic the old BlockManager).

@github-actions github-actions bot added this to the 3.0.0 milestone May 4, 2023
@braingram braingram force-pushed the immutable_block_manager branch 5 times, most recently from 0e65d76 to 5467ac0 Compare May 10, 2023 15:14
@braingram braingram marked this pull request as ready for review May 16, 2023 19:17
@braingram braingram requested a review from a team as a code owner May 16, 2023 19:17
@braingram braingram requested review from eslavich and perrygreenfield and removed request for a team May 16, 2023 19:17
@braingram braingram changed the title Immutable block manager Move ndarray conversion to a Converter May 16, 2023
@braingram
Copy link
Contributor Author

Roman regression tests showing no new errors (19 are failing due to unrelated roman issues)
https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FRoman-devdeps/detail/Roman-devdeps/286/pipeline/185

@braingram
Copy link
Contributor Author

Running the benchmarks on this PR vs main I see only one major difference. For 'large' files (these have 26 x 26 = 676 tree nodes) both custom_tree_to_tagged_tree and tagged_tree_to_custom_tree are roughly 2x slower with this PR. My initial suspicion is that this has to do with the addition of the Serialization and Deserialization operations that get created, used and discarded for each object. It probably makes more sense to have these created once for each tree operation and re-used/reset for each object.

@braingram
Copy link
Contributor Author

Running the benchmarks on this PR vs main I see only one major difference. For 'large' files (these have 26 x 26 = 676 tree nodes) both custom_tree_to_tagged_tree and tagged_tree_to_custom_tree are roughly 2x slower with this PR. My initial suspicion is that this has to do with the addition of the Serialization and Deserialization operations that get created, used and discarded for each object. It probably makes more sense to have these created once for each tree operation and re-used/reset for each object.

The changes in 52f7bf8 and 77fb0e6 improve the performance of these operations by:

  • indexing blocks to be written by their data for fast finding of shared blocks (40% improvement)
  • create one Serialization/Deserialization per call to custom_tree_to_tagged_tree/tagged_tree_to_custom_tree and do not use it as a context manager (14% improvement)
  • pre-calculate the context version string (to avoid a call to the ctx.version_string property for every node) (4% improvement)
  • cache converter and AsdfType lookups (3% improvement)
    In total this makes custom_tree_to_tagged_tree about 20% faster compared to main and tagged_tree_to_custom_tree is a few percent faster.

The JWST regression tests finished with 74 fewer errors than the previous run which suggests other changes are confounding the result:
https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-devdeps/662/
I ran a follow up using asdf main:
https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-devdeps/663/
that had 2 fewer errors:

  • test_nirspec_missing_msa_nofail
  • test_nirspec_missing_msa_fail
    These will require some investigation. It appears the jwst downstream tests are also now failing although I don't immediately see how the failures are related to the recent changes.

@braingram
Copy link
Contributor Author

Comparing the jwst CI tests to a run on a PR that includes only documentation changes shows the same failures as the run here:
https://github.com/spacetelescope/jwst/actions/runs/4996866055?pr=7589
I think it's safe to say the 8 failing tests in jwst are not related to the changes in this asdf PR.

@braingram
Copy link
Contributor Author

braingram commented May 17, 2023

I've been so far unable to replicate the 2 failed jwst regression tests.
I reran the regression tests and the 2 differing errors went away.
https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FJWST-devdeps/detail/JWST-devdeps/664/pipeline/192

@braingram
Copy link
Contributor Author

@eslavich while doing some testing with reading and writing to (a fake) s3. I encountered 2 new issues that are related to this PR.

All of the tests so far have been with a fake s3 so I need to verify that this is the same for real s3.

The first issue is that reading from s3 (using s3fs) produces a file-like object that has no valid fileno. This means that np.fromfile used in GenericFile.read_into_array will fail. This issue also impacts the code in main so I opened a separate issue: #1552

The second issue is that writing to s3 (using s3fs) produces a file-like object that is not seekable but in this case returns valid results for 'tell'. This differs from #1542 as it appears that os.pipe produces a file that is not seekable and calling tell raises an exception. The changes in this PR skipped attempting to write a block index to a non-seekable file but with this new case (for the file-like object produced by s3fs) we can and likely should write a block index. I'm inclined to fix this issue in this PR since it does not impact main. However I don't want this change to negatively impact any in-progress review of this PR.

Would you like me to open a separate issue (and follow-up PR) for the second issue above or update this PR to fix the issue?

@eslavich
Copy link
Contributor

Would you like me to open a separate issue (and follow-up PR) for the second issue above or update this PR to fix the issue?

I didn't read this message until now, and I expect to have the review done by the end of the weekend, so I think it will be fine to update this PR. Kind of you to ask :)

Copy link
Contributor

@eslavich eslavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to finish the review, but I also don't want to make you wait any longer for feedback, so here's the first batch of comments...

from asdf import generic_io, util


class UseInternal:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it matters, but there's precedent for this kind of constant in the Python standard libraries and they seem to implement it more like this:

class NotImplementedType:
    pass
  
NotImplemented = NotImplementedType()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I updated the usage here:
228e022

from asdf import open as asdf_open

with asdf_open(resolved_uri, lazy_load=False, copy_arrays=True) as af:
self._cache[key] = af._blocks.blocks[0].cached_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'll be! Standard says:

The ability to reference block data in an external ASDF file is intentionally limited to the first block in the external ASDF file...

(no problem here, just documenting my voyage of discovery)


class ExternalBlockCache:
def __init__(self):
self._cache = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an LRU cache? Since we're forcing arrays to be copied into memory, a large exploded file might lead to problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure about this one.

The old code called AsdfFile.open_external which caches every opened AsdfFile (in a dictionary). This means that a repeat call to open_external (and then accessing the first block) always returns the same array as Block caches the data). This behavior is relied upon for matching arrays that share an external block as tested in test_explode_then_implode:

def test_explode_then_implode(tmpdir):
x = np.arange(0, 10, dtype=float)
tree = {
"science_data": x,
"subset": x[3:-3],
"skipping": x[::2],
"not_shared": np.arange(10, 0, -1, dtype=np.uint8),
}
path = os.path.join(str(tmpdir), "original.asdf")
ff = AsdfFile(tree)
# Since we're testing with small arrays, force all arrays to be stored
# in internal blocks rather than letting some of them be automatically put
# inline.
ff.write_to(path, all_array_storage="internal")
with asdf.open(path) as af:
assert len(af._blocks._internal_blocks) == 2
result = main.main_from_args(["explode", path])
assert result == 0
files = get_file_sizes(str(tmpdir))
assert "original.asdf" in files
assert "original_exploded.asdf" in files
assert "original_exploded0000.asdf" in files
assert "original_exploded0001.asdf" in files
assert "original_exploded0002.asdf" not in files
assert files["original.asdf"] > files["original_exploded.asdf"]
path = os.path.join(str(tmpdir), "original_exploded.asdf")
result = main.main_from_args(["implode", path])
assert result == 0
with asdf.open(str(tmpdir.join("original_exploded_all.asdf"))) as af:
assert_tree_match(af.tree, tree)
assert len(af._blocks) == 2

Changing ExternalBlockCache to LRU would mean that when the cache overflows, a call to load would return an array with a different id which could interfere with settings assignment and block sharing.

I'm inclined to view the memory mapping of external blocks as a bug (and fix it in this PR by not memory mapping) but that does create an issue for large exploded files as you pointed out. As ExternalBlockCache is used by the Manger that is aware of if the blocks should be lazy loaded and/or memory mapped another option would be to see how complicated it would be to pass these options to ExternalBlockCache.load and allow the user some control over how the external blocks are loaded. What do you think about this as an option for addressing the concern about large exploded files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a bit of time looking at this today and I think there may be a few issues with passing along open options.

lazy_load is not an issue (and is currently supported in this PR as the ndarray converter sets up for lazy loading the external block then forces it's loading if lazy_load was disabled).

validate_checksums also should not be an issue but is not currently being passed on for this PR.

copy_arrays is where things get complicated. The main complication is when the external file is closed. For a non-memmapped external block, the external file can be opened, the block data read and the file closed before the block data is returned. This is not true for memmapped blocks where the external file will need to be held open until the AsdfFile instance is closed. This was one of the reasons why the previous BlockManager held a reference to the AsdfFile (so that it could used AsdfFile.open_external which can keep the external files open and close them when the AsdfFile is closed).

I could see a few options to make memory mapping external blocks possible but I'm not sure if the complexity outweighs the benefits.

  1. Duplicate the AsdfFile.open_external functionality in _block.Manager. asdf prior to this PR called close on the block manager and we will need to re-add that call to allow the block manager to close the external files.
  2. Allow for generating pure numpy.memmap objects (which don't guarantee file closing) for external blocks.

I'm going to think about this one some more but any suggestions on how to handle this are appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried 2 above (allow generating a pure numpy.memmap) and it came together pretty cleanly. See:
a488690

Note that this does not support file open mode (so memory maps are always read only). This open mode is currently not passed down to the block manager so would require more work to get it to the numpy.memmap call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me thinking about other structures that might need to be cleaned up on AsdfFile.close. It seems like it is safe to clear (set to None) all cached_data for internal blocks. I made that change here:
d4498e4

asdf/_block/external.py Show resolved Hide resolved
asdf/_block/key.py Show resolved Hide resolved
asdf/_block/manager.py Outdated Show resolved Hide resolved
asdf/_block/io.py Outdated Show resolved Hide resolved
asdf/_block/io.py Outdated Show resolved Hide resolved
asdf/_block/io.py Outdated Show resolved Hide resolved
asdf/_block/io.py Outdated Show resolved Hide resolved
@braingram braingram force-pushed the immutable_block_manager branch 2 times, most recently from f928524 to b65023a Compare May 31, 2023 18:28
@braingram
Copy link
Contributor Author

Thanks for the detailed review @eslavich :)
I believe I've addressed all of your comments. The one outstanding issue is what to do about external blocks (see: #1537 (comment)).

@braingram braingram requested a review from eslavich June 1, 2023 13:45
Copy link
Contributor

@eslavich eslavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@braingram
Copy link
Contributor Author

I updated and release sphinx-asdf and updated pyproject.toml (setting a lower pin and removing the dev requirement) so now the docs build with the pypi version of sphinx asdf.

I've opened PR for:

@braingram braingram merged commit 1c6e5c1 into asdf-format:main Sep 8, 2023
48 of 53 checks passed
@braingram braingram deleted the immutable_block_manager branch September 8, 2023 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment