Move ndarray conversion to a Converter #1537

braingram · 2023-05-04T16:46:00Z

This PR moves ndarray handling to a converter.

This required extensive changes to the internal block API which had several knock-on effects.

Public changes

`AsdfConfig.convert_unknown_ndarray_subclasses`

Moving ndarray to a converter means that subclasses are no longer automatically converted.

This change would break stdatamodels (where FITS_rec is automatically converted to ndarray) and breaks one test in asdf-astropy (because of NDArrayMixIn).

It seems likely there are other instances 'in the wild' so an AsdfConfig option, convert_uknown_ndarray_subclasses was added and some internal changes were added to allow the new ndarray converter to continue handling subclasses. If an instance of a subclass of ndarray is encountered during serialization which is not explicitly handled by a converter and the convert_unknown_ndarray_subclasses option is enabled (as is by default), this instance will be coerced into an ndarray and a warning will be issued (this warning means that the test in asdf-astropy will continue to fail as warnings are converted to errors).

A note was added to the docs describing that the default setting for this will be converted to False and the option (and special handling) removed in a future version of asdf.

`asdf.Stream` moved to `asdf.tags.core.Stream` and `asdf.stream` deprecated

This is more consistent with other special tag related objects (asdf.tags.core.NDArrayType, asdf.tags.core.IntegerType etc). As an added benefit, the eventual removal of asdf.stream should allow the docs to build for development installs on systems that are case-insensitive (by removing the ambiguous asdf.Stream and asdf.stream).

Changes to `Converter` and `SerializationContext` methods for block access

This changes some of the unreleased public API added in #1508

The updates to the block management to allow moving ndarray to a converter meant that some of the complexity added in #1508 could be removed. With the changes in this PR, the SerializationContext is now aware if a serialization or deserialization is being performed and what object is being handled (and can automatically generate and assign keys for most use cases). This means that asdf.util.BlockKey is no longer needed (extension code still needs to generate keys if an object reads from multiple blocks, by generating keys with SerializationContext.generate_block_key asdf can automatically associate these with the correct object). Key-related changes to SerializationContext are:

update find_block_index(key, data_callback=None) to find_available_block_index(data_callback, key=None)
update get_block_data_callback(index) to get_block_data_callback(index, key=None)
removed assign_block_key (all assignment is now automatic)

These changes also mean that Converter.reserve_blocks is no longer needed and was removed.

array settings copied between `AsdfFile` instances

To fix #1524 and remove an undocumented behavior, creating an AsdfFile with an existing AsdfFile (AsdfFile(af)) no longer attempts to copy block options from the original AsdfFile. This was previously done inconsistently and could result in files that could not be opened (see #1524).

external blocks are no longer memmapped

In working towards a fix for #1525 memmapping for external blocks was disabled. With this PR external blocks are always lazy loaded and never memmapped. If these are sensible settings I propose we document the behavior and close the issue. If instead further changes are needed to allow control over lazy loading, memmapping or read/write mode than a separate PR will be needed.

`AsdfFile.update` block organization change

Instead of attempting to limit the number of blocks moved/rewritten (which added significant complexity), AsdfFile.update now writes all new blocks to the end of the file then moves the blocks to the end of the new tree and updates associations between data callbacks and blocks to allow objects that lazy load blocks to read the correct block. This new strategy allows for updating with compressed blocks (where the size is unknown until the data is written) and allows for any change in tree size (even if it completely overwrites all previous blocks). This has one major downside where for a very large file it can temporarily almost double in size. However, the final file size after an update with this PR should be equivalent or even smaller (windows support for truncating was added in this PR to allow the final file to be a sensible size without excessive padding). This seems like a reasonable trade-off (exchanging some storage inefficiency for simplicity in code) to fix the numerous errors related to update found during work on this PR.

Private API changes

Removal of `block.py` and replacement with `_block` submodules

The Block and BlockManger classes were removed and replaced with a number of new classes and functions. Documentation was added to the new _block submodules and should serve as the primary description of this new code (and should be updated based on questions and comments to the PR).

The changes to the internal block API do mean that sphinx-asdf is no longer compatible with this PR. A PR was opened for sphinx-asdf to make it compatible with the changes in this PR and previous versions of ASDF: asdf-format/sphinx-asdf#64

To allow docs to build for this PR the requirements were temporarily changed to install sphinx-asdf from the source branch for the sphinx-asdf PR linked above. Once the private block API changes in this PR are decided the sphinx-asdf PR can be finalized, merged, released and the requirements updated here (either prior to or after merging but prior to release of asdf 3.0).

Test strucuture

Numerous issues were discovered during this rewrite:

Fixes or work towards fixes for many issues are included in this PR

controlling array storage in to_yaml_tree methods #1013 a test was added for a possible fix for this, if we decide that SerializationContext should have the same public set_array_storage methods as AsdfFile than this issue could be solved
fixes A failed update can corrupt the file #1520
fixes update corrupts stream data #1523
fixes array storage settings are inconsistently propagated to new AsdfFile and can produce a file that cannot be opened #1524 by not copying block options between AsdfFiles
fixes Rewriting a file with external blocks fails if arrays are not first accessed #1526
Calling update with memmapped data can create invalid data in memmap views #1530 a test was added and xfailed as this may require subclassing ndarray
fixes Support file truncating on windows #1534
fixes ASDF unable to read empty inline array. #1538
Seek and read from closed MemoryIO #1539 a test was added and xfailed, this will require possibly major changes to generic_io
fixes ASDF writes but fails to read inline structured array #1540
fixes Block checksums are only checked for first block if a block index is present #1541
fixes ASDF fails to write blocks to non-seekable file #1542

To make it more obvious which tests correspond to issues, which are unit tests and which are now 'legacy' tests a new test organization was started in this PR.

'legacy' tests were mostly untouched (left in the same location and only updated to reflect changes to any internal API used and changed in this PR).

Tests that correspond to a specific issue are organized in _tests/_issues. Each issue gets it's own file test_1520.py with a function test_1520 with a docstring linking to the github issue and code that closely reflects any minimal test case in the issue.

New unit tests (like those added for _block) are organized in a directory structure that matches the code layout (asdf._block.key tests exists in asdf._tests._block.key). The goal is to have 100% test coverage for a submodule using only the tests for that module. Measurement of this goal is not currently done in the CI (and has to be checked manually) and this PR does not achieve that goal for all changes (especially asdf._block.manager which attempts to mimic the old BlockManager).

braingram · 2023-05-16T19:45:14Z

Roman regression tests showing no new errors (19 are failing due to unrelated roman issues)
https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FRoman-devdeps/detail/Roman-devdeps/286/pipeline/185

braingram · 2023-05-16T20:26:59Z

Running the benchmarks on this PR vs main I see only one major difference. For 'large' files (these have 26 x 26 = 676 tree nodes) both custom_tree_to_tagged_tree and tagged_tree_to_custom_tree are roughly 2x slower with this PR. My initial suspicion is that this has to do with the addition of the Serialization and Deserialization operations that get created, used and discarded for each object. It probably makes more sense to have these created once for each tree operation and re-used/reset for each object.

braingram · 2023-05-17T16:56:21Z

Running the benchmarks on this PR vs main I see only one major difference. For 'large' files (these have 26 x 26 = 676 tree nodes) both custom_tree_to_tagged_tree and tagged_tree_to_custom_tree are roughly 2x slower with this PR. My initial suspicion is that this has to do with the addition of the Serialization and Deserialization operations that get created, used and discarded for each object. It probably makes more sense to have these created once for each tree operation and re-used/reset for each object.

The changes in 52f7bf8 and 77fb0e6 improve the performance of these operations by:

indexing blocks to be written by their data for fast finding of shared blocks (40% improvement)
create one Serialization/Deserialization per call to custom_tree_to_tagged_tree/tagged_tree_to_custom_tree and do not use it as a context manager (14% improvement)
pre-calculate the context version string (to avoid a call to the ctx.version_string property for every node) (4% improvement)
cache converter and AsdfType lookups (3% improvement)
In total this makes custom_tree_to_tagged_tree about 20% faster compared to main and tagged_tree_to_custom_tree is a few percent faster.

The JWST regression tests finished with 74 fewer errors than the previous run which suggests other changes are confounding the result:
https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-devdeps/662/
I ran a follow up using asdf main:
https://plwishmaster.stsci.edu:8081/job/RT/job/JWST-devdeps/663/
that had 2 fewer errors:

test_nirspec_missing_msa_nofail
test_nirspec_missing_msa_fail
These will require some investigation. It appears the jwst downstream tests are also now failing although I don't immediately see how the failures are related to the recent changes.

braingram · 2023-05-17T17:02:15Z

Comparing the jwst CI tests to a run on a PR that includes only documentation changes shows the same failures as the run here:
https://github.com/spacetelescope/jwst/actions/runs/4996866055?pr=7589
I think it's safe to say the 8 failing tests in jwst are not related to the changes in this asdf PR.

braingram · 2023-05-17T21:26:13Z

I've been so far unable to replicate the 2 failed jwst regression tests.
I reran the regression tests and the 2 differing errors went away.
https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FJWST-devdeps/detail/JWST-devdeps/664/pipeline/192

braingram · 2023-05-24T21:23:46Z

@eslavich while doing some testing with reading and writing to (a fake) s3. I encountered 2 new issues that are related to this PR.

All of the tests so far have been with a fake s3 so I need to verify that this is the same for real s3.

The first issue is that reading from s3 (using s3fs) produces a file-like object that has no valid fileno. This means that np.fromfile used in GenericFile.read_into_array will fail. This issue also impacts the code in main so I opened a separate issue: #1552

The second issue is that writing to s3 (using s3fs) produces a file-like object that is not seekable but in this case returns valid results for 'tell'. This differs from #1542 as it appears that os.pipe produces a file that is not seekable and calling tell raises an exception. The changes in this PR skipped attempting to write a block index to a non-seekable file but with this new case (for the file-like object produced by s3fs) we can and likely should write a block index. I'm inclined to fix this issue in this PR since it does not impact main. However I don't want this change to negatively impact any in-progress review of this PR.

Would you like me to open a separate issue (and follow-up PR) for the second issue above or update this PR to fix the issue?

eslavich · 2023-05-28T22:31:50Z

Would you like me to open a separate issue (and follow-up PR) for the second issue above or update this PR to fix the issue?

I didn't read this message until now, and I expect to have the review done by the end of the weekend, so I think it will be fine to update this PR. Kind of you to ask :)

eslavich

I wasn't able to finish the review, but I also don't want to make you wait any longer for feedback, so here's the first batch of comments...

eslavich · 2023-05-28T22:36:37Z

asdf/_block/external.py

+from asdf import generic_io, util
+
+
+class UseInternal:


Not sure if it matters, but there's precedent for this kind of constant in the Python standard libraries and they seem to implement it more like this:

class NotImplementedType: pass NotImplemented = NotImplementedType()

Thanks! I updated the usage here:
228e022

eslavich · 2023-05-28T22:45:47Z

asdf/_block/external.py

+            from asdf import open as asdf_open
+
+            with asdf_open(resolved_uri, lazy_load=False, copy_arrays=True) as af:
+                self._cache[key] = af._blocks.blocks[0].cached_data


Well, I'll be! Standard says:

The ability to reference block data in an external ASDF file is intentionally limited to the first block in the external ASDF file...

(no problem here, just documenting my voyage of discovery)

eslavich · 2023-05-28T22:50:02Z

asdf/_block/external.py

+
+class ExternalBlockCache:
+    def __init__(self):
+        self._cache = {}


Should this be an LRU cache? Since we're forcing arrays to be copied into memory, a large exploded file might lead to problems.

I'm not quite sure about this one.

The old code called AsdfFile.open_external which caches every opened AsdfFile (in a dictionary). This means that a repeat call to open_external (and then accessing the first block) always returns the same array as Block caches the data). This behavior is relied upon for matching arrays that share an external block as tested in test_explode_then_implode:

asdf/asdf/_tests/commands/tests/test_exploded.py

Lines 11 to 51 in b0c9a50

def test_explode_then_implode(tmpdir):

x = np.arange(0, 10, dtype=float)

tree = {

"science_data": x,

"subset": x[3:-3],

"skipping": x[::2],

"not_shared": np.arange(10, 0, -1, dtype=np.uint8),

}

path = os.path.join(str(tmpdir), "original.asdf")

ff = AsdfFile(tree)

# Since we're testing with small arrays, force all arrays to be stored

# in internal blocks rather than letting some of them be automatically put

# inline.

ff.write_to(path, all_array_storage="internal")

with asdf.open(path) as af:

assert len(af._blocks._internal_blocks) == 2

result = main.main_from_args(["explode", path])

assert result == 0

files = get_file_sizes(str(tmpdir))

assert "original.asdf" in files

assert "original_exploded.asdf" in files

assert "original_exploded0000.asdf" in files

assert "original_exploded0001.asdf" in files

assert "original_exploded0002.asdf" not in files

assert files["original.asdf"] > files["original_exploded.asdf"]

path = os.path.join(str(tmpdir), "original_exploded.asdf")

result = main.main_from_args(["implode", path])

assert result == 0

with asdf.open(str(tmpdir.join("original_exploded_all.asdf"))) as af:

assert_tree_match(af.tree, tree)

assert len(af._blocks) == 2

Changing ExternalBlockCache to LRU would mean that when the cache overflows, a call to load would return an array with a different id which could interfere with settings assignment and block sharing.

I'm inclined to view the memory mapping of external blocks as a bug (and fix it in this PR by not memory mapping) but that does create an issue for large exploded files as you pointed out. As ExternalBlockCache is used by the Manger that is aware of if the blocks should be lazy loaded and/or memory mapped another option would be to see how complicated it would be to pass these options to ExternalBlockCache.load and allow the user some control over how the external blocks are loaded. What do you think about this as an option for addressing the concern about large exploded files?

I spent a bit of time looking at this today and I think there may be a few issues with passing along open options.

lazy_load is not an issue (and is currently supported in this PR as the ndarray converter sets up for lazy loading the external block then forces it's loading if lazy_load was disabled).

validate_checksums also should not be an issue but is not currently being passed on for this PR.

copy_arrays is where things get complicated. The main complication is when the external file is closed. For a non-memmapped external block, the external file can be opened, the block data read and the file closed before the block data is returned. This is not true for memmapped blocks where the external file will need to be held open until the AsdfFile instance is closed. This was one of the reasons why the previous BlockManager held a reference to the AsdfFile (so that it could used AsdfFile.open_external which can keep the external files open and close them when the AsdfFile is closed).

I could see a few options to make memory mapping external blocks possible but I'm not sure if the complexity outweighs the benefits.

Duplicate the AsdfFile.open_external functionality in _block.Manager. asdf prior to this PR called close on the block manager and we will need to re-add that call to allow the block manager to close the external files.

Allow for generating pure numpy.memmap objects (which don't guarantee file closing) for external blocks.

I'm going to think about this one some more but any suggestions on how to handle this are appreciated.

I tried 2 above (allow generating a pure numpy.memmap) and it came together pretty cleanly. See:
a488690

Note that this does not support file open mode (so memory maps are always read only). This open mode is currently not passed down to the block manager so would require more work to get it to the numpy.memmap call.

This got me thinking about other structures that might need to be cleaned up on AsdfFile.close. It seems like it is safe to clear (set to None) all cached_data for internal blocks. I made that change here:
d4498e4

asdf/_block/external.py

asdf/_block/key.py

asdf/_block/manager.py

asdf/_block/io.py

braingram · 2023-06-01T13:45:42Z

Thanks for the detailed review @eslavich :)
I believe I've addressed all of your comments. The one outstanding issue is what to do about external blocks (see: #1537 (comment)).

and add test to make sure a failed write doesn't modify the tree or the version

tests that updated compressed block data does not cause a failed update

eslavich

🎉

braingram · 2023-09-05T19:22:50Z

I updated and release sphinx-asdf and updated pyproject.toml (setting a lower pin and removing the dev requirement) so now the docs build with the pypi version of sphinx asdf.

I've opened PR for:

stdatamodels to convert all fitsrec to arrays prior to passing them to asdf for validation or serialization: Safely convert fitsrec in tree before serializing with asdf spacetelescope/stdatamodels#205 (I took a brief stab at completely removing fitsrec but there is at least one use of it for storing units and numerous uses of things like fields and the case insensitivity of names which will need to be sorted): Safely convert fitsrec in tree before serializing with asdf spacetelescope/stdatamodels#205
asdf-astropy to add a converter/tag/schema for NdarrayMixin: add NdarrayMixin support, bump astropy extension version to 1.1.0 astropy/asdf-astropy#200
dkist to update files with incorrect block indexes: Remove incorrect block index sections from asdf test data DKISTDC/dkist#296

See discussion at #1572

braingram added no-changelog-entry-needed Downstream CI development No backport required labels May 4, 2023

github-actions bot added this to the 3.0.0 milestone May 4, 2023

braingram force-pushed the immutable_block_manager branch 5 times, most recently from 0e65d76 to 5467ac0 Compare May 10, 2023 15:14

braingram force-pushed the immutable_block_manager branch from e5946d8 to a5170ba Compare May 16, 2023 16:00

braingram mentioned this pull request May 16, 2023

update private block api usage asdf-format/sphinx-asdf#64

Merged

braingram removed the no-changelog-entry-needed label May 16, 2023

braingram marked this pull request as ready for review May 16, 2023 19:17

braingram requested a review from a team as a code owner May 16, 2023 19:17

braingram requested review from eslavich and perrygreenfield and removed request for a team May 16, 2023 19:17

braingram changed the title ~~Immutable block manager~~ Move ndarray conversion to a Converter May 16, 2023

eslavich reviewed May 30, 2023

View reviewed changes

braingram force-pushed the immutable_block_manager branch 2 times, most recently from f928524 to b65023a Compare May 31, 2023 18:28

braingram requested a review from eslavich June 1, 2023 13:45

braingram added 17 commits August 18, 2023 15:29

keep SerializationContext exposed at asdf.asdf

7cc553c

temporarily use dev sphinx-asdf

c224590

move SerializationContext back into asdf.extension

4148331

deprecate import of asdf.asdf.SerializationContext

ab67c48

move _issues tests to _regtests and rename tests

1802597

typo prevented dev sphinx_asdf install

e6e3c44

move write_to version reset into finally

428dc9e

and add test to make sure a failed write doesn't modify the tree or the version

add missing BlockAccess docstring

75e005d

add asdf.asdf.SerializationContext import deprecation to docs

f740292

add parametrization to 1525 regression test

5a143c0

remove unneeded line

adf9a99

simplify ndarray subclass handling

6651672

remove unnecessary assign_object(None)

3a59907

add warnings to failed block index reading

2caf2c9

remove unnecessary config context

9bf92fe

add test_update_compressed_blocks

688617a

tests that updated compressed block data does not cause a failed update

add AsdfBlockIndexWarning

1534bd9

braingram force-pushed the immutable_block_manager branch from 949d1d1 to 1534bd9 Compare August 18, 2023 19:29

braingram mentioned this pull request Aug 29, 2023

Remove legacy extension api #1637

Merged

eslavich approved these changes Sep 4, 2023

View reviewed changes

remove sphinx-asdf dev version requirement

8672815

braingram merged commit 1c6e5c1 into asdf-format:main Sep 8, 2023
48 of 53 checks passed

braingram deleted the immutable_block_manager branch September 8, 2023 13:25

This was referenced Sep 18, 2023

Block parsing allows a few (<4) bytes of junk data after the last block #1547

Open

Seek and read from closed MemoryIO #1539

Closed

braingram mentioned this pull request Sep 26, 2023

Fix issue with asdftool diff #1652

Merged

This was referenced Nov 28, 2023

Create a new block manager when writing files #619

Closed

Refactor write_to function #581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move ndarray conversion to a Converter #1537

Move ndarray conversion to a Converter #1537

braingram commented May 4, 2023 •

edited

braingram commented May 16, 2023

braingram commented May 16, 2023

braingram commented May 17, 2023

braingram commented May 17, 2023

braingram commented May 17, 2023 •

edited

braingram commented May 24, 2023

eslavich commented May 28, 2023

eslavich left a comment

eslavich May 28, 2023

braingram May 30, 2023

eslavich May 28, 2023

eslavich May 28, 2023

braingram May 30, 2023

braingram May 31, 2023

braingram Jun 6, 2023

braingram Jun 6, 2023

braingram commented Jun 1, 2023

eslavich left a comment

braingram commented Sep 5, 2023

	def test_explode_then_implode(tmpdir):
	x = np.arange(0, 10, dtype=float)

	tree = {
	"science_data": x,
	"subset": x[3:-3],
	"skipping": x[::2],
	"not_shared": np.arange(10, 0, -1, dtype=np.uint8),
	}

	path = os.path.join(str(tmpdir), "original.asdf")
	ff = AsdfFile(tree)
	# Since we're testing with small arrays, force all arrays to be stored
	# in internal blocks rather than letting some of them be automatically put
	# inline.
	ff.write_to(path, all_array_storage="internal")
	with asdf.open(path) as af:
	assert len(af._blocks._internal_blocks) == 2

	result = main.main_from_args(["explode", path])

	assert result == 0

	files = get_file_sizes(str(tmpdir))

	assert "original.asdf" in files
	assert "original_exploded.asdf" in files
	assert "original_exploded0000.asdf" in files
	assert "original_exploded0001.asdf" in files
	assert "original_exploded0002.asdf" not in files

	assert files["original.asdf"] > files["original_exploded.asdf"]

	path = os.path.join(str(tmpdir), "original_exploded.asdf")
	result = main.main_from_args(["implode", path])

	assert result == 0

	with asdf.open(str(tmpdir.join("original_exploded_all.asdf"))) as af:
	assert_tree_match(af.tree, tree)
	assert len(af._blocks) == 2

Move ndarray conversion to a Converter #1537

Move ndarray conversion to a Converter #1537

Conversation

braingram commented May 4, 2023 • edited

Public changes

AsdfConfig.convert_unknown_ndarray_subclasses

asdf.Stream moved to asdf.tags.core.Stream and asdf.stream deprecated

Changes to Converter and SerializationContext methods for block access

array settings copied between AsdfFile instances

external blocks are no longer memmapped

AsdfFile.update block organization change

Private API changes

Removal of block.py and replacement with _block submodules

Test strucuture

braingram commented May 16, 2023

braingram commented May 16, 2023

braingram commented May 17, 2023

braingram commented May 17, 2023

braingram commented May 17, 2023 • edited

braingram commented May 24, 2023

eslavich commented May 28, 2023

eslavich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

braingram commented Jun 1, 2023

eslavich left a comment

Choose a reason for hiding this comment

braingram commented Sep 5, 2023

braingram commented May 4, 2023 •

edited

`AsdfConfig.convert_unknown_ndarray_subclasses`

`asdf.Stream` moved to `asdf.tags.core.Stream` and `asdf.stream` deprecated

Changes to `Converter` and `SerializationContext` methods for block access

array settings copied between `AsdfFile` instances

`AsdfFile.update` block organization change

Removal of `block.py` and replacement with `_block` submodules

braingram commented May 17, 2023 •

edited