Avoid duplicate storage of info in serialized columns. #12607

mhvk · 2021-12-18T01:42:21Z

As is, info is store on any primary data column, but also in the serialized column, which makes the header unnecessarily long. (Saw this while working on #12589, and includes a bit of the other small refactoring there.)

I think this is a bug, but could also call it refactoring.

Comparing this PR with current master:

from astropy.table import QTable, Column
QTable({"a": Column([1.], description='description', unit='')}).write('temp.ecsv', overwrite=True)

--> current master
# %ECSV 1.0
# ---
# datatype:
# - {name: a, datatype: float64, description: description}
# meta: !!omap
# - __serialized_columns__:
#     a:
#       __class__: astropy.units.quantity.Quantity
#       __info__: {description: description}
#       unit: !astropy.units.Unit {unit: ''}
#       value: !astropy.table.SerializedColumn {name: a}
# schema: astropy-2.0
a
1.0


--> this PR (__info__ removed)

# %ECSV 1.0
# ---
# datatype:
# - {name: a, datatype: float64, description: description}
# meta: !!omap
# - __serialized_columns__:
#     a:
#       __class__: astropy.units.quantity.Quantity
#       unit: !astropy.units.Unit {unit: ''}
#       value: !astropy.table.SerializedColumn {name: a}
# schema: astropy-2.0
a
1.0

Fixes #

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

Do the proposed changes actually accomplish desired goals?
Do the proposed changes follow the Astropy coding guidelines?
Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the Extra CI label.
Is a change log needed? If yes, did the change log check pass? If no, add the no-changelog-entry-needed label. If this is a manual backport, use the skip-changelog-checks label unless special changelog handling is necessary.
Is a milestone set? Milestone must be set but astropy-bot check might be missing; do not let the green checkmark fool you.
At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate backport-X.Y.x label(s) before merge.

taldcroft · 2021-12-18T11:58:31Z

Good plan but I think this can be done with a smaller change footprint. At least this passes existing tests and your new test. I think the existing logic is simpler to understand since there is no recursion for the non-mixin case.

(astropy) ➜  astropy git:(main) ✗ git diff
diff --git a/astropy/table/serialize.py b/astropy/table/serialize.py
index 62d69ccf44..d7b21a0a96 100644
--- a/astropy/table/serialize.py
+++ b/astropy/table/serialize.py
@@ -134,7 +134,7 @@ def _represent_mixin_as_column(col, name, new_cols, mixin_cols,
         # MaskedColumn).  For primary data, we attempt to store any info on
         # the format, etc., on the column, but not for ancillary data (e.g.,
         # no sense to use a float format for a mask).
-        if data_attr == col.info._represent_as_dict_primary_data:
+        if is_primary := (data_attr == col.info._represent_as_dict_primary_data):
             new_name = name
             new_info = info
         else:
@@ -146,6 +146,10 @@ def _represent_mixin_as_column(col, name, new_cols, mixin_cols,
                                        and np.any(data.mask)) else Column
             new_cols.append(col_cls(data, name=new_name, **new_info))
             obj_attrs[data_attr] = SerializedColumn({'name': new_name})
+            if is_primary:
+                # Don't store info in the __serialized_columns__ dict for this column
+                # since this is redundant with info stored on new column itself.
+                info = {}
         else:
             # recurse. This will define obj_attrs[new_name].
             _represent_mixin_as_column(data, new_name, new_cols, obj_attrs)

taldcroft · 2021-12-18T12:03:07Z

Just an FYI, I find this to be an easier way to get ECSV output to the console without making a temp file:

qt = QTable({"a": Column([1.], description='description', unit='')})
ascii.write(qt, format='ecsv')

mhvk · 2021-12-18T18:00:22Z

@taldcroft - yes, that makes sense. I probably should not have combined this with the case in #12589, where I found that recursion was needed, because with structured dtypes one can no longer assume that a Column never needs to be serialized -- it now joins MaskedColumn in sometimes needing it. But probably best to keep that separate; certainly no reason to backport that.

As is, info is store on any primary data column, but also in the serialized column, which makes the header unnecessarily long.

mhvk · 2021-12-18T20:29:22Z

OK, went with your suggestion (though minus the walrus, since is_primary is used quite far from its definition). Better for this to-be-backported PR indeed.

mhvk · 2021-12-19T17:30:44Z

Hmm, failures are real -- for hdf5. Will check.

mhvk · 2021-12-19T18:55:01Z

OK, found that in reconstructing Masked instances, I had made a mistake, which was easily solved. Now the remaining failures are false positives: a time-out and a claim of reduced coverage. So, ready for final review!

taldcroft

I see your point that this will need updating when structured data are serialized, but agree that this way is better for the bug-fix backport.

mhvk · 2021-12-20T20:03:20Z

OK, thanks. Will merge and rebase the other PRs.

…zed columns.

pllim · 2021-12-21T20:33:10Z

@meeseeksdev backport to v5.0.x

…607-on-v5.0.x Backport PR #12607 on branch v5.0.x (Avoid duplicate storage of info in serialized columns.)

mhvk added table Bug 💤 backport-v5.0.x on-merge: backport to v5.0.x labels Dec 18, 2021

mhvk added this to the v5.0.1 milestone Dec 18, 2021

mhvk requested a review from taldcroft December 18, 2021 01:42

Avoid duplicate storage of info in serialized columns.

754aac9

As is, info is store on any primary data column, but also in the serialized column, which makes the header unnecessarily long.

mhvk force-pushed the table-serialization-info-storage branch from b6dac9a to 754aac9 Compare December 18, 2021 20:28

Ensure MaskedNDArray propagates info when reconstructing

f5ac7aa

mhvk added the Ready-for-final-review label Dec 19, 2021

taldcroft approved these changes Dec 20, 2021

View reviewed changes

taldcroft mentioned this pull request Dec 20, 2021

Allow structured dtype for column serialization. #12589

Closed

mhvk merged commit 85da057 into astropy:main Dec 20, 2021

mhvk deleted the table-serialization-info-storage branch December 20, 2021 20:03

meeseeksmachine pushed a commit to meeseeksmachine/astropy that referenced this pull request Dec 20, 2021

Backport PR astropy#12607: Avoid duplicate storage of info in seriali…

2413795

…zed columns.

meeseeksmachine mentioned this pull request Dec 20, 2021

Backport PR #12607 on branch v5.0.x (Avoid duplicate storage of info in serialized columns.) #12613

Closed

meeseeksmachine mentioned this pull request Dec 21, 2021

Backport PR #12607 on branch v5.0.x (Avoid duplicate storage of info in serialized columns.) #12618

Merged

pllim added a commit that referenced this pull request Dec 22, 2021

Merge pull request #12618 from meeseeksmachine/auto-backport-of-pr-12…

e1d7e60

…607-on-v5.0.x Backport PR #12607 on branch v5.0.x (Avoid duplicate storage of info in serialized columns.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicate storage of info in serialized columns. #12607

Avoid duplicate storage of info in serialized columns. #12607

mhvk commented Dec 18, 2021 •

edited by taldcroft

taldcroft commented Dec 18, 2021

taldcroft commented Dec 18, 2021

mhvk commented Dec 18, 2021

mhvk commented Dec 18, 2021

mhvk commented Dec 19, 2021

mhvk commented Dec 19, 2021

taldcroft left a comment

mhvk commented Dec 20, 2021

pllim commented Dec 21, 2021

Avoid duplicate storage of info in serialized columns. #12607

Avoid duplicate storage of info in serialized columns. #12607

Conversation

mhvk commented Dec 18, 2021 • edited by taldcroft

Checklist for package maintainer(s)

taldcroft commented Dec 18, 2021

taldcroft commented Dec 18, 2021

mhvk commented Dec 18, 2021

mhvk commented Dec 18, 2021

mhvk commented Dec 19, 2021

mhvk commented Dec 19, 2021

taldcroft left a comment

Choose a reason for hiding this comment

mhvk commented Dec 20, 2021

pllim commented Dec 21, 2021

mhvk commented Dec 18, 2021 •

edited by taldcroft