Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate storage of info in serialized columns. #12607

Merged
merged 2 commits into from Dec 20, 2021

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented Dec 18, 2021

As is, info is store on any primary data column, but also in the serialized column, which makes the header unnecessarily long. (Saw this while working on #12589, and includes a bit of the other small refactoring there.)

I think this is a bug, but could also call it refactoring.

Comparing this PR with current master:

from astropy.table import QTable, Column
QTable({"a": Column([1.], description='description', unit='')}).write('temp.ecsv', overwrite=True)

--> current master
# %ECSV 1.0
# ---
# datatype:
# - {name: a, datatype: float64, description: description}
# meta: !!omap
# - __serialized_columns__:
#     a:
#       __class__: astropy.units.quantity.Quantity
#       __info__: {description: description}
#       unit: !astropy.units.Unit {unit: ''}
#       value: !astropy.table.SerializedColumn {name: a}
# schema: astropy-2.0
a
1.0


--> this PR (__info__ removed)

# %ECSV 1.0
# ---
# datatype:
# - {name: a, datatype: float64, description: description}
# meta: !!omap
# - __serialized_columns__:
#     a:
#       __class__: astropy.units.quantity.Quantity
#       unit: !astropy.units.Unit {unit: ''}
#       value: !astropy.table.SerializedColumn {name: a}
# schema: astropy-2.0
a
1.0

Fixes #

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the Extra CI label.
  • Is a change log needed? If yes, did the change log check pass? If no, add the no-changelog-entry-needed label. If this is a manual backport, use the skip-changelog-checks label unless special changelog handling is necessary.
  • Is a milestone set? Milestone must be set but astropy-bot check might be missing; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate backport-X.Y.x label(s) before merge.

@mhvk mhvk added table Bug 💤 backport-v5.0.x on-merge: backport to v5.0.x labels Dec 18, 2021
@mhvk mhvk added this to the v5.0.1 milestone Dec 18, 2021
@mhvk mhvk requested a review from taldcroft December 18, 2021 01:42
@taldcroft
Copy link
Member

Good plan but I think this can be done with a smaller change footprint. At least this passes existing tests and your new test. I think the existing logic is simpler to understand since there is no recursion for the non-mixin case.

(astropy) ➜  astropy git:(main) ✗ git diff
diff --git a/astropy/table/serialize.py b/astropy/table/serialize.py
index 62d69ccf44..d7b21a0a96 100644
--- a/astropy/table/serialize.py
+++ b/astropy/table/serialize.py
@@ -134,7 +134,7 @@ def _represent_mixin_as_column(col, name, new_cols, mixin_cols,
         # MaskedColumn).  For primary data, we attempt to store any info on
         # the format, etc., on the column, but not for ancillary data (e.g.,
         # no sense to use a float format for a mask).
-        if data_attr == col.info._represent_as_dict_primary_data:
+        if is_primary := (data_attr == col.info._represent_as_dict_primary_data):
             new_name = name
             new_info = info
         else:
@@ -146,6 +146,10 @@ def _represent_mixin_as_column(col, name, new_cols, mixin_cols,
                                        and np.any(data.mask)) else Column
             new_cols.append(col_cls(data, name=new_name, **new_info))
             obj_attrs[data_attr] = SerializedColumn({'name': new_name})
+            if is_primary:
+                # Don't store info in the __serialized_columns__ dict for this column
+                # since this is redundant with info stored on new column itself.
+                info = {}
         else:
             # recurse. This will define obj_attrs[new_name].
             _represent_mixin_as_column(data, new_name, new_cols, obj_attrs)

@taldcroft
Copy link
Member

Just an FYI, I find this to be an easier way to get ECSV output to the console without making a temp file:

qt = QTable({"a": Column([1.], description='description', unit='')})
ascii.write(qt, format='ecsv')

@mhvk
Copy link
Contributor Author

mhvk commented Dec 18, 2021

@taldcroft - yes, that makes sense. I probably should not have combined this with the case in #12589, where I found that recursion was needed, because with structured dtypes one can no longer assume that a Column never needs to be serialized -- it now joins MaskedColumn in sometimes needing it. But probably best to keep that separate; certainly no reason to backport that.

As is, info is store on any primary data column, but also in
the serialized column, which makes the header unnecessarily long.
@mhvk mhvk force-pushed the table-serialization-info-storage branch from b6dac9a to 754aac9 Compare December 18, 2021 20:28
@mhvk
Copy link
Contributor Author

mhvk commented Dec 18, 2021

OK, went with your suggestion (though minus the walrus, since is_primary is used quite far from its definition). Better for this to-be-backported PR indeed.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 19, 2021

Hmm, failures are real -- for hdf5. Will check.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 19, 2021

OK, found that in reconstructing Masked instances, I had made a mistake, which was easily solved. Now the remaining failures are false positives: a time-out and a claim of reduced coverage. So, ready for final review!

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that this will need updating when structured data are serialized, but agree that this way is better for the bug-fix backport.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 20, 2021

OK, thanks. Will merge and rebase the other PRs.

@mhvk mhvk merged commit 85da057 into astropy:main Dec 20, 2021
@mhvk mhvk deleted the table-serialization-info-storage branch December 20, 2021 20:03
meeseeksmachine pushed a commit to meeseeksmachine/astropy that referenced this pull request Dec 20, 2021
@pllim
Copy link
Member

pllim commented Dec 21, 2021

@meeseeksdev backport to v5.0.x

pllim added a commit that referenced this pull request Dec 22, 2021
…607-on-v5.0.x

Backport PR #12607 on branch v5.0.x (Avoid duplicate storage of info in serialized columns.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants