Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617

Merged
merged 9 commits into from Dec 12, 2023
Merged

[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617

merged 9 commits into from Dec 12, 2023

Conversation

jobovy
Copy link
Contributor

@jobovy jobovy commented Nov 15, 2023

Description

Fixes #15608

Fixing this turned out to be a bit tricky, because a multi-line continuation of a column description starting with a number can just get read in like a normal line and have all of the required entries (end, format, units, name, and descr). In particular, the line I was having an issue with is

"                                 200 times the critical density"

and gets matched using the regular expression to

>>> match.groupdict()
{'start': None,
 'end': '200',
 'format': 'times',
 'units': 'the',
 'name': 'critical',
 'descr': 'density'}

So my fix intercepts an error that occurs due to the format being incorrect and combining this with an invalid unit to not just catch any issue with the format.

I can add a test if somebody points me to the right place to put this. Would this consist of adding a new test file to the io/ascii/tests/data/cds directory and adding this in the test_read.py file in the testfiles list? I wasn't sure whether this is the appropriate place, because those files seem to be used in a bunch of tests and maybe you just want to simple test for this issue.

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

Copy link

github-actions bot commented Nov 15, 2023

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@pllim pllim added this to the v6.0 milestone Nov 15, 2023
@pllim pllim added backport-v6.0.x on-merge: backport to v6.0.x Bug labels Nov 15, 2023
@pllim pllim requested a review from hamogu November 15, 2023 15:56
@pllim
Copy link
Member

pllim commented Nov 15, 2023

Adding a simple test would be nice but io.ascii maintainers can suggest where/how. But looks like this patch also breaks an existing test, failing the CI.

Also would need a change log.

I milestoned to 6.0 because I feel like this might make it if merged soon enough since we're definitely doing a RC2 at some point.

@jobovy
Copy link
Contributor Author

jobovy commented Nov 15, 2023

Adding a simple test would be nice but io.ascii maintainers can suggest where/how. But looks like this patch also breaks an existing test, failing the CI.

Okay, I'll take a look at the test failure(s). The one I immediately noticed is related to another very small fix that I included, which is to add a space between the lines of continued descriptions. In the current version that's in the current tests, one gets this:

"DAOSPEC quality parameter Q(large values are bad)"

where the line break is between Q and (large, but I think this should really be

"DAOSPEC quality parameter Q (large values are bad)"

which it is in this PR.

Also would need a change log.

Okay, I'll add this as well.

@pllim
Copy link
Member

pllim commented Nov 15, 2023

In general, the diff LGTM but I'll let subpackage maintainers do final approval. Thanks!

@jobovy
Copy link
Contributor Author

jobovy commented Nov 15, 2023

Okay, I think this is now ready. A summary of what I've done:

  • I added a quick test by editing a currently unused column in io/ascii/tests/data/cds/description/ReadMe to have a multi-line description that starts with a number. I did this because I noticed that this ReadMe has already been heavily edited from the original one to test edge cases, so it seemed like a good place to put this.
  • Tests then failed because the test suite runs with warnings as errors, so the normal behavior where an unrecognized unit is just labeled an UnrecognizedUnit does not apply. My fix checked whether the unit is an UnrecognizedUnit, so the fix doesn't apply to the test suite. So I fixed this by catching the warning-turned-into-error when parsing the unit and then just re-running that line to re-raise the unit error for cases where it does not seem like the line is a continued description. Not ideal, but it seems to work.
  • I also slightly changed the logic for how multi-line descriptions are combined, only adding a starting whitespace when the description is not empty from a previous line (this happened in a test)

One question for the reviewers is this: with the current fix, there are two identical lines in the code where a multi-line description is added to the existing description: this new way where a line matches the BYTES/Format/Units/Name/Explanations format because it starts with a number and the existing way where a line doesn't match this format. Ideally, the description would only be updated in a single place, but it's quite awkward to do that in the current code and it's just a single line. But it would be possible. An intermediate solution could be to write a small function that specifies how to add to the description, so it can at least be done consistently if this is ever changed (e.g., if the white-space handling needs to be changed).

@pllim
Copy link
Member

pllim commented Nov 15, 2023

slightly changed the logic for how multi-line descriptions are combined

This one sounds like a behavior change. So if this PR does not make it into 6.0.0, maybe we should not backport to 6.0.1?

@jobovy
Copy link
Contributor Author

jobovy commented Nov 15, 2023

This one sounds like a behavior change. So if this PR does not make it into 6.0.0, maybe we should not backport to 6.0.1?

I guess it's possible that people have been relying on this behavior, but in most cases not having a space between the lines would be a bug in my opinion (because if the line is split between two words, they are now concatenated). Ideally it would go in 6.0.0 so then I guess there's no issue, although I guess I should add it to the changelog. Anyway, maybe I'll wait to hear from the io.ascii.cds maintainers?

@jobovy jobovy changed the title [ascii.io.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number [io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number Nov 15, 2023
Copy link
Contributor

@dhomeier dhomeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this fix! Indeed catching lines like these can be quite tricky, and I think it is becoming somewhat unnecessarily complex when trying to "repair" a line entry after it has been misidentified. In principle the continuation line might still by chance contain a valid format and unit string and thus slip through, unlikely as that may be.
From a human reader's perspective the more obvious feature is that the line is blank over the entire Bytes Format Units Label section, so this could be rather easier fixed by stripping the lines in the prior processing less aggressively:

line = line.strip()

would need to go entirely, and can safely, since all lines are processed again in the next step, and replacing
r"""\s*
(?P<start> \d+ \s* -)? \s*

with

            r"""\s{0,12}
                (?P<start> \d+ \s* -)? \s{0,12}

would catch continuation lines (the minimum width of the Bytes Format Units Label block should be 24, and the first Bytes number should hardly have more than 4-5 spaces ahead of it). Tested this version with your tests and all of io/table.

astropy/io/ascii/tests/data/cds/description/ReadMe Outdated Show resolved Hide resolved
astropy/io/ascii/tests/test_cds_header_from_readme.py Outdated Show resolved Hide resolved
astropy/io/ascii/cds.py Outdated Show resolved Hide resolved
@dhomeier
Copy link
Contributor

I guess it's possible that people have been relying on this behavior, but in most cases not having a space between the lines would be a bug in my opinion (because if the line is split between two words, they are now concatenated).

I noticed that the test ReadMe does have an extra space before the (large values are bad), so the correct parsing might be to strip exactly the width of the first four columns but not more. But that seems rather difficult to implement (thus the proposal to strip anything up to 24 blanks in my alternative solution), and I don't know if there is any CDS specification to that effect.

@jobovy
Copy link
Contributor Author

jobovy commented Nov 16, 2023

Thanks for submitting this fix! Indeed catching lines like these can be quite tricky, and I think it is becoming somewhat unnecessarily complex when trying to "repair" a line entry after it has been misidentified. In principle the continuation line might still by chance contain a valid format and unit string and thus slip through, unlikely as that may be. From a human reader's perspective the more obvious feature is that the line is blank over the entire Bytes Format Units Label section, so this could be rather easier fixed by stripping the lines in the prior processing less aggressively

My proposed fix followed the same philosophy as the current way that continuation lines are identified, by failing the standard processing. I can't say that it's entirely clear to me that your proposed fix is robust enough, although I agree that it probably is. I can't find anything about multi-line explanations in the CDS standard and the mrt standard is not that specific (indent to the start of the Explanation column).

I'll take a look whether your proposed fix works and can switch to that.

I noticed that the test ReadMe does have an extra space before the (large values are bad), so the correct parsing might be to strip exactly the width of the first four columns but not more. But that seems rather difficult to implement (thus the proposal to strip anything up to 24 blanks in my alternative solution), and I don't know if there is any CDS specification to that effect.

Yes, the extra space is currently just stripped. From what I linked to above, I think there is an extra space in the standard, so stripping up to the start of the Explanation column would be the right thing to do.

Maybe this even suggests a better overall fix: since the standard requires the start of the Explanation column to be aligned across all rows, one could use the first row in the Byte-by-Byte description to figure out where the Explanation column starts and then for subsequent lines check whether the line has exactly that many white spaces at the start. Then you know it's a continuation line and can just grab the end of the line as the continued explanation, which per the standard should include the space (assuming the MRT standard is the same as the CDS, because I can only find the extra space in the MRT standard).

@taldcroft
Copy link
Member

Maybe this even suggests a better overall fix: since the standard requires the start of the Explanation column to be aligned across all rows, one could use the first row in the Byte-by-Byte description to figure out where the Explanation column starts and then for subsequent lines check whether the line has exactly that many white spaces at the start.

Agreed something like this is a good approach. You might need to be careful since I am not sure that an Explanation is required by the standard. But certainly you can apply that principle for columns that are definitely required, namely Bytes Format Units Label. So if you find the end of the Label in the first row, that gives a minimum span that must have non-whitespace characters.

   Bytes Format Units     Label  Explanations
--------------------------------------------------------------------------------
   1- 16 A16    ---       ID     Cluster identifier as given in WH15

All these shenanigans are a but frustrating from the lack of a good spec. One question is whether that Bytes Format Units Label Explanations is required to line up with the data below. If so then you can use the position of Explanations to determine unambiguously where the explanations start. In practice that seems to be the case but isn't clear that we can rely on that.

@dhomeier
Copy link
Contributor

dhomeier commented Nov 16, 2023

My proposed fix followed the same philosophy as the current way that continuation lines are identified, by failing the standard processing.

My suggestion actually does that as well, just having the match fail earlier in the processing.

All these shenanigans are a but frustrating from the lack of a good spec. One question is whether that Bytes Format Units Label Explanations is required to line up with the data below. If so then you can use the position of Explanations to determine unambiguously where the explanations start. In practice that seems to be the case but isn't clear that we can rely on that.

Yes, at first seemed a bit cumbersome to me, but the line should be available in lines[i_col_def + 2] so you could improve on my estimate of 24 characters minimum. But I also was not sure if the specification requires the lines to line up exactly, or if e.g. 4 tabs were also possible. Your example file and the ? Wavelength also would fail if the extra space was required.

@saimn saimn modified the milestones: v6.0.0, v6.0.1 Nov 25, 2023
@jobovy
Copy link
Contributor Author

jobovy commented Dec 7, 2023

I applied some of the suggestions from the review, except for the one complaining about black formatting. IMHO, if you're going to be using black, you shouldn't complain about what it does to code. I rebased to the latest main.

I don't have time to do the alternative fix that was suggested in the review, so I'm just going to leave the PR as it is here, which does fix the original problem. If you don't want to accept it, so be it, then somebody else will have to fix it in a different way.

@hamogu
Copy link
Member

hamogu commented Dec 8, 2023

@jobovy Thanks for taking it this far! We can always add to it in a later PR if there is more we want to do. Unless someone beats me to it, I'll look at this early next week and either merge to edit on top of your changes - but I won't get to it before the weekend.

@dhomeier
Copy link
Contributor

dhomeier commented Dec 9, 2023

Agreed; this will fix the issue and should be ready to go in. The alternative implementation is only a 3-line change, but I cannot enter it here as suggestion since the lines are disconnected from the existing changes. And determining the offset as discussed in #15617 (comment) would require yet some extra work, so it would be OK to do this in a follow-up PR.

I applied some of the suggestions from the review, except for the one complaining about black formatting. IMHO, if you're going to be using black, you shouldn't complain about what it does to code. I rebased to the latest main.

black has been adopted for the entire codebase as general rule, but as described in APE 20 per-case exceptions are to be made where readability is excessively impacted. The normal way to go would be to skip those sections with # fmt: off, but this is not something I'd block this PR over.

@jobovy
Copy link
Contributor Author

jobovy commented Dec 10, 2023

I applied some of the suggestions from the review, except for the one complaining about black formatting. IMHO, if you're going to be using black, you shouldn't complain about what it does to code. I rebased to the latest main.

black has been adopted for the entire codebase as general rule, but as described in APE 20 per-case exceptions are to be made where readability is excessively impacted. The normal way to go would be to skip those sections with # fmt: off, but this is not something I'd block this PR over.

Okay, I changed the lines in question to be more readable and not get mangled by black.

Not sure why the coverage project check is failing, it seems a bit flaky; it was fine before the last change and I don't think this change can have done anything to the total project coverage given that the patch coverage is 100%.

@dhomeier
Copy link
Contributor

Thanks! I would not worry about the codecov report, I rarely understand fully how it gets to its results, though it's possible if counts the explicit conditional now as additional (uncovered) line...

@hamogu hamogu merged commit cf35727 into astropy:main Dec 12, 2023
24 of 26 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/astropy that referenced this pull request Dec 12, 2023
…DS descriptions where the continued line starts with a number
@hamogu
Copy link
Member

hamogu commented Dec 12, 2023

I think this is good, too, and Derek, who originally had requested changes, has approved it, so I'm merging. Thank you @jobovy !

@jobovy jobovy deleted the fix-cds-multi-line-explanation branch December 12, 2023 15:43
pllim added a commit that referenced this pull request Dec 12, 2023
…617-on-v6.0.x

Backport PR #15617 on branch v6.0.x ([io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number)
@dhomeier
Copy link
Contributor

Forgot to mark this for squashing, but should be OK, thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v6.0.x on-merge: backport to v6.0.x Bug io.ascii
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect reading of multi-line Explanations in MRT/CDS ReadMes
6 participants