[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617

jobovy · 2023-11-15T15:47:14Z

Description

Fixing this turned out to be a bit tricky, because a multi-line continuation of a column description starting with a number can just get read in like a normal line and have all of the required entries (end, format, units, name, and descr). In particular, the line I was having an issue with is

"                                 200 times the critical density"

and gets matched using the regular expression to

>>> match.groupdict()
{'start': None,
 'end': '200',
 'format': 'times',
 'units': 'the',
 'name': 'critical',
 'descr': 'density'}

So my fix intercepts an error that occurs due to the format being incorrect and combining this with an invalid unit to not just catch any issue with the format.

I can add a test if somebody points me to the right place to put this. Would this consist of adding a new test file to the io/ascii/tests/data/cds directory and adding this in the test_read.py file in the testfiles list? I wasn't sure whether this is the appropriate place, because those files seem to be used in a bunch of tests and maybe you just want to simple test for this issue.

By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

github-actions · 2023-11-15T15:47:50Z

pllim · 2023-11-15T15:58:20Z

Adding a simple test would be nice but io.ascii maintainers can suggest where/how. But looks like this patch also breaks an existing test, failing the CI.

Also would need a change log.

I milestoned to 6.0 because I feel like this might make it if merged soon enough since we're definitely doing a RC2 at some point.

jobovy · 2023-11-15T16:05:10Z

Adding a simple test would be nice but io.ascii maintainers can suggest where/how. But looks like this patch also breaks an existing test, failing the CI.

Okay, I'll take a look at the test failure(s). The one I immediately noticed is related to another very small fix that I included, which is to add a space between the lines of continued descriptions. In the current version that's in the current tests, one gets this:

"DAOSPEC quality parameter Q(large values are bad)"

where the line break is between Q and (large, but I think this should really be

"DAOSPEC quality parameter Q (large values are bad)"

which it is in this PR.

Also would need a change log.

Okay, I'll add this as well.

pllim · 2023-11-15T16:24:55Z

In general, the diff LGTM but I'll let subpackage maintainers do final approval. Thanks!

jobovy · 2023-11-15T18:24:25Z

Okay, I think this is now ready. A summary of what I've done:

I added a quick test by editing a currently unused column in io/ascii/tests/data/cds/description/ReadMe to have a multi-line description that starts with a number. I did this because I noticed that this ReadMe has already been heavily edited from the original one to test edge cases, so it seemed like a good place to put this.
Tests then failed because the test suite runs with warnings as errors, so the normal behavior where an unrecognized unit is just labeled an UnrecognizedUnit does not apply. My fix checked whether the unit is an UnrecognizedUnit, so the fix doesn't apply to the test suite. So I fixed this by catching the warning-turned-into-error when parsing the unit and then just re-running that line to re-raise the unit error for cases where it does not seem like the line is a continued description. Not ideal, but it seems to work.
I also slightly changed the logic for how multi-line descriptions are combined, only adding a starting whitespace when the description is not empty from a previous line (this happened in a test)

One question for the reviewers is this: with the current fix, there are two identical lines in the code where a multi-line description is added to the existing description: this new way where a line matches the BYTES/Format/Units/Name/Explanations format because it starts with a number and the existing way where a line doesn't match this format. Ideally, the description would only be updated in a single place, but it's quite awkward to do that in the current code and it's just a single line. But it would be possible. An intermediate solution could be to write a small function that specifies how to add to the description, so it can at least be done consistently if this is ever changed (e.g., if the white-space handling needs to be changed).

pllim · 2023-11-15T18:28:05Z

slightly changed the logic for how multi-line descriptions are combined

This one sounds like a behavior change. So if this PR does not make it into 6.0.0, maybe we should not backport to 6.0.1?

jobovy · 2023-11-15T18:39:46Z

This one sounds like a behavior change. So if this PR does not make it into 6.0.0, maybe we should not backport to 6.0.1?

I guess it's possible that people have been relying on this behavior, but in most cases not having a space between the lines would be a bug in my opinion (because if the line is split between two words, they are now concatenated). Ideally it would go in 6.0.0 so then I guess there's no issue, although I guess I should add it to the changelog. Anyway, maybe I'll wait to hear from the io.ascii.cds maintainers?

dhomeier

Thanks for submitting this fix! Indeed catching lines like these can be quite tricky, and I think it is becoming somewhat unnecessarily complex when trying to "repair" a line entry after it has been misidentified. In principle the continuation line might still by chance contain a valid format and unit string and thus slip through, unlikely as that may be.
From a human reader's perspective the more obvious feature is that the line is blank over the entire Bytes Format Units Label section, so this could be rather easier fixed by stripping the lines in the prior processing less aggressively:

astropy/astropy/io/ascii/cds.py

Line 68 in 5f05c10

line = line.strip()

would need to go entirely, and can safely, since all lines are processed again in the next step, and replacing

astropy/astropy/io/ascii/cds.py

Lines 109 to 110 in 5f05c10

    
                       r"""\s* 
        
                           (?P<start> \d+ \s* -)? \s*

with

            r"""\s{0,12}
                (?P<start> \d+ \s* -)? \s{0,12}

would catch continuation lines (the minimum width of the Bytes Format Units Label block should be 24, and the first Bytes number should hardly have more than 4-5 spaces ahead of it). Tested this version with your tests and all of io/table.

astropy/io/ascii/tests/data/cds/description/ReadMe

astropy/io/ascii/tests/test_cds_header_from_readme.py

astropy/io/ascii/cds.py

dhomeier · 2023-11-15T22:38:08Z

I guess it's possible that people have been relying on this behavior, but in most cases not having a space between the lines would be a bug in my opinion (because if the line is split between two words, they are now concatenated).

I noticed that the test ReadMe does have an extra space before the (large values are bad), so the correct parsing might be to strip exactly the width of the first four columns but not more. But that seems rather difficult to implement (thus the proposal to strip anything up to 24 blanks in my alternative solution), and I don't know if there is any CDS specification to that effect.

jobovy · 2023-11-16T02:33:32Z

Thanks for submitting this fix! Indeed catching lines like these can be quite tricky, and I think it is becoming somewhat unnecessarily complex when trying to "repair" a line entry after it has been misidentified. In principle the continuation line might still by chance contain a valid format and unit string and thus slip through, unlikely as that may be. From a human reader's perspective the more obvious feature is that the line is blank over the entire Bytes Format Units Label section, so this could be rather easier fixed by stripping the lines in the prior processing less aggressively

My proposed fix followed the same philosophy as the current way that continuation lines are identified, by failing the standard processing. I can't say that it's entirely clear to me that your proposed fix is robust enough, although I agree that it probably is. I can't find anything about multi-line explanations in the CDS standard and the mrt standard is not that specific (indent to the start of the Explanation column).

I'll take a look whether your proposed fix works and can switch to that.

I noticed that the test ReadMe does have an extra space before the (large values are bad), so the correct parsing might be to strip exactly the width of the first four columns but not more. But that seems rather difficult to implement (thus the proposal to strip anything up to 24 blanks in my alternative solution), and I don't know if there is any CDS specification to that effect.

Yes, the extra space is currently just stripped. From what I linked to above, I think there is an extra space in the standard, so stripping up to the start of the Explanation column would be the right thing to do.

Maybe this even suggests a better overall fix: since the standard requires the start of the Explanation column to be aligned across all rows, one could use the first row in the Byte-by-Byte description to figure out where the Explanation column starts and then for subsequent lines check whether the line has exactly that many white spaces at the start. Then you know it's a continuation line and can just grab the end of the line as the continued explanation, which per the standard should include the space (assuming the MRT standard is the same as the CDS, because I can only find the extra space in the MRT standard).

taldcroft · 2023-11-16T11:39:04Z

Maybe this even suggests a better overall fix: since the standard requires the start of the Explanation column to be aligned across all rows, one could use the first row in the Byte-by-Byte description to figure out where the Explanation column starts and then for subsequent lines check whether the line has exactly that many white spaces at the start.

Agreed something like this is a good approach. You might need to be careful since I am not sure that an Explanation is required by the standard. But certainly you can apply that principle for columns that are definitely required, namely Bytes Format Units Label. So if you find the end of the Label in the first row, that gives a minimum span that must have non-whitespace characters.

   Bytes Format Units     Label  Explanations
--------------------------------------------------------------------------------
   1- 16 A16    ---       ID     Cluster identifier as given in WH15

All these shenanigans are a but frustrating from the lack of a good spec. One question is whether that Bytes Format Units Label Explanations is required to line up with the data below. If so then you can use the position of Explanations to determine unambiguously where the explanations start. In practice that seems to be the case but isn't clear that we can rely on that.

dhomeier · 2023-11-16T15:00:01Z

My proposed fix followed the same philosophy as the current way that continuation lines are identified, by failing the standard processing.

My suggestion actually does that as well, just having the match fail earlier in the processing.

All these shenanigans are a but frustrating from the lack of a good spec. One question is whether that Bytes Format Units Label Explanations is required to line up with the data below. If so then you can use the position of Explanations to determine unambiguously where the explanations start. In practice that seems to be the case but isn't clear that we can rely on that.

Yes, at first seemed a bit cumbersome to me, but the line should be available in lines[i_col_def + 2] so you could improve on my estimate of 24 characters minimum. But I also was not sure if the specification requires the lines to line up exactly, or if e.g. 4 tabs were also possible. Your example file and the ? Wavelength also would fail if the extra space was required.

…tarts with a number

…scription is not empty

Co-authored-by: Derek Homeier <709020+dhomeier@users.noreply.github.com>

jobovy · 2023-12-07T21:09:46Z

I applied some of the suggestions from the review, except for the one complaining about black formatting. IMHO, if you're going to be using black, you shouldn't complain about what it does to code. I rebased to the latest main.

I don't have time to do the alternative fix that was suggested in the review, so I'm just going to leave the PR as it is here, which does fix the original problem. If you don't want to accept it, so be it, then somebody else will have to fix it in a different way.

hamogu · 2023-12-08T14:08:06Z

@jobovy Thanks for taking it this far! We can always add to it in a later PR if there is more we want to do. Unless someone beats me to it, I'll look at this early next week and either merge to edit on top of your changes - but I won't get to it before the weekend.

dhomeier · 2023-12-09T11:40:18Z

Agreed; this will fix the issue and should be ready to go in. The alternative implementation is only a 3-line change, but I cannot enter it here as suggestion since the lines are disconnected from the existing changes. And determining the offset as discussed in #15617 (comment) would require yet some extra work, so it would be OK to do this in a follow-up PR.

I applied some of the suggestions from the review, except for the one complaining about black formatting. IMHO, if you're going to be using black, you shouldn't complain about what it does to code. I rebased to the latest main.

black has been adopted for the entire codebase as general rule, but as described in APE 20 per-case exceptions are to be made where readability is excessively impacted. The normal way to go would be to skip those sections with # fmt: off, but this is not something I'd block this PR over.

jobovy · 2023-12-10T19:49:44Z

I applied some of the suggestions from the review, except for the one complaining about black formatting. IMHO, if you're going to be using black, you shouldn't complain about what it does to code. I rebased to the latest main.

black has been adopted for the entire codebase as general rule, but as described in APE 20 per-case exceptions are to be made where readability is excessively impacted. The normal way to go would be to skip those sections with # fmt: off, but this is not something I'd block this PR over.

Okay, I changed the lines in question to be more readable and not get mangled by black.

Not sure why the coverage project check is failing, it seems a bit flaky; it was fine before the last change and I don't think this change can have done anything to the total project coverage given that the patch coverage is 100%.

dhomeier · 2023-12-12T10:15:23Z

Thanks! I would not worry about the codecov report, I rarely understand fully how it gets to its results, though it's possible if counts the explicit conditional now as additional (uncovered) line...

…DS descriptions where the continued line starts with a number

hamogu · 2023-12-12T11:42:03Z

I think this is good, too, and Derek, who originally had requested changes, has approved it, so I'm merging. Thank you @jobovy !

…617-on-v6.0.x Backport PR #15617 on branch v6.0.x ([io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number)

dhomeier · 2023-12-12T16:37:38Z

Forgot to mark this for squashing, but should be OK, thanks everyone!

jobovy requested review from taldcroft and dhomeier as code owners November 15, 2023 15:47

github-actions bot added the io.ascii label Nov 15, 2023

pllim added this to the v6.0 milestone Nov 15, 2023

pllim added backport-v6.0.x on-merge: backport to v6.0.x Bug labels Nov 15, 2023

pllim requested a review from hamogu November 15, 2023 15:56

jobovy changed the title ~~[ascii.io.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number~~ [io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number Nov 15, 2023

dhomeier requested changes Nov 15, 2023

View reviewed changes

saimn modified the milestones: v6.0.0, v6.0.1 Nov 25, 2023

jobovy and others added 8 commits December 7, 2023 16:06

Fix reading of multi-line CDS descriptions where the continued line s…

826d661

…tarts with a number

Fix failing multi-line description tests because of new whitespace

d81864d

Add changelog entry

e84f18b

Add test of multi-line description that starts with a number

3778384

Deal with the case where warnings are turned into errors

4952562

Only add a whitespace to a continued description when the existing de…

04aa800

…scription is not empty

Fix re-raising unit parsing error

c4d7b46

Apply some suggestions from code review

3c1484e

Co-authored-by: Derek Homeier <709020+dhomeier@users.noreply.github.com>

Suggested fix re:black formatting of adding to column description

e70c35e

dhomeier approved these changes Dec 12, 2023

View reviewed changes

hamogu merged commit cf35727 into astropy:main Dec 12, 2023
24 of 26 checks passed

meeseeksmachine pushed a commit to meeseeksmachine/astropy that referenced this pull request Dec 12, 2023

Backport PR astropy#15617: [io.ascii.cds] Fix reading of multi-line C…

fd4daac

…DS descriptions where the continued line starts with a number

meeseeksmachine mentioned this pull request Dec 12, 2023

Backport PR #15617 on branch v6.0.x ([io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number) #15718

Merged

jobovy deleted the fix-cds-multi-line-explanation branch December 12, 2023 15:43

dhomeier mentioned this pull request Dec 13, 2023

[io.ascii.cds] Find "Explanations" position to identify continuation lines in Readme from column headings #15731

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617

[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617

jobovy commented Nov 15, 2023

github-actions bot commented Nov 15, 2023 •

edited by hamogu

pllim commented Nov 15, 2023

jobovy commented Nov 15, 2023

pllim commented Nov 15, 2023

jobovy commented Nov 15, 2023

pllim commented Nov 15, 2023

jobovy commented Nov 15, 2023

dhomeier left a comment

dhomeier commented Nov 15, 2023

jobovy commented Nov 16, 2023

taldcroft commented Nov 16, 2023

dhomeier commented Nov 16, 2023 •

edited

jobovy commented Dec 7, 2023

hamogu commented Dec 8, 2023

dhomeier commented Dec 9, 2023

jobovy commented Dec 10, 2023

dhomeier commented Dec 12, 2023

hamogu commented Dec 12, 2023

dhomeier commented Dec 12, 2023

[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617

[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617

Conversation

jobovy commented Nov 15, 2023

Description

github-actions bot commented Nov 15, 2023 • edited by hamogu

pllim commented Nov 15, 2023

jobovy commented Nov 15, 2023

pllim commented Nov 15, 2023

jobovy commented Nov 15, 2023

pllim commented Nov 15, 2023

jobovy commented Nov 15, 2023

dhomeier left a comment

Choose a reason for hiding this comment

dhomeier commented Nov 15, 2023

jobovy commented Nov 16, 2023

taldcroft commented Nov 16, 2023

dhomeier commented Nov 16, 2023 • edited

jobovy commented Dec 7, 2023

hamogu commented Dec 8, 2023

dhomeier commented Dec 9, 2023

jobovy commented Dec 10, 2023

dhomeier commented Dec 12, 2023

hamogu commented Dec 12, 2023

dhomeier commented Dec 12, 2023

github-actions bot commented Nov 15, 2023 •

edited by hamogu

dhomeier commented Nov 16, 2023 •

edited