New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number #15617
Conversation
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
Adding a simple test would be nice but Also would need a change log. I milestoned to 6.0 because I feel like this might make it if merged soon enough since we're definitely doing a RC2 at some point. |
Okay, I'll take a look at the test failure(s). The one I immediately noticed is related to another very small fix that I included, which is to add a space between the lines of continued descriptions. In the current version that's in the current tests, one gets this:
where the line break is between Q and (large, but I think this should really be
which it is in this PR.
Okay, I'll add this as well. |
In general, the diff LGTM but I'll let subpackage maintainers do final approval. Thanks! |
Okay, I think this is now ready. A summary of what I've done:
One question for the reviewers is this: with the current fix, there are two identical lines in the code where a multi-line description is added to the existing description: this new way where a line matches the BYTES/Format/Units/Name/Explanations format because it starts with a number and the existing way where a line doesn't match this format. Ideally, the description would only be updated in a single place, but it's quite awkward to do that in the current code and it's just a single line. But it would be possible. An intermediate solution could be to write a small function that specifies how to add to the description, so it can at least be done consistently if this is ever changed (e.g., if the white-space handling needs to be changed). |
This one sounds like a behavior change. So if this PR does not make it into 6.0.0, maybe we should not backport to 6.0.1? |
I guess it's possible that people have been relying on this behavior, but in most cases not having a space between the lines would be a bug in my opinion (because if the line is split between two words, they are now concatenated). Ideally it would go in 6.0.0 so then I guess there's no issue, although I guess I should add it to the changelog. Anyway, maybe I'll wait to hear from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this fix! Indeed catching lines like these can be quite tricky, and I think it is becoming somewhat unnecessarily complex when trying to "repair" a line entry after it has been misidentified. In principle the continuation line might still by chance contain a valid format and unit string and thus slip through, unlikely as that may be.
From a human reader's perspective the more obvious feature is that the line is blank over the entire Bytes Format Units Label section, so this could be rather easier fixed by stripping the lines in the prior processing less aggressively:
astropy/astropy/io/ascii/cds.py
Line 68 in 5f05c10
line = line.strip() |
would need to go entirely, and can safely, since all lines are processed again in the next step, and replacing
astropy/astropy/io/ascii/cds.py
Lines 109 to 110 in 5f05c10
r"""\s* | |
(?P<start> \d+ \s* -)? \s* |
with
r"""\s{0,12}
(?P<start> \d+ \s* -)? \s{0,12}
would catch continuation lines (the minimum width of the Bytes Format Units Label
block should be 24, and the first Bytes
number should hardly have more than 4-5 spaces ahead of it). Tested this version with your tests and all of io
/table
.
I noticed that the test ReadMe does have an extra space before the |
My proposed fix followed the same philosophy as the current way that continuation lines are identified, by failing the standard processing. I can't say that it's entirely clear to me that your proposed fix is robust enough, although I agree that it probably is. I can't find anything about multi-line explanations in the CDS standard and the mrt standard is not that specific (indent to the start of the Explanation column). I'll take a look whether your proposed fix works and can switch to that.
Yes, the extra space is currently just stripped. From what I linked to above, I think there is an extra space in the standard, so stripping up to the start of the Explanation column would be the right thing to do. Maybe this even suggests a better overall fix: since the standard requires the start of the Explanation column to be aligned across all rows, one could use the first row in the Byte-by-Byte description to figure out where the Explanation column starts and then for subsequent lines check whether the line has exactly that many white spaces at the start. Then you know it's a continuation line and can just grab the end of the line as the continued explanation, which per the standard should include the space (assuming the MRT standard is the same as the CDS, because I can only find the extra space in the MRT standard). |
Agreed something like this is a good approach. You might need to be careful since I am not sure that an Explanation is required by the standard. But certainly you can apply that principle for columns that are definitely required, namely Bytes Format Units Label. So if you find the end of the Label in the first row, that gives a minimum span that must have non-whitespace characters.
All these shenanigans are a but frustrating from the lack of a good spec. One question is whether that |
My suggestion actually does that as well, just having the match fail earlier in the processing.
Yes, at first seemed a bit cumbersome to me, but the line should be available in |
…tarts with a number
…scription is not empty
Co-authored-by: Derek Homeier <709020+dhomeier@users.noreply.github.com>
I applied some of the suggestions from the review, except for the one complaining about I don't have time to do the alternative fix that was suggested in the review, so I'm just going to leave the PR as it is here, which does fix the original problem. If you don't want to accept it, so be it, then somebody else will have to fix it in a different way. |
@jobovy Thanks for taking it this far! We can always add to it in a later PR if there is more we want to do. Unless someone beats me to it, I'll look at this early next week and either merge to edit on top of your changes - but I won't get to it before the weekend. |
Agreed; this will fix the issue and should be ready to go in. The alternative implementation is only a 3-line change, but I cannot enter it here as suggestion since the lines are disconnected from the existing changes. And determining the offset as discussed in #15617 (comment) would require yet some extra work, so it would be OK to do this in a follow-up PR.
|
Okay, I changed the lines in question to be more readable and not get mangled by Not sure why the coverage project check is failing, it seems a bit flaky; it was fine before the last change and I don't think this change can have done anything to the total project coverage given that the patch coverage is 100%. |
Thanks! I would not worry about the codecov report, I rarely understand fully how it gets to its results, though it's possible if counts the explicit conditional now as additional (uncovered) line... |
…DS descriptions where the continued line starts with a number
I think this is good, too, and Derek, who originally had requested changes, has approved it, so I'm merging. Thank you @jobovy ! |
…617-on-v6.0.x Backport PR #15617 on branch v6.0.x ([io.ascii.cds] Fix reading of multi-line CDS descriptions where the continued line starts with a number)
Forgot to mark this for squashing, but should be OK, thanks everyone! |
Description
Fixes #15608
Fixing this turned out to be a bit tricky, because a multi-line continuation of a column description starting with a number can just get read in like a normal line and have all of the required entries (end, format, units, name, and descr). In particular, the line I was having an issue with is
and gets matched using the regular expression to
So my fix intercepts an error that occurs due to the format being incorrect and combining this with an invalid unit to not just catch any issue with the format.
I can add a test if somebody points me to the right place to put this. Would this consist of adding a new test file to the
io/ascii/tests/data/cds
directory and adding this in thetest_read.py
file in thetestfiles
list? I wasn't sure whether this is the appropriate place, because those files seem to be used in a bunch of tests and maybe you just want to simple test for this issue.