-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update APE-6 for subtype support #68
Conversation
Looks good to me. |
@taldcroft, this basically looks good, but I have couple of suggestions for change. First, that use of So where the
with something like:
A few other knock-on changes elsewhere in the text will be required for consistency. Second, for the "object" subtype, I'd suggest renaming it to "json". This is more descriptive, but also more accurate: in your example file in this section, one of the cells has the value |
Sounds fine and I agree with allowing flexibility for future enhancements.
OK. This requires a small change to the merged ECSV PR (astropy/astropy#11662). |
Now had a look. I like the addition very much. I also like the changes proposed by @mbtaylor - and am happy to quickly review the necessary follow-up to the astropy implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, but at least one change required: the file example in the Object columns section still has subtype: object
instead of subtype: json
. I wonder if it would also make sense to retitle that section "JSON columns" and change the language in the text to avoid the term "object" given that the things in the column can be scalars with JSON representation ... but that might end up being more confusing, so only do that if you think it's a good idea.
There are also a couple of typos inherited from earlier versions of the document:
- "reprented" -> "represented"
- "include column types" -> "including column types"
- "tablular" -> "tabular"
@mbtaylor - my hope was that the "As a point of clarification" sentence would make things clear. I'm certainly coming from a Python or Java perspective where everything (scalars included) is an object. IMO the JSON spec choose poorly in labeling a dict/mapping as "object", so my preference is to stick with the existing text and more common usage of "object". We can always come back to this choice of wording if it ends up causing confusion. It won't change any implementation details. Thanks for the sharp eye I'll fix those. I did put each of the examples into the Python parser, but it turns out that since |
Fine, I'm happy with the existing text then. All OK by me. |
APE6.rst
Outdated
From version 1.0 and later it is possible to embed extended data types beyond | ||
simple typed scalars in the data section. The column data in the ECSV output | ||
shall be consistent with the specified ``datatype``, with additional details of | ||
the data being captured in the``subtype`` keyword. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"subtype" not formatted correctly here - missing space?
One other formatting error I spotted: in the second paragraph of the "Header Details" section, there is a long run of monospaced text after "start with the two characters" |
From @mbtaylor: "I have been creating some ECSV 1.0 test data, for example the attached Good point about blank values as a marker for missing values. I didn't realize that this is not mentioned in APE-6, so that is definitely something to fix in this PR. Right now astropy supports blank values for all scalar column types. Here is the astropy Table representation of just the scalar type data in test file from @mbtaylor:
For the subtype columns, IIRC we had discussed this <somewhere> but I can't seem to find it. Basically I said that for any subtype column, a blank value is not allowed to signify missing data. In particular for fixed or variable array data, missing values should be represented as Again, none of this is reflected in this PR, so I'll add text accordingly if this seems OK with you. |
@mbtaylor - thinking some more, I do see the improved overall consistency in allowing a blank field to indicate all values are missing even for subtype columns. I'll confess that I'm partly worried about the astropy implementation, even though that should not be impacting how the spec is defined. I'm not sure at this point how difficult it would be to support. Is your java version working now with this? What does it do for an empty field in a variable array? Give a zero-length array? |
@taldcroft yes, we did discuss this before at astropy/astropy#11569 (comment), and I said I didn't have strong views about whether blank values should be permitted for array-valued columns. However, on reflection I think blanks should be permitted here, as you say for consistency and also because I think people are going to want to be able to write data like that, for instance a Gaia column containing a flux time series where some rows have time series data and others don't. Yes my java implementation handles this naturally. How the data is stored ... it depends, but as far as the API goes, the content of each cell is a java Object, and that can be returned as either an array object or a null value (the data for the whole column is not available in the API as a free-standlng data structure). If it turns out to be hard or impossible to represent those blank values faithfully within the Astropy implementation I'd say that's OK, just provide the best representation of a blank value that's feasible, maybe a zero-length array or an array of all nan's or whatever. My representation of the ECSV data model is not going to be perfect either, e.g. I can't represent blank array elements of |
@mbtaylor - It turned out to be just a few lines of code to support blank values for masking data (plus tests of course, astropy/astropy#11720), so I'll update the APE-6 PR accordingly. I did end up on a detour with numpy/numpy#18981. Unfortunately in Python if you mask a variable length array with a blank, numpy does something funky to make it appear as an unmasked zero-length array instead of a masked value. We can live with that and it doesn't impact APE-6. |
@mbtaylor - I pushed three new commits that you can review. The new missing values section has some slight waffling about whether a blank entry counts as a missing value. I was always bothered by not being able to represent a valid zero-length string, so for astropy invented an alternate way using a secondary column of the masks. This is nominally part of the (undocumented) Hopefully this is converging. I think the astropy "bug fix" that implements all the missing value support is basically done. I put in support for missing values in variable length arrays this morning. |
BTW, I was not able to understand that issue with the long literal line in the Header Details section. I stared at the source for a long time and AFAIK it was valid RST. For some reason it wasn't rendering right, so I took an end run and changed the words. |
@@ -559,6 +582,9 @@ a double-quote within a string, hence the double-double quotes. | |||
"{""b"":[2.5,null]}" | |||
true | |||
|
|||
In this subtype, the ``null`` marker is decoded by JSON as the language-specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence sounds a bit confusing to me (maybe it's a language-specific
use of the term "JSON" to mean "the JSON parsing library"). I'd suggest
instead something like:
The
null
token is the JSON representation of a null value. JSON parsers will typically decode this to some language-specific value, for instanceNone
in Python.
I only have two remaining tiny niggles:
Otherwise, all looks good to go! |
@mbtaylor - looking at this fresh I didn't like the "fixed-dimension" description. I changed to "multidimensional", which I hope is clear enough and definitely more common usage. |
I'm not sure that replacing "fixed-length" with "multidimensional" is an improvement, since "variable-length array" subtypes can be multidimensional too (and "multidimensional array" subtypes can be 1-dimensional). But the terminology is consistent, and the explanatory text and examples are clear, so if you prefer it this way I'm happy to accept it. |
@mbtaylor - thanks. In that case let's stick with what is there. Since @mhvk has approved and nobody else has commented since I posted on astropy-dev, I think we can go to the next step and I will request final approval from the rest of the Coordination Committee. Thanks for the great input which has been immensely helpful. I'm pretty happy with where this has ended up now as a result. |
The Coordination Committee discussed this today and have approved this APE. In the last commit I updated the info accordingly, and will now merge |
Excellent. Thanks a lot @taldcroft et al. for getting this done, I agree it's worked out well. |
@mbtaylor - the updated DOI will be added as part of the standard APE acceptance process, hopefully soon. |
This updates APE-6 to add support for extended data types, in particular multidimensional fixed-dimension columns, variable-length arrays, and object columns. This is done primarily by adding a
subtype
data keyword and specifying the convention for using this keyword for new supported extended types.The discussion in astropy/astropy#11368 was a large driver for this update including details of the new specification.
Writing standards documents is not my forte, so comments are most welcome!
cc: @mbtaylor @mhvk @hamogu @dhomeier