Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update APE-6 for subtype support #68

Merged
merged 9 commits into from
May 25, 2021
Merged

Conversation

taldcroft
Copy link
Member

This updates APE-6 to add support for extended data types, in particular multidimensional fixed-dimension columns, variable-length arrays, and object columns. This is done primarily by adding a subtype data keyword and specifying the convention for using this keyword for new supported extended types.

The discussion in astropy/astropy#11368 was a large driver for this update including details of the new specification.

Writing standards documents is not my forte, so comments are most welcome!

cc: @mbtaylor @mhvk @hamogu @dhomeier

@hamogu
Copy link
Member

hamogu commented Apr 28, 2021

Looks good to me.

@mbtaylor
Copy link

@taldcroft, this basically looks good, but I have couple of suggestions for change.

First, that use of subtype is not restricted to string columns. I agree that most uses, including the array and object ones enumerated here, will be for string columns, but (a) I don't see any cost to relaxing this restriction, and (b) I can imagine possibilities where non-string subtypes might be desirable, e.g. datatype: float64, subtype: MJD; I don't know if python/numpy has specific datatypes for time values, but some languages or parsing environments might do. Usages like this don't have to be written into the APE now or ever, but if either ECSV v1.x or individual users decide something like that would be a good idea in future, it won't conflict with the existing specification.

So where the subtype keyword is introduced I would remove or replace the sentence "If subtype is defined the datatype must be set to string", and in the Subtype data section replace the sentence:

If table readers do not recognize the subtype then the column should be returned as a string.

with something like:

If table readers do not recognise or support the subtype then they may ignore it and use the datatype only.

A few other knock-on changes elsewhere in the text will be required for consistency.

Second, for the "object" subtype, I'd suggest renaming it to "json". This is more descriptive, but also more accurate: in your example file in this section, one of the cells has the value true, which is JSON, but is not a JSON object (which would have to be enclosed in curly brackets).

@taldcroft
Copy link
Member Author

Relaxing the restriction on requiring string

Sounds fine and I agree with allowing flexibility for future enhancements.

Second, for the "object" subtype, I'd suggest renaming it to "json".

OK. This requires a small change to the merged ECSV PR (astropy/astropy#11662).

@mhvk
Copy link
Contributor

mhvk commented Apr 29, 2021

Now had a look. I like the addition very much. I also like the changes proposed by @mbtaylor - and am happy to quickly review the necessary follow-up to the astropy implementation.

@taldcroft
Copy link
Member Author

@mbtaylor @mhvk - I think have addressed the comments and that this is ready for final review. It's worth viewing the rendered file since I picked up some problems I had not previously noticed.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link

@mbtaylor mbtaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, but at least one change required: the file example in the Object columns section still has subtype: object instead of subtype: json. I wonder if it would also make sense to retitle that section "JSON columns" and change the language in the text to avoid the term "object" given that the things in the column can be scalars with JSON representation ... but that might end up being more confusing, so only do that if you think it's a good idea.

There are also a couple of typos inherited from earlier versions of the document:

  • "reprented" -> "represented"
  • "include column types" -> "including column types"
  • "tablular" -> "tabular"

@taldcroft
Copy link
Member Author

@mbtaylor - my hope was that the "As a point of clarification" sentence would make things clear. I'm certainly coming from a Python or Java perspective where everything (scalars included) is an object. IMO the JSON spec choose poorly in labeling a dict/mapping as "object", so my preference is to stick with the existing text and more common usage of "object".

We can always come back to this choice of wording if it ends up causing confusion. It won't change any implementation details.

Thanks for the sharp eye I'll fix those. I did put each of the examples into the Python parser, but it turns out that since object is a valid numpy dtype having "object" there also works for Python.

@mbtaylor
Copy link

mbtaylor commented May 3, 2021

Fine, I'm happy with the existing text then. All OK by me.

APE6.rst Outdated
From version 1.0 and later it is possible to embed extended data types beyond
simple typed scalars in the data section. The column data in the ECSV output
shall be consistent with the specified ``datatype``, with additional details of
the data being captured in the``subtype`` keyword.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"subtype" not formatted correctly here - missing space?

@mbtaylor
Copy link

mbtaylor commented May 7, 2021

One other formatting error I spotted: in the second paragraph of the "Header Details" section, there is a long run of monospaced text after "start with the two characters"

@taldcroft
Copy link
Member Author

taldcroft commented May 8, 2021

From @mbtaylor:

"I have been creating some ECSV 1.0 test data, for example the attached
file. You might disagree with some of the details, in particular
blank values (adjacent delimiters in the file with no text between them)
for some data types - that usage does seem to be supported by the
astropy reader for at least some data types, though it's not
discussed explicitly in APE6, and I'd argue it would be a good idea
to allow blank (masked?) values for any column (though I might have
said something different in the past). But in any case it would
probably be a good idea for us to agree what is and is not legal,
and change things if required, so my output handler is not writing
files that the astropy input handler can't read, and ideally vice versa."

Good point about blank values as a marker for missing values. I didn't realize that this is not mentioned in APE-6, so that is definitely something to fix in this PR.

Right now astropy supports blank values for all scalar column types. Here is the astropy Table representation of just the scalar type data in test file from @mbtaylor:

In [4]: t.pprint_all()
i_index s_byte s_short s_int s_long s_float s_double    s_string    s_boolean
------- ------ ------- ----- ------ ------- -------- -------------- ---------
      0      0       0     0      0     0.0      0.0           zero     False
      1     --       1     1      1     1.0      1.0            one      True
      2      2      --     2      2     2.0      2.0            two     False
      3      3       3    --      3     3.0      3.0          three      True
      4      4       4     4     --     4.0      4.0           four     False
      5      5       5     5      5     nan      5.0           five      True
      6      6       6     6      6     6.0      nan            six     False
      7      7       7     7      7     7.0      7.0             --      True
      8      8       8     8      8     8.0      8.0 ' "\""' ; '&<>        --
      9      9       9     9      9     nan      nan             --      True
     10    -10     -10   -10    -10   -10.0    -10.0             10     False
     11     --     -11   -11    -11   -11.0    -11.0       10 + one      True
     12    -12      --   -12    -12   -12.0    -12.0       10 + two     False
     13    -13     -13    --    -13   -13.0    -13.0     10 + three      True
     14    -14     -14   -14     --   -14.0    -14.0      10 + four     False
     15    -15     -15   -15    -15     nan    -15.0      10 + five      True

For the subtype columns, IIRC we had discussed this <somewhere> but I can't seem to find it. Basically I said that for any subtype column, a blank value is not allowed to signify missing data. In particular for fixed or variable array data, missing values should be represented as null. So instead of using a blank field, use [null,null]. The logic behind this is that being "missing" really applies to just a scalar element.

Again, none of this is reflected in this PR, so I'll add text accordingly if this seems OK with you.

@taldcroft
Copy link
Member Author

@mbtaylor - thinking some more, I do see the improved overall consistency in allowing a blank field to indicate all values are missing even for subtype columns. I'll confess that I'm partly worried about the astropy implementation, even though that should not be impacting how the spec is defined. I'm not sure at this point how difficult it would be to support.

Is your java version working now with this? What does it do for an empty field in a variable array? Give a zero-length array?

@mbtaylor
Copy link

mbtaylor commented May 8, 2021

@taldcroft yes, we did discuss this before at astropy/astropy#11569 (comment), and I said I didn't have strong views about whether blank values should be permitted for array-valued columns.

However, on reflection I think blanks should be permitted here, as you say for consistency and also because I think people are going to want to be able to write data like that, for instance a Gaia column containing a flux time series where some rows have time series data and others don't.

Yes my java implementation handles this naturally. How the data is stored ... it depends, but as far as the API goes, the content of each cell is a java Object, and that can be returned as either an array object or a null value (the data for the whole column is not available in the API as a free-standlng data structure).

If it turns out to be hard or impossible to represent those blank values faithfully within the Astropy implementation I'd say that's OK, just provide the best representation of a blank value that's feasible, maybe a zero-length array or an array of all nan's or whatever. My representation of the ECSV data model is not going to be perfect either, e.g. I can't represent blank array elements of subtype int*[...] array-valued cells, so they'll just have to be zeros.

@taldcroft
Copy link
Member Author

@mbtaylor - It turned out to be just a few lines of code to support blank values for masking data (plus tests of course, astropy/astropy#11720), so I'll update the APE-6 PR accordingly. I did end up on a detour with numpy/numpy#18981. Unfortunately in Python if you mask a variable length array with a blank, numpy does something funky to make it appear as an unmasked zero-length array instead of a masked value. We can live with that and it doesn't impact APE-6.

@taldcroft
Copy link
Member Author

@mbtaylor - I pushed three new commits that you can review. The new missing values section has some slight waffling about whether a blank entry counts as a missing value. I was always bothered by not being able to represent a valid zero-length string, so for astropy invented an alternate way using a secondary column of the masks. This is nominally part of the (undocumented) astropy-2.0 schema. Anyway, suggestions welcome.

Hopefully this is converging. I think the astropy "bug fix" that implements all the missing value support is basically done. I put in support for missing values in variable length arrays this morning.

@taldcroft
Copy link
Member Author

BTW, I was not able to understand that issue with the long literal line in the Header Details section. I stared at the source for a long time and AFAIK it was valid RST. For some reason it wasn't rendering right, so I took an end run and changed the words.

@@ -559,6 +582,9 @@ a double-quote within a string, hence the double-double quotes.
"{""b"":[2.5,null]}"
true

In this subtype, the ``null`` marker is decoded by JSON as the language-specific

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence sounds a bit confusing to me (maybe it's a language-specific
use of the term "JSON" to mean "the JSON parsing library"). I'd suggest
instead something like:

The null token is the JSON representation of a null value. JSON parsers will typically decode this to some language-specific value, for instance None in Python.

@mbtaylor
Copy link

I only have two remaining tiny niggles:

  • The Subtype data section refers to "three types of extended data that can be represented: fixed-dimension array data, variable-length array data, and object data". But the first two subsections are titled "fixed-length" and "variable-length" not "fixed-dimension" and "variable-length" array data. Make this terminology consistent by rewording text or subtitles?
  • The first sentence in the "Variable-length array data" section says "the data cell are arrays", should be "the data cells are arrays".

Otherwise, all looks good to go!

@taldcroft
Copy link
Member Author

@mbtaylor - looking at this fresh I didn't like the "fixed-dimension" description. I changed to "multidimensional", which I hope is clear enough and definitely more common usage.

@mbtaylor
Copy link

I'm not sure that replacing "fixed-length" with "multidimensional" is an improvement, since "variable-length array" subtypes can be multidimensional too (and "multidimensional array" subtypes can be 1-dimensional).

But the terminology is consistent, and the explanatory text and examples are clear, so if you prefer it this way I'm happy to accept it.

@taldcroft
Copy link
Member Author

@mbtaylor - thanks. In that case let's stick with what is there. Since @mhvk has approved and nobody else has commented since I posted on astropy-dev, I think we can go to the next step and I will request final approval from the rest of the Coordination Committee.

Thanks for the great input which has been immensely helpful. I'm pretty happy with where this has ended up now as a result.

@eteq
Copy link
Member

eteq commented May 25, 2021

The Coordination Committee discussed this today and have approved this APE. In the last commit I updated the info accordingly, and will now merge

@eteq eteq merged commit 0c583cf into astropy:main May 25, 2021
@mbtaylor
Copy link

Excellent. Thanks a lot @taldcroft et al. for getting this done, I agree it's worked out well.
One question: ECSV 0.9 had a DOI, will ECSV 1.0 get one, and where should I go looking for it?

@taldcroft
Copy link
Member Author

@mbtaylor - the updated DOI will be added as part of the standard APE acceptance process, hopefully soon.

@taldcroft taldcroft deleted the ape6-subtypes branch May 25, 2021 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants