Update APE-6 for subtype support #68

taldcroft · 2021-04-28T13:32:47Z

This updates APE-6 to add support for extended data types, in particular multidimensional fixed-dimension columns, variable-length arrays, and object columns. This is done primarily by adding a subtype data keyword and specifying the convention for using this keyword for new supported extended types.

The discussion in astropy/astropy#11368 was a large driver for this update including details of the new specification.

Writing standards documents is not my forte, so comments are most welcome!

cc: @mbtaylor @mhvk @hamogu @dhomeier

hamogu · 2021-04-28T14:09:44Z

Looks good to me.

mbtaylor · 2021-04-29T11:00:48Z

@taldcroft, this basically looks good, but I have couple of suggestions for change.

First, that use of subtype is not restricted to string columns. I agree that most uses, including the array and object ones enumerated here, will be for string columns, but (a) I don't see any cost to relaxing this restriction, and (b) I can imagine possibilities where non-string subtypes might be desirable, e.g. datatype: float64, subtype: MJD; I don't know if python/numpy has specific datatypes for time values, but some languages or parsing environments might do. Usages like this don't have to be written into the APE now or ever, but if either ECSV v1.x or individual users decide something like that would be a good idea in future, it won't conflict with the existing specification.

So where the subtype keyword is introduced I would remove or replace the sentence "If subtype is defined the datatype must be set to string", and in the Subtype data section replace the sentence:

If table readers do not recognize the subtype then the column should be returned as a string.

with something like:

If table readers do not recognise or support the subtype then they may ignore it and use the datatype only.

A few other knock-on changes elsewhere in the text will be required for consistency.

Second, for the "object" subtype, I'd suggest renaming it to "json". This is more descriptive, but also more accurate: in your example file in this section, one of the cells has the value true, which is JSON, but is not a JSON object (which would have to be enclosed in curly brackets).

taldcroft · 2021-04-29T13:58:52Z

Relaxing the restriction on requiring string

Sounds fine and I agree with allowing flexibility for future enhancements.

Second, for the "object" subtype, I'd suggest renaming it to "json".

OK. This requires a small change to the merged ECSV PR (astropy/astropy#11662).

mhvk · 2021-04-29T14:39:34Z

Now had a look. I like the addition very much. I also like the changes proposed by @mbtaylor - and am happy to quickly review the necessary follow-up to the astropy implementation.

taldcroft · 2021-05-02T13:16:06Z

@mbtaylor @mhvk - I think have addressed the comments and that this is ready for final review. It's worth viewing the rendered file since I picked up some problems I had not previously noticed.

mhvk

Looks good to me!

mbtaylor

Mostly looks good, but at least one change required: the file example in the Object columns section still has subtype: object instead of subtype: json. I wonder if it would also make sense to retitle that section "JSON columns" and change the language in the text to avoid the term "object" given that the things in the column can be scalars with JSON representation ... but that might end up being more confusing, so only do that if you think it's a good idea.

There are also a couple of typos inherited from earlier versions of the document:

"reprented" -> "represented"
"include column types" -> "including column types"
"tablular" -> "tabular"

taldcroft · 2021-05-03T13:45:23Z

@mbtaylor - my hope was that the "As a point of clarification" sentence would make things clear. I'm certainly coming from a Python or Java perspective where everything (scalars included) is an object. IMO the JSON spec choose poorly in labeling a dict/mapping as "object", so my preference is to stick with the existing text and more common usage of "object".

We can always come back to this choice of wording if it ends up causing confusion. It won't change any implementation details.

Thanks for the sharp eye I'll fix those. I did put each of the examples into the Python parser, but it turns out that since object is a valid numpy dtype having "object" there also works for Python.

mbtaylor · 2021-05-03T21:34:32Z

Fine, I'm happy with the existing text then. All OK by me.

mbtaylor · 2021-05-07T22:00:20Z

APE6.rst

+From version 1.0 and later it is possible to embed extended data types beyond
+simple typed scalars in the data section. The column data in the ECSV output
+shall be consistent with the specified ``datatype``, with additional details of
+the data being captured in the``subtype`` keyword.


"subtype" not formatted correctly here - missing space?

mbtaylor · 2021-05-07T22:03:30Z

One other formatting error I spotted: in the second paragraph of the "Header Details" section, there is a long run of monospaced text after "start with the two characters"

taldcroft · 2021-05-08T11:15:08Z

From @mbtaylor:

"I have been creating some ECSV 1.0 test data, for example the attached
file. You might disagree with some of the details, in particular
blank values (adjacent delimiters in the file with no text between them)
for some data types - that usage does seem to be supported by the
astropy reader for at least some data types, though it's not
discussed explicitly in APE6, and I'd argue it would be a good idea
to allow blank (masked?) values for any column (though I might have
said something different in the past). But in any case it would
probably be a good idea for us to agree what is and is not legal,
and change things if required, so my output handler is not writing
files that the astropy input handler can't read, and ideally vice versa."

Good point about blank values as a marker for missing values. I didn't realize that this is not mentioned in APE-6, so that is definitely something to fix in this PR.

Right now astropy supports blank values for all scalar column types. Here is the astropy Table representation of just the scalar type data in test file from @mbtaylor:

In [4]: t.pprint_all()
i_index s_byte s_short s_int s_long s_float s_double    s_string    s_boolean
------- ------ ------- ----- ------ ------- -------- -------------- ---------
      0      0       0     0      0     0.0      0.0           zero     False
      1     --       1     1      1     1.0      1.0            one      True
      2      2      --     2      2     2.0      2.0            two     False
      3      3       3    --      3     3.0      3.0          three      True
      4      4       4     4     --     4.0      4.0           four     False
      5      5       5     5      5     nan      5.0           five      True
      6      6       6     6      6     6.0      nan            six     False
      7      7       7     7      7     7.0      7.0             --      True
      8      8       8     8      8     8.0      8.0 ' "\""' ; '&<>        --
      9      9       9     9      9     nan      nan             --      True
     10    -10     -10   -10    -10   -10.0    -10.0             10     False
     11     --     -11   -11    -11   -11.0    -11.0       10 + one      True
     12    -12      --   -12    -12   -12.0    -12.0       10 + two     False
     13    -13     -13    --    -13   -13.0    -13.0     10 + three      True
     14    -14     -14   -14     --   -14.0    -14.0      10 + four     False
     15    -15     -15   -15    -15     nan    -15.0      10 + five      True

For the subtype columns, IIRC we had discussed this <somewhere> but I can't seem to find it. Basically I said that for any subtype column, a blank value is not allowed to signify missing data. In particular for fixed or variable array data, missing values should be represented as null. So instead of using a blank field, use [null,null]. The logic behind this is that being "missing" really applies to just a scalar element.

Again, none of this is reflected in this PR, so I'll add text accordingly if this seems OK with you.

taldcroft · 2021-05-08T12:30:06Z

@mbtaylor - thinking some more, I do see the improved overall consistency in allowing a blank field to indicate all values are missing even for subtype columns. I'll confess that I'm partly worried about the astropy implementation, even though that should not be impacting how the spec is defined. I'm not sure at this point how difficult it would be to support.

Is your java version working now with this? What does it do for an empty field in a variable array? Give a zero-length array?

mbtaylor · 2021-05-08T22:06:25Z

@taldcroft yes, we did discuss this before at astropy/astropy#11569 (comment), and I said I didn't have strong views about whether blank values should be permitted for array-valued columns.

However, on reflection I think blanks should be permitted here, as you say for consistency and also because I think people are going to want to be able to write data like that, for instance a Gaia column containing a flux time series where some rows have time series data and others don't.

Yes my java implementation handles this naturally. How the data is stored ... it depends, but as far as the API goes, the content of each cell is a java Object, and that can be returned as either an array object or a null value (the data for the whole column is not available in the API as a free-standlng data structure).

If it turns out to be hard or impossible to represent those blank values faithfully within the Astropy implementation I'd say that's OK, just provide the best representation of a blank value that's feasible, maybe a zero-length array or an array of all nan's or whatever. My representation of the ECSV data model is not going to be perfect either, e.g. I can't represent blank array elements of subtype int*[...] array-valued cells, so they'll just have to be zeros.

taldcroft · 2021-05-10T15:47:21Z

@mbtaylor - It turned out to be just a few lines of code to support blank values for masking data (plus tests of course, astropy/astropy#11720), so I'll update the APE-6 PR accordingly. I did end up on a detour with numpy/numpy#18981. Unfortunately in Python if you mask a variable length array with a blank, numpy does something funky to make it appear as an unmasked zero-length array instead of a masked value. We can live with that and it doesn't impact APE-6.

taldcroft · 2021-05-11T13:29:34Z

@mbtaylor - I pushed three new commits that you can review. The new missing values section has some slight waffling about whether a blank entry counts as a missing value. I was always bothered by not being able to represent a valid zero-length string, so for astropy invented an alternate way using a secondary column of the masks. This is nominally part of the (undocumented) astropy-2.0 schema. Anyway, suggestions welcome.

Hopefully this is converging. I think the astropy "bug fix" that implements all the missing value support is basically done. I put in support for missing values in variable length arrays this morning.

taldcroft · 2021-05-11T13:33:10Z

BTW, I was not able to understand that issue with the long literal line in the Header Details section. I stared at the source for a long time and AFAIK it was valid RST. For some reason it wasn't rendering right, so I took an end run and changed the words.

mbtaylor · 2021-05-11T14:05:23Z

APE6.rst

@@ -559,6 +582,9 @@ a double-quote within a string, hence the double-double quotes.
  "{""b"":[2.5,null]}"
  true

+In this subtype, the ``null`` marker is decoded by JSON as the language-specific


This sentence sounds a bit confusing to me (maybe it's a language-specific
use of the term "JSON" to mean "the JSON parsing library"). I'd suggest
instead something like:

The null token is the JSON representation of a null value. JSON parsers will typically decode this to some language-specific value, for instance None in Python.

mbtaylor · 2021-05-11T14:17:57Z

I only have two remaining tiny niggles:

The Subtype data section refers to "three types of extended data that can be represented: fixed-dimension array data, variable-length array data, and object data". But the first two subsections are titled "fixed-length" and "variable-length" not "fixed-dimension" and "variable-length" array data. Make this terminology consistent by rewording text or subtitles?
The first sentence in the "Variable-length array data" section says "the data cell are arrays", should be "the data cells are arrays".

Otherwise, all looks good to go!

taldcroft · 2021-05-13T10:35:30Z

@mbtaylor - looking at this fresh I didn't like the "fixed-dimension" description. I changed to "multidimensional", which I hope is clear enough and definitely more common usage.

mbtaylor · 2021-05-13T10:50:15Z

I'm not sure that replacing "fixed-length" with "multidimensional" is an improvement, since "variable-length array" subtypes can be multidimensional too (and "multidimensional array" subtypes can be 1-dimensional).

But the terminology is consistent, and the explanatory text and examples are clear, so if you prefer it this way I'm happy to accept it.

taldcroft · 2021-05-14T18:38:41Z

@mbtaylor - thanks. In that case let's stick with what is there. Since @mhvk has approved and nobody else has commented since I posted on astropy-dev, I think we can go to the next step and I will request final approval from the rest of the Coordination Committee.

Thanks for the great input which has been immensely helpful. I'm pretty happy with where this has ended up now as a result.

eteq · 2021-05-25T16:19:18Z

The Coordination Committee discussed this today and have approved this APE. In the last commit I updated the info accordingly, and will now merge

mbtaylor · 2021-05-25T16:37:52Z

Excellent. Thanks a lot @taldcroft et al. for getting this done, I agree it's worked out well.
One question: ECSV 0.9 had a DOI, will ECSV 1.0 get one, and where should I go looking for it?

taldcroft · 2021-05-25T17:05:18Z

@mbtaylor - the updated DOI will be added as part of the standard APE acceptance process, hopefully soon.

Update APE-6 for subtype support

632d49e

taldcroft mentioned this pull request Apr 28, 2021

Support reading and writing multidimensional and object columns in ECSV astropy/astropy#11569

Merged

1 task

taldcroft mentioned this pull request Apr 29, 2021

Change ECSV subtype=object to subtype=json astropy/astropy#11662

Merged

taldcroft mentioned this pull request May 2, 2021

Add discussion of support for object columns #64

Closed

taldcroft added 3 commits May 2, 2021 08:57

Implement changes suggested in review

939bdb9

Fix dates and references

4579acf

Fix missing colons

ca687ec

mhvk approved these changes May 2, 2021

View reviewed changes

mbtaylor suggested changes May 3, 2021

View reviewed changes

mbtaylor reviewed May 7, 2021

View reviewed changes

taldcroft added 3 commits May 11, 2021 09:20

Add details for missing value support

588c415

Add other PRs

de44208

Fix various typos and format issues

f1f0eca

mbtaylor reviewed May 11, 2021

View reviewed changes

Final tweaks, use "multidimensional"

d5edd12

astrofrog approved these changes May 25, 2021

View reviewed changes

update info for astropy#68

bdb213d

eteq approved these changes May 25, 2021

View reviewed changes

eteq merged commit 0c583cf into astropy:main May 25, 2021

eteq mentioned this pull request May 25, 2021

Update to using "all version" DOIs #69

Merged

taldcroft deleted the ape6-subtypes branch May 25, 2021 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update APE-6 for subtype support #68

Update APE-6 for subtype support #68

taldcroft commented Apr 28, 2021

hamogu commented Apr 28, 2021

mbtaylor commented Apr 29, 2021

taldcroft commented Apr 29, 2021

mhvk commented Apr 29, 2021

taldcroft commented May 2, 2021

mhvk left a comment

mbtaylor left a comment

taldcroft commented May 3, 2021

mbtaylor commented May 3, 2021

mbtaylor May 7, 2021

mbtaylor commented May 7, 2021

taldcroft commented May 8, 2021 •

edited

taldcroft commented May 8, 2021

mbtaylor commented May 8, 2021

taldcroft commented May 10, 2021

taldcroft commented May 11, 2021

taldcroft commented May 11, 2021

mbtaylor May 11, 2021

mbtaylor commented May 11, 2021

taldcroft commented May 13, 2021

mbtaylor commented May 13, 2021

taldcroft commented May 14, 2021

eteq commented May 25, 2021

mbtaylor commented May 25, 2021

taldcroft commented May 25, 2021

Update APE-6 for subtype support #68

Update APE-6 for subtype support #68

Conversation

taldcroft commented Apr 28, 2021

hamogu commented Apr 28, 2021

mbtaylor commented Apr 29, 2021

taldcroft commented Apr 29, 2021

mhvk commented Apr 29, 2021

taldcroft commented May 2, 2021

mhvk left a comment

Choose a reason for hiding this comment

mbtaylor left a comment

Choose a reason for hiding this comment

taldcroft commented May 3, 2021

mbtaylor commented May 3, 2021

mbtaylor May 7, 2021

Choose a reason for hiding this comment

mbtaylor commented May 7, 2021

taldcroft commented May 8, 2021 • edited

taldcroft commented May 8, 2021

mbtaylor commented May 8, 2021

taldcroft commented May 10, 2021

taldcroft commented May 11, 2021

taldcroft commented May 11, 2021

mbtaylor May 11, 2021

Choose a reason for hiding this comment

mbtaylor commented May 11, 2021

taldcroft commented May 13, 2021

mbtaylor commented May 13, 2021

taldcroft commented May 14, 2021

eteq commented May 25, 2021

mbtaylor commented May 25, 2021

taldcroft commented May 25, 2021

taldcroft commented May 8, 2021 •

edited