-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support delimited
format for the array
field
#34
Conversation
|
||
##### Delimited Format | ||
|
||
The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `MUST` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `MUST` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format. | |
The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `SHOULD` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how to remove white space? Additional key to trim data or regex as delimiter?
This is a property that we may adopt in Camtrap DP (e.g. for tags which use However, I don't think implementors MUST split the array. It should be fine for them to leave it as a string. |
I am generally supportive of this, and would note that use of |
I support this as well. I was at first a little wary of the "no additional processing part" because of cases involving whitespace after delimiters (e.g. "Item 1, Item 2, Item 3") but then realized these can be handled with
I agree -- I think most features should be opt-in for implementations, especially when there are graceful ways to fall back. Another acceptable way implementors may handle array fields is by exploding the items into separate columns. E.g.:
can be exploded to:
This is exceedingly common for the special case of categorical multiselect items. Here's an example using the "liked_colors" field I define here:
is often exploded to:
|
I'd rather expect the
But such transformation could also be kept outside of the scope of the specification. By the way this key is called Example of equivalent (actually?) data in two forms: { "name": "foo", "type": "list", "listDelimiter": "|" }
{ "name": "foo", "type": "aray" }
|
@nichtich I agree, exploding method can safely be left to implementation. I'm hesitant to add a I don't mind |
So both differ in physical value representation and in logical representation (unless |
@nichtich I'm wondering if we really can do what you propose like having 2 types completely tied to JSON semantics (lexically and logically):
And having 1 type tied to SQL-semantics that solves both
where |
Sorry, I don't get what is meant by "JSON-semantics" and "SQL-semantics". AFAIK the former is the data model of objects, arrays, strings, numbers, boolean, and null as defined by JSON (RFC 8259). SQL is not mentioned at all in the current specification. |
@nichtich
I think that's really close to your comment above |
I like how @roll is posing the potential distinctions between a Firstly, I think we can agree that The remaining attributes I see that might distinguish these types are:
Taking a quick review across implementations I'm familiar with: R
Python
Python (Polars)
SQL (DuckDB)
SQL (PostgreSQL)
Presently, the Adding the We could similarly use {
"itemType": [
{ "type": "string" },
{ "type": "number" }
]
} Nested elements (arrays of arrays) are equally representable by {
"type": "array",
"itemType": {
"type": "array",
"itemType": {
"type": "integer"
}
}
} So the only type ordered collection type we're missing with this scheme is a homogeneous, fixed-length array. But rather than making a completely new field type for this, I would suggest we simply add a
Defining Now back to
That way all four distinct collection variations (heterogeneous+fixed, heterogeneous+variable, homogeneous+fixed, homogeneous+variable) can have their values equally represented via a JSON physical value cells (e.g. "[1, 2, 3]"), as well as simple delimited string physical value cells (e.g. "1, 2, 3") |
Ah now I think I'm seeing @roll and @nichtich 's earlier point about nesting & delimited values -- the delimited format cannot represent nested types without grouping characters. E.g. Or, just encode that as a limitation of the |
@khusmann
I felt the same initially but then I realized that actually JSON has a strict and well-defined data model on the logical level as well. So, for example, having on a logical level
I think it's achievable via constraints. For example, if we have this type: list
itemType: string
constraints:
minLength: 5
maxLengh: 5 Of course, on the model (programming) level an implementation is better to have a shortcut for setting the length. Let me create a new proposal for the |
Yes, but we can still represent {
"name": "dateField",
"type": "array",
"itemType": {
"type": "date",
"format": "yyyy-mm-dd"
}
} Where it would parse JSON strings of the form Equivalently with {
"name": "dateField",
"type": "array",
"format": "delimited",
"delimiter": ",",
"itemType": {
"type": "date",
"format": "yyyy-mm-dd"
}
} Where it would parse delimited strings of the form:
In the spirit of "parse, don't validate", I would argue it's different. A fixed-length array has different capabilities than a variable-length array with constraints. For example, you know you can always explode a fixed array-column into a separate column for each element, and you can statically determine the exact # of columns you'll need. You also know that you can stack fixed-length arrays of the same size and they will always align. These are important static implications, and are not guarantees of variable-length arrays with constraints. Because we get these special capabilities of the type that occur when the length is fixed, I think it's deserving of a type-level attribute. (But I'll wait for the |
CLOSED in favor of #38 |
Rationale
Please see the attached issue