Support `delimited` format for the `array` field #34

roll · 2024-02-06T17:20:32Z

fixes Future possibility for delimiter-separated list for arrays (instead of JSON array)? specs#736

Rationale

Please see the attached issue

peterdesmet · 2024-02-07T10:26:57Z

content/docs/specifications/table-schema.md

+
+##### Delimited Format
+
+The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `MUST` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format.


Suggested change

The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `MUST` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format.

The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `SHOULD` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format.

Then how to remove white space? Additional key to trim data or regex as delimiter?

peterdesmet · 2024-02-07T10:28:25Z

This is a property that we may adopt in Camtrap DP (e.g. for tags which use | as delimiter).

However, I don't think implementors MUST split the array. It should be fine for them to leave it as a string.

pschumm · 2024-02-17T21:10:55Z

I am generally supportive of this, and would note that use of | as a delimiter is actually fairly common (though personally I prefer a comma), so it should be supported. And I agree with @peterdesmet that while implementors may wish to split the array, they shouldn't be required to if they are targeting software that doesn't support array values within a variable.

khusmann · 2024-02-19T19:33:03Z

I support this as well. I was at first a little wary of the "no additional processing part" because of cases involving whitespace after delimiters (e.g. "Item 1, Item 2, Item 3") but then realized these can be handled with "delimiter": ", ". In any case I think it's better to start off more strict, as we are doing here.

However, I don't think implementors MUST split the array. It should be fine for them to leave it as a string.

I agree -- I think most features should be opt-in for implementations, especially when there are graceful ways to fall back.

Another acceptable way implementors may handle array fields is by exploding the items into separate columns. E.g.:

foo
[1, 2]
[1]
[1, 2, 3]

can be exploded to:

foo_1	foo_2	foo_3
1	2	null
1	null	null
1	2	3

This is exceedingly common for the special case of categorical multiselect items. Here's an example using the "liked_colors" field I define here:

liked_colors
["red"]
["red", "yellow" ]
["orange", "green"]

is often exploded to:

liked_colors_red	liked_colors_orange	liked_colors_yellow	liked_colors_green	etc.
true	false	false	false	...
true	false	true	false	...
false	true	false	true	...

nichtich · 2024-02-19T19:50:50Z

I'd rather expect the foo example to be exploded to

foo	foo	foo
1	2	null
1	null	null
1	2	3

But such transformation could also be kept outside of the scope of the specification.

By the way this key is called arrayDelimiter in Neo4J CSV import format. It may be a better name to avoid confusion with delimiter from CSV Dialect Specification. However the name "array" is confusing as this does not refer to "format": "array". Maybe call it listDelimiter and use "type": "list" instead of "type": "delimited"?

Example of equivalent (actually?) data in two forms:

{ "name": "foo", "type": "list", "listDelimiter": "|" }

foo
1|2
1
1|2|3

{ "name": "foo", "type": "aray" }

foo
["1","2"]
["1"]
["1","2","3"]

khusmann · 2024-02-20T02:36:13Z

@nichtich I agree, exploding method can safely be left to implementation.

I'm hesitant to add a "type": "list", because it is an equivalent logical type to "type": "array"... the difference in question is regarding the physical value representation of the field, right? Or is there a logical difference between array and list fields I'm missing?

I don't mind delimiter used in this context, I think it's concise and descriptive. I think it's meaning is clear within the array field scope, just as dialect.delimiter is clear within in the CSVDialect scope.

nichtich · 2024-02-20T07:32:27Z

I'm hesitant to add a "type": "list", because it is an equivalent logical type to "type": "array"... the difference in question is regarding the physical value representation of the field, right?
Or is there a logical difference between array and list fields I'm missing?

list or delimited (which I'd try to avoid as this name is already used in other context) is one cell delimited into parts. How to further define these parts has not been discussed yet, it stick to string as default. The logical representation is a sequence of strings.
array is one cell containing a JSON array. This JSON array may contain arbitrary JSON elements, so the logical representation is the full spectrum of JSON data model.

So both differ in physical value representation and in logical representation (unless array is also restricted to JSON array of plains strings, but this is not the case). Therefore it would be helpful to avoid the term "array" in this issue and use another name such as "list" or "sequence" instead.

roll · 2024-02-20T11:10:03Z

@nichtich
That's a good point regarding logical semantics. It's true that JSON-compatible objects and arrays imply exact "data rules" on the logical level too. Being tied to JSON semantics logically means that e.g. array item types are restricted to JSON types.

I'm wondering if we really can do what you propose like having 2 types completely tied to JSON semantics (lexically and logically):

{"key": "value"}

type: object
jsonSchema: {}

["value1", "value2"]

type: array
jsonSchema: {}

And having 1 type tied to SQL-semantics that solves both delimited (#34) and itemType (#31) problems:

value1,value2

type: list
delimiter: "," (default)
itemType: string (default) | number | integer | boolean | datetime | date | time | year | yearmonth

where itemType is required (to make compatible with SQL) but defaults to string. Also it's parsed on Table Schema level supporting only default type formats while object/array are parsed by JSON parser. The difference between array and list will be also typed/not-typed. So the list will mimic SQL array (we also can adjust allowed types as well)

nichtich · 2024-02-20T12:46:40Z

Sorry, I don't get what is meant by "JSON-semantics" and "SQL-semantics". AFAIK the former is the data model of objects, arrays, strings, numbers, boolean, and null as defined by JSON (RFC 8259). SQL is not mentioned at all in the current specification.

roll · 2024-02-20T14:29:46Z

@nichtich
I use "SQL-semantics" only for this discussion; I don't think we need to introduce it to the specification. So basically:

array: is one cell containing a JSON array. This JSON array may contain arbitrary JSON elements, so the logical representation is the full spectrum of JSON data model, meaning that the array is:
- not-typed
- arbitrary nested
list: is one cell, if represented lexically, a string delimited into parts by a delimiter. The logical representation is the array of values that is:
- strongly typed
- not nested

I think that's really close to your comment above

khusmann · 2024-02-20T19:01:15Z

I like how @roll is posing the potential distinctions between a list and array field type, but I'm uncomfortable with some of those distinctions being tied to their physical representation, rather than logical attributes of the collection.

Firstly, I think we can agree that list and array types are ordered collections that allow duplicate elements. (In contrast to a collection type like set).

The remaining attributes I see that might distinguish these types are:

Heterogeneous or homogeneous child element type
Fixed length or variable length

Taking a quick review across implementations I'm familiar with:

R

list: heterogeneous elements, variable length
vector: homogeneous elements, variable length

Python

list: heterogeneous elements, variable length
tuple: heterogeneous elements, fixed length
np.array: homogeneous elements, variable length

Python (Polars)

list: homogeneous elements, variable length
struct: heterogeneous elements, fixed length

SQL (DuckDB)

list: homogeneous elements, variable length
array: homogeneous elements, fixed length

SQL (PostgreSQL)

No list type
array: homogeneous elements, variable length

Presently, the array field type in frictionless allows heterogeneous elements, and is variable length.

Adding the itemType property would allow us to define homogeneous element types.

We could similarly use itemType to define heterogeneous, fixed-length types via an array of field types:

{
  "itemType": [
    { "type": "string" },
    { "type": "number" }
  ]
}

Nested elements (arrays of arrays) are equally representable by itemType, but recursive types are not (and I think that's ok… recursive types are a can of worms!) For example a 2D variable length homogeneous array of integers:

{
  "type": "array",
  "itemType": {
     "type": "array",
     "itemType": {
        "type": "integer"
     }
  }
}

So the only type ordered collection type we're missing with this scheme is a homogeneous, fixed-length array. But rather than making a completely new field type for this, I would suggest we simply add a length property to the existing array type. In summary, we can safely specify all possibilities for ordered collection types via the array field type:

	heterogeneous	homogeneous
variable length	`{ "type": "array" }`	`{"type": "array", "itemType": FieldObject }`
fixed length	`{ "type": "array", "itemType": FieldObject[] }` or `{ "type": "array", "length": number }` (for untyped child elements)`	`{ "type": "array", "itemType": FieldObject, "length": number }`

Defining array in this way allows us to statically type / represent any type of ordered collection…(which would be amazing for validation). At the end of the day, I don't see a need to create a new list field type.

Now back to delimiter:

delimiter, by contrast, I think is a property of the physical value cell representing the collection (as format does), and should not be conflated into the definition of collection logical type.

That way all four distinct collection variations (heterogeneous+fixed, heterogeneous+variable, homogeneous+fixed, homogeneous+variable) can have their values equally represented via a JSON physical value cells (e.g. "[1, 2, 3]"), as well as simple delimited string physical value cells (e.g. "1, 2, 3")

khusmann · 2024-02-20T19:42:00Z

Ah now I think I'm seeing @roll and @nichtich 's earlier point about nesting & delimited values -- the delimited format cannot represent nested types without grouping characters. E.g. [[1, 2], [2, 3], [4, 1]] vs 1,2,2,3,4,1. To do that, we'd need to define additional grouping characters (e.g. with groupPattern = "({})" you could parse (1,2), (2, 3), (4, 1)).

Or, just encode that as a limitation of the array type in that special case as the current proposal basically does already e.g. "When format = "delimited", array.itemType MUST NOT include nested field types (e.g. other arrays)". I'd still prefer that to a new field type, for the reasons I mention above.

roll · 2024-02-21T08:47:44Z

@khusmann
Amazing analysis!

but I'm uncomfortable with some of those distinctions being tied to their physical representation, rather than logical attributes of the collection.

I felt the same initially but then I realized that actually JSON has a strict and well-defined data model on the logical level as well. So, for example, having on a logical level [new Date()] won't make valid array field value and it actually makes a lot of sense as it's anyway can't be serialized

So the only type ordered collection type we're missing with this scheme is a homogeneous, fixed-length array.

I think it's achievable via constraints. For example, if we have this list type:

type: list
itemType: string
constraints:
  minLength: 5
  maxLengh: 5

Of course, on the model (programming) level an implementation is better to have a shortcut for setting the length.

Let me create a new proposal for the list type to summarize the discussion and clarify the idea

roll · 2024-02-21T10:35:09Z

I have created a new PR for the list type, please take a look - #38. Note that #34 (this) and #38 are mutual exclusive (to chose what you like more)

khusmann · 2024-02-21T16:47:29Z

JSON has a strict and well-defined data model on the logical level as well. So, for example, having on a logical level [new Date()] won't make valid array field value and it actually makes a lot of sense as it's anyway can't be serialized

Yes, but we can still represent DateField[] with the itemType property on a JSON array:

{
  "name": "dateField",
  "type": "array",
  "itemType": {
    "type": "date",
    "format": "yyyy-mm-dd"
  }
}

Where it would parse JSON strings of the form '["2024-02-21", "2024-02-22", "2024-02-23"]', etc.

Equivalently with delimiter:

{
  "name": "dateField",
  "type": "array",
  "format": "delimited",
  "delimiter": ",",
  "itemType": {
    "type": "date",
    "format": "yyyy-mm-dd"
  }
}

Where it would parse delimited strings of the form: "2024-02-21,2024-02-22,2024-02-23"

I think it's achievable via constraints.

In the spirit of "parse, don't validate", I would argue it's different. A fixed-length array has different capabilities than a variable-length array with constraints. For example, you know you can always explode a fixed array-column into a separate column for each element, and you can statically determine the exact # of columns you'll need. You also know that you can stack fixed-length arrays of the same size and they will always align. These are important static implications, and are not guarantees of variable-length arrays with constraints. Because we get these special capabilities of the type that occur when the length is fixed, I think it's deserving of a type-level attribute. (But I'll wait for the list / array discussion to resolve before making a separate proposal)

roll · 2024-03-11T10:10:39Z

CLOSED in favor of #38

Support delimited array

7797313

roll mentioned this pull request Feb 6, 2024

Frictionless Standards (v2). Feb 5 - Feb 11 frictionlessdata/specs#877

Closed

peterdesmet reviewed Feb 7, 2024

View reviewed changes

peterdesmet mentioned this pull request Feb 7, 2024

Support delimited arrays frictionlessdata/frictionless-r#173

Open

roll mentioned this pull request Feb 12, 2024

Frictionless Standards (v2). Live Tracker frictionlessdata/specs#853

Open

roll mentioned this pull request Feb 20, 2024

Added arrayItem property #31

Closed

roll mentioned this pull request Feb 21, 2024

Add new list field type for typed collections, lexically delimiter-based #38

Merged

roll closed this Mar 11, 2024

roll mentioned this pull request Mar 11, 2024

Frictionless Standards (v2). Mar 1 - Mar 31 frictionlessdata/specs#892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `delimited` format for the `array` field #34

Support `delimited` format for the `array` field #34

roll commented Feb 6, 2024

peterdesmet Feb 7, 2024

nichtich Feb 12, 2024 •

edited

peterdesmet commented Feb 7, 2024

pschumm commented Feb 17, 2024

khusmann commented Feb 19, 2024 •

edited

nichtich commented Feb 19, 2024 •

edited

khusmann commented Feb 20, 2024 •

edited

nichtich commented Feb 20, 2024

roll commented Feb 20, 2024 •

edited

nichtich commented Feb 20, 2024

roll commented Feb 20, 2024 •

edited

khusmann commented Feb 20, 2024 •

edited

khusmann commented Feb 20, 2024 •

edited

roll commented Feb 21, 2024

roll commented Feb 21, 2024 •

edited

khusmann commented Feb 21, 2024 •

edited

roll commented Mar 11, 2024


		##### Delimited Format

		The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `MUST` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format.

Support delimited format for the array field #34

Support delimited format for the array field #34

Conversation

roll commented Feb 6, 2024

Rationale

peterdesmet Feb 7, 2024

Choose a reason for hiding this comment

nichtich Feb 12, 2024 • edited

Choose a reason for hiding this comment

peterdesmet commented Feb 7, 2024

pschumm commented Feb 17, 2024

khusmann commented Feb 19, 2024 • edited

nichtich commented Feb 19, 2024 • edited

khusmann commented Feb 20, 2024 • edited

nichtich commented Feb 20, 2024

roll commented Feb 20, 2024 • edited

nichtich commented Feb 20, 2024

roll commented Feb 20, 2024 • edited

khusmann commented Feb 20, 2024 • edited

khusmann commented Feb 20, 2024 • edited

roll commented Feb 21, 2024

roll commented Feb 21, 2024 • edited

khusmann commented Feb 21, 2024 • edited

roll commented Mar 11, 2024

Support `delimited` format for the `array` field #34

Support `delimited` format for the `array` field #34

nichtich Feb 12, 2024 •

edited

khusmann commented Feb 19, 2024 •

edited

nichtich commented Feb 19, 2024 •

edited

khusmann commented Feb 20, 2024 •

edited

roll commented Feb 20, 2024 •

edited

roll commented Feb 20, 2024 •

edited

khusmann commented Feb 20, 2024 •

edited

khusmann commented Feb 20, 2024 •

edited

roll commented Feb 21, 2024 •

edited

khusmann commented Feb 21, 2024 •

edited