Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support delimited format for the array field #34

Closed
wants to merge 1 commit into from
Closed

Conversation

roll
Copy link
Member

@roll roll commented Feb 6, 2024


##### Delimited Format

The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `MUST` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `MUST` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format.
The `array` field provides an ability to lexically represent data as a delimited value with `format` set to `delimited`. In this case a data consumer `SHOULD` split the lexical representation of the value using the `delimiter` property and `MUST NOT` do any additional processing e.g. quotes interpretation, white space removal, and others. The `delimited` format is not CSV and does not support any Table Dialect options except for the `delimiter`. In case of complex array items that require additional escaping or quotation, a data producer `MUST` use the `default` JSON-based format.

Copy link

@nichtich nichtich Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how to remove white space? Additional key to trim data or regex as delimiter?

@peterdesmet
Copy link
Member

This is a property that we may adopt in Camtrap DP (e.g. for tags which use | as delimiter).

However, I don't think implementors MUST split the array. It should be fine for them to leave it as a string.

@pschumm
Copy link

pschumm commented Feb 17, 2024

I am generally supportive of this, and would note that use of | as a delimiter is actually fairly common (though personally I prefer a comma), so it should be supported. And I agree with @peterdesmet that while implementors may wish to split the array, they shouldn't be required to if they are targeting software that doesn't support array values within a variable.

@khusmann
Copy link
Contributor

khusmann commented Feb 19, 2024

I support this as well. I was at first a little wary of the "no additional processing part" because of cases involving whitespace after delimiters (e.g. "Item 1, Item 2, Item 3") but then realized these can be handled with "delimiter": ", ". In any case I think it's better to start off more strict, as we are doing here.

However, I don't think implementors MUST split the array. It should be fine for them to leave it as a string.

I agree -- I think most features should be opt-in for implementations, especially when there are graceful ways to fall back.

Another acceptable way implementors may handle array fields is by exploding the items into separate columns. E.g.:

foo
[1, 2]
[1]
[1, 2, 3]

can be exploded to:

foo_1 foo_2 foo_3
1 2 null
1 null null
1 2 3

This is exceedingly common for the special case of categorical multiselect items. Here's an example using the "liked_colors" field I define here:

liked_colors
["red"]
["red", "yellow" ]
["orange", "green"]

is often exploded to:

liked_colors_red liked_colors_orange liked_colors_yellow liked_colors_green etc.
true false false false ...
true false true false ...
false true false true ...

@nichtich
Copy link

nichtich commented Feb 19, 2024

I'd rather expect the foo example to be exploded to

foo foo foo
1 2 null
1 null null
1 2 3

But such transformation could also be kept outside of the scope of the specification.

By the way this key is called arrayDelimiter in Neo4J CSV import format. It may be a better name to avoid confusion with delimiter from CSV Dialect Specification. However the name "array" is confusing as this does not refer to "format": "array". Maybe call it listDelimiter and use "type": "list" instead of "type": "delimited"?

Example of equivalent (actually?) data in two forms:

{ "name": "foo", "type": "list", "listDelimiter": "|" }
foo
1|2
1
1|2|3
{ "name": "foo", "type": "aray" }
foo
["1","2"]
["1"]
["1","2","3"]

@khusmann
Copy link
Contributor

khusmann commented Feb 20, 2024

@nichtich I agree, exploding method can safely be left to implementation.

I'm hesitant to add a "type": "list", because it is an equivalent logical type to "type": "array"... the difference in question is regarding the physical value representation of the field, right? Or is there a logical difference between array and list fields I'm missing?

I don't mind delimiter used in this context, I think it's concise and descriptive. I think it's meaning is clear within the array field scope, just as dialect.delimiter is clear within in the CSVDialect scope.

@nichtich
Copy link

I'm hesitant to add a "type": "list", because it is an equivalent logical type to "type": "array"... the difference in question is regarding the physical value representation of the field, right?
Or is there a logical difference between array and list fields I'm missing?

  • list or delimited (which I'd try to avoid as this name is already used in other context) is one cell delimited into parts. How to further define these parts has not been discussed yet, it stick to string as default. The logical representation is a sequence of strings.
  • array is one cell containing a JSON array. This JSON array may contain arbitrary JSON elements, so the logical representation is the full spectrum of JSON data model.

So both differ in physical value representation and in logical representation (unless array is also restricted to JSON array of plains strings, but this is not the case). Therefore it would be helpful to avoid the term "array" in this issue and use another name such as "list" or "sequence" instead.

@roll
Copy link
Member Author

roll commented Feb 20, 2024

@nichtich
That's a good point regarding logical semantics. It's true that JSON-compatible objects and arrays imply exact "data rules" on the logical level too. Being tied to JSON semantics logically means that e.g. array item types are restricted to JSON types.

I'm wondering if we really can do what you propose like having 2 types completely tied to JSON semantics (lexically and logically):

{"key": "value"}

type: object
jsonSchema: {}

["value1", "value2"]

type: array
jsonSchema: {}

And having 1 type tied to SQL-semantics that solves both delimited (#34) and itemType (#31) problems:

value1,value2

type: list
delimiter: "," (default)
itemType: string (default) | number | integer | boolean | datetime | date | time | year | yearmonth

where itemType is required (to make compatible with SQL) but defaults to string. Also it's parsed on Table Schema level supporting only default type formats while object/array are parsed by JSON parser. The difference between array and list will be also typed/not-typed. So the list will mimic SQL array (we also can adjust allowed types as well)

@roll roll mentioned this pull request Feb 20, 2024
@nichtich
Copy link

Sorry, I don't get what is meant by "JSON-semantics" and "SQL-semantics". AFAIK the former is the data model of objects, arrays, strings, numbers, boolean, and null as defined by JSON (RFC 8259). SQL is not mentioned at all in the current specification.

@roll
Copy link
Member Author

roll commented Feb 20, 2024

@nichtich
I use "SQL-semantics" only for this discussion; I don't think we need to introduce it to the specification. So basically:

  • array: is one cell containing a JSON array. This JSON array may contain arbitrary JSON elements, so the logical representation is the full spectrum of JSON data model, meaning that the array is:
    • not-typed
    • arbitrary nested
  • list: is one cell, if represented lexically, a string delimited into parts by a delimiter. The logical representation is the array of values that is:
    • strongly typed
    • not nested

I think that's really close to your comment above

@khusmann
Copy link
Contributor

khusmann commented Feb 20, 2024

I like how @roll is posing the potential distinctions between a list and array field type, but I'm uncomfortable with some of those distinctions being tied to their physical representation, rather than logical attributes of the collection.

Firstly, I think we can agree that list and array types are ordered collections that allow duplicate elements. (In contrast to a collection type like set).

The remaining attributes I see that might distinguish these types are:

  • Heterogeneous or homogeneous child element type
  • Fixed length or variable length

Taking a quick review across implementations I'm familiar with:

R

  • list: heterogeneous elements, variable length
  • vector: homogeneous elements, variable length

Python

  • list: heterogeneous elements, variable length
  • tuple: heterogeneous elements, fixed length
  • np.array: homogeneous elements, variable length

Python (Polars)

  • list: homogeneous elements, variable length
  • struct: heterogeneous elements, fixed length

SQL (DuckDB)

  • list: homogeneous elements, variable length
  • array: homogeneous elements, fixed length

SQL (PostgreSQL)

  • No list type
  • array: homogeneous elements, variable length

Presently, the array field type in frictionless allows heterogeneous elements, and is variable length.

Adding the itemType property would allow us to define homogeneous element types.

We could similarly use itemType to define heterogeneous, fixed-length types via an array of field types:

{
  "itemType": [
    { "type": "string" },
    { "type": "number" }
  ]
}

Nested elements (arrays of arrays) are equally representable by itemType, but recursive types are not (and I think that's ok… recursive types are a can of worms!) For example a 2D variable length homogeneous array of integers:

{
  "type": "array",
  "itemType": {
     "type": "array",
     "itemType": {
        "type": "integer"
     }
  }
}

So the only type ordered collection type we're missing with this scheme is a homogeneous, fixed-length array. But rather than making a completely new field type for this, I would suggest we simply add a length property to the existing array type. In summary, we can safely specify all possibilities for ordered collection types via the array field type:

heterogeneous homogeneous
variable length { "type": "array" } {"type": "array", "itemType": FieldObject }
fixed length { "type": "array", "itemType": FieldObject[] } or { "type": "array", "length": number } (for untyped child elements)` { "type": "array", "itemType": FieldObject, "length": number }

Defining array in this way allows us to statically type / represent any type of ordered collection…(which would be amazing for validation). At the end of the day, I don't see a need to create a new list field type.

Now back to delimiter:

delimiter, by contrast, I think is a property of the physical value cell representing the collection (as format does), and should not be conflated into the definition of collection logical type.

That way all four distinct collection variations (heterogeneous+fixed, heterogeneous+variable, homogeneous+fixed, homogeneous+variable) can have their values equally represented via a JSON physical value cells (e.g. "[1, 2, 3]"), as well as simple delimited string physical value cells (e.g. "1, 2, 3")

@khusmann
Copy link
Contributor

khusmann commented Feb 20, 2024

Ah now I think I'm seeing @roll and @nichtich 's earlier point about nesting & delimited values -- the delimited format cannot represent nested types without grouping characters. E.g. [[1, 2], [2, 3], [4, 1]] vs 1,2,2,3,4,1. To do that, we'd need to define additional grouping characters (e.g. with groupPattern = "({})" you could parse (1,2), (2, 3), (4, 1)).

Or, just encode that as a limitation of the array type in that special case as the current proposal basically does already e.g. "When format = "delimited", array.itemType MUST NOT include nested field types (e.g. other arrays)". I'd still prefer that to a new field type, for the reasons I mention above.

@roll
Copy link
Member Author

roll commented Feb 21, 2024

@khusmann
Amazing analysis!

but I'm uncomfortable with some of those distinctions being tied to their physical representation, rather than logical attributes of the collection.

I felt the same initially but then I realized that actually JSON has a strict and well-defined data model on the logical level as well. So, for example, having on a logical level [new Date()] won't make valid array field value and it actually makes a lot of sense as it's anyway can't be serialized

So the only type ordered collection type we're missing with this scheme is a homogeneous, fixed-length array.

I think it's achievable via constraints. For example, if we have this list type:

type: list
itemType: string
constraints:
  minLength: 5
  maxLengh: 5

Of course, on the model (programming) level an implementation is better to have a shortcut for setting the length.


Let me create a new proposal for the list type to summarize the discussion and clarify the idea

@roll
Copy link
Member Author

roll commented Feb 21, 2024

I have created a new PR for the list type, please take a look - #38. Note that #34 (this) and #38 are mutual exclusive (to chose what you like more)

@khusmann
Copy link
Contributor

khusmann commented Feb 21, 2024

JSON has a strict and well-defined data model on the logical level as well. So, for example, having on a logical level [new Date()] won't make valid array field value and it actually makes a lot of sense as it's anyway can't be serialized

Yes, but we can still represent DateField[] with the itemType property on a JSON array:

{
  "name": "dateField",
  "type": "array",
  "itemType": {
    "type": "date",
    "format": "yyyy-mm-dd"
  }
}

Where it would parse JSON strings of the form '["2024-02-21", "2024-02-22", "2024-02-23"]', etc.

Equivalently with delimiter:

{
  "name": "dateField",
  "type": "array",
  "format": "delimited",
  "delimiter": ",",
  "itemType": {
    "type": "date",
    "format": "yyyy-mm-dd"
  }
}

Where it would parse delimited strings of the form: "2024-02-21,2024-02-22,2024-02-23"

I think it's achievable via constraints.

In the spirit of "parse, don't validate", I would argue it's different. A fixed-length array has different capabilities than a variable-length array with constraints. For example, you know you can always explode a fixed array-column into a separate column for each element, and you can statically determine the exact # of columns you'll need. You also know that you can stack fixed-length arrays of the same size and they will always align. These are important static implications, and are not guarantees of variable-length arrays with constraints. Because we get these special capabilities of the type that occur when the length is fixed, I think it's deserving of a type-level attribute. (But I'll wait for the list / array discussion to resolve before making a separate proposal)

@roll
Copy link
Member Author

roll commented Mar 11, 2024

CLOSED in favor of #38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Future possibility for delimiter-separated list for arrays (instead of JSON array)?
5 participants