Allow any type for `true/falseValues`? #621

akariv · 2018-09-18T12:10:35Z

For some reason this has been limited to strings only, and there's no actual need for that.
There are quite a few cases where boolean values are represented with 0/1 (integers) in the source data.

I propose removing the type restriction here, so that you could specify (as an example):

{
   "name": "my_boolean",
   "type": "boolean",
   "trueValues": [1],
   "falseValues": [0]
}

akariv · 2018-09-18T12:10:46Z

/cc @roll @zelima

roll · 2018-09-19T07:39:23Z

In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below.

Also, this section seems bound to the string representation.

akariv · 2018-09-19T08:13:25Z

Yes, I know, except a physical representation is not necessarily a string...

The physical representation of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation may have some type information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form).

I think this is an error in the spec - there's an issue I opened there.

roll · 2024-01-03T15:59:14Z

It took me a while to think about it, and based on the current Table Schema's concept of physical (I'm not sure that physical is a good word here it's more like textual) and logical separation it seems to be that this issue needs to be closed as wontfix.

Of course, there are use cases when boolean fields represented with integers but strictly speaking if a logical value is an integer it must be marked invalid against a boolean field. By my understanding value substitution is a part of data casting process and we do don't data casting in-general for already typed data.

If the above is wrong I think the change should affect missingValues as well.

I'll ask WG for a discussion for this issue

peterdesmet · 2024-01-08T14:51:31Z

The PR frictionlessdata/datapackage#5 allows non-string values. Do I understand this issue is solved then?

roll · 2024-01-08T14:57:04Z

Hi @peterdesmet,

Sorry for the confusion. frictionlessdata/datapackage#5 has been reverted and issue returned back to the discussion (I was measled by #864)

peterdesmet · 2024-01-08T16:12:58Z

If I understand correctly, the suggestion by @roll is to keep requiring the values provided in missingValues, trueValues and falseValues to be strings.

If so, I'm fine with that suggestion.

pwalsh · 2024-01-08T19:48:08Z

@roll I think @akariv 's suggestion is preferable

akariv · 2024-01-09T06:10:46Z

This is actually a very good example on why the distinction between logical and representation (physical, lexical...) values is so important, and how making that distinction would have made this issue very simple to resolve.

Generally we use two types of values in the spec, without distinction - on the one hand, we talk about logical values a lot. For example, a datetime value points to a specific point in time. A number will point to a specific point on the real number line. A boolean can have two distinct logical values, a truthy one and a falsey one. And so on and so forth.

The representation of these values in data files that may be described by a data package, might also vary. While we're used to thinking about csv files, where the issues are usually dates with various formats or the decimal character of a number, other data formats use different data types - i.e. not strings - to represent data. For example, Excel might use an integer to represent dates or booleans. JSON files will have native boolean values, but still needs help when dates need to be decoded.

In the spec we sometimes refer to the logical value - for example, when defining constraints for max value of a number, we don't care how the it was represented, only it's logical value. In other cases, we refer to the representation of the value - for example, when defining the 'missingValues' field, we will declare which representations of values should be ignored.

What we need to do is to specify in every location where a value is to be given whether it's a logical value or a representation value, and have the same rules apply for all instances of the same kind. Obviously, this specific issue requires a representation value, so if we decide that we allow for excel files to be described by data packages I think the conclusion should be pretty clear on what should be the correct solution here :)

khusmann · 2024-01-09T20:52:57Z

@akariv I appreciate your distinctions & definitions here. I think making the distinction between logical and representation values is especially salient re: value labels in categorical / ordinal types (as discussed in #844).

Conceptually, a boolean logical type with true/falseValues is a special case of a ordinal / categorical logical type with value labels (i.e. a binary categorical variable with logical values "true" and "false").

So as we consider the approach & language to adopt here with boolean types in distinguishing logical vs representations (i.e. label vs underlying representation values), I think it would be good to be thinking about how the decisions here generalize to categorical / ordinal types and value labels so we can keep the approach / language there consistent.

Right now the categorical extension defines a mapping between representation and label in the enumLabels where the representation values are all strings (which is consistent with how strings are provided for missingValues, trueValues, and falseValues). Allowing any type for the keys enumLabels opens a can of worms, because as a json object the keys need be strings.

Another potential point of logical / representation confusion is labels on missing values. If the logical missing label "PARTICIPANT_SKIPPED_ITEM" is represented in the data as -999, does should we define missingValues = [-999] or missingValues = ["PARTICIPANT_SKIPPED_ITEM"]?

I'm 100% with you in the idea that making the distinction between logical & representation values from the get-go would have made this issue (and the value labels situation by extension) easy to solve. I think the challenge here is to balance conceptual correctness / internal consistency / historical compatibility. So rather than doing the surgery that would be required across the spec to represent excel files or other binary formats "correctly", I think it might be easier to just let a limitation of the datapackage format be that it is designed for underlying textual data representations... That way, "representation values" (e.g. missingValues, trueValues, falseValues, enumLabels keys) are always strings.

Not ideal, I know (especially for representing floating point values!). But given that we're not allowing breaking changes in V2, it kind of limits our ability for deep surgery... I'm open to more thoughts / brainstorm though!

akariv added the Table Schema label Sep 18, 2018

akariv mentioned this issue Sep 18, 2018

Allow casting non-strings to booleans as well frictionlessdata/tableschema-py#221

Merged

roll added this to Specifications in Frictionless General Mar 19, 2019

roll added this to the v2 milestone Apr 7, 2023

roll changed the title ~~allow any type for trueValues and falseValues in boolean types~~ Allow any type for trueValues and falseValues in boolean types Jan 3, 2024

roll added bug and removed bug labels Jan 3, 2024

roll mentioned this issue Jan 3, 2024

Don't require missing/true/falseValues to be a string frictionlessdata/datapackage#5

Merged

roll added the bug label Jan 3, 2024

roll closed this as completed in frictionlessdata/datapackage#5 Jan 3, 2024

roll reopened this Jan 3, 2024

roll removed the bug label Jan 3, 2024

roll changed the title ~~Allow any type for trueValues and falseValues in boolean types~~ Allow any type for missing/true/falseValues? Jan 3, 2024

roll added the help wanted label Jan 3, 2024

roll changed the title ~~Allow any type for missing/true/falseValues?~~ Allow any type for true/falseValues? Jan 3, 2024

roll removed the help wanted label Jan 3, 2024

roll added the proposal label Jan 3, 2024

This was referenced Jan 3, 2024

Frictionless Standards (v2). Jan 1 - Jan 7 #860

Closed

Clarify physical/logical representation in Table Schema #864

Open

roll added discussion and removed proposal labels Jan 4, 2024

roll self-assigned this Jan 5, 2024

roll mentioned this issue Jan 8, 2024

Frictionless Standards (v2). Live Tracker #853

Open

khusmann mentioned this issue Feb 19, 2024

Discourage usage of unnecessary union types in Table Schema frictionlessdata/datapackage#28

Merged

roll modified the milestones: v2-draft, v2-final Mar 11, 2024

roll linked a pull request Mar 18, 2024 that will close this issue

[Draft] Rename 'physical' to 'lexical' values of data frictionlessdata/datapackage#17

Open

roll removed this from the v2-final milestone Mar 27, 2024

roll linked a pull request Apr 3, 2024 that will close this issue

Data representation model frictionlessdata/datapackage#49

Open

roll added this to the v2-final milestone Apr 12, 2024

roll added proposal and removed discussion labels Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow any type for `true/falseValues`? #621

Allow any type for `true/falseValues`? #621

akariv commented Sep 18, 2018

akariv commented Sep 18, 2018

roll commented Sep 19, 2018 •

edited

akariv commented Sep 19, 2018

roll commented Jan 3, 2024 •

edited

peterdesmet commented Jan 8, 2024

roll commented Jan 8, 2024 •

edited

peterdesmet commented Jan 8, 2024

pwalsh commented Jan 8, 2024

akariv commented Jan 9, 2024

khusmann commented Jan 9, 2024

Allow any type for true/falseValues? #621

Allow any type for true/falseValues? #621

Comments

akariv commented Sep 18, 2018

akariv commented Sep 18, 2018

roll commented Sep 19, 2018 • edited

akariv commented Sep 19, 2018

roll commented Jan 3, 2024 • edited

peterdesmet commented Jan 8, 2024

roll commented Jan 8, 2024 • edited

peterdesmet commented Jan 8, 2024

pwalsh commented Jan 8, 2024

akariv commented Jan 9, 2024

khusmann commented Jan 9, 2024

Allow any type for `true/falseValues`? #621

Allow any type for `true/falseValues`? #621

roll commented Sep 19, 2018 •

edited

roll commented Jan 3, 2024 •

edited

roll commented Jan 8, 2024 •

edited