Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow any type for true/falseValues? #621

Open
akariv opened this issue Sep 18, 2018 · 10 comments · May be fixed by frictionlessdata/datapackage#17 or frictionlessdata/datapackage#49
Open

Allow any type for true/falseValues? #621

akariv opened this issue Sep 18, 2018 · 10 comments · May be fixed by frictionlessdata/datapackage#17 or frictionlessdata/datapackage#49

Comments

@akariv
Copy link
Member

akariv commented Sep 18, 2018

For some reason this has been limited to strings only, and there's no actual need for that.
There are quite a few cases where boolean values are represented with 0/1 (integers) in the source data.

I propose removing the type restriction here, so that you could specify (as an example):

{
   "name": "my_boolean",
   "type": "boolean",
   "trueValues": [1],
   "falseValues": [0]
}
@akariv
Copy link
Member Author

akariv commented Sep 18, 2018

/cc @roll @zelima

@roll
Copy link
Member

roll commented Sep 19, 2018

In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below.

Also, this section seems bound to the string representation.

@akariv
Copy link
Member Author

akariv commented Sep 19, 2018

Yes, I know, except a physical representation is not necessarily a string...

The physical representation of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation may have some type information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form).

I think this is an error in the spec - there's an issue I opened there.

@roll roll added this to Specifications in Frictionless General Mar 19, 2019
@roll roll added this to the v2 milestone Apr 7, 2023
@roll roll changed the title allow any type for trueValues and falseValues in boolean types Allow any type for trueValues and falseValues in boolean types Jan 3, 2024
@roll roll added bug and removed bug labels Jan 3, 2024
@roll roll added the bug label Jan 3, 2024
@roll roll reopened this Jan 3, 2024
@roll roll removed the bug label Jan 3, 2024
@roll roll changed the title Allow any type for trueValues and falseValues in boolean types Allow any type for missing/true/falseValues? Jan 3, 2024
@roll roll changed the title Allow any type for missing/true/falseValues? Allow any type for true/falseValues? Jan 3, 2024
@roll roll removed the help wanted label Jan 3, 2024
@roll
Copy link
Member

roll commented Jan 3, 2024

It took me a while to think about it, and based on the current Table Schema's concept of physical (I'm not sure that physical is a good word here it's more like textual) and logical separation it seems to be that this issue needs to be closed as wontfix.

Of course, there are use cases when boolean fields represented with integers but strictly speaking if a logical value is an integer it must be marked invalid against a boolean field. By my understanding value substitution is a part of data casting process and we do don't data casting in-general for already typed data.

If the above is wrong I think the change should affect missingValues as well.

I'll ask WG for a discussion for this issue

@peterdesmet
Copy link
Member

The PR frictionlessdata/datapackage#5 allows non-string values. Do I understand this issue is solved then?

@roll
Copy link
Member

roll commented Jan 8, 2024

Hi @peterdesmet,

Sorry for the confusion. frictionlessdata/datapackage#5 has been reverted and issue returned back to the discussion (I was measled by #864)

@peterdesmet
Copy link
Member

If I understand correctly, the suggestion by @roll is to keep requiring the values provided in missingValues, trueValues and falseValues to be strings.

If so, I'm fine with that suggestion.

@pwalsh
Copy link
Member

pwalsh commented Jan 8, 2024

@roll I think @akariv 's suggestion is preferable

@akariv
Copy link
Member Author

akariv commented Jan 9, 2024

This is actually a very good example on why the distinction between logical and representation (physical, lexical...) values is so important, and how making that distinction would have made this issue very simple to resolve.

Generally we use two types of values in the spec, without distinction - on the one hand, we talk about logical values a lot. For example, a datetime value points to a specific point in time. A number will point to a specific point on the real number line. A boolean can have two distinct logical values, a truthy one and a falsey one. And so on and so forth.

The representation of these values in data files that may be described by a data package, might also vary. While we're used to thinking about csv files, where the issues are usually dates with various formats or the decimal character of a number, other data formats use different data types - i.e. not strings - to represent data. For example, Excel might use an integer to represent dates or booleans. JSON files will have native boolean values, but still needs help when dates need to be decoded.

In the spec we sometimes refer to the logical value - for example, when defining constraints for max value of a number, we don't care how the it was represented, only it's logical value. In other cases, we refer to the representation of the value - for example, when defining the 'missingValues' field, we will declare which representations of values should be ignored.

What we need to do is to specify in every location where a value is to be given whether it's a logical value or a representation value, and have the same rules apply for all instances of the same kind. Obviously, this specific issue requires a representation value, so if we decide that we allow for excel files to be described by data packages I think the conclusion should be pretty clear on what should be the correct solution here :)

@khusmann
Copy link

khusmann commented Jan 9, 2024

@akariv I appreciate your distinctions & definitions here. I think making the distinction between logical and representation values is especially salient re: value labels in categorical / ordinal types (as discussed in #844).

Conceptually, a boolean logical type with true/falseValues is a special case of a ordinal / categorical logical type with value labels (i.e. a binary categorical variable with logical values "true" and "false").

So as we consider the approach & language to adopt here with boolean types in distinguishing logical vs representations (i.e. label vs underlying representation values), I think it would be good to be thinking about how the decisions here generalize to categorical / ordinal types and value labels so we can keep the approach / language there consistent.

Right now the categorical extension defines a mapping between representation and label in the enumLabels where the representation values are all strings (which is consistent with how strings are provided for missingValues, trueValues, and falseValues). Allowing any type for the keys enumLabels opens a can of worms, because as a json object the keys need be strings.

Another potential point of logical / representation confusion is labels on missing values. If the logical missing label "PARTICIPANT_SKIPPED_ITEM" is represented in the data as -999, does should we define missingValues = [-999] or missingValues = ["PARTICIPANT_SKIPPED_ITEM"]?

I'm 100% with you in the idea that making the distinction between logical & representation values from the get-go would have made this issue (and the value labels situation by extension) easy to solve. I think the challenge here is to balance conceptual correctness / internal consistency / historical compatibility. So rather than doing the surgery that would be required across the spec to represent excel files or other binary formats "correctly", I think it might be easier to just let a limitation of the datapackage format be that it is designed for underlying textual data representations... That way, "representation values" (e.g. missingValues, trueValues, falseValues, enumLabels keys) are always strings.

Not ideal, I know (especially for representing floating point values!). But given that we're not allowing breaking changes in V2, it kind of limits our ability for deep surgery... I'm open to more thoughts / brainstorm though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Frictionless General
  
Specifications
5 participants