Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data representation model #49

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Data representation model #49

wants to merge 27 commits into from

Conversation

roll
Copy link
Member

@roll roll commented Apr 3, 2024


It might be easier to review using a preview deployment:

Copy link

cloudflare-pages bot commented Apr 3, 2024

Deploying datapackage with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8d7a60d
Status: ✅  Deploy successful!
Preview URL: https://45068655.datapackage.pages.dev
Branch Preview URL: https://data-representation.datapackage.pages.dev

View logs

content/docs/specifications/glossary.md Outdated Show resolved Hide resolved
content/docs/specifications/glossary.md Outdated Show resolved Hide resolved
content/docs/specifications/glossary.md Show resolved Hide resolved

`format`: no options (other than the default).
The list field can be customised with this additional property:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. why the native representation here couldn't also be a native JSON array?
  2. The strings we get after separating the delimited string representation are then to be treated according to the itemType property. However, If they are integers/numbers/bools/date/etc. there are other properties (e.g. format, decimalChar, groupChar, trueValues etc. etc.) and constraints that might be required to validate the values (e.g. a list of country codes).
    I suggest we add here an itemOptions property to contain these item-specific properties.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a long discussion between extending array or adding list so we ended up to this state currently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... but do you agree that having itemType (e.g. date) without being able to specify the format makes it a bit pointless (and very hard for implementors to handle)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akariv
I more consider list a serialization feature rather than a full data description field like others.

For example, we have number with a lot of options so if there is a CSV having numbers with group char a data publisher can notate it and then implementation can read it.

For the list, my understanding is that is only one directional, the data needs to be created on the logical level e.g. reading from SQL or writing Python and then implementation will be able to serialize it and the receiver will be able to deserialize it. In this case, you don't need any other formats or options other than default.

Copy link
Member

@akariv akariv Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand (and again, maybe this discussion is out of scope for this PR...)

If I have a list field in a data package with itemType: date and delimiter: "|" and in my CSV file I see the value "11/11/12|11/12/12" - what is the implementation supposed to do with it? According to the spec, it's supposed to provide a list of logical dates, but there isn't enough information here to allow it to do that.

Copy link
Member Author

@roll roll Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sure, it's not related to this one, but I think it's very useful so I hope we don't create to much noise for others =)

So my understanding is that the value "11/11/12|11/12/12" does not exist anywhere and won't ever be created. list values need to be created exclusively by Data Pakage-aware implementations. So as it's serialized according to the Table Schema rules by data publisher it will be 2012-11-11|2012-12-11 and the data consumer will be able to read it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I started from this PR - #31 - that provides full field definition to array items as in frictionless-py so, of course, I understand what you mean I think we just not sure yet about use cases and the balance between complexity and demand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I now see we had the same discussion 4 years ago :D frictionlessdata/specs#409 (comment). I didn't follow the decision made on February, but I still believe the arrayItem approach was probably the best.... (even if it meant we made the name mandatory only for field definitions in an object schema and not in an array schema).
Anyway let's leave it here, as that decision was already made.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, s now we're open governance in-place, we can adjust way quicker to user needs so nothing prevent us from extending itemType later

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akariv I also still think the arrayItem approach was the best -- My use case is for multiple-select options using the categorical type: #31 (comment) . Also, we could potentially get typed-tuple support if we allow it to be defined as an array of field descriptors. (And both of these cases have parallels in nested types with implementations like Polars & parquet)

But as @roll says, we can extend itemType later, hopefully :)

@khusmann
Copy link

khusmann commented Apr 9, 2024

As I have mentioned in other threads, I'm very skeptical of the addition of "native values" into the TableSchema. Although it does open the ability to more directly interface with types found in binary formats like SQL / xlsx / etc, I think we lose the simplicity and predictability of defining all field parsing in terms of textual values. With the inclusion of native values, TableSchema becomes a leaky abstraction, with behaviors highly coupled to source format.

I think a safer way to include binary formats into the standard would be to create separate schema descriptor definitions with field types for specific formats (JSONTableSchema, SQLSchema, XLSXSchema, etc.) that could be then converted along with their data to and from textually-backed TableSchemas via explicit native type conversion rules.

That said, if we're all set on supporting binary formats via the inclusion of native types into the TableSchema, this PR is looking good. I'm torn between 👀 and 👎, because while I recognize the importance of resolving the issues this addresses, I am very concerned about the implicit complexity it invites. That said, based on the earlier discussions I think I'm in the minority on this issue, so feel free to override me :)

Quick things to check my understanding of the current definition:

For CSV files, is "missingValues": [-99] a no-op because native CSV values are all string?

For an integer column in an SQL db, will"missingValues": [-99] match on values in the column, even though -99 is a floating-point value in the JSON?

For a string column in an SQL db with no missingValues definition, will empty strings be considered missing? (because by default, "missingValues": [""]

In an SQL db, will "missingValues": [] generate a validation error on NULL values, correct? To handle NULL, we need to specify "missingValues": [null], right?

@pschumm
Copy link

pschumm commented Apr 10, 2024

As I have mentioned in other threads, I'm very skeptical of the addition of "native values" into the TableSchema.

FWIW, I initially shared your skepticism, but on further reflection I came to the opposite conclusion. Here's why. I spend a lot of time generating data packages for distribution. Those packages, being Frictionless Tabular Data Packages, contain a Table Schema, and for various reasons the actual resources are written in CSV format. I provide a guarantee that these packages can be validated with frictionless validate (e.g., using the console in the Python reference implementation).

For the most part, the pipelines that generate these packages are written in Python (using Pandas), and validation (against the schema) is one of the final steps. In that case, it is the Pandas data frame that is being validated, as opposed to the ultimate CSV file which is what the user will often validate before reading the data into their analytic package of choice.

Initially, I (and folks working with me) were getting tripped up that the same schema that is valid while running the pipeline was not necessarily valid when validating the CSV file (e.g., #1599). However in retrospect, I realized that part of this was likely due to ignoring the difference between the physical (in the case of the CSV files) and native (in the case of the Pandas data frame) representations.

[...] while I recognize the importance of resolving the issues this addresses, I am very concerned about the implicit complexity it invites.

Actually, I think the additional complexity in this case is explicit; specifically, we are trading misleadingly simpler implicit behavior for apparently more complex explicit behavior. And per the Zen of the Python, the latter is generally preferable.

What I do think is that we should do everything we can to make the resulting solution as simple as possible for the greatest number of users and common use cases. I think @roll's work on this PR is excellent, but I think that we can do a bit more to simplify things for large groups of users. For example, we could add a section near the top named something like "Start Here: Simple Cases" that says things like: "If you are working exclusively with CSV files, then the physical and native formats will always be equivalent and will be equal to the string value represented in the (decoded) version of the file." Or something like that (it's been a long day, so this suggestion could almost certainly benefit from some wordsmithing).

@khusmann
Copy link

Really appreciate your perspective here, @pschumm!

In that case, it is the Pandas data frame that is being validated, as opposed to the ultimate CSV file which is what the user will often validate before reading the data into their analytic package of choice.

Exactly. TableSchema was designed with definitions of types derived from the features found in delimited textual data. So when a TableSchema is used to validate a given CSV, there are no surprises, because TableSchema field types are defined by way of physical textual features. The surprises start popping up when we start doing lossy conversions on the data to other formats.

When we convert CSV+TableSchema to Pandas data frame (binary data + PandasTypes), we lose information in the conversion (for example, distinct missingValues get converted to None).

Then when we then use a TableSchema to validate that Pandas dataframe, we're doing yet another lossy conversion: we have features and values in the binary data+PandasTypes that may not round trip in the same way back to TableSchema field definitions. Our original TableSchema is not guaranteed to validate.

Using a TableSchema to validate SQL databases (binary data+SqlTypes) requires similarly lossy type conversions. If we use TableSchema as the universal way to validate SQL databases & other typed formats, I think it will lead to similar situations as you experienced with validating the same data as a CSV vs Pandas dataframe. where validation & other behaviors diverge in unexpected ways across different formats.

By having different schema types for different typed data formats (e.g. SQLTableSchema, XLSXTableSchema, etc.), we avoid these implicit type conversions in validation:

  • TableSchema defines logical types representable by the physical features of delimited text, so it can predictably validate delimited text files.
  • SQLTableSchema defines logical types by way of the features of SQL native types, so can predictably validate SQL databases.
  • XLSXTableSchema defines logical types by way of the features of native xlsx types so it predictably validates xlsx files
  • etc.

In other words, when we convert a CSV+TableSchema to an SQL db, I think it should result in a SQL db along with a SQLTableSchema that it can be immediately validated against… not a bare SQL db that may or may not validate with the original TableSchema.

That said, I recognize I'm the minority (only?) voice taking this position… I fully expect/accept being overruled on this! I don't mean to kill the momentum here, just want to summarize my concerns moving forward.

For example, we could add a section near the top named something like "Start Here: Simple Cases" that says things like: "If you are working exclusively with CSV files, then the physical and native formats will always be equivalent and will be equal to the string value represented in the (decoded) version of the file." Or something like that

I really like this idea!

I think it also illustrates how we're thinking on very similar lines here – We get exactly this when we have explicit schemas for different formats with logical field types built on the format's native types. Somebody using SQL can go straight to the SQLTableSchema page to describe their database with frictionless field type definitions that map to their familiar native SQL types, in the same way someone can go to the TableSchema page to describe their delimited text file with type definitions derived from the physical features of delimited text. If I'm creating an XLSXTableSchema, I can focus on defining my logical types from the native types that exist in XLSX, and I don't need to think about the peculiarities of SQL or delimited textual formats.

Then conversions between formats become well-defined and easy to explicitly represent as standard conversion rules between these schema types… taking us one step closer to a "pandoc" for data & metadata, which I know is a dream we both share :)

@ezwelty
Copy link

ezwelty commented Apr 10, 2024

I do feel that with native values the spec is getting a bit ahead of itself here, because the spec is limited to native JSON data types. There is a suggested workaround:

Note, that whenever a native value is allowed to be provided in this spec, the most similar JSON type should be used to represent it. If no such type exists (e.g. in case there's a native date value), a string representation of that value should be provided.

The problem I see with this idea is that the intention of the spec becomes unclear wherever the logical type cannot be explicitly stated (the proposed categorical field, missingValues, trueValues, and falseValues come to mind), as then either the native type or its string representation could be considered valid. For example for categorical (#48), this could mean "categories": ["2010-01-01"] passing validation on both a string or a date column, even though the spec already defines a logical date type which would address this.

@roll
Copy link
Member Author

roll commented Apr 10, 2024

@khusmann
For example, frictinonless-py architecturally works like this:

  • a loader loads binary stream based on file protocol
  • a parser converts it to a tabular stream based on file format
  • a schema does all the remaining post-processing

The thing here is that the schema is completely abstracted from underlying file protocols and formats dealing only with input tabular data stream. So basically schema doesn't know whether this data comes from SQL or CSV as anyway it's typed based on computational environment. In case of frictionless-py it is Python types, but it can be polars types etc. Probably it's not perfectly puristic approach but it enables to created pluggable implementations that can support multiple file protocols and formats

@roll
Copy link
Member Author

roll commented Apr 10, 2024

I do feel that with native values the spec is getting a bit ahead of itself here, because the spec is limited to native JSON data types.

I agree it does to some extent but I think the real-world case scenario we got in 7 years was frictionlessdata/specs#621 so I would say that ok missingValues is limited to JSON values; the feature is limited but still serves the majority of needs

@pschumm
Copy link

pschumm commented Apr 10, 2024

I haven't had my first coffee of the morning yet so this might not make any sense, but while this exercise may feel a bit like peeling an onion, I do feel it's worthwhile. I feel like we're trying to do two things simultaneously: (1) moving the spec forward to v2 in a timely way by incorporating meaningful additions and improvements while (hopefully) not doing any harm; and (2) rethinking some of our fundamental assumptions and objectives. The key is to balance these two (easy to say, harder to do).

It may sound hokey, but I think our North Star should be the very word "Frictionless" itself. We may (and should) worry about the distinctions between physical, native and logical values, as well as about edge cases, but for the average user our measure of success will be whether Frictionless makes their life better, i.e., whether it makes data easier to use and share (including making data FAIR), whether it improves data quality (e.g., through validation and avoiding error-prone manipulations), and whether it helps avoid common pitfalls. At least that's what Frictionless means to me.

@ezwelty raises a good point above; one could argue that we just need to add a few more type properties to the schema, but that feels a bit kludgy. I wonder if a more mechanical strategy (as opposed to conceptual) might help us to move forward here. Specifically, we create an explicit list of all of the cases where this distinction between physical and native representation creates problems for the schema spec, establish some type of priority ordering, and then come back to ideas for what step(s) in v2 we might want to take to address them.

@roll
Copy link
Member Author

roll commented Apr 10, 2024

It may sound hokey, but I think our North Star should be the very word "Frictionless" itself. We may (and should) worry about the distinctions between physical, native and logical values, as well as about edge cases, but for the average user our measure of success will be whether Frictionless makes their life better, i.e., whether it makes data easier to use and share (including making data FAIR), whether it improves data quality (e.g., through validation and avoiding error-prone manipulations), and whether it helps avoid common pitfalls. At least that's what Frictionless means to me.

Being pragmatic was always one of the main principles of Frictionless Data. For me this PR is quite just writing down what is de-facto implemented basically in any existent Data Package implementation. My guess is that, even though we're changing some definitions here it won't require any change for any existing implementation or user

@khusmann
Copy link

while this exercise may feel a bit like peeling an onion, I do feel it's worthwhile.

Agreed!

I think this represents a big clarification of direction and scope for frictionless, analogous to something like choosing duck-typing vs static-typing for a language.

At the end of the day, it's a value judgment – tradeoff between expressiveness and predictability. For validation and large-scale interoperability, I lean towards predictability, because it allows you to make certain guarantees & invariants that users and implementations can count on (e.g. this datapackage will validate in the same way no matter which implementation we validate it with).

But it sounds like the group is wanting to prioritize expressiveness because it'll reduce barriers to adoption, and I can appreciate that too. I think the apparent simplicity of duck-typing is partly why Python got so popular… even if the simplicity came at a cost of predictability / robustness, which is why static typing was eventually retrofitted into the language much later. (So maybe it's something we can revisit down the road, as well)

For me this PR is quite just writing down what is de-facto implemented basically in any existent Data Package implementation.

Right. Up until now, I've treated the handling of formats besides delimited text as somewhat undefined behavior, and this PR makes it more explicit, which is good.

So if the group is in agreement that duck-typing is what we want for the spec, then I do think this PR is the way to go! I've upgraded my vote from 👎 to 👀.


That said, like @ezwelty points out (and I have also pointed out in previous threads) there's the remaining issue of representing native types when they are not representable in JSON. To summarize our options, we have:

  1. "choose the most similar JSON type should be used to represent" (in the current PR)
  2. add more itemType fields (@ezwelty)
  3. enumerate cases & establish some type of priority ordering (@pschumm)

@ezwelty – by adding more itemType fields, did you mean allowing native values to be represented by JSON string, number, or object? So if we wanted to mark a native date as a missing value, we would do the following:

"missingValues": [
   -99,
   "NA",
  {
    "value": "1900-01-01",
    "type": "date"
  }
]

But as @roll points out, this is vanishingly rare in practice -- if we have a native date type then it's most likely going to be null-able anyway, so the above definition is not something you'd see. Similarly, when there is no native type for boolean or categorical, the values we want to specify for trueValues and falseValues and categories are going to be native string or number types.

So I think maybe the key for me in the present PR is to make this a guarantee -- as long as we're ensuring that all places where we're wanting to specify native type values are ALWAYS variations of string or number, and we're not concerned about edge cases (like loss of precision when working with native floating point, etc.), we're golden right? As @roll says, it gives us a good pragmatic representation of how native values are being used by the spec in present implementations.

@roll
Copy link
Member Author

roll commented Apr 12, 2024

I think this one is something to discuss on the CC.

TBH, I don't see a big (or even any) difference for users/implementations if we just switch from:

  • physical
  • native
  • logical

to

  • physical
  • lexical
  • logical

The bottom line in my opinion is that anyway we rely on data formats for data reading and Table Schema post-process data types that are natively not supported and stored as strings.

Using native instead of lexical layer seems to be just more generic and simple to write down.

@roll
Copy link
Member Author

roll commented Apr 12, 2024

BTW @khusmann what do you think about this one frictionlessdata/specs#709

@khusmann
Copy link

khusmann commented Apr 12, 2024

I think this one is something to discuss on the CC.

Sounds good!

TBH, I don't see a big (or even any) difference for users/implementations if we just switch from:

For me, the key difference is clarified by whether we say the following two definitions are identical or not:

  1. "missingValues": [-99]

  2. "missingValues": ["-99"]

If we say "native values", I would expect the two definitions to exhibit different behavior on an integer SQL column: In the first definition, numeric -99 will be used as missing value, in the second there will be no missing values defined because an integer column does not have string values.

If we say "lexical values", I would expect these definitions to exhibit identical behavior on an integer SQL column: They both will define string -99 as a missing value, because everything is converted to a string/lexical representation before missingValue comparisons take place.

BTW @khusmann what do you think about this one frictionlessdata/specs#709

I 100% agree with what you said in #33 (comment):

if we clarify float/decimal difference we need to add a new type as it's fundamentally different

(My instinct would be to add a "decimal" type with properties for leftDigits, rightDigits, etc. and use the existing "number" for float types)

@roll
Copy link
Member Author

roll commented Apr 13, 2024

If we say "native values", I would expect the two definitions to exhibit different behavior on an integer SQL column: In the first definition, numeric -99 will be used as missing value, in the second there will be no missing values defined because an integer column does not have string values.

Yes, -99 and "-99" are different values and missing/false/trueValues checks for exact matching. If, for example, schema.missingValues = [-99, "-99"], then it will make null all the "-99" values in string fields and all the -99 values in numeric fields

If we say "lexical values", I would expect these definitions to exhibit identical behavior on an integer SQL column: They both will define string -99 as a missing value, because everything is converted to a string/lexical representation before missingValue comparisons take place.

In this case i.e. v1, we just don't allow non-strings in missing values. So, basically, it's just a subset of the first example

@ezwelty
Copy link

ezwelty commented Apr 18, 2024

@ezwelty – As long as we're ensuring that all places where we're wanting to specify native type values are ALWAYS variations of string or number, and we're not concerned about edge cases (like loss of precision when working with native floating point, etc.), we're golden right?

@khusmann I guess it makes me a wee bit uneasy to easily generate a Table Schema that behaves differently depending on the data format, since my impression was that Table Schema aimed to be somewhat portable and format agnostic (the format-specific stuff being the concern of the parent Resource).

Say we want an integer column with a few allowable values, we could write it as:

{
   "type": "integer",
   "constraints": {"enum": [1, 2, 3]},
   "missingValues": [""]
}

constraints.enum is compared to the logical representation of the data (as are all constraints), so this works for either CSV or JSON data source (assuming the JSON uses either null or "" for missing values).

{
    "type": "categorical",
    "categories": [1, 2, 3],
    "missingValues": [""]
}

If as proposed the categories are compared directly to native representation of the data, this works for JSON but not for CSV, so different Table Schema would be needed depending on the format.

Similar issues arise if we allow numeric missing values (e.g. -99 needed for JSON, but "-99" for CSV), as previously pointed out.

@roll This feels weird to me, but maybe diverging behavior is already the reality in common use cases of v1 of Table Schema?

@roll
Copy link
Member Author

roll commented Apr 19, 2024

e.g. -99 needed for JSON, but "-99" for CSV

I would say that -99 is needed for the integer field and "-99" for the string field. It's not about formats.

constraints.enum is compared to the logical representation of the data (as are all constraints), so this works for either CSV or JSON data source (assuming the JSON uses either null or "" for missing values).

The problem here is that categorical field type doesn't have "base" data type so we can't use the same logic as with constraints that can be presented as a logical type of its lexical representaiton.

@khusmann
Copy link

khusmann commented Apr 19, 2024

e.g. -99 needed for JSON, but "-99" for CSV

I would say that -99 is needed for the integer field and "-99" for the string field. It's not about formats.

What about boolean fields in SQLite vs CSV?

SQLite doesn't have a native bool, and by convention uses integers instead:

{
   "name": "field1",
   "type": "boolean",
   "trueValues": [1],
   "falseValues": [0],
   "missingValues": [-99]
}

Compared to the equivalent bool field coming from CSV:

{
   "name": "field1",
   "type": "boolean",
   "trueValues": ["1"],
   "falseValues": ["0"],
   "missingValues": ["-99"]
}

Side note -- by default trueValues is currently [ "true", "True", "TRUE", "1" ] and falseValues is [ "false", "False", "FALSE", "0" ], so the following definition will not work out of the box for a integer column in SQLite representing a boolean field (but works fine in CSV):

{
   "name": "field1",
   "type": "boolean"
}

If we want this def to work with native values, default trueValues will need to be [ "true", "True", "TRUE", "1" , 1], etc.

...so I agree with @ezwelty's unease.

@nichtich
Copy link

There is a reason why several scripting languages automatically convert between number and string depending on context...

@pschumm
Copy link

pschumm commented Apr 20, 2024

The problem here is that categorical field type doesn't have "base" data type so we can't use the same logic as with constraints that can be presented as a logical type of its lexical representation.

Indeed. I believe there are three primary cases:

  1. The implementation has a native categorical type (e.g., Pandas, R, Julia, db with enum type).
  2. The implementation represents categoricals by integers, optionally with attached labels; examples include Stata, SAS and SPSS, all of which accommodate (value) labels, but there are other analytic software packages that represent categoricals by integers but do not accommodate labels.
  3. The implementation doesn't explicitly distinguish between categorical types and either integers or strings.

The only constraints that are relevant here, I think, are required and enum. And if we assume that enum is evaluated after the data have been cast to the target type, I believe that the only way to accommodate all cases would be to require that the enum array contain items that are consistent with the way the categories property is specified. That is, it would either be an array of strings matching the strings in the categories array, or an array of objects matching the objects in the categories array. That would not require any change to the way enum is described in the spec.

@khusmann, this really isn't the place for this (apologies), but I don't want to forget and I know you'll read this. Your current proposed categorical addition to the spec limits value to string or number, or, if the simplified notation is used, only string. Is there a reason for these limitations? I'd argue that we should avoid unnecessary limitations here, if possible.

Once this issue is resolved, we can add (or not) references to native specification to your addition.

@pschumm
Copy link

pschumm commented Apr 20, 2024

There is a reason why several scripting languages automatically convert between number and string depending on context...

Well put @nichtich, but I just want to make sure I'm following what you're implying here. Are you suggesting that we formally adopt this as part of the Frictionless spec in order to address the trueValues/falseValues and missingValues issues described above?

@pschumm
Copy link

pschumm commented Apr 20, 2024

I would say that -99 is needed for the integer field and "-99" for the string field. It's not about formats.

Where I think we want to avoid confusion here is in the distinction between the following two cases:

  1. Using missingValues to indicate invalid values that are dropped prior to casting to the target type (e.g., dropping string values such as "Don't know" or "Refused" when reading from CSV prior to casting to integer), and
  2. Using missingValues to indicate logical values that should be dropped (or ignored) prior to analysis.

At the moment, (1) is potentially dependent on the source of the data (i.e., the native type). And since missingValues is restricted to an array of strings, (2) is not possible for fields other than type string. I think we need to decide whether we want to support (2); note that for full support of categorical fields in software such as Stata and SAS, we do.

I would propose that (1) is more a statement about a specific dataset and/or source, while (2) is, at least potentially, a more general statement. Think something like "NA" which may merely indicate that the data are coming from R (where that is the default way of writing missing values), as compared to "Don't know" which has substantive meaning and which we might expect to see regardless of the source of the data.

Where I'm going with this is that, as least for me, having to worry about the distinction between -99 and "-99" is less of a problem for (1) (since in that case it is often specific to data source anyway) than for (2). I think it is also less of a problem for resource-level missingValues than for field-specific missingValues.

@khusmann
Copy link

khusmann commented Apr 22, 2024

I would say that -99 is needed for the integer field and "-99" for the string field. It's not about formats.

Where I think we want to avoid confusion here is in the distinction between the following two cases:

I don't think the confusion here is related to the nature of missingValues, as the identical issue comes up with trueValues and falseValues. We run into this situation whenever 1) the logical field type is not supported by the native type (so we need to define it via other native types) AND 2) it is equally valid to derive the logical field type from a native string or native numeric type

So for the example of a boolean field in SQLite vs CSV: 1) both formats do not have a native boolean type and 2) logical booleans are equally valid to derive from native string or native numeric. Therefore whether we need to write "missingValues": ["-99"] vs "missingValues": [-99] or "trueValues": ["0"] vs "trueValues": [0] depends on the native type of the source.

To sum up our possible ways forward, I think there are three options that we've covered (to help frame our discussion in the upcoming community call):

  1. "lexical values": This is how v1 of the spec was written. It does not suffer from this JSON typing issue because all native types are explicitly promoted to string.

  2. "native values": This is the current writing of the PR, which has the potential confusion of "-99" != -99.

  3. "native values with automatic coercion". This is what I think @nichtich was implying (please correct me if you meant something else!). We state that the JSON type doesn't matter for native values, and everything automatically gets coerced (i.e. "-99" == -99). With this, we avoid the above confusion. In practice, this gives us behavior very similar to (1), just is less explicit at face value about what kind of coercion will place.

As @ezwelty said earlier, I still think we're getting a bit ahead of ourselves with native values. But if we're really wanting to relax our physical value types in v2, I prefer (3) over (2) because, like (1), it decouples the source native format from the way source values are expressed in the schema.

(@ezwelty @pschumm I'll address your comments re: categorical in the categorical PR in a bit!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify physical/logical representation in Table Schema Allow any type for true/falseValues?
6 participants