Support for server-specific data types. #63

jochenchrist · 2024-05-05T14:18:07Z

The specification has a closed enumeration of logical data types:
https://datacontract.com/#data-types

The enumeration makes it simple to support the creation of data contracts using tools like schema store. It is also very useful to support checks and other logic in tests, import, and export logic. It also helps, if the data provider and data consumer have different technologies. Tools like the Data Contract CLI then can convert the logical data type to the appropriate export format.

In some cases, however, the enumeration is not enough, e.g., when a specific type is used (e.g. TIMESTAMP_LTZ in Snowflake, SMALLINT) and it is important or helpful to specific this information.

Option: Do nothing

Keep the enumeration
Keep it simple
The specific type information can be expressed in the description
The Data Contract Specification supports custom fields, so any information may be added, but it is not standardized

Option: Enumeration to String

Allow any data type in the type attribute

Option: Additional attribute `customType`

Add a data type custom
The custom type can be defined in an attribute customType

my_field:
  description: Some description.
  type: custom
  customType: TIMESTAMP_LTZ

Option: Additional attribute `physicalType`

In a string attribute physicalType any value can be defined

my_field:
  description: Some description.
  type: timestamp
  physicalType: TIMESTAMP_LTZ

Option server-specific fields

my_field_1:
  description: Example for AVRO with Timestamp (millisecond precision)
  type: timestamp
  example: 1970 00:00:00.000 UTC
  avroType: long
  avroLogicalType: timestamp-millis
my_field_2:
  description: Example for AVRO with Timestamp (microsecond precision)
  type: timestamp
  example: 1970 00:00:00.000000 UTC
  avroType: long
  avroLogicalType: timestamp-micros
my_field_3:
  description: Example for AVRO with Local timestamp (millisecond precision)
  type: timestamp_ntz
  example: 1970 00:00:00.000
  avroType: long
  avroLogicalType: local-timestamp-micros
my_field_4:
  description: Example for AVTO duration
  type: string
  example: P3DT12H
  avroType: fixed
  avroLogicalType: duration

Snowflake:

my_field_1:
  description: Example for Snowflake TIMESTAMP_LTZ
  type: string # or timestamp??
  snowflakeType: TIMESTAMP_LTZ

Option: Add config map with server-specific fields (dbt-style)

Like above, but put all additional information that may be useful for tooling in to a config, meta, ... structure.

my_field_1:
  description: Example for AVRO with Timestamp (millisecond precision)
  type: timestamp
  example: 1970 00:00:00.000 UTC
  config:
    avroType: long
    avroLogicalType: timestamp-millis
my_field_2:
  description: Example for AVRO with Timestamp (microsecond precision)
  type: timestamp
  example: 1970 00:00:00.000000 UTC
  config:
    avroType: long
    avroLogicalType: timestamp-micros
my_field_3:
  description: Example for AVRO with Local timestamp (millisecond precision)
  type: timestamp_ntz
  example: 1970 00:00:00.000
  config:
    avroType: long
    avroLogicalType: local-timestamp-micros
my_field_4:
  description: Example for AVTO duration
  type: string
  example: P3DT12H
  config:
    avroType: fixed
    avroLogicalType: duration

The text was updated successfully, but these errors were encountered:

ryancollingwood · 2024-05-06T00:42:22Z

Great question, thank you for raising this - some very off the top of my head thoughts - I welcome all disagreement and agreement.

Going with my gut feeling physicalType seems the most bang for buck in terms of effort and if the decision needed to evolve or be reversed.
Another thought how much could achieve a of option 3 if we know we are targeting a particular server and have the physical datatype as expressed in option 2?

Generally on types I see it potentially in three dimensions:

logical - the highest level of abstraction we can get away with for systems interoperability
physical - the lowest level of abstraction we can get away for persistence purposes
logical - what does this mean to people

pixie79 · 2024-05-08T06:47:15Z

I like the Config map style option, it keeps it simple for those that need simple and allows for more refined data types to be added later. It would also allow for a parser to easily output for example an AVRO schema later or start with an AVRO schema and write out the start of a data contract

enriquecatala · 2024-05-09T06:45:15Z

I like the config map stile option.

ryancollingwood · 2024-05-09T07:22:31Z

My concern with the config map option is it may introduce a lot of "engineer" talk into a document that ideally should be reviewed and used by a broad audience.

I believe this could be handled through either handling the mapping internally or modelling the specifics of the datatype (e.g. the precision, min/max value, time aware, encoding, scale) as those are things that could apply to variety of datatypes and have semantic significance - as in they make things explicit but potentially without making the document unapproachable - I'd need to have some worked examples to verify this

I'll try to carve some time to work through some examples

pixie79 · 2024-05-14T07:45:28Z

I don't disagree on the a lot of "engineer" talk statement however, I fear that argument may already be shot down looking at the following support for JSON

https://github.com/datacontract/datacontract-cli/blob/94ac626e2624f38fa5b7948a6539b44405f63f4d/tests/fixtures/local-json-complex/datacontract.yaml#L11

At the same time I think that it is very important that we do define these using a common method, as they are also useful for the data tests. As long as we can map those standards to the correct export types for each of the different export methods.

As for the non engineers maybe we offer a --light style option which can parse out the technical detail leaving just a summary for non technical implementers to review. Or even in the HTML viewer a toggle to switch between both which would allow multiple parties to review the same document. I do think there is great benefit in keeping all the detail in one place save duplication but it does also mean we need to be able to display what is appropriate to each group of users and at the same time ensure when we export to AVRO, JSON etc we are putting in the maximum detail possible for the schemas (or tests)

ShasTheMass · 2024-05-14T13:26:47Z

This will be a useful PR to add.
Does anyone think following a language agnostic type system could help? I wonder if something like Apache Arrow could be used:

For any other sorts of type I like physicalType and the server-specific fields

simonharrer · 2024-05-15T12:51:27Z

We decided to go with Option: Add config map with server-specific fields (dbt-style).

We decided to support a config map on model and field level. A config map may include any additional key-value pairs and support multiple server type bindings.

Example:

models:
  orders:
    config:
      avroNamespace: "my.namespace"
    fields:
      my_field_1:
        description: Example for AVRO with Timestamp (millisecond precision)
        type: timestamp
        example: 1970 00:00:00.000 UTC
        config:
          avroType: long
          avroLogicalType: timestamp-millis
          snowflakeType: timestamp_tz

Closes #63

jochenchrist mentioned this issue May 6, 2024

Missing data type in Snowflake datacontract/datacontract-cli#150

Closed

jochenchrist mentioned this issue May 8, 2024

Bugfix import avro with logicalType and default values datacontract/datacontract-cli#183

Closed

jochenchrist pushed a commit that referenced this issue May 16, 2024

Add a config object.

e1d7298

Closes #63

jochenchrist mentioned this issue May 16, 2024

Add a config object #66

Merged

jochenchrist closed this as completed in #66 May 16, 2024

enriquecatala mentioned this issue May 23, 2024

New feature import avro with logical type default values datacontract/datacontract-cli#217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for server-specific data types. #63

Support for server-specific data types. #63

jochenchrist commented May 5, 2024 •

edited

ryancollingwood commented May 6, 2024

pixie79 commented May 8, 2024

enriquecatala commented May 9, 2024

ryancollingwood commented May 9, 2024

pixie79 commented May 14, 2024

ShasTheMass commented May 14, 2024

simonharrer commented May 15, 2024

Support for server-specific data types. #63

Support for server-specific data types. #63

Comments

jochenchrist commented May 5, 2024 • edited

Option: Do nothing

Option: Enumeration to String

Option: Additional attribute customType

Option: Additional attribute physicalType

Option server-specific fields

Option: Add config map with server-specific fields (dbt-style)

ryancollingwood commented May 6, 2024

pixie79 commented May 8, 2024

enriquecatala commented May 9, 2024

ryancollingwood commented May 9, 2024

pixie79 commented May 14, 2024

ShasTheMass commented May 14, 2024

simonharrer commented May 15, 2024

jochenchrist commented May 5, 2024 •

edited

Option: Additional attribute `customType`

Option: Additional attribute `physicalType`