Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for server-specific data types. #63

Closed
jochenchrist opened this issue May 5, 2024 · 7 comments · Fixed by #66
Closed

Support for server-specific data types. #63

jochenchrist opened this issue May 5, 2024 · 7 comments · Fixed by #66

Comments

@jochenchrist
Copy link
Contributor

jochenchrist commented May 5, 2024

The specification has a closed enumeration of logical data types:
https://datacontract.com/#data-types

The enumeration makes it simple to support the creation of data contracts using tools like schema store. It is also very useful to support checks and other logic in tests, import, and export logic. It also helps, if the data provider and data consumer have different technologies. Tools like the Data Contract CLI then can convert the logical data type to the appropriate export format.

In some cases, however, the enumeration is not enough, e.g., when a specific type is used (e.g. TIMESTAMP_LTZ in Snowflake, SMALLINT) and it is important or helpful to specific this information.

Option: Do nothing

  • Keep the enumeration
  • Keep it simple
  • The specific type information can be expressed in the description
  • The Data Contract Specification supports custom fields, so any information may be added, but it is not standardized

Option: Enumeration to String

  • Allow any data type in the type attribute

Option: Additional attribute customType

  • Add a data type custom
  • The custom type can be defined in an attribute customType
my_field:
  description: Some description.
  type: custom
  customType: TIMESTAMP_LTZ

Option: Additional attribute physicalType

  • In a string attribute physicalType any value can be defined
my_field:
  description: Some description.
  type: timestamp
  physicalType: TIMESTAMP_LTZ

Option server-specific fields

my_field_1:
  description: Example for AVRO with Timestamp (millisecond precision)
  type: timestamp
  example: 1970 00:00:00.000 UTC
  avroType: long
  avroLogicalType: timestamp-millis
my_field_2:
  description: Example for AVRO with Timestamp (microsecond precision)
  type: timestamp
  example: 1970 00:00:00.000000 UTC
  avroType: long
  avroLogicalType: timestamp-micros
my_field_3:
  description: Example for AVRO with Local timestamp (millisecond precision)
  type: timestamp_ntz
  example: 1970 00:00:00.000
  avroType: long
  avroLogicalType: local-timestamp-micros
my_field_4:
  description: Example for AVTO duration
  type: string
  example: P3DT12H
  avroType: fixed
  avroLogicalType: duration

Snowflake:

my_field_1:
  description: Example for Snowflake TIMESTAMP_LTZ
  type: string # or timestamp??
  snowflakeType: TIMESTAMP_LTZ

Option: Add config map with server-specific fields (dbt-style)

Like above, but put all additional information that may be useful for tooling in to a config, meta, ... structure.

my_field_1:
  description: Example for AVRO with Timestamp (millisecond precision)
  type: timestamp
  example: 1970 00:00:00.000 UTC
  config:
    avroType: long
    avroLogicalType: timestamp-millis
my_field_2:
  description: Example for AVRO with Timestamp (microsecond precision)
  type: timestamp
  example: 1970 00:00:00.000000 UTC
  config:
    avroType: long
    avroLogicalType: timestamp-micros
my_field_3:
  description: Example for AVRO with Local timestamp (millisecond precision)
  type: timestamp_ntz
  example: 1970 00:00:00.000
  config:
    avroType: long
    avroLogicalType: local-timestamp-micros
my_field_4:
  description: Example for AVTO duration
  type: string
  example: P3DT12H
  config:
    avroType: fixed
    avroLogicalType: duration
@ryancollingwood
Copy link

Great question, thank you for raising this - some very off the top of my head thoughts - I welcome all disagreement and agreement.

Going with my gut feeling physicalType seems the most bang for buck in terms of effort and if the decision needed to evolve or be reversed.
Another thought how much could achieve a of option 3 if we know we are targeting a particular server and have the physical datatype as expressed in option 2?

Generally on types I see it potentially in three dimensions:

  • logical - the highest level of abstraction we can get away with for systems interoperability
  • physical - the lowest level of abstraction we can get away for persistence purposes
  • logical - what does this mean to people

@pixie79
Copy link

pixie79 commented May 8, 2024

I like the Config map style option, it keeps it simple for those that need simple and allows for more refined data types to be added later. It would also allow for a parser to easily output for example an AVRO schema later or start with an AVRO schema and write out the start of a data contract

@enriquecatala
Copy link

I like the config map stile option.

@ryancollingwood
Copy link

My concern with the config map option is it may introduce a lot of "engineer" talk into a document that ideally should be reviewed and used by a broad audience.

I believe this could be handled through either handling the mapping internally or modelling the specifics of the datatype (e.g. the precision, min/max value, time aware, encoding, scale) as those are things that could apply to variety of datatypes and have semantic significance - as in they make things explicit but potentially without making the document unapproachable - I'd need to have some worked examples to verify this

I'll try to carve some time to work through some examples

@pixie79
Copy link

pixie79 commented May 14, 2024

I don't disagree on the a lot of "engineer" talk statement however, I fear that argument may already be shot down looking at the following support for JSON

https://github.com/datacontract/datacontract-cli/blob/94ac626e2624f38fa5b7948a6539b44405f63f4d/tests/fixtures/local-json-complex/datacontract.yaml#L11

At the same time I think that it is very important that we do define these using a common method, as they are also useful for the data tests. As long as we can map those standards to the correct export types for each of the different export methods.

As for the non engineers maybe we offer a --light style option which can parse out the technical detail leaving just a summary for non technical implementers to review. Or even in the HTML viewer a toggle to switch between both which would allow multiple parties to review the same document. I do think there is great benefit in keeping all the detail in one place save duplication but it does also mean we need to be able to display what is appropriate to each group of users and at the same time ensure when we export to AVRO, JSON etc we are putting in the maximum detail possible for the schemas (or tests)

@ShasTheMass
Copy link

This will be a useful PR to add.
Does anyone think following a language agnostic type system could help? I wonder if something like Apache Arrow could be used:

For any other sorts of type I like physicalType and the server-specific fields

@simonharrer
Copy link
Contributor

We decided to go with Option: Add config map with server-specific fields (dbt-style).

We decided to support a config map on model and field level. A config map may include any additional key-value pairs and support multiple server type bindings.

Example:

models:
  orders:
    config:
      avroNamespace: "my.namespace"
    fields:
      my_field_1:
        description: Example for AVRO with Timestamp (millisecond precision)
        type: timestamp
        example: 1970 00:00:00.000 UTC
        config:
          avroType: long
          avroLogicalType: timestamp-millis
          snowflakeType: timestamp_tz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants