Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Support to add with properties when creating Trino/Presto tables #58

Merged
merged 1 commit into from
Aug 11, 2021

Conversation

mdesmet
Copy link
Contributor

@mdesmet mdesmet commented Jul 23, 2021

Fixes #53

By adding with_props in the model, we can enable users to add properties when creating tables. As presto/Trino supports a lot of different adapters, for now a Dict[String, String] type enables flexibility.

{{
  config(
    materialized='table',
    with_props={
      "format": "'PARQUET'",
      "partitioning": "ARRAY['bucket(id, 2)']",
    }
  )
}}
select 
  * 
from {{ source('inventory', 'products') }}
WHERE weight < 5
```
Following query is executed on Trino

```
create table "iceberg"."datalake"."low_weight_products__dbt_tmp"
    WITH (format = 'PARQUET',partitioning = ARRAY['bucket(id, 2)'])
  as (
    
select 
  * 
from "postgres"."inventory"."products"
WHERE weight < 5
  )
```

@cla-bot
Copy link

cla-bot bot commented Jul 23, 2021

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @mdesmet

@bachng2017
Copy link

hello @mdesmet
Do you think it is necessary to apply same implements for __create_csv_table also ? When seeding, this macros is called instead of create_table_as

@bachng2017
Copy link

I need a comma here, before '\n' to run with our trino/hive configuration. Not sure about others.
650fb27#diff-0b5044213c0ce72456e38a752114fedb09c0623a7256be2a2fa3c6b4a7291f2bR65

@mdesmet
Copy link
Contributor Author

mdesmet commented Aug 8, 2021

@bachng2017: the comma is indeed necessary. Serves as a good reminder to myself to actually test before committing last minute changes.

Following two points:

  • The seed should indeed also adapted.
  • We should probably add a test for this feature into the integration test.

@mdesmet
Copy link
Contributor Author

mdesmet commented Aug 8, 2021

Hi @bachng2017 ,

I have an iceberg docker setup running and there it works for both the seeds and the models.

dbt_project.yaml:

seeds:
  trino_project:
      +catalog: iceberg
      +schema: datalake
      +with_props:
          format: "'PARQUET'"
          partitioning: ARRAY['bucket(series_reference, 2)']

Generating following create table statement on Trino:

create table "iceberg"."datalake"."trx" (Series_reference VARCHAR,Period DOUBLE,Data_value DOUBLE,Suppressed VARCHAR,STATUS VARCHAR,UNITS VARCHAR,Magnitude INTEGER,Subject VARCHAR,Group_name VARCHAR,Series_title_1 VARCHAR,Series_title_2 VARCHAR,Series_title_3 VARCHAR,Series_title_4 INTEGER,Series_title_5 INTEGER) WITH (format = 'PARQUET',
  partitioning = ARRAY['bucket(series_reference, 2)'])

@bachng2017
Copy link

cool and thanks :) Seems that the maintainers are waiting for your CLA to be signed. Hope this will be merged soon

CLA has not been signed by users: @mdesmet

@mdesmet
Copy link
Contributor Author

mdesmet commented Aug 9, 2021

The CLA has been signed after the first commit.

@jtcohen6: any other remarks before this PR can be merged?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdesmet Thanks for the PR, and apologies for the delayed follow-up!

The actual substance of the change here looks fine by me. I have two questions:

  • Do you have a strong feeling about naming the config with_props, versus properties or with_properties? I'm trying to find a canonical name for what these things are in Presto/Trino documentation (a la options in BigQuery or Spark), but a really satisfying answer is eluding me.
  • Did you consider implementing this in the way recommended by Support for table properties #53, and as we've done for other adapters, where each with property would be configured + validated independently? The biggest reason to prefer that approach is that it allows users to set/clobber each config independently, so e.g. they could set format for the entire project but then independently set another_property without clobbering the entire with_props dictionary. The biggest reason to prefer the generic dict approach is if there are dozens/hundreds of properties available, and we'd want to support all of them arbitrarily without naming/validating them specifically—as discussed a bit in Support setting table OPTIONS using config dbt-spark#147 (comment).

In any case, you certainly deserve props for the solid PR! :)


import agate


@dataclass
class PrestoConfig(AdapterConfig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mdesmet
Copy link
Contributor Author

mdesmet commented Aug 10, 2021

Hi @jtcohen6,

The concept is called table properties. So maybe properties is a better name.

I have considered that approach mentioned in #53. However presto/trino is different from athena that it supports way more connectors.

For example take the Kudu connector:

CREATE TABLE user_events (
  user_id int WITH (primary_key = true),
  event_name varchar WITH (primary_key = true),
  message varchar,
  details varchar WITH (nullable = true, encoding = 'plain')
) WITH (
  partition_by_hash_columns = ARRAY['user_id'],
  partition_by_hash_buckets = 5,
  number_of_replicas = 3
);

Another example is the Hive connector:

CREATE TABLE hive.avro.avro_data (
   id bigint
 )
WITH (
   format = 'AVRO',
   avro_schema_url = '/usr/local/avro_data.avsc'
)

Another for clickhouse:

CREATE TABLE default.trino_ck (
  id int NOT NULL,
  birthday DATE NOT NULL,
  name VARCHAR,
  age BIGINT,
  logdate DATE NOT NULL
)
WITH (
  engine = 'MergeTree',
  order_by = ARRAY['id', 'birthday'],
  partition_by = ARRAY['toYYYYMM(logdate)'],
  primary_key = ARRAY['id'],
  sample_by = 'id'
);

It will be difficult to maintain and follow up with all connectors, hence the freedom of the dict[str, str] approach.

@jtcohen6
Copy link
Contributor

Thanks for the thorough response @mdesmet! I'm happy with properties as the config name, and I totally buy your rationale about there being too many connector-specific properties to state them all explicitly. That flexibility is worth the cost of losing finer-grained config inheritance + validation.

We don't currently have a suite of custom tests for Presto-specific functionality in this plugin. The code change looks straightforward, and most of the logic lives inside Jinja macros, so users have the ability to override/reimplement if they need to. If this is something you've tested locally, and verified that it's working for the cases you're interested in, I'd be ok with merging this for inclusion in v0.21.

@mdesmet
Copy link
Contributor Author

mdesmet commented Aug 11, 2021

Great, it's tested locally and working. Actually testing it is definitely possible through including a few docker images. That's how i've tested it, see https://github.com/innoverio/modern-data-stack

@jtcohen6
Copy link
Contributor

Very cool repo! I hadn't seen before, thanks for linking to it.

I'm thinking more along the lines of custom dbt integration tests to automate in CI, such as these ones for dbt-spark. That leverages a testing framework we've played around with, but haven't all the way locked down. If we find we need automated testing for more complex dbt-presto features, we should set up similar scaffolding.

@jtcohen6 jtcohen6 merged commit 43c6e2b into dbt-labs:master Aug 11, 2021
@jtcohen6
Copy link
Contributor

I'll backport onto 0.21.latest now, so that this feature will become available in the next v0.21 prerelease.

@bachng2017
Copy link

hello @mdesmet .
Talking about this, do you think we should have 2 kinds of properties: the current one and something line table_properties

In the implement macro, then we mix those 2 properties together before create the final one. This will help to solve the issue of common/individual setting while make sure users could use various formats for different connectors.

For example we will have a ``transactional: trueset in the project level andpartitioned_by` is set in table level of the config file or in each model. Increase the number of properties keyword might make user more confused.

@mdesmet
Copy link
Contributor Author

mdesmet commented Aug 20, 2021

@bachng2017 : Can you create a new issue for further discussion?

I have following thoughts:

  • How many levels of merging exist (project level, directory level, ...), so any key of the properties set on a lower level woud override or set the keys from a higher level.
  • Why name it differently as it really is the same properties?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for table properties
3 participants