0.15.0 upgrade #46

SamKosky · 2020-01-15T08:05:45Z

Additions/Fixes

Incorporated fixes from current pull requests (8)
Added missing 0.15.0 functionality
Databricks Delta compatibility
incremental_strategy = merge (Databricks) or insert_overwrite
Additional model config options: partition_by, cluster_by, num_buckets, location
Cache schema and relations
Updated README

dbt-integration-tests all pass. I haven't added many unit tests at this stage, so this is something to work on.

cache schemas and relations

jtcohen6 · 2020-01-27T23:31:19Z

@SamKosky Thank you for your tremendous effort in consolidating and coalescing all the proposed changes to the dbt-spark plugin! I know that we haven't been able to turn around several of these features as quickly as we'd hoped.

From here on, I'm going to be running point as the primary maintainer of this plugin, with support from @drewbanin and @beckjake. Over the next week, I’m going to work through the open PRs one by one. Once all of those PRs are merged, would you be able to rebase this branch on top of the latest master branch? At that point, we'd be able to cut the 0.15.0 release and get it onto PyPi.

jtcohen6 · 2020-01-30T00:32:00Z

After spending the start of this week re-familiarizing myself with the open issues and PRs, it's become clear to me that there's a lot of really heroic work baked into this PR. I'm particularly excited about the addition of incremental_strategy for Delta-only merge.

On areas where there's overlap with other PRs currently open—the extensions to get_catalog in #39 and #41, the addition of several config elements in #43—I'm hopeful that we can reconcile any implementation differences and ship these features with the 0.15.0 release.

I'm also interested in whether we can leverage the show table extended in ... like '*' method (outlined in #49 + #50) to improve on the caching that you've implemented, as well as speeding up the get_catalog method.

@beckjake: Would you be able to look at some of the deeper python elements? Particularly the switch to dataclasses for 0.15.0.

beckjake

Some feedback on the dataclasses parts, they mostly look pretty good. Also a quick note on a potential type narrowing issue w/floats.

beckjake · 2020-01-30T14:38:18Z

dbt/adapters/spark/column.py

+        self,
+        column: str,
+        dtype: str,
+        comment: str = None


If comments can be None, instead just do:

@dataclass class SparkColumn(Column): column: str dtype: str comment: Optional[str]

That will add comment as a new optional str field, which is what it is based on this argument spec.

Perfect, thanks.

beckjake · 2020-01-30T14:39:35Z

dbt/adapters/spark/connections.py


    @property
    def type(self):
        return 'spark'

    def _connection_keys(self):
-        return ('host', 'port', 'cluster', 'schema')
+        return 'host', 'port', 'schema', 'organization'


is cluster secret? If it's not, probably worth including it again so it shows up in dbt debug output.

Cluster is just a random ID assigned by Spark/Databricks. I'll update this now.

beckjake · 2020-01-30T14:44:31Z

dbt/adapters/spark/connections.py

@@ -136,6 +108,10 @@ def execute(self, sql, bindings=None):
            ThriftState.FINISHED_STATE,
        ]

+        # Convert decimal.Decimal to float as PyHive doesn't work with decimals
+        if bindings:
+            bindings = [float(x) if isinstance(x, decimal.Decimal) else x for x in bindings]


will this plugin have to handle numbers that double-precision floats can't represent? For example, 2^64 can't be represented by floats. Does pyhive accept strs? If so you could use str in place of float to get exact representations.

I think it's fairly unlikely. If it were to happen, chances are strings would be in use already instead of trying to represent it as a decimal. But not opposed to changing this to str if you think it's a better approach.

beckjake · 2020-01-30T14:51:24Z

dbt/adapters/spark/relation.py

+@dataclass(frozen=True, eq=False, repr=False)
+class SparkRelation(BaseRelation):
+    quote_policy: SparkQuotePolicy = SparkQuotePolicy()
+    include_policy: SparkIncludePolicy = SparkIncludePolicy()


Assuming this wasn't intended to change the quote character to ", I think you'll also want to include:

quote_character: str = '`'

I've actually overridden SparkQuotePolicy and set everything to not include quotes. I'm using this adapter with Databricks and there are restrictions on table names, specifically;

Valid names only contain alphabet characters, numbers and _.

I'm not sure if this is the same with Spark, but I can only assume. I'd like to just enforce no quotes if possible.

It looks like Spark itself encodes that restriction on object name validity. I checked and saw that it does not enforce the same validation on column names, however.

There's a lot of dbt and dbt-adjacent functionality that uses the quoted versions of columns and relations by default, since it gives us more flexibility across databases. In general, even though quoting is far less important on Spark than on Redshift or Snowflake, I think it's a better idea to keep quoting implemented in this plugin, rather than needing to reimplement dozens of common macros.

A specific example that came up today: the dbt_utils.equality test uses the quoted attribute from get_columns_from_relation, and missing that attribute caused an integration test to fail. I think our answer would be to override the quoted property of the SparkColumn class in the same way BigQuery does here.

Makes sense. I have implemented this change.

jtcohen6

@SamKosky Thanks for your patience on this. I think we're very close to merging the three currently open PRs, all of which are narrow in scope. At that point, I'm hopeful that we can test these changes, resolve any conflicts, and ship version 0.15.0.

In the meantime, would you be able to rebase this branch onto master?

jtcohen6 · 2020-02-24T23:21:24Z

dbt/adapters/spark/relation.py

+@dataclass(frozen=True, eq=False, repr=False)
+class SparkRelation(BaseRelation):
+    quote_policy: SparkQuotePolicy = SparkQuotePolicy()
+    include_policy: SparkIncludePolicy = SparkIncludePolicy()


It looks like Spark itself encodes that restriction on object name validity. I checked and saw that it does not enforce the same validation on column names, however.

There's a lot of dbt and dbt-adjacent functionality that uses the quoted versions of columns and relations by default, since it gives us more flexibility across databases. In general, even though quoting is far less important on Spark than on Redshift or Snowflake, I think it's a better idea to keep quoting implemented in this plugin, rather than needing to reimplement dozens of common macros.

A specific example that came up today: the dbt_utils.equality test uses the quoted attribute from get_columns_from_relation, and missing that attribute caused an integration test to fail. I think our answer would be to override the quoted property of the SparkColumn class in the same way BigQuery does here.

tongqqiu · 2020-02-28T21:09:00Z

@SamKosky Thanks for your contribution. I have a question about table overwrite. When model is a "table", the current behavior is to drop and create the table. Since spark doesn't support the transaction, it is not good to drop the table first. I noticed that your change will do a temp table and rename, which is great! I wonder if it is possible to support "Insert into overwrite" statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html. In this way, we can use delta table's time travel feature. Any suggestions on how to make that an option?

SamKosky · 2020-02-28T22:49:55Z

@SamKosky I wonder if it is possible to support "Insert into overwrite" statement?

@tongqqiu if you want to use the delta functionality, you need to change your materialisations to incremental. Check out the updated README - once you've changed your model config to use materialized='incremental you then need to choose between an incremental_strategy of insert_overwrite (the default) or merge. This takes advantage of the new Databricks Delta functionality. We're using this for one of our clients and it seems to be working well 😄

tongqqiu · 2020-03-02T21:14:38Z

@SamKosky Thanks for your information. I am working a PR to make snapshot working on spark delta format. Please take a look when you have time SamKosky#1

jtcohen6

@SamKosky I left a few comments, specifically on the view, table, and incremental materializations.

Beyond that, I think the big blocker to shipping in this code is rebasing or merging the changes from master. Is that something you'd be able to do in the near future? I'm hoping to cut a release next week.

jtcohen6 · 2020-03-18T16:27:40Z

dbt/include/spark/macros/materializations/table.sql

  -- build model
  {% call statement('main') -%}
-    {{ create_table_as(False, target_relation, sql) }}
+    {{ create_table_as(False, intermediate_relation, sql) }}


We switched (in #17) from using the default (Postgres/Redshift) approach (creating an intermediate table and performing a rename-swap) to a naive approach (drop the existing table, create it anew) for two main reasons:

The appeal of this clever approach on Postgres/Redshift is the atomicity of running the whole thing inside of a transaction (except for the final drop). Spark doesn't have transactions, so if a query fails, it gets stuck midway through the process. It felt better to have a simple/intuitive process of only two steps (drop + create) than potentially have __dbt_tmp and __dbt_backup relational objects hanging around.

In Databricks (and potentially other hosted Spark providers), if it's a managed table, renaming a table physically moves the contents of that table from one file location to another. We found that the alter ... rename step can take longer than the create step.

I'd be open to hearing arguments in favor of returning to this default, cleverer approach.

@jtcohen6 For drop + create approach, the table is not available before create complete. Right? Any suggestions if I want to use that for dim table? That table might be temporarily unavailable?

@jtcohen6 For drop + create approach, the table is not available before create complete. Right? Any suggestions if I want to use that for dim table? That table might be temporarily unavailable?

That's correct, but it's true for the default/Redshift/Postgres approach too (create ... __dbt_tmp + alter ... rename + alter ... rename + drop), because Spark doesn't have transactions. The table will be unavailable while the second alter ... rename is occurring, and we found that many renames (especially for managed tables) take longer than simply recreating the table from scratch. So the drop + create approach is both simpler and, in many cases, results in less overall downtime.

The only atomic operation currently available in Spark for updating/replacing a table is insert overwrite. Though it feels a bit counterintuitive, you could materialize your dim table as an incremental model, and re-select all the data (= overwrite all partitions) every time.

I think the best answer is that create or replace table is coming in Spark 3.0.0.

Thanks for your comments.

jtcohen6 · 2020-03-18T16:35:23Z

dbt/include/spark/macros/materializations/view.sql

  {%- set identifier = model['alias'] -%}
+  {%- set tmp_identifier = model['name'] + '__dbt_tmp' -%}


In the case of the view materialization, since Spark supports create or replace view, I think we should move in the direction of Snowflake, BigQuery, and Presto (dbt-labs/dbt-presto#13) and away from the clever transactional approach in Postgres/Redshift (default).

I think this materialization would look like:

{% materialization view, adapter='spark' -%} {{ return(create_or_replace_view()) }} {%- endmaterialization %}

And in the adapter macro:

{% macro spark__create_view_as(relation, sql) -%} create or replace view {{ relation }} as {{ sql }} {% endmacro %}

This would slim the materialization code down significantly, and leverage some existing art in dbt-core's create_or_replace_view macro.

jtcohen6 · 2020-03-18T16:43:43Z

dbt/include/spark/macros/materializations/incremental.sql

-    {%- set old_relation = none -%}
-  {%- endif %}
+{% macro dbt_spark_get_incremental_sql(strategy, source, target, unique_key) %}
+  {%- if strategy == 'insert_overwrite' -%}


I split the insert_overwrite approach into its own macro, get_insert_overwrite_sql, in #60. I think you should pull out the merge SQL into a get_merge_sql macro, and then dbt_spark_get_incremental_sql can call get_insert_overwrite_sql and get_merge_sql based on the strategy.

A note to myself: dbt-core implements a common get_merge_sql macro here, and the syntax for Spark is roughly identical—it looks like Spark lets you use * for update set + insert, where other databases require an explicit list of columns. We're changing the names of a few of these macros in the 0.16.0 release of dbt, so it probably makes more sense to connect this back to the core macros for the 0.16.0 release of the plugin.

jtcohen6 · 2020-03-19T18:54:19Z

Update: @beckjake has offered to rebase this, in time for a release next week. Let me know if that's okay by you, or if you'd prefer to see it through yourself.

SamKosky · 2020-03-19T21:33:00Z

@jtcohen6 @beckjake that sounds amazing! I've been pretty flat out with work at the moment so might not have the time to get to this. Much appreciated!

Sam Kosky added 6 commits January 14, 2020 17:28

upgrade to version 0.15.0

e2f5cb5

exclude build and dist folders

02809a3

include drop_schema macro

601a840

separate spark__load_csv_rows macro into seed.sql file

beb2b91

ignore dbt-integration-tests

a32e29c

support for extracting statistics

7e62f13

cache schemas and relations

SamKosky requested a review from drewbanin January 15, 2020 08:05

Sam Kosky added 2 commits January 19, 2020 21:31

fixes for full-refresh

4f60031

ensure relation type is set for views and tables

3c93fa5

jtcohen6 mentioned this pull request Jan 29, 2020

Expose location, clustered_by to dbt-spark #43

Merged

beckjake reviewed Jan 30, 2020

View reviewed changes

Sam Kosky added 2 commits February 7, 2020 16:34

updated connection keys to include cluster when logging

87590de

change comment to Optional[str]

7ed8129

jtcohen6 reviewed Feb 24, 2020

View reviewed changes

implement quoting on columns

ebee137

jtcohen6 reviewed Mar 18, 2020

View reviewed changes

beckjake mentioned this pull request Mar 20, 2020

0.15.3 upgrade #65

Merged

beckjake merged commit 10741ff into dbt-labs:master Mar 23, 2020

jtcohen6 mentioned this pull request Apr 8, 2020

Support delta lake format #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.15.0 upgrade #46

0.15.0 upgrade #46

SamKosky commented Jan 15, 2020

jtcohen6 commented Jan 27, 2020

jtcohen6 commented Jan 30, 2020 •

edited

beckjake left a comment

beckjake Jan 30, 2020

SamKosky Feb 7, 2020

beckjake Jan 30, 2020

SamKosky Feb 7, 2020

beckjake Jan 30, 2020

SamKosky Feb 7, 2020

beckjake Jan 30, 2020

SamKosky Feb 7, 2020

jtcohen6 Feb 24, 2020

SamKosky Feb 26, 2020

jtcohen6 left a comment

jtcohen6 Feb 24, 2020

tongqqiu commented Feb 28, 2020

SamKosky commented Feb 28, 2020 •

edited

tongqqiu commented Mar 2, 2020

jtcohen6 left a comment •

edited

jtcohen6 Mar 18, 2020

tongqqiu Mar 20, 2020

jtcohen6 Mar 24, 2020 •

edited

tongqqiu Mar 24, 2020

jtcohen6 Mar 18, 2020

jtcohen6 Mar 18, 2020

jtcohen6 commented Mar 19, 2020

SamKosky commented Mar 19, 2020

		{%- set identifier = model['alias'] -%}
		{%- set tmp_identifier = model['name'] + '__dbt_tmp' -%}

0.15.0 upgrade #46

0.15.0 upgrade #46

Conversation

SamKosky commented Jan 15, 2020

jtcohen6 commented Jan 27, 2020

jtcohen6 commented Jan 30, 2020 • edited

beckjake left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tongqqiu commented Feb 28, 2020

SamKosky commented Feb 28, 2020 • edited

tongqqiu commented Mar 2, 2020

jtcohen6 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 Mar 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 commented Mar 19, 2020

SamKosky commented Mar 19, 2020

jtcohen6 commented Jan 30, 2020 •

edited

SamKosky commented Feb 28, 2020 •

edited

jtcohen6 left a comment •

edited

jtcohen6 Mar 24, 2020 •

edited