Pull the owner from the DESCRIBE EXTENDED #39

Fokko · 2019-12-04T17:59:11Z

I would like to add the owner to the docs, therefore I've written a PR that will do a DESCRIBE EXTENDED to get the relevant fields. Added an unit test, and tested it against Databricks/Azure.

drewbanin · 2019-12-20T22:01:00Z

hey @Fokko - can you merge master into this branch? After you do that, I'll kick off the unit + integration tests. I'll give this a deeper look in the coming week :)

Fokko · 2019-12-21T12:39:05Z

Sure, I've pulled master into this. Since we have integration testing now as well, I'll check if I can come up with a better test :-)

jtcohen6 · 2020-01-29T17:57:50Z

I'm having difficulty testing this branch: describe extended does not work in my Spark environment (Databricks on AWS). This results in dbt's get_catalog method returning the error, 'NoneType' object has no attribute 'lower'.

describe `my_db`.`my_table`

COLUMN_NAME	DATA_TYPE	PK	NULLABLE	AUTOINCREMENT	COMPUTED	POSITION
id	BIGINT	NO	YES	NO	NO	1
first_name	STRING(255)	NO	YES	NO	NO	2
last_name	STRING(255)	NO	YES	NO	NO	3
email	STRING(255)	NO	YES	NO	NO	4
gender	STRING(255)	NO	YES	NO	NO	5
ip_address	STRING(255)	NO	YES	NO	NO	6

describe extended `my_db`.`my_table`

No object named extended `my_db`.`my_table` found!

It seems like the statement is being parsed wrong, somehow. I wonder if there's some session parameter or cluster configuration I'm missing. I'm running:

Databricks Runtime Version 6.2 (includes Apache Spark 2.4.4, Scala 2.11)

@Fokko I'm curious if you've encountered a similar issue before?

Fokko · 2020-01-29T18:48:07Z

Thanks @jtcohen6 for giving this a try.

Currently, I'm on holiday so I don't have access to any production environment. But I think you've got an edge case here. With ANALYZE TABLE my_db.my_table COMPUTE STATISTICS; you should get the stats. We added this as a post hook:

{{ config(
    file_format='delta',
    post_hook=[
        'OPTIMIZE {{ this }}',
        'ANALYZE TABLE {{ this }} COMPUTE STATISTICS'
    ]
) }}

If you didn't do this, you'll probably get the error that you're seeing. Allow me to push a patch.

Fokko · 2020-01-29T18:51:33Z

@jtcohen6 Can you share the full stack trace?

jtcohen6 · 2020-01-29T20:03:33Z

@Fokko Ah! Sorry to bother you on holiday.

I figured out what's going wrong on my end: describe extended does work when I run it from dbt or a Databricks notebook, but not from a SQL runner that connects via JDBC driver. The more I know:

Here's the trace of what's going wrong:

2020-01-29 14:54:25,297 (MainThread): 'NoneType' object has no attribute 'lower'
2020-01-29 14:54:25,310 (MainThread): Traceback (most recent call last):
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/main.py", line 82, in main
    results, succeeded = handle_and_check(args)
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/main.py", line 151, in handle_and_check
    task, res = run_from_args(parsed)
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/main.py", line 216, in run_from_args
    results = task.run()
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/task/generate.py", line 209, in run
    results = adapter.get_catalog(manifest)
  File "/Users/jerco/dev/product/dbt-spark/dbt/adapters/spark/impl.py", line 184, in get_catalog
    columns += self._parse_relation(relation, table_columns, rel_type, properties)
  File "/Users/jerco/dev/product/dbt-spark/dbt/adapters/spark/impl.py", line 146, in _parse_relation
    logger.info(column.data_type)
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/adapters/base/relation.py", line 356, in data_type
    if self.is_string():
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/adapters/base/relation.py", line 365, in is_string
    return self.dtype.lower() in ['text', 'character varying', 'character',
AttributeError: 'NoneType' object has no attribute 'lower'

It looks like this is the offending line. I'm not sure if it's because:

We might be passing None as the dtype (the empty line between the last column and # Detailed Table Information? )
Spark's string datatype is actually called string, which is not in the default implementation. I know that the BigQuery adapter has to reimplement the Column class for similar reasons

jtcohen6 · 2020-02-06T16:25:35Z

@drewbanin Would you be able to take a look at the error in _parse_relation related to None data type? Once we can get dbt docs generate to work, I think this PR is in solid shape.

Fokko · 2020-02-06T17:55:06Z

I think the empty line is being transformed into a None, I can do additional checks today or tomorrow.

drewbanin · 2020-02-07T03:40:09Z

I think the issue here is indeed that describe extended returns a whole lot of information besides just the column names + types. While @Fokko is correct above that the issue is the blank line in the result set, there's also going to be an issue around all of these other metadata fields!

In spark__get_columns_in_relation, it's not enough to just add extended to the describe statement. Instead, I think we should rip out the ensuing code and replace it with something more sensible:

{% macro spark__get_columns_in_relation(relation) -%}
  {% call statement('get_columns_in_relation', fetch_result=True) %}
    describe extended {{ relation }}
  {% endcall %}

  {% set table = load_result('get_columns_in_relation').table %}
  -- TODO : Remove this line
  -- {{ return(sql_convert_columns_in_relation(table)) }}

  -- TODO : Call an adapter method (exposed in the compilation context)
  --        which will convert the raw `describe extended` results into a list of columns
  {{ return(adapter.parse_describe_extended(table) }}

{% endmacro %}

For reference the dbt Core 0.14.3 implementation of sql_convert_columns_in_relation is:

https://github.com/fishtown-analytics/dbt/blob/b5aff36253f5c6563e942265bbaa4cb366722b14/core/dbt/include/global_project/macros/adapters/common.sql#L109-L115

Ultimately, this logic is going to look pretty similar to the current contents of _parse_relation in this PR. I think they key difference will be that we should parse the raw results of the describe extended statement before wrapping those results up in Column objects.

dbt/adapters/spark/impl.py

into fd-describe-table-spark

jtcohen6 · 2020-02-17T14:44:30Z

@Fokko Thank you for making these changes! This PR worked for me when I checked it out last night. I'm kicking off tests now.

Fokko · 2020-02-17T14:51:46Z

Thanks Drew for the pointers, this made the code a bit nicer.

I'm unable to change the OWNER in Databricks. Last December this was possible, maybe something changed. Right now, it is stuck at root:

I can check against AWS EMR to see if this is possible there. I've tested against Azure Databricks, and the DESCRIBE TABLE EXTENDED is properly working.

Fokko · 2020-02-17T14:52:09Z

My pleasure @jtcohen6. Let me know what comes out of it.

Fokko · 2020-02-20T08:47:56Z

@jtcohen6 The error looks unrelated. After this one, I'll continue on two other PR's:

Add support for Table properties. So you can also set the owner of a table through the Owner property: https://github.com/fishtown-analytics/dbt-spark/pull/39/files#diff-5cd7d7f886de03e2de4b8cd030ee8c43R151-R156
Add support for fetching the statistics. I have a first draft in Add support for extracting statistics #41 but this needs TLC.

jtcohen6 · 2020-02-24T16:55:08Z

@Fokko I figured out why the integration test is failing!

Here the test uses the dbt_utils.equality macro.
Here is where that macro expects that adapter.get_columns_in_relation returns an array of column dicts, and that those dicts include a quoted attribute

The implementation of parse_describe_extended does not currently return a quoted attribute. It really just needs to be the column name wrapped in backticks, `column_name`.

Other adapters' implementations of get_columns_in_relation handle this by instantiating each result as an api.Column object, which has a quoted property. E.g. the Snowflake version calls the common macro sql_convert_columns_in_relation to do just this.

In the long run, I think the better answer here is to reimplement the Column object as SparkColumn, and update the quoted property to use ` instead of ". (BigQuery does the same thing here in 0.14.x, and here in >= 0.15.x.)

jtcohen6

@Fokko Thank you for making these changes! This is looking really close. I think you'll just need to revise the new unit test test_parse_relation, since you've changed the method names.

test/unit/test_adapter.py

dbt/adapters/spark/impl.py

jtcohen6 · 2020-03-03T14:50:02Z

@drewbanin I'm unable to locally replicate the latest integration test failure. Could you take a look when you have a second?

Fokko · 2020-03-03T15:51:33Z

@jtcohen6 I was already looking into it. Not sure what caused it, could it be a race condition? This could also be wishful thinking.
Also a bit weird that we don't have any logs. Curious about the SQL statements.

into fd-describe-table-spark

Fokko · 2020-03-05T19:56:33Z

@jtcohen6 Can we trigger the tests once more? :)

jtcohen6 · 2020-03-05T21:25:54Z

@Fokko :( I'll try to get to the bottom of this

Fokko · 2020-03-05T22:23:17Z

I’ll dig into it as well

drewbanin · 2020-03-05T22:29:36Z

@Fokko @jtcohen6 I gave this a spin locally and I found that the two tables (once created via a seed, one created via an incremental model) were created with the columns in a weird order.

This manifests as an error when the incremental model is run twice. It writes columns out of order, which looks something like this:

Have you two seen anything like this before?

Fokko · 2020-03-07T12:11:40Z

I got the integration tests working locally as well. I've noticed that the inserted data is shifted by one, the id column is prepended to the statement.

It looks like the SEED is not properly being created:

create table reporting.seed (id BIGINT,first_name STRING,last_name STRING,email STRING,gender STRING,ip_address STRING)

the id should be at the end

jtcohen6

I think I've hit the bottom of this. It turns out that this integration test has been a false negative all along. The change in this PR to correctly define Spark's quote character ` caused the test to finally work, and caused me to realize that we need an update to our incremental materialization (#59). Together with the proposed changes in #60, we should finally have a passing integration test.

See my one comment about updating the get_columns_in_relation method to avoid including partition columns twice.

jtcohen6 · 2020-03-09T01:48:56Z

dbt/adapters/spark/impl.py

+    def find_table_information_separator(rows: List[dict]) -> int:
+        pos = 0
+        for row in rows:
+            if not row['col_name']:


To avoid including partition columns twice, could this be something like

Suggested change

if not row['col_name']:

if not row['col_name'] or row['col_name'] == '# Partition Information':

I'm not sure if that's the most elegant way of writing, and it would need at least a line break to satisfy pep8

jtcohen6 · 2020-03-16T18:24:36Z

@Fokko Could you take a look at this when you get a chance? There's one small change I think we should make here to avoid duplicate columns in the get_columns_in_relation result.

Fokko · 2020-03-16T18:34:23Z

Thanks for pining me @jtcohen6. I was away last week, and the notification got lost. I've updated the PR

into fd-describe-table-spark

test/unit/test_adapter.py

Fokko · 2020-03-16T19:02:31Z

Yes, just checked to be sure:

Fokko · 2020-03-16T19:04:05Z

We can test against Spark when #60 has been merged :)

jtcohen6 · 2020-03-16T19:55:54Z

Thanks @Fokko!

As it is, #39 depends on #60 and #60 depends on #39 for passing integration tests. I've confirmed locally that the two in combination can pass this integration test.

I think I'm going to take what I see as the simplest approach here:

Squash + merge this PR into master
Pull these changes into Fix: column order for incremental insert overwrite #60
Test Fix: column order for incremental insert overwrite #60
Merge when tests pass

Does that sound ok by you?

Fokko · 2020-03-16T20:00:27Z

Sounds good, I'm also happy to cherry-pick #60 onto this branch.

Fokko added 3 commits December 4, 2019 18:57

Pull the owner from the DESCRIBE EXTENDED

fdf66b4

Switch the order

5af7865

We also want to look inside of the property

42487a8

Fokko mentioned this pull request Dec 10, 2019

Add support for extracting statistics #41

Merged

Merge branch 'master' into fd-describe-table-spark

31e2ef5

This was referenced Jan 29, 2020

Saner approaches to getting metadata for Relations #49

Closed

Expose location, clustered_by to dbt-spark #43

Merged

0.15.0 upgrade #46

Merged

drewbanin requested changes Feb 7, 2020

View reviewed changes

dbt/adapters/spark/impl.py Outdated Show resolved Hide resolved

Fokko added 4 commits February 14, 2020 10:48

Merge branch 'master' of https://github.com/fishtown-analytics/dbt-spark

2a858d3

into fd-describe-table-spark

Update

2d19025

First version

c0ceb22

A bit of a cleanup

9b6a134

Fokko force-pushed the fd-describe-table-spark branch from 48a8bbf to 9b6a134 Compare February 17, 2020 14:41

Fokko added 2 commits February 17, 2020 15:53

Make Flake8 happy

03d3c19

Fix the failing test

5edae89

Less is more

e8f6297

jtcohen6 reviewed Mar 2, 2020

View reviewed changes

test/unit/test_adapter.py Show resolved Hide resolved

dbt/adapters/spark/impl.py Outdated Show resolved Hide resolved

Fokko added 2 commits March 2, 2020 17:13

Fix small issues

e40b528

Fix the test

f9745fa

Merge branch 'master' of https://github.com/fishtown-analytics/dbt-spark

bbf7df5

into fd-describe-table-spark

This was referenced Mar 9, 2020

Incremental insert overwrite requires identical column ordering #59

Closed

Fix: column order for incremental insert overwrite #60

Merged

jtcohen6 reviewed Mar 9, 2020

View reviewed changes

Apply comments

519a870

Merge branch 'master' of https://github.com/fishtown-analytics/dbt-spark

06c8663

into fd-describe-table-spark

jtcohen6 reviewed Mar 16, 2020

View reviewed changes

test/unit/test_adapter.py Show resolved Hide resolved

jtcohen6 reviewed Mar 16, 2020

View reviewed changes

test/unit/test_adapter.py Show resolved Hide resolved

Fix the unit test :)

7f06033

jtcohen6 approved these changes Mar 16, 2020

View reviewed changes

jtcohen6 merged commit 955e816 into dbt-labs:master Mar 16, 2020

Fokko deleted the fd-describe-table-spark branch March 16, 2020 20:30

jtcohen6 mentioned this pull request May 26, 2020

Re-add table owner + stats to catalog #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull the owner from the DESCRIBE EXTENDED #39

Pull the owner from the DESCRIBE EXTENDED #39

Fokko commented Dec 4, 2019

drewbanin commented Dec 20, 2019

Fokko commented Dec 21, 2019

jtcohen6 commented Jan 29, 2020

Fokko commented Jan 29, 2020

Fokko commented Jan 29, 2020

jtcohen6 commented Jan 29, 2020

jtcohen6 commented Feb 6, 2020

Fokko commented Feb 6, 2020

drewbanin commented Feb 7, 2020

jtcohen6 commented Feb 17, 2020

Fokko commented Feb 17, 2020

Fokko commented Feb 17, 2020

Fokko commented Feb 20, 2020

jtcohen6 commented Feb 24, 2020 •

edited

Loading

jtcohen6 left a comment

jtcohen6 commented Mar 3, 2020

Fokko commented Mar 3, 2020

Fokko commented Mar 5, 2020

jtcohen6 commented Mar 5, 2020

Fokko commented Mar 5, 2020

drewbanin commented Mar 5, 2020 •

edited

Loading

Fokko commented Mar 7, 2020

jtcohen6 left a comment

jtcohen6 Mar 9, 2020

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 16, 2020

Fokko commented Mar 16, 2020

Fokko commented Mar 16, 2020

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 16, 2020

	if not row['col_name']:
	if not row['col_name'] or row['col_name'] == '# Partition Information':

Pull the owner from the DESCRIBE EXTENDED #39

Pull the owner from the DESCRIBE EXTENDED #39

Conversation

Fokko commented Dec 4, 2019

drewbanin commented Dec 20, 2019

Fokko commented Dec 21, 2019

jtcohen6 commented Jan 29, 2020

Fokko commented Jan 29, 2020

Fokko commented Jan 29, 2020

jtcohen6 commented Jan 29, 2020

jtcohen6 commented Feb 6, 2020

Fokko commented Feb 6, 2020

drewbanin commented Feb 7, 2020

jtcohen6 commented Feb 17, 2020

Fokko commented Feb 17, 2020

Fokko commented Feb 17, 2020

Fokko commented Feb 20, 2020

jtcohen6 commented Feb 24, 2020 • edited Loading

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 commented Mar 3, 2020

Fokko commented Mar 3, 2020

Fokko commented Mar 5, 2020

jtcohen6 commented Mar 5, 2020

Fokko commented Mar 5, 2020

drewbanin commented Mar 5, 2020 • edited Loading

Fokko commented Mar 7, 2020

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Mar 9, 2020

Choose a reason for hiding this comment

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 16, 2020

Fokko commented Mar 16, 2020

Fokko commented Mar 16, 2020

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 16, 2020

jtcohen6 commented Feb 24, 2020 •

edited

Loading

drewbanin commented Mar 5, 2020 •

edited

Loading