Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull the owner from the DESCRIBE EXTENDED #39

Merged
merged 18 commits into from
Mar 16, 2020

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 4, 2019

I would like to add the owner to the docs, therefore I've written a PR that will do a DESCRIBE EXTENDED to get the relevant fields. Added an unit test, and tested it against Databricks/Azure.

@drewbanin
Copy link
Contributor

hey @Fokko - can you merge master into this branch? After you do that, I'll kick off the unit + integration tests. I'll give this a deeper look in the coming week :)

@Fokko
Copy link
Contributor Author

Fokko commented Dec 21, 2019

Sure, I've pulled master into this. Since we have integration testing now as well, I'll check if I can come up with a better test :-)

@jtcohen6
Copy link
Contributor

I'm having difficulty testing this branch: describe extended does not work in my Spark environment (Databricks on AWS). This results in dbt's get_catalog method returning the error, 'NoneType' object has no attribute 'lower'.

describe `my_db`.`my_table`
COLUMN_NAME DATA_TYPE PK NULLABLE DEFAULT AUTOINCREMENT COMPUTED REMARKS POSITION
id BIGINT NO YES   NO NO   1
first_name STRING(255) NO YES   NO NO   2
last_name STRING(255) NO YES   NO NO   3
email STRING(255) NO YES   NO NO   4
gender STRING(255) NO YES   NO NO   5
ip_address STRING(255) NO YES   NO NO   6
describe extended `my_db`.`my_table`
No object named extended `my_db`.`my_table` found!

It seems like the statement is being parsed wrong, somehow. I wonder if there's some session parameter or cluster configuration I'm missing. I'm running:

Databricks Runtime Version 6.2 (includes Apache Spark 2.4.4, Scala 2.11)

@Fokko I'm curious if you've encountered a similar issue before?

@Fokko
Copy link
Contributor Author

Fokko commented Jan 29, 2020

Thanks @jtcohen6 for giving this a try.

Currently, I'm on holiday so I don't have access to any production environment. But I think you've got an edge case here. With ANALYZE TABLE my_db.my_table COMPUTE STATISTICS; you should get the stats. We added this as a post hook:

{{ config(
    file_format='delta',
    post_hook=[
        'OPTIMIZE {{ this }}',
        'ANALYZE TABLE {{ this }} COMPUTE STATISTICS'
    ]
) }}

If you didn't do this, you'll probably get the error that you're seeing. Allow me to push a patch.

@Fokko
Copy link
Contributor Author

Fokko commented Jan 29, 2020

@jtcohen6 Can you share the full stack trace?

@jtcohen6
Copy link
Contributor

@Fokko Ah! Sorry to bother you on holiday.

I figured out what's going wrong on my end: describe extended does work when I run it from dbt or a Databricks notebook, but not from a SQL runner that connects via JDBC driver. The more I know:

Screen Shot 2020-01-29 at 2 46 12 PM

Here's the trace of what's going wrong:

2020-01-29 14:54:25,297 (MainThread): 'NoneType' object has no attribute 'lower'
2020-01-29 14:54:25,310 (MainThread): Traceback (most recent call last):
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/main.py", line 82, in main
    results, succeeded = handle_and_check(args)
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/main.py", line 151, in handle_and_check
    task, res = run_from_args(parsed)
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/main.py", line 216, in run_from_args
    results = task.run()
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/task/generate.py", line 209, in run
    results = adapter.get_catalog(manifest)
  File "/Users/jerco/dev/product/dbt-spark/dbt/adapters/spark/impl.py", line 184, in get_catalog
    columns += self._parse_relation(relation, table_columns, rel_type, properties)
  File "/Users/jerco/dev/product/dbt-spark/dbt/adapters/spark/impl.py", line 146, in _parse_relation
    logger.info(column.data_type)
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/adapters/base/relation.py", line 356, in data_type
    if self.is_string():
  File "/Users/jerco/dev/product/dbt-spark/env/lib/python3.7/site-packages/dbt_core-0.14.3-py3.7.egg/dbt/adapters/base/relation.py", line 365, in is_string
    return self.dtype.lower() in ['text', 'character varying', 'character',
AttributeError: 'NoneType' object has no attribute 'lower'

It looks like this is the offending line. I'm not sure if it's because:

  • We might be passing None as the dtype (the empty line between the last column and # Detailed Table Information? )
  • Spark's string datatype is actually called string, which is not in the default implementation. I know that the BigQuery adapter has to reimplement the Column class for similar reasons

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 6, 2020

@drewbanin Would you be able to take a look at the error in _parse_relation related to None data type? Once we can get dbt docs generate to work, I think this PR is in solid shape.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 6, 2020

I think the empty line is being transformed into a None, I can do additional checks today or tomorrow.

@drewbanin
Copy link
Contributor

I think the issue here is indeed that describe extended returns a whole lot of information besides just the column names + types. While @Fokko is correct above that the issue is the blank line in the result set, there's also going to be an issue around all of these other metadata fields!

Screen Shot 2020-02-06 at 10 27 28 PM

In spark__get_columns_in_relation, it's not enough to just add extended to the describe statement. Instead, I think we should rip out the ensuing code and replace it with something more sensible:

{% macro spark__get_columns_in_relation(relation) -%}
  {% call statement('get_columns_in_relation', fetch_result=True) %}
    describe extended {{ relation }}
  {% endcall %}

  {% set table = load_result('get_columns_in_relation').table %}
  -- TODO : Remove this line
  -- {{ return(sql_convert_columns_in_relation(table)) }}

  -- TODO : Call an adapter method (exposed in the compilation context)
  --        which will convert the raw `describe extended` results into a list of columns
  {{ return(adapter.parse_describe_extended(table) }}

{% endmacro %}

For reference the dbt Core 0.14.3 implementation of sql_convert_columns_in_relation is:

https://github.com/fishtown-analytics/dbt/blob/b5aff36253f5c6563e942265bbaa4cb366722b14/core/dbt/include/global_project/macros/adapters/common.sql#L109-L115

Ultimately, this logic is going to look pretty similar to the current contents of _parse_relation in this PR. I think they key difference will be that we should parse the raw results of the describe extended statement before wrapping those results up in Column objects.

dbt/adapters/spark/impl.py Outdated Show resolved Hide resolved
@jtcohen6
Copy link
Contributor

@Fokko Thank you for making these changes! This PR worked for me when I checked it out last night. I'm kicking off tests now.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 17, 2020

Thanks Drew for the pointers, this made the code a bit nicer.

I'm unable to change the OWNER in Databricks. Last December this was possible, maybe something changed. Right now, it is stuck at root:
image

I can check against AWS EMR to see if this is possible there. I've tested against Azure Databricks, and the DESCRIBE TABLE EXTENDED is properly working.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 17, 2020

My pleasure @jtcohen6. Let me know what comes out of it.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 20, 2020

@jtcohen6 The error looks unrelated. After this one, I'll continue on two other PR's:

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 24, 2020

@Fokko I figured out why the integration test is failing!

  • Here the test uses the dbt_utils.equality macro.
  • Here is where that macro expects that adapter.get_columns_in_relation returns an array of column dicts, and that those dicts include a quoted attribute

The implementation of parse_describe_extended does not currently return a quoted attribute. It really just needs to be the column name wrapped in backticks, `column_name`.

Other adapters' implementations of get_columns_in_relation handle this by instantiating each result as an api.Column object, which has a quoted property. E.g. the Snowflake version calls the common macro sql_convert_columns_in_relation to do just this.

In the long run, I think the better answer here is to reimplement the Column object as SparkColumn, and update the quoted property to use ` instead of ". (BigQuery does the same thing here in 0.14.x, and here in >= 0.15.x.)

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Thank you for making these changes! This is looking really close. I think you'll just need to revise the new unit test test_parse_relation, since you've changed the method names.

test/unit/test_adapter.py Show resolved Hide resolved
dbt/adapters/spark/impl.py Outdated Show resolved Hide resolved
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 3, 2020

@drewbanin I'm unable to locally replicate the latest integration test failure. Could you take a look when you have a second?

@Fokko
Copy link
Contributor Author

Fokko commented Mar 3, 2020

@jtcohen6 I was already looking into it. Not sure what caused it, could it be a race condition? This could also be wishful thinking.
Also a bit weird that we don't have any logs. Curious about the SQL statements.

@Fokko
Copy link
Contributor Author

Fokko commented Mar 5, 2020

@jtcohen6 Can we trigger the tests once more? :)

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 5, 2020

@Fokko :( I'll try to get to the bottom of this

@Fokko
Copy link
Contributor Author

Fokko commented Mar 5, 2020

I’ll dig into it as well

@drewbanin
Copy link
Contributor

drewbanin commented Mar 5, 2020

@Fokko @jtcohen6 I gave this a spin locally and I found that the two tables (once created via a seed, one created via an incremental model) were created with the columns in a weird order.

This manifests as an error when the incremental model is run twice. It writes columns out of order, which looks something like this:

Screen Shot 2020-03-05 at 5 28 16 PM

Have you two seen anything like this before?

@Fokko
Copy link
Contributor Author

Fokko commented Mar 7, 2020

I got the integration tests working locally as well. I've noticed that the inserted data is shifted by one, the id column is prepended to the statement.

It looks like the SEED is not properly being created:

create table reporting.seed (id BIGINT,first_name STRING,last_name STRING,email STRING,gender STRING,ip_address STRING)

the id should be at the end

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've hit the bottom of this. It turns out that this integration test has been a false negative all along. The change in this PR to correctly define Spark's quote character ` caused the test to finally work, and caused me to realize that we need an update to our incremental materialization (#59). Together with the proposed changes in #60, we should finally have a passing integration test.

See my one comment about updating the get_columns_in_relation method to avoid including partition columns twice.

def find_table_information_separator(rows: List[dict]) -> int:
pos = 0
for row in rows:
if not row['col_name']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid including partition columns twice, could this be something like

Suggested change
if not row['col_name']:
if not row['col_name'] or row['col_name'] == '# Partition Information':

I'm not sure if that's the most elegant way of writing, and it would need at least a line break to satisfy pep8

@jtcohen6
Copy link
Contributor

@Fokko Could you take a look at this when you get a chance? There's one small change I think we should make here to avoid duplicate columns in the get_columns_in_relation result.

@Fokko
Copy link
Contributor Author

Fokko commented Mar 16, 2020

Thanks for pining me @jtcohen6. I was away last week, and the notification got lost. I've updated the PR

@Fokko
Copy link
Contributor Author

Fokko commented Mar 16, 2020

Yes, just checked to be sure:
image

@Fokko
Copy link
Contributor Author

Fokko commented Mar 16, 2020

We can test against Spark when #60 has been merged :)

@jtcohen6
Copy link
Contributor

Thanks @Fokko!

As it is, #39 depends on #60 and #60 depends on #39 for passing integration tests. I've confirmed locally that the two in combination can pass this integration test.

I think I'm going to take what I see as the simplest approach here:

  1. Squash + merge this PR into master
  2. Pull these changes into Fix: column order for incremental insert overwrite #60
  3. Test Fix: column order for incremental insert overwrite #60
  4. Merge when tests pass

Does that sound ok by you?

@Fokko
Copy link
Contributor Author

Fokko commented Mar 16, 2020

Sounds good, I'm also happy to cherry-pick #60 onto this branch.

@jtcohen6 jtcohen6 merged commit 955e816 into dbt-labs:master Mar 16, 2020
@Fokko Fokko deleted the fd-describe-table-spark branch March 16, 2020 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants