Add support for extracting statistics #41

Fokko · 2019-12-10T14:27:39Z

Depends on #39

With Spark you can compute statistics using ANALYZE TABLE <table> COMPUTE STATISTICS. This will compute statistics such as number of rows, and size and this will be stored in the metastore. With this PR this data will be fetched and shown in the docs:

NielsZeilemaker · 2019-12-13T20:47:35Z

dbt/adapters/spark/impl.py

+            if not table_owner and column.name == table_owner_key:
+                table_owner = column.data_type
+
+            if column.name == 'Statistics':


Hi @Fokko you seem to be missing a test for parsing the Statistics

Great catch Niels. I have a test on my fork: master...Fokko:master#diff-cb0ffa1fa3373b03d243ccd85a226cc6R404

However, these tests are all mocked, something that I don't really like. Especially with Spark 3.0 coming up. It is better to directly talk to Spark, and don't rely on mock that might not reflect the actual interface. It would be great to update the Spark version when a new one gets released.

I'm working on a docker based CI for running the tests. This is a docker container with Hadoop, Hive, and Spark. This one will expose the HTTP 10001 port for doing the queries against Spark. With docker, we can test against 2.4.x which is the LTS release, and 3.0 which is currently in preview. This requires Hive as well for storing the metadata. Currently, I'm occupied with another project, and continuing the work on this end of next week.

Fokko · 2020-02-16T14:53:18Z

Before merging this, please wait for #39

jtcohen6 · 2020-03-16T21:27:41Z

@Fokko If you can fix the merge conflicts between this branch and master, I'd be happy to take a look at the PR!

Fokko · 2020-03-18T21:32:17Z

Thanks @jtcohen6. I've just updated the branch. I'm a bit busy, so sorry for the late response. I've tested against Azure Databricks and it works 👍

jtcohen6 · 2020-03-19T18:06:35Z

@Fokko No worries, thank you! I tested on AWS Databricks and the local (dockerized) Spark. It looks great on both.

One tiny flake8 error, and then this is good to merge:

dbt/adapters/spark/impl.py:183:80: E501 line too long (83 > 79 characters)

Fokko · 2020-03-19T19:30:49Z

Thanks @jtcohen6 I've pushed a fix

jtcohen6

Looks great! Thanks for the follow-on work for this @Fokko

Fokko · 2020-03-20T07:35:20Z

My pleasure @jtcohen6 !

NielsZeilemaker reviewed Dec 13, 2019

View reviewed changes

This was referenced Jan 29, 2020

Saner approaches to getting metadata for Relations #49

Closed

0.15.0 upgrade #46

Merged

Fokko mentioned this pull request Feb 20, 2020

Pull the owner from the DESCRIBE EXTENDED #39

Merged

Add support for table statistics

280be10

Fokko force-pushed the fd-add-statistics-support branch from 49077a9 to 280be10 Compare March 18, 2020 21:31

Fix

f81c41e

jtcohen6 approved these changes Mar 20, 2020

View reviewed changes

jtcohen6 merged commit e6610ae into dbt-labs:master Mar 20, 2020

Fokko deleted the fd-add-statistics-support branch March 20, 2020 07:35

jtcohen6 mentioned this pull request May 26, 2020

Re-add table owner + stats to catalog #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for extracting statistics #41

Add support for extracting statistics #41

Fokko commented Dec 10, 2019

NielsZeilemaker Dec 13, 2019

Fokko Dec 15, 2019

Fokko commented Feb 16, 2020

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 18, 2020

jtcohen6 commented Mar 19, 2020

Fokko commented Mar 19, 2020

jtcohen6 left a comment

Fokko commented Mar 20, 2020

Add support for extracting statistics #41

Add support for extracting statistics #41

Conversation

Fokko commented Dec 10, 2019

NielsZeilemaker Dec 13, 2019

Choose a reason for hiding this comment

Fokko Dec 15, 2019

Choose a reason for hiding this comment

Fokko commented Feb 16, 2020

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 18, 2020

jtcohen6 commented Mar 19, 2020

Fokko commented Mar 19, 2020

jtcohen6 left a comment

Choose a reason for hiding this comment

Fokko commented Mar 20, 2020