Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for extracting statistics #41

Merged
merged 2 commits into from
Mar 20, 2020

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 10, 2019

Depends on #39

With Spark you can compute statistics using ANALYZE TABLE <table> COMPUTE STATISTICS. This will compute statistics such as number of rows, and size and this will be stored in the metastore. With this PR this data will be fetched and shown in the docs:

image

if not table_owner and column.name == table_owner_key:
table_owner = column.data_type

if column.name == 'Statistics':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Fokko you seem to be missing a test for parsing the Statistics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch Niels. I have a test on my fork: master...Fokko:master#diff-cb0ffa1fa3373b03d243ccd85a226cc6R404

However, these tests are all mocked, something that I don't really like. Especially with Spark 3.0 coming up. It is better to directly talk to Spark, and don't rely on mock that might not reflect the actual interface. It would be great to update the Spark version when a new one gets released.

I'm working on a docker based CI for running the tests. This is a docker container with Hadoop, Hive, and Spark. This one will expose the HTTP 10001 port for doing the queries against Spark. With docker, we can test against 2.4.x which is the LTS release, and 3.0 which is currently in preview. This requires Hive as well for storing the metadata. Currently, I'm occupied with another project, and continuing the work on this end of next week.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 16, 2020

Before merging this, please wait for #39

@jtcohen6
Copy link
Contributor

@Fokko If you can fix the merge conflicts between this branch and master, I'd be happy to take a look at the PR!

@Fokko Fokko force-pushed the fd-add-statistics-support branch from 49077a9 to 280be10 Compare March 18, 2020 21:31
@Fokko
Copy link
Contributor Author

Fokko commented Mar 18, 2020

Thanks @jtcohen6. I've just updated the branch. I'm a bit busy, so sorry for the late response. I've tested against Azure Databricks and it works 👍

@jtcohen6
Copy link
Contributor

@Fokko No worries, thank you! I tested on AWS Databricks and the local (dockerized) Spark. It looks great on both.

One tiny flake8 error, and then this is good to merge:

dbt/adapters/spark/impl.py:183:80: E501 line too long (83 > 79 characters)

@Fokko
Copy link
Contributor Author

Fokko commented Mar 19, 2020

Thanks @jtcohen6 I've pushed a fix

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for the follow-on work for this @Fokko

@jtcohen6 jtcohen6 merged commit e6610ae into dbt-labs:master Mar 20, 2020
@Fokko Fokko deleted the fd-add-statistics-support branch March 20, 2020 07:35
@Fokko
Copy link
Contributor Author

Fokko commented Mar 20, 2020

My pleasure @jtcohen6 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants