-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for extracting statistics #41
Conversation
dbt/adapters/spark/impl.py
Outdated
if not table_owner and column.name == table_owner_key: | ||
table_owner = column.data_type | ||
|
||
if column.name == 'Statistics': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Fokko you seem to be missing a test for parsing the Statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch Niels. I have a test on my fork: master...Fokko:master#diff-cb0ffa1fa3373b03d243ccd85a226cc6R404
However, these tests are all mocked, something that I don't really like. Especially with Spark 3.0 coming up. It is better to directly talk to Spark, and don't rely on mock that might not reflect the actual interface. It would be great to update the Spark version when a new one gets released.
I'm working on a docker based CI for running the tests. This is a docker container with Hadoop, Hive, and Spark. This one will expose the HTTP 10001 port for doing the queries against Spark. With docker, we can test against 2.4.x which is the LTS release, and 3.0 which is currently in preview. This requires Hive as well for storing the metadata. Currently, I'm occupied with another project, and continuing the work on this end of next week.
Before merging this, please wait for #39 |
@Fokko If you can fix the merge conflicts between this branch and |
49077a9
to
280be10
Compare
Thanks @jtcohen6. I've just updated the branch. I'm a bit busy, so sorry for the late response. I've tested against Azure Databricks and it works 👍 |
@Fokko No worries, thank you! I tested on AWS Databricks and the local (dockerized) Spark. It looks great on both. One tiny flake8 error, and then this is good to merge:
|
Thanks @jtcohen6 I've pushed a fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for the follow-on work for this @Fokko
My pleasure @jtcohen6 ! |
Depends on #39
With Spark you can compute statistics using
ANALYZE TABLE <table> COMPUTE STATISTICS
. This will compute statistics such as number of rows, and size and this will be stored in the metastore. With this PR this data will be fetched and shown in the docs: