feat: Adds delta-lake-extractor #351

samshuster · 2020-08-28T00:58:01Z

Summary of Changes

This is the MR for a delta lake amundsen extractor. This approach requires that the extractor runs on a spark cluster and so would need to be executed differently from other types of extractors.

It currently emits both table metadata and table last updated, BUT it can be augmented at a later point to also emit:

programmatic descriptions with delta-lake specific information
watermarks
column stats

Tests

Tests for all core extractor functionality were added.

Documentation

Initial documentation added to the main README.md

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test

samshuster · 2020-10-06T14:34:14Z

I have been waiting on feedback from delta-lake users to make sure this solution as written would work for them. Given there has been no feedback, I wonder if I should bring in this script with no testing for now in the "sample" directory instead?

feng-tao · 2020-10-06T16:17:23Z

I have been waiting on feedback from delta-lake users to make sure this solution as written would work for them. Given there has been no feedback, I wonder if I should bring in this script with no testing for now in the "sample" directory instead?

maybe a basic unit test to ensure the coverage?

samshuster · 2020-10-19T03:43:50Z

Ok this one has taken me awhile... apologies about that.

I have now a working first pass of it with unit tests. Next steps:

fix all build failures, DCO and linting
complete documentation for the feature where its needed
take a look at what it would take to add the programmatic and watermark part or whether it should just be a separate story.

samshuster · 2020-10-19T16:15:16Z

ok @feng-tao this one is finally ready

feng-tao

just a few nits and i think we could improve iteratively as well. Curious: does Edmunds have a chance to test and switch with this extractor in databricks platform instead of the original csv approach?

README.md

feng-tao · 2020-10-20T22:26:52Z

databuilder/extractor/delta_lake_metadata_extractor.py

+LOGGER = logging.getLogger(__name__)
+
+
+class ScrapedColumnMetadata(object):


curious of why we don't reuse the ColumnMetadata model?

So I initially tried to use ColumnMetadata, but ran into problems when also trying to get as much other metadata specific to delta-lake tables at the same time that doesn't fit well into the current ColumnMetadata class. An example (that actually is for hive tables in general) would be isPartitionColumn. If I add it to columnTags then it ends up tagging the table instead.

I think somebody is working on fixing column tagging, perhaps at that point I can use the tag dictionary to store all of this other metadata?

databuilder/extractor/delta_lake_metadata_extractor.py

feng-tao · 2020-10-20T22:31:21Z

databuilder/extractor/delta_lake_metadata_extractor.py

+        return [self.scrape_table(table) for table in tables]
+
+    def scrape_table(self, table: Table) -> ScrapedTableMetadata:
+        '''Takes a table object and creates a scraped table metadata object.'''


could you share more context on the scrape table metadata object? is it because it doesn't include all metadata for the full table metadata?

similar problem as with ColumnMetadata. I am trying to batch getting all of the possible metadata that I can from delta-lake tables at the same time and not all parts of the metadata have an amundsen equivalent yet. It seemed like at this point in amundsen development at least, it would be easier to just scrape it and store it in a custom python class and then map to the amundsen specific classes later. For example, a ScrapedTableMetadata can actually produce a TableMetadata, LastUpdatedMetadata etc.

I can add todos saying that it would be good to revisit this though.

feng-tao · 2020-10-20T22:33:00Z

^^ @samshuster thanks for the pr!

samshuster · 2020-10-22T00:13:37Z

Thanks for the review! I will take a look at answering your questions tomorrow. As for whether we have switched over to this approach at Edmunds - not yet. I have run it in our QA environment a couple of times and the load looked ok. But with that being said, I might wait on merging this in until I swap out our old scraping approach with this one (hopefully even on production as well) to weed out any other issues that may pop up.

requirements.txt

tests/unit/extractor/test_deltalake_extractor.py

samshuster · 2020-10-28T13:14:17Z

so right when I thought it was good to go, I spotted a couple of issues in the running QA job that only appeared when working on all of the tables.

Looks like there are some edge cases where I am not catching appropriate exception. Need to resolve that.
Looks like extra columns are being added in some cases (partition columns duplicated)
Don't think I am going to solve this with this iteration, but it is pretty slow. I think it could benefit from multithreading. Something to think about for the future I guess.

I will work on resolving these and hopefully it won't be too long of a turn around.

…e script on how it would be used. Signed-off-by: sshuster <sshuster@edmunds.com> fixing the table last updated parts Signed-off-by: sshuster <sshuster@edmunds.com> Adding pyspark dependency Fixing formatting for the two deltalake classes Signed-off-by: sshuster <sshuster@edmunds.com> Adding delta-lake unit tests and fixing delta lake metadata extractor bugs. Now is a working piece of code. Signed-off-by: sshuster <sshuster@edmunds.com> Adding more typing information and changing formatting Signed-off-by: sshuster <sshuster@edmunds.com> Fixing column taggging issue which doesn't accomplish what we need it to. Signed-off-by: sshuster <sshuster@edmunds.com> fixing lint Signed-off-by: sshuster <sshuster@edmunds.com> Fixing the typing Signed-off-by: sshuster <sshuster@edmunds.com> fixing lint again Signed-off-by: sshuster <sshuster@edmunds.com> More documentation Signed-off-by: sshuster <sshuster@edmunds.com> adding some todos per code review. fixing the sample deltalake metadata script Signed-off-by: sshuster <sshuster@edmunds.com> fixing lint Signed-off-by: sshuster <sshuster@edmunds.com> - removed pyspark from requirements and moved to extras - added some simple parallelism to the scraping - Fixed bug that should allow for parsing the describe statement on both spark 2 and spark 3 - Fixed bug that caused parser to die if table does not exist Signed-off-by: sshuster <sshuster@edmunds.com> Fixing another exception. Adding spark to all_deps Signed-off-by: sshuster <sshuster@edmunds.com> Fixed table parsing so that it doesn't try getting table details on views. Signed-off-by: sshuster <sshuster@edmunds.com> Fixed import Signed-off-by: sshuster <sshuster@edmunds.com> fixing mypy issues Signed-off-by: sshuster <sshuster@edmunds.com>

samshuster · 2020-10-28T18:26:23Z

@samshuster ok I fixed the issues mentioned and have confirmed its working ok in our qa environment. Hopefully a better and more seamless solution will come in the future!

feng-tao · 2020-10-28T20:08:51Z

thanks @samshuster !

feng-tao added the keep fresh Disables stalebot from closing an issue label Sep 3, 2020

samshuster requested review from allisonsuarez, dikshathakur3119, feng-tao, jinhyukchang and a team as code owners October 12, 2020 21:00

samshuster force-pushed the issue-608/delta-lake branch from 8dfe045 to 49287a5 Compare October 12, 2020 21:05

samshuster force-pushed the issue-608/delta-lake branch 3 times, most recently from 70b83d9 to 1bf50bf Compare October 19, 2020 15:24

samshuster changed the title ~~WIP: feat - Adds delta-lake-extractor~~ feat - Adds delta-lake-extractor Oct 19, 2020

samshuster changed the title ~~feat - Adds delta-lake-extractor~~ feat: Adds delta-lake-extractor Oct 19, 2020

feng-tao approved these changes Oct 20, 2020

View reviewed changes

samshuster force-pushed the issue-608/delta-lake branch 2 times, most recently from 7fb7009 to bea306a Compare October 28, 2020 02:47

feng-tao approved these changes Oct 28, 2020

View reviewed changes

feng-tao reviewed Oct 28, 2020

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

feng-tao reviewed Oct 28, 2020

View reviewed changes

tests/unit/extractor/test_deltalake_extractor.py Outdated Show resolved Hide resolved

samshuster force-pushed the issue-608/delta-lake branch from f7e25f9 to f10f573 Compare October 28, 2020 17:33

samshuster force-pushed the issue-608/delta-lake branch from 970fd85 to 522de20 Compare October 28, 2020 18:05

feng-tao merged commit e8679aa into amundsen-io:master Oct 28, 2020

feng-tao mentioned this pull request Oct 28, 2020

doc: Add delta lake support to readme amundsen-io/amundsen#788

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds delta-lake-extractor #351

feat: Adds delta-lake-extractor #351

samshuster commented Aug 28, 2020 •

edited

samshuster commented Oct 6, 2020

feng-tao commented Oct 6, 2020

samshuster commented Oct 19, 2020

samshuster commented Oct 19, 2020

feng-tao left a comment

feng-tao Oct 20, 2020

samshuster Oct 26, 2020

feng-tao Oct 20, 2020

samshuster Oct 26, 2020

feng-tao commented Oct 20, 2020

samshuster commented Oct 22, 2020

samshuster commented Oct 28, 2020

samshuster commented Oct 28, 2020

feng-tao commented Oct 28, 2020

		LOGGER = logging.getLogger(__name__)


		class ScrapedColumnMetadata(object):

feat: Adds delta-lake-extractor #351

feat: Adds delta-lake-extractor #351

Conversation

samshuster commented Aug 28, 2020 • edited

Summary of Changes

Tests

Documentation

CheckList

samshuster commented Oct 6, 2020

feng-tao commented Oct 6, 2020

samshuster commented Oct 19, 2020

samshuster commented Oct 19, 2020

feng-tao left a comment

Choose a reason for hiding this comment

feng-tao Oct 20, 2020

Choose a reason for hiding this comment

samshuster Oct 26, 2020

Choose a reason for hiding this comment

feng-tao Oct 20, 2020

Choose a reason for hiding this comment

samshuster Oct 26, 2020

Choose a reason for hiding this comment

feng-tao commented Oct 20, 2020

samshuster commented Oct 22, 2020

samshuster commented Oct 28, 2020

samshuster commented Oct 28, 2020

feng-tao commented Oct 28, 2020

samshuster commented Aug 28, 2020 •

edited