Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds delta-lake-extractor #351

Merged
merged 1 commit into from Oct 28, 2020

Conversation

samshuster
Copy link
Contributor

@samshuster samshuster commented Aug 28, 2020

Summary of Changes

This is the MR for a delta lake amundsen extractor. This approach requires that the extractor runs on a spark cluster and so would need to be executed differently from other types of extractors.

It currently emits both table metadata and table last updated, BUT it can be augmented at a later point to also emit:

  • programmatic descriptions with delta-lake specific information
  • watermarks
  • column stats

Tests

Tests for all core extractor functionality were added.

Documentation

Initial documentation added to the main README.md

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

@feng-tao feng-tao added the keep fresh Disables stalebot from closing an issue label Sep 3, 2020
@samshuster
Copy link
Contributor Author

I have been waiting on feedback from delta-lake users to make sure this solution as written would work for them. Given there has been no feedback, I wonder if I should bring in this script with no testing for now in the "sample" directory instead?

@feng-tao
Copy link
Member

feng-tao commented Oct 6, 2020

I have been waiting on feedback from delta-lake users to make sure this solution as written would work for them. Given there has been no feedback, I wonder if I should bring in this script with no testing for now in the "sample" directory instead?

maybe a basic unit test to ensure the coverage?

@samshuster
Copy link
Contributor Author

Ok this one has taken me awhile... apologies about that.

I have now a working first pass of it with unit tests. Next steps:

  • fix all build failures, DCO and linting
  • complete documentation for the feature where its needed
  • take a look at what it would take to add the programmatic and watermark part or whether it should just be a separate story.

@samshuster samshuster force-pushed the issue-608/delta-lake branch 3 times, most recently from 70b83d9 to 1bf50bf Compare October 19, 2020 15:24
@samshuster samshuster changed the title WIP: feat - Adds delta-lake-extractor feat - Adds delta-lake-extractor Oct 19, 2020
@samshuster samshuster changed the title feat - Adds delta-lake-extractor feat: Adds delta-lake-extractor Oct 19, 2020
@samshuster
Copy link
Contributor Author

ok @feng-tao this one is finally ready

Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nits and i think we could improve iteratively as well. Curious: does Edmunds have a chance to test and switch with this extractor in databricks platform instead of the original csv approach?

README.md Show resolved Hide resolved
LOGGER = logging.getLogger(__name__)


class ScrapedColumnMetadata(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious of why we don't reuse the ColumnMetadata model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I initially tried to use ColumnMetadata, but ran into problems when also trying to get as much other metadata specific to delta-lake tables at the same time that doesn't fit well into the current ColumnMetadata class. An example (that actually is for hive tables in general) would be isPartitionColumn. If I add it to columnTags then it ends up tagging the table instead.

I think somebody is working on fixing column tagging, perhaps at that point I can use the tag dictionary to store all of this other metadata?

databuilder/extractor/delta_lake_metadata_extractor.py Outdated Show resolved Hide resolved
return [self.scrape_table(table) for table in tables]

def scrape_table(self, table: Table) -> ScrapedTableMetadata:
'''Takes a table object and creates a scraped table metadata object.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share more context on the scrape table metadata object? is it because it doesn't include all metadata for the full table metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar problem as with ColumnMetadata. I am trying to batch getting all of the possible metadata that I can from delta-lake tables at the same time and not all parts of the metadata have an amundsen equivalent yet. It seemed like at this point in amundsen development at least, it would be easier to just scrape it and store it in a custom python class and then map to the amundsen specific classes later. For example, a ScrapedTableMetadata can actually produce a TableMetadata, LastUpdatedMetadata etc.

I can add todos saying that it would be good to revisit this though.

@feng-tao
Copy link
Member

^^ @samshuster thanks for the pr!

@samshuster
Copy link
Contributor Author

Thanks for the review! I will take a look at answering your questions tomorrow. As for whether we have switched over to this approach at Edmunds - not yet. I have run it in our QA environment a couple of times and the load looked ok. But with that being said, I might wait on merging this in until I swap out our old scraping approach with this one (hopefully even on production as well) to weed out any other issues that may pop up.

@samshuster samshuster force-pushed the issue-608/delta-lake branch 2 times, most recently from 7fb7009 to bea306a Compare October 28, 2020 02:47
requirements.txt Outdated Show resolved Hide resolved
@samshuster
Copy link
Contributor Author

so right when I thought it was good to go, I spotted a couple of issues in the running QA job that only appeared when working on all of the tables.

  1. Looks like there are some edge cases where I am not catching appropriate exception. Need to resolve that.
  2. Looks like extra columns are being added in some cases (partition columns duplicated)
  3. Don't think I am going to solve this with this iteration, but it is pretty slow. I think it could benefit from multithreading. Something to think about for the future I guess.

I will work on resolving these and hopefully it won't be too long of a turn around.

…e script on how it would be used.

Signed-off-by: sshuster <sshuster@edmunds.com>

fixing the table last updated parts

Signed-off-by: sshuster <sshuster@edmunds.com>

Adding pyspark dependency
Fixing formatting for the two deltalake classes

Signed-off-by: sshuster <sshuster@edmunds.com>

Adding delta-lake unit tests and fixing delta lake metadata extractor bugs. Now is a working piece of code.

Signed-off-by: sshuster <sshuster@edmunds.com>

Adding more typing information and changing formatting

Signed-off-by: sshuster <sshuster@edmunds.com>

Fixing column taggging issue which doesn't accomplish what we need it to.

Signed-off-by: sshuster <sshuster@edmunds.com>

fixing lint

Signed-off-by: sshuster <sshuster@edmunds.com>

Fixing the typing

Signed-off-by: sshuster <sshuster@edmunds.com>

fixing lint again

Signed-off-by: sshuster <sshuster@edmunds.com>

More documentation

Signed-off-by: sshuster <sshuster@edmunds.com>

adding some todos per code review.
fixing the sample deltalake metadata script

Signed-off-by: sshuster <sshuster@edmunds.com>

fixing lint

Signed-off-by: sshuster <sshuster@edmunds.com>

- removed pyspark from requirements and moved to extras
- added some simple parallelism to the scraping
- Fixed bug that should allow for parsing the describe statement on both spark 2 and spark 3
- Fixed bug that caused parser to die if table does not exist

Signed-off-by: sshuster <sshuster@edmunds.com>

Fixing another exception.
Adding spark to all_deps

Signed-off-by: sshuster <sshuster@edmunds.com>

Fixed table parsing so that it doesn't try getting table details on views.

Signed-off-by: sshuster <sshuster@edmunds.com>

Fixed import

Signed-off-by: sshuster <sshuster@edmunds.com>

fixing mypy issues

Signed-off-by: sshuster <sshuster@edmunds.com>
@samshuster
Copy link
Contributor Author

@samshuster ok I fixed the issues mentioned and have confirmed its working ok in our qa environment. Hopefully a better and more seamless solution will come in the future!

@feng-tao
Copy link
Member

thanks @samshuster !

@feng-tao feng-tao merged commit e8679aa into amundsen-io:master Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep fresh Disables stalebot from closing an issue
Projects
None yet
2 participants