New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds delta-lake-extractor #351
Conversation
I have been waiting on feedback from delta-lake users to make sure this solution as written would work for them. Given there has been no feedback, I wonder if I should bring in this script with no testing for now in the "sample" directory instead? |
maybe a basic unit test to ensure the coverage? |
8dfe045
to
49287a5
Compare
Ok this one has taken me awhile... apologies about that. I have now a working first pass of it with unit tests. Next steps:
|
70b83d9
to
1bf50bf
Compare
ok @feng-tao this one is finally ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few nits and i think we could improve iteratively as well. Curious: does Edmunds have a chance to test and switch with this extractor in databricks platform instead of the original csv approach?
LOGGER = logging.getLogger(__name__) | ||
|
||
|
||
class ScrapedColumnMetadata(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious of why we don't reuse the ColumnMetadata model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I initially tried to use ColumnMetadata, but ran into problems when also trying to get as much other metadata specific to delta-lake tables at the same time that doesn't fit well into the current ColumnMetadata class. An example (that actually is for hive tables in general) would be isPartitionColumn. If I add it to columnTags then it ends up tagging the table instead.
I think somebody is working on fixing column tagging, perhaps at that point I can use the tag dictionary to store all of this other metadata?
return [self.scrape_table(table) for table in tables] | ||
|
||
def scrape_table(self, table: Table) -> ScrapedTableMetadata: | ||
'''Takes a table object and creates a scraped table metadata object.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you share more context on the scrape table metadata object? is it because it doesn't include all metadata for the full table metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar problem as with ColumnMetadata. I am trying to batch getting all of the possible metadata that I can from delta-lake tables at the same time and not all parts of the metadata have an amundsen equivalent yet. It seemed like at this point in amundsen development at least, it would be easier to just scrape it and store it in a custom python class and then map to the amundsen specific classes later. For example, a ScrapedTableMetadata can actually produce a TableMetadata, LastUpdatedMetadata etc.
I can add todos saying that it would be good to revisit this though.
^^ @samshuster thanks for the pr! |
Thanks for the review! I will take a look at answering your questions tomorrow. As for whether we have switched over to this approach at Edmunds - not yet. I have run it in our QA environment a couple of times and the load looked ok. But with that being said, I might wait on merging this in until I swap out our old scraping approach with this one (hopefully even on production as well) to weed out any other issues that may pop up. |
7fb7009
to
bea306a
Compare
so right when I thought it was good to go, I spotted a couple of issues in the running QA job that only appeared when working on all of the tables.
I will work on resolving these and hopefully it won't be too long of a turn around. |
f7e25f9
to
f10f573
Compare
…e script on how it would be used. Signed-off-by: sshuster <sshuster@edmunds.com> fixing the table last updated parts Signed-off-by: sshuster <sshuster@edmunds.com> Adding pyspark dependency Fixing formatting for the two deltalake classes Signed-off-by: sshuster <sshuster@edmunds.com> Adding delta-lake unit tests and fixing delta lake metadata extractor bugs. Now is a working piece of code. Signed-off-by: sshuster <sshuster@edmunds.com> Adding more typing information and changing formatting Signed-off-by: sshuster <sshuster@edmunds.com> Fixing column taggging issue which doesn't accomplish what we need it to. Signed-off-by: sshuster <sshuster@edmunds.com> fixing lint Signed-off-by: sshuster <sshuster@edmunds.com> Fixing the typing Signed-off-by: sshuster <sshuster@edmunds.com> fixing lint again Signed-off-by: sshuster <sshuster@edmunds.com> More documentation Signed-off-by: sshuster <sshuster@edmunds.com> adding some todos per code review. fixing the sample deltalake metadata script Signed-off-by: sshuster <sshuster@edmunds.com> fixing lint Signed-off-by: sshuster <sshuster@edmunds.com> - removed pyspark from requirements and moved to extras - added some simple parallelism to the scraping - Fixed bug that should allow for parsing the describe statement on both spark 2 and spark 3 - Fixed bug that caused parser to die if table does not exist Signed-off-by: sshuster <sshuster@edmunds.com> Fixing another exception. Adding spark to all_deps Signed-off-by: sshuster <sshuster@edmunds.com> Fixed table parsing so that it doesn't try getting table details on views. Signed-off-by: sshuster <sshuster@edmunds.com> Fixed import Signed-off-by: sshuster <sshuster@edmunds.com> fixing mypy issues Signed-off-by: sshuster <sshuster@edmunds.com>
970fd85
to
522de20
Compare
@samshuster ok I fixed the issues mentioned and have confirmed its working ok in our qa environment. Hopefully a better and more seamless solution will come in the future! |
thanks @samshuster ! |
Summary of Changes
This is the MR for a delta lake amundsen extractor. This approach requires that the extractor runs on a spark cluster and so would need to be executed differently from other types of extractors.
It currently emits both table metadata and table last updated, BUT it can be augmented at a later point to also emit:
Tests
Tests for all core extractor functionality were added.
Documentation
Initial documentation added to the main README.md
CheckList
Make sure you have checked all steps below to ensure a timely review.
make test