feat: support complex & nested column extraction for delta #1329

jroof88 · 2021-07-09T22:21:50Z

Summary of Changes

This change implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types.

This PR also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly.

I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values.

Two Design Descisions:

I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well.
Nested Fields have the ability to be commented. This DDL statement works for delta:
CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;
But there is not way to actually extract the nested fields comments:

from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)

Prints just the top level struct comment:

[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
 Row(col_name='', data_type='', comment=''),
 Row(col_name='# Partitioning', data_type='', comment=''),
 Row(col_name='Not partitioned', data_type='', comment='')]

Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it.

Tests

I added an entirely new test suite within test_deltalake_extractor.py that tests for lots of different possibilities of complex types and verifies it extracts all columns correctly.

Documentation

I added some comments in delta_metadata_extractor.py detailing the new configuration value and how to use. I also updated the databuilder docs for the delta_extractor to tell about the configuration value and show a quick example of how it would work on a complex type.

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.

jroof88 · 2021-07-14T21:00:07Z

@feng-tao Here is the backend PR for extracting complex delta types so they can be properly indexed for frontend search

feng-tao

looks great! I put a few comments, let me know what you think

databuilder/README.md

databuilder/databuilder/extractor/delta_lake_metadata_extractor.py

feng-tao · 2021-07-16T01:10:27Z

databuilder/databuilder/extractor/delta_lake_metadata_extractor.py

+                name=col_name,
+                data_type=curr_field.dataType.simpleString(),
+                sort_order=total_cols,
+                description=None,


assume no description for nested file?

At first pass I was under the impression that you couldn't. I am going to double check to be sure.

Interesting. This works for DDL:

CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;

but I'm unsure how to extract the nested comment:

from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) print("***********") fields = spark.table("db.jack_test").schema pprint(fields)

Output:

[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] *********** StructType(List(StructField(struct_col,StructType(List(StructField(a,StringType,true),StructField(b,IntegerType,true))),true)))

@feng-tao I updated the commit message with this as a design decision. I am also going to file a Support Ticket with Databricks about this to see if I am missing something or if it is a current limitation. This PR is ready for another pass.

databuilder/databuilder/extractor/delta_lake_metadata_extractor.py

This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types. This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly. I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values. Two Design Descisions: 1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well. 2. Nested Fields have the ability to be commented. This DDL statement works for delta: `CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;` But there is not way to actually extract the nested fields comments: ``` from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) ``` Prints just the top level struct comment: ``` [Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] ``` Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it. Signed-off-by: jroof88 <jack.roof@samsara.com>

feng-tao · 2021-07-17T04:34:36Z

thanks!

…io#1329) This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types. This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly. I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values. Two Design Descisions: 1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well. 2. Nested Fields have the ability to be commented. This DDL statement works for delta: `CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;` But there is not way to actually extract the nested fields comments: ``` from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) ``` Prints just the top level struct comment: ``` [Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] ``` Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it. Signed-off-by: jroof88 <jack.roof@samsara.com> Signed-off-by: Amom <amommendes@hotmail.com>

…io#1329) This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types. This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly. I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values. Two Design Descisions: 1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well. 2. Nested Fields have the ability to be commented. This DDL statement works for delta: `CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;` But there is not way to actually extract the nested fields comments: ``` from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) ``` Prints just the top level struct comment: ``` [Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] ``` Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it. Signed-off-by: jroof88 <jack.roof@samsara.com>

jroof88 requested a review from a team as a code owner July 9, 2021 22:21

boring-cyborg bot added the area:databuilder From databuilder folder label Jul 9, 2021

jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch 3 times, most recently from 3121422 to 05b7558 Compare July 14, 2021 16:46

jroof88 changed the title ~~WIP FOR NESTED DELTA SCRAPING~~ NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta ) Jul 14, 2021

jroof88 changed the title ~~NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta )~~ NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta) Jul 14, 2021

jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from 05b7558 to b7435a2 Compare July 14, 2021 16:59

jroof88 changed the title ~~NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta)~~ feat databuilder: support complex & nested column extraction for delta [NOT READY FOR REVIEW] Jul 14, 2021

jroof88 changed the title ~~feat databuilder: support complex & nested column extraction for delta [NOT READY FOR REVIEW]~~ feat: support complex & nested column extraction for delta [NOT READY FOR REVIEW] Jul 14, 2021

jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from b7435a2 to 3b67a17 Compare July 14, 2021 20:49

jroof88 changed the title ~~feat: support complex & nested column extraction for delta [NOT READY FOR REVIEW]~~ feat: support complex & nested column extraction for delta Jul 14, 2021

jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from 3b67a17 to 6e55fd4 Compare July 14, 2021 20:58

boring-cyborg bot added area:all Related to all the project area:docs labels Jul 14, 2021

jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch 5 times, most recently from 30377e0 to ddababb Compare July 16, 2021 00:27

feng-tao reviewed Jul 16, 2021

View reviewed changes

jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch 2 times, most recently from 97e0e08 to b63bb28 Compare July 16, 2021 16:28

jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from b63bb28 to 6a66f41 Compare July 16, 2021 16:47

jroof88 requested a review from feng-tao July 16, 2021 16:47

feng-tao approved these changes Jul 17, 2021

View reviewed changes

feng-tao merged commit 9db18bc into amundsen-io:main Jul 17, 2021

cooper6581 mentioned this pull request Aug 6, 2021

fix: Handle ParseException when reading schema property #1422

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support complex & nested column extraction for delta #1329

feat: support complex & nested column extraction for delta #1329

jroof88 commented Jul 9, 2021 •

edited

jroof88 commented Jul 14, 2021 •

edited

feng-tao left a comment

feng-tao Jul 16, 2021

jroof88 Jul 16, 2021

jroof88 Jul 16, 2021 •

edited

jroof88 Jul 16, 2021

feng-tao commented Jul 17, 2021

feat: support complex & nested column extraction for delta #1329

feat: support complex & nested column extraction for delta #1329

Conversation

jroof88 commented Jul 9, 2021 • edited

Summary of Changes

Tests

Documentation

CheckList

jroof88 commented Jul 14, 2021 • edited

feng-tao left a comment

Choose a reason for hiding this comment

feng-tao Jul 16, 2021

Choose a reason for hiding this comment

jroof88 Jul 16, 2021

Choose a reason for hiding this comment

jroof88 Jul 16, 2021 • edited

Choose a reason for hiding this comment

jroof88 Jul 16, 2021

Choose a reason for hiding this comment

feng-tao commented Jul 17, 2021

jroof88 commented Jul 9, 2021 •

edited

jroof88 commented Jul 14, 2021 •

edited

jroof88 Jul 16, 2021 •

edited