-
Notifications
You must be signed in to change notification settings - Fork 945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support complex & nested column extraction for delta #1329
feat: support complex & nested column extraction for delta #1329
Conversation
3121422
to
05b7558
Compare
05b7558
to
b7435a2
Compare
b7435a2
to
3b67a17
Compare
3b67a17
to
6e55fd4
Compare
@feng-tao Here is the backend PR for extracting complex delta types so they can be properly indexed for frontend search |
30377e0
to
ddababb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! I put a few comments, let me know what you think
databuilder/databuilder/extractor/delta_lake_metadata_extractor.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/extractor/delta_lake_metadata_extractor.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/extractor/delta_lake_metadata_extractor.py
Outdated
Show resolved
Hide resolved
name=col_name, | ||
data_type=curr_field.dataType.simpleString(), | ||
sort_order=total_cols, | ||
description=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume no description for nested file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first pass I was under the impression that you couldn't. I am going to double check to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. This works for DDL:
CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;
but I'm unsure how to extract the nested comment:
from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)
print("***********")
fields = spark.table("db.jack_test").schema
pprint(fields)
Output:
[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
Row(col_name='', data_type='', comment=''),
Row(col_name='# Partitioning', data_type='', comment=''),
Row(col_name='Not partitioned', data_type='', comment='')]
***********
StructType(List(StructField(struct_col,StructType(List(StructField(a,StringType,true),StructField(b,IntegerType,true))),true)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@feng-tao I updated the commit message with this as a design decision. I am also going to file a Support Ticket with Databricks about this to see if I am missing something or if it is a current limitation. This PR is ready for another pass.
databuilder/databuilder/extractor/delta_lake_metadata_extractor.py
Outdated
Show resolved
Hide resolved
97e0e08
to
b63bb28
Compare
This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types. This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly. I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values. Two Design Descisions: 1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well. 2. Nested Fields have the ability to be commented. This DDL statement works for delta: `CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;` But there is not way to actually extract the nested fields comments: ``` from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) ``` Prints just the top level struct comment: ``` [Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] ``` Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it. Signed-off-by: jroof88 <jack.roof@samsara.com>
b63bb28
to
6a66f41
Compare
thanks! |
…io#1329) This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types. This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly. I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values. Two Design Descisions: 1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well. 2. Nested Fields have the ability to be commented. This DDL statement works for delta: `CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;` But there is not way to actually extract the nested fields comments: ``` from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) ``` Prints just the top level struct comment: ``` [Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] ``` Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it. Signed-off-by: jroof88 <jack.roof@samsara.com> Signed-off-by: Amom <amommendes@hotmail.com>
…io#1329) This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types. This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly. I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values. Two Design Descisions: 1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well. 2. Nested Fields have the ability to be commented. This DDL statement works for delta: `CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;` But there is not way to actually extract the nested fields comments: ``` from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) ``` Prints just the top level struct comment: ``` [Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] ``` Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it. Signed-off-by: jroof88 <jack.roof@samsara.com>
…io#1329) This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types. This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly. I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values. Two Design Descisions: 1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well. 2. Nested Fields have the ability to be commented. This DDL statement works for delta: `CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;` But there is not way to actually extract the nested fields comments: ``` from pprint import pprint raw = spark.sql("DESCRIBE db.jack_test").collect() pprint(raw) ``` Prints just the top level struct comment: ``` [Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'), Row(col_name='', data_type='', comment=''), Row(col_name='# Partitioning', data_type='', comment=''), Row(col_name='Not partitioned', data_type='', comment='')] ``` Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it. Signed-off-by: jroof88 <jack.roof@samsara.com>
Summary of Changes
This change implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types.
This PR also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly.
I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values.
Two Design Descisions:
I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well.
Nested Fields have the ability to be commented. This DDL statement works for delta:
CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;
But there is not way to actually extract the nested fields comments:
Prints just the top level struct comment:
Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it.
Tests
I added an entirely new test suite within
test_deltalake_extractor.py
that tests for lots of different possibilities of complex types and verifies it extracts all columns correctly.Documentation
I added some comments in
delta_metadata_extractor.py
detailing the new configuration value and how to use. I also updated the databuilder docs for the delta_extractor to tell about the configuration value and show a quick example of how it would work on a complex type.CheckList
Make sure you have checked all steps below to ensure a timely review.