Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support complex & nested column extraction for delta #1329

Merged

Conversation

jroof88
Copy link
Contributor

@jroof88 jroof88 commented Jul 9, 2021

Summary of Changes

This change implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types.

This PR also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly.

I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values.

Two Design Descisions:

  1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well.

  2. Nested Fields have the ability to be commented. This DDL statement works for delta:
    CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;
    But there is not way to actually extract the nested fields comments:

from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)

Prints just the top level struct comment:

[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
 Row(col_name='', data_type='', comment=''),
 Row(col_name='# Partitioning', data_type='', comment=''),
 Row(col_name='Not partitioned', data_type='', comment='')]

Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it.

Tests

I added an entirely new test suite within test_deltalake_extractor.py that tests for lots of different possibilities of complex types and verifies it extracts all columns correctly.

Documentation

I added some comments in delta_metadata_extractor.py detailing the new configuration value and how to use. I also updated the databuilder docs for the delta_extractor to tell about the configuration value and show a quick example of how it would work on a complex type.

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely.
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.

@jroof88 jroof88 requested a review from a team as a code owner July 9, 2021 22:21
@boring-cyborg boring-cyborg bot added the area:databuilder From databuilder folder label Jul 9, 2021
@jroof88 jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch 3 times, most recently from 3121422 to 05b7558 Compare July 14, 2021 16:46
@jroof88 jroof88 changed the title WIP FOR NESTED DELTA SCRAPING NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta ) Jul 14, 2021
@jroof88 jroof88 changed the title NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta ) NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta) Jul 14, 2021
@jroof88 jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from 05b7558 to b7435a2 Compare July 14, 2021 16:59
@jroof88 jroof88 changed the title NOT READY FOR REVIEW (feat databuilder: support complex & nested column extraction for delta) feat databuilder: support complex & nested column extraction for delta [NOT READY FOR REVIEW] Jul 14, 2021
@jroof88 jroof88 changed the title feat databuilder: support complex & nested column extraction for delta [NOT READY FOR REVIEW] feat: support complex & nested column extraction for delta [NOT READY FOR REVIEW] Jul 14, 2021
@jroof88 jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from b7435a2 to 3b67a17 Compare July 14, 2021 20:49
@jroof88 jroof88 changed the title feat: support complex & nested column extraction for delta [NOT READY FOR REVIEW] feat: support complex & nested column extraction for delta Jul 14, 2021
@jroof88 jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from 3b67a17 to 6e55fd4 Compare July 14, 2021 20:58
@boring-cyborg boring-cyborg bot added area:all Related to all the project area:docs labels Jul 14, 2021
@jroof88
Copy link
Contributor Author

jroof88 commented Jul 14, 2021

@feng-tao Here is the backend PR for extracting complex delta types so they can be properly indexed for frontend search

@jroof88 jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch 5 times, most recently from 30377e0 to ddababb Compare July 16, 2021 00:27
Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! I put a few comments, let me know what you think

databuilder/README.md Outdated Show resolved Hide resolved
name=col_name,
data_type=curr_field.dataType.simpleString(),
sort_order=total_cols,
description=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume no description for nested file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first pass I was under the impression that you couldn't. I am going to double check to be sure.

Copy link
Contributor Author

@jroof88 jroof88 Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This works for DDL:

CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;

but I'm unsure how to extract the nested comment:

from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)
print("***********")
fields = spark.table("db.jack_test").schema
pprint(fields)

Output:

[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
 Row(col_name='', data_type='', comment=''),
 Row(col_name='# Partitioning', data_type='', comment=''),
 Row(col_name='Not partitioned', data_type='', comment='')]
***********
StructType(List(StructField(struct_col,StructType(List(StructField(a,StringType,true),StructField(b,IntegerType,true))),true)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feng-tao I updated the commit message with this as a design decision. I am also going to file a Support Ticket with Databricks about this to see if I am missing something or if it is a current limitation. This PR is ready for another pass.

@jroof88 jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch 2 times, most recently from 97e0e08 to b63bb28 Compare July 16, 2021 16:28
This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types.

This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly.

I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values.

Two Design Descisions:
1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well.

2. Nested Fields have the ability to be commented. This DDL statement works for delta:
`CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;`
But there is not way to actually extract the nested fields comments:
```
from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)
```
Prints just the top level struct comment:
```
[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
 Row(col_name='', data_type='', comment=''),
 Row(col_name='# Partitioning', data_type='', comment=''),
 Row(col_name='Not partitioned', data_type='', comment='')]
```
Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it.

Signed-off-by: jroof88 <jack.roof@samsara.com>
@jroof88 jroof88 force-pushed the jroof-amundsen-nestedDeltaColumnScrape branch from b63bb28 to 6a66f41 Compare July 16, 2021 16:47
@jroof88 jroof88 requested a review from feng-tao July 16, 2021 16:47
@feng-tao
Copy link
Member

thanks!

@feng-tao feng-tao merged commit 9db18bc into amundsen-io:main Jul 17, 2021
amommendes pushed a commit to amommendes/amundsen that referenced this pull request Jul 22, 2021
…io#1329)

This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types.

This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly.

I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values.

Two Design Descisions:
1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well.

2. Nested Fields have the ability to be commented. This DDL statement works for delta:
`CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;`
But there is not way to actually extract the nested fields comments:
```
from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)
```
Prints just the top level struct comment:
```
[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
 Row(col_name='', data_type='', comment=''),
 Row(col_name='# Partitioning', data_type='', comment=''),
 Row(col_name='Not partitioned', data_type='', comment='')]
```
Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it.

Signed-off-by: jroof88 <jack.roof@samsara.com>
Signed-off-by: Amom <amommendes@hotmail.com>
zacr pushed a commit to SaltIO/amundsen that referenced this pull request May 13, 2022
…io#1329)

This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types.

This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly.

I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values.

Two Design Descisions:
1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well.

2. Nested Fields have the ability to be commented. This DDL statement works for delta:
`CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;`
But there is not way to actually extract the nested fields comments:
```
from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)
```
Prints just the top level struct comment:
```
[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
 Row(col_name='', data_type='', comment=''),
 Row(col_name='# Partitioning', data_type='', comment=''),
 Row(col_name='Not partitioned', data_type='', comment='')]
```
Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it.

Signed-off-by: jroof88 <jack.roof@samsara.com>
hansadriaans pushed a commit to DataChefHQ/amundsen that referenced this pull request Jun 30, 2022
…io#1329)

This commit implements proper extraction of complex & nested types for delta tables in the delta_metadata_extractor. This commit was slightly modeled off of the bigquery_metadata_extractor which supports nested columns (https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/bigquery_metadata_extractor.py). However, this implementation is slightly more complex because delta allows support for arrays and maps whose elements, keys, and values can also be nested types.

This commit also includes a new test suite to test delta tables with these complex types and make sure the parsing works properly.

I also added this behind a configuration value. Since users of amundsen might be using the delta extractor as is, it could be jarring that one day they have X number more columns since their complex types are properly parsed. This configuration value is checked before recursing into a nested type to verify we have backwards compatibility. Folks who are using the delta extractor but want nested columns can easily opt in by setting the configuration values.

Two Design Descisions:
1. I toyed with how to render the name for MapTypes. Delta allows you have complex types as the maps key (ex: 'map<struct<b:array<struct<c:int,d:string>>,e:double>,int>'). Although this is a valid delta table I wonder how used an array or a map is as a maps key. As a result, I chose not to parse through a maps key and display all nested columns as it's own columns. I figured people who are using maps are more concerning with complex types in the value and can simply construct the key from the regular string that is printed out when showing the whole map type. I added tests for this as well.

2. Nested Fields have the ability to be commented. This DDL statement works for delta:
`CREATE TABLE db.jack_test (struct_col struct<a:string COMMENT "NESTED STRING COL", b:int COMMENT "NESTED INT COL"> COMMENT "THIS IS A STRUCT COL") USING delta;`
But there is not way to actually extract the nested fields comments:
```
from pprint import pprint
raw = spark.sql("DESCRIBE db.jack_test").collect()
pprint(raw)
```
Prints just the top level struct comment:
```
[Row(col_name='struct_col', data_type='struct<a:string,b:int>', comment='THIS IS A STRUCT COL'),
 Row(col_name='', data_type='', comment=''),
 Row(col_name='# Partitioning', data_type='', comment=''),
 Row(col_name='Not partitioned', data_type='', comment='')]
```
Since this is the current state of delta, this commit sets all descriptions of nested fields to None. I filed a bug report with delta about this and will revist if they implement it.

Signed-off-by: jroof88 <jack.roof@samsara.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:all Related to all the project area:databuilder From databuilder folder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants