feat: Column lineage implementation & sample ingest scripts #470

sewardgw · 2021-04-08T14:04:35Z

This PR implements a column level lineage model per the RFC: amundsen-io/rfcs#32

This PR will not be merged until after the RFC's final comment period has closed.

Summary of Changes

This is an initial implementation with thoughts below on how this code may be curated going forward.

Current implementation

Added a column lineage model
Added CSV extracts for both table and column lineage + sample data
Updated the existing TableLineage to take a table key instead of the low-level inputs for table metadata (e.g. db, cluster, schema & name).

Additional thoughts:

There are 2 reasons why the TableLineage was updated to take a key instead of dependencies:

Creation of the table key shouldn't be duplicated from the TableMetadata class
Any time that lineage is being created, the keys for the nodes to be connected will should already be available as users will likely be creating lineage after creating table/column metadata. Worst case scenario, users can create a Table/Column metadata object and have that object create the corresponding key.

Lineage probably should be created using objects as inputs instead of the keys / strings. For example, that would make the implementation for Column Lineage change from:

class ColumnLineage(GraphSerializable):
    ...
    def __init__(self,
                 column_key: str,
                 downstream_deps: List[str] = None,  # List of column keys
                 ) -> None:
    ...

to

class ColumnLineage(GraphSerializable):
    ...
    def __init__(self,
                 column: ColumnMetadata,
                 downstream_deps: List[ColumnMetadata] = None,  # List of column metadata objects
                 ) -> None:
    ...

The reason behind this abstraction layer is that it will allow lineage to grow and be maintained more easily in the future. Here are some benefits / capabilities that this could more easily enable:

Dynamically create lineage between different types of nodes (e.g. allow tables to be connected to other tables or to "ELT jobs" through lineage but not to columns)
Create a more generic and reusable Lineage class
More intuitive development (IMHO)

This class-based approach for the inputs would require a larger set of changes, which is why the current implementation relies on the keys. Consider how the _create_rel_iterator function might work for a generic Lineage class:

def _create_rel_iterator(self) -> Iterator[GraphRelationship]:
    """
    Create relations between source table and all the downstream tables
    :return:
    """
    for downstream in self.downstream_deps:
        relationship = GraphRelationship(
            start_key=self.source.get_key(),
            start_label=self.source.get_node_label(),
            end_label=downstream.get_node_label(),
            end_key=downstream.get_key(),
            type=LineageBase.ORIGIN_DEPENDENCY_RELATION_TYPE,
            reverse_type=LineageBase.DEPENDENCY_ORIGIN_RELATION_TYPE,
            attributes={}
        )
        yield relationship

Currently there is not a consistent way to access the node label for different node types. Columns have an attribute called COLUMN_NODE_LABEL and Tables have the attribute TABLE_NODE_LABEL; similarly, there is not a consistent way to access the keys. Each class has a tightly coupled function (e.g: _get_table_key, _get_col_key) and Columns are not currently able to generate this key on their own without the Table as well. These items, and a few others, would need to be addressed to enable this.

Tests

New tests added

Documentation

What documentation did you add or modify and why? Add any relevant links then remove this line

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test

…rt table lineage Signed-off-by: Grant Seward <grant@stemma.ai>

Signed-off-by: Grant Seward <grant@stemma.ai>

…g to be explicit Signed-off-by: Grant Seward <grant@stemma.ai>

verdan · 2021-04-09T19:33:33Z

databuilder/models/table_lineage.py

    """
    LABEL = 'Lineage'
-    ORIGIN_DEPENDENCY_RELATION_TYPE = 'UPSTREAM'
-    DEPENDENCY_ORIGIN_RELATION_TYPE = 'DOWNSTREAM'
+    ORIGIN_DEPENDENCY_RELATION_TYPE = 'HAS_DOWNSTREAM'


I think we should make it

ORIGIN_DEPENDENCY_RELATION_TYPE = 'DOWNSTREAM' DEPENDENCY_ORIGIN_RELATION_TYPE = 'UPSTREAM'

Will make it simple, and clear in terms of the same as UI.

For Example:

origin: hive://gold.test_schema/test_table1
dependency: dynamo://gold.test_schema/test_table2

origin: hive://gold.test_schema/test_table1
dependency: hive://gold.test_schema/test_view1

hen the relation should look something like this.

origin -->[DOWNSTREAM] dependency origin <--[UPSTREAM] dependency

I agree, these semantics can be confusing 😕

I do think it is important to have either a verb or preposition to help provide context to the words "upstream" and "downstream". I personally also agree with your directionality which is why I chose HAS_DOWNSTREAM for the origin --> dependency relationship and HAS_UPSTREAM for the dependency --> origin relationship. By doing this, the nodes retrieved from a table using the HAS_DOWNSTREAM relationship will also show up in the downstream tab in the UI.

Another approach could be origin -- [UPSTREAM_TO] --> dependency and dependency -- [DOWNSTREAM_OF] --> origin. While this provides the same level of semantic context, retrieving a node with the UPSTREAM_TO relationship would actually show up in the downstream tab in the UI which could be slightly confusing.

Above all, I think that consistency here is probably the most important and would love to get input from a broader audience.

no strong feelings, but my 2c: the uplink (down -> up) would be HAS_UPSTREAM, but the downlink would be DOWNSTREAM_TO, because the downstream link is causal, but the upstream link is inferred/denormalized (due to the downstream). https://neo4j.com/developer/guide-data-modeling/

When I read the relationship with the conventions proposed the semantic relationships don't seem to resonate with me, to me DOWNSTREAM_TO reads as if it is downstream, rather than the relationship created is downstream to the source. I put it into a quick picture just to baseline some of the options:

I'm happy to switch it up since consistency is really the key aspect here - I'll post this to the OSS chat to see if we can get some additional perspectives / votes.

Signed-off-by: Grant Seward <grant@stemma.ai>

dorianj

LGTM.

Note that this is blocked on amundsen-io/rfcs#32

dorianj · 2021-04-10T17:53:27Z

databuilder/models/table_lineage.py



-class TableLineage(GraphSerializable):
+class BaseLineage(GraphSerializable):


dorianj · 2021-04-10T17:55:11Z

databuilder/models/table_lineage.py

+    """
+
+    def __init__(self,
+                 table_key: str,


passing this key (vs properties) seems better indeed 👍

dorianj · 2021-04-10T17:57:43Z

databuilder/models/table_lineage.py

    """
    LABEL = 'Lineage'
-    ORIGIN_DEPENDENCY_RELATION_TYPE = 'UPSTREAM'
-    DEPENDENCY_ORIGIN_RELATION_TYPE = 'DOWNSTREAM'
+    ORIGIN_DEPENDENCY_RELATION_TYPE = 'HAS_DOWNSTREAM'


no strong feelings, but my 2c: the uplink (down -> up) would be HAS_UPSTREAM, but the downlink would be DOWNSTREAM_TO, because the downstream link is causal, but the upstream link is inferred/denormalized (due to the downstream). https://neo4j.com/developer/guide-data-modeling/

dorianj · 2021-04-16T22:52:09Z

amundsen-io/rfcs#32 has landed -- please ping when the relationship label thing is resolved, and the conflict is fixed and I'll land

… table-lineage-csv-ingest Signed-off-by: Grant Seward <grant@stemma.ai> # Conflicts: # tests/unit/extractor/test_csv_extractor.py

Signed-off-by: Grant Seward <grant@stemma.ai>

sewardgw · 2021-04-22T18:59:05Z

@dorianj - I would consider the relationship label as "resolved" unless you really want to change it. IMHO, the values @verdan and I came up with originally feel the most intuitive (see the simple graph examples in the image above) and, without a clear majority in any particular direction from the community, I would be inclined to keep it as is.

sewardgw added 2 commits April 8, 2021 10:03

Slight refactor to table lineage interface, added csv extract to impo…

62636a0

…rt table lineage Signed-off-by: Grant Seward <grant@stemma.ai>

Removed whitespace

20b003d

Signed-off-by: Grant Seward <grant@stemma.ai>

sewardgw changed the title ~~Slight refactor to table lineage interface, added csv extract to impo…~~ WIP: Table lineage change to receive table key & CSV extract for table lineage Apr 8, 2021

sewardgw added 4 commits April 8, 2021 12:03

Local fork test

d9cc905

Signed-off-by: Grant Seward <grant@stemma.ai>

Fixed linting

13bc766

Signed-off-by: Grant Seward <grant@stemma.ai>

Fixed isort

4b4568b

Signed-off-by: Grant Seward <grant@stemma.ai>

Additional test data

2e690bc

Signed-off-by: Grant Seward <grant@stemma.ai>

sewardgw changed the title ~~WIP: Table lineage change to receive table key & CSV extract for table lineage~~ feat: Column lineage implementation & sample ingest scripts Apr 8, 2021

sewardgw marked this pull request as ready for review April 8, 2021 18:28

sewardgw requested review from allisonsuarez, dikshathakur3119, feng-tao, jinhyukchang and a team as code owners April 8, 2021 18:28

Created generic lineage interface, changed upstream/downstream wordin…

b83dad8

…g to be explicit Signed-off-by: Grant Seward <grant@stemma.ai>

verdan reviewed Apr 9, 2021

View reviewed changes

sewardgw added 2 commits April 9, 2021 15:35

removed white space...

4b1092d

Signed-off-by: Grant Seward <grant@stemma.ai>

Fixed static typing

b71ced4

Signed-off-by: Grant Seward <grant@stemma.ai>

dorianj reviewed Apr 10, 2021

View reviewed changes

sewardgw added 2 commits April 19, 2021 12:33

Merge branch 'master' of github.com:sewardgw/amundsendatabuilder into…

e51d82d

… table-lineage-csv-ingest Signed-off-by: Grant Seward <grant@stemma.ai> # Conflicts: # tests/unit/extractor/test_csv_extractor.py

Fixed test from upstream merge

ab51ac2

Signed-off-by: Grant Seward <grant@stemma.ai>

dorianj approved these changes Apr 22, 2021

View reviewed changes

dorianj merged commit f5e6ba4 into amundsen-io:master Apr 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Column lineage implementation & sample ingest scripts #470

feat: Column lineage implementation & sample ingest scripts #470

sewardgw commented Apr 8, 2021 •

edited

Loading

verdan Apr 9, 2021

sewardgw Apr 9, 2021

dorianj Apr 10, 2021

sewardgw Apr 19, 2021

dorianj left a comment

dorianj Apr 10, 2021

dorianj Apr 10, 2021

dorianj Apr 10, 2021

dorianj commented Apr 16, 2021

sewardgw commented Apr 22, 2021



		class TableLineage(GraphSerializable):
		class BaseLineage(GraphSerializable):

feat: Column lineage implementation & sample ingest scripts #470

feat: Column lineage implementation & sample ingest scripts #470

Conversation

sewardgw commented Apr 8, 2021 • edited Loading

Summary of Changes

Tests

Documentation

CheckList

verdan Apr 9, 2021

Choose a reason for hiding this comment

sewardgw Apr 9, 2021

Choose a reason for hiding this comment

dorianj Apr 10, 2021

Choose a reason for hiding this comment

sewardgw Apr 19, 2021

Choose a reason for hiding this comment

dorianj left a comment

Choose a reason for hiding this comment

dorianj Apr 10, 2021

Choose a reason for hiding this comment

dorianj Apr 10, 2021

Choose a reason for hiding this comment

dorianj Apr 10, 2021

Choose a reason for hiding this comment

dorianj commented Apr 16, 2021

sewardgw commented Apr 22, 2021

sewardgw commented Apr 8, 2021 •

edited

Loading