Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Column lineage implementation & sample ingest scripts #470

Merged
merged 11 commits into from
Apr 25, 2021
Merged

feat: Column lineage implementation & sample ingest scripts #470

merged 11 commits into from
Apr 25, 2021

Conversation

sewardgw
Copy link
Contributor

@sewardgw sewardgw commented Apr 8, 2021

This PR implements a column level lineage model per the RFC: amundsen-io/rfcs#32

This PR will not be merged until after the RFC's final comment period has closed.

Summary of Changes

This is an initial implementation with thoughts below on how this code may be curated going forward.

Current implementation

  • Added a column lineage model
  • Added CSV extracts for both table and column lineage + sample data
  • Updated the existing TableLineage to take a table key instead of the low-level inputs for table metadata (e.g. db, cluster, schema & name).

Additional thoughts:

There are 2 reasons why the TableLineage was updated to take a key instead of dependencies:

  1. Creation of the table key shouldn't be duplicated from the TableMetadata class
  2. Any time that lineage is being created, the keys for the nodes to be connected will should already be available as users will likely be creating lineage after creating table/column metadata. Worst case scenario, users can create a Table/Column metadata object and have that object create the corresponding key.

Lineage probably should be created using objects as inputs instead of the keys / strings. For example, that would make the implementation for Column Lineage change from:

class ColumnLineage(GraphSerializable):
    ...
    def __init__(self,
                 column_key: str,
                 downstream_deps: List[str] = None,  # List of column keys
                 ) -> None:
    ...

to

class ColumnLineage(GraphSerializable):
    ...
    def __init__(self,
                 column: ColumnMetadata,
                 downstream_deps: List[ColumnMetadata] = None,  # List of column metadata objects
                 ) -> None:
    ...

The reason behind this abstraction layer is that it will allow lineage to grow and be maintained more easily in the future. Here are some benefits / capabilities that this could more easily enable:

  • Dynamically create lineage between different types of nodes (e.g. allow tables to be connected to other tables or to "ELT jobs" through lineage but not to columns)
  • Create a more generic and reusable Lineage class
  • More intuitive development (IMHO)

This class-based approach for the inputs would require a larger set of changes, which is why the current implementation relies on the keys. Consider how the _create_rel_iterator function might work for a generic Lineage class:

def _create_rel_iterator(self) -> Iterator[GraphRelationship]:
    """
    Create relations between source table and all the downstream tables
    :return:
    """
    for downstream in self.downstream_deps:
        relationship = GraphRelationship(
            start_key=self.source.get_key(),
            start_label=self.source.get_node_label(),
            end_label=downstream.get_node_label(),
            end_key=downstream.get_key(),
            type=LineageBase.ORIGIN_DEPENDENCY_RELATION_TYPE,
            reverse_type=LineageBase.DEPENDENCY_ORIGIN_RELATION_TYPE,
            attributes={}
        )
        yield relationship

Currently there is not a consistent way to access the node label for different node types. Columns have an attribute called COLUMN_NODE_LABEL and Tables have the attribute TABLE_NODE_LABEL; similarly, there is not a consistent way to access the keys. Each class has a tightly coupled function (e.g: _get_table_key, _get_col_key) and Columns are not currently able to generate this key on their own without the Table as well. These items, and a few others, would need to be addressed to enable this.

Tests

New tests added

Documentation

What documentation did you add or modify and why? Add any relevant links then remove this line

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

…rt table lineage

Signed-off-by: Grant Seward <grant@stemma.ai>
Signed-off-by: Grant Seward <grant@stemma.ai>
@sewardgw sewardgw changed the title Slight refactor to table lineage interface, added csv extract to impo… WIP: Table lineage change to receive table key & CSV extract for table lineage Apr 8, 2021
Signed-off-by: Grant Seward <grant@stemma.ai>
Signed-off-by: Grant Seward <grant@stemma.ai>
Signed-off-by: Grant Seward <grant@stemma.ai>
Signed-off-by: Grant Seward <grant@stemma.ai>
@sewardgw sewardgw changed the title WIP: Table lineage change to receive table key & CSV extract for table lineage feat: Column lineage implementation & sample ingest scripts Apr 8, 2021
@sewardgw sewardgw marked this pull request as ready for review April 8, 2021 18:28
…g to be explicit

Signed-off-by: Grant Seward <grant@stemma.ai>
"""
LABEL = 'Lineage'
ORIGIN_DEPENDENCY_RELATION_TYPE = 'UPSTREAM'
DEPENDENCY_ORIGIN_RELATION_TYPE = 'DOWNSTREAM'
ORIGIN_DEPENDENCY_RELATION_TYPE = 'HAS_DOWNSTREAM'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it

    ORIGIN_DEPENDENCY_RELATION_TYPE = 'DOWNSTREAM'
    DEPENDENCY_ORIGIN_RELATION_TYPE = 'UPSTREAM'

Will make it simple, and clear in terms of the same as UI.

For Example:

origin: hive://gold.test_schema/test_table1
dependency: dynamo://gold.test_schema/test_table2

origin: hive://gold.test_schema/test_table1
dependency: hive://gold.test_schema/test_view1

hen the relation should look something like this.

origin -->[DOWNSTREAM] dependency
origin <--[UPSTREAM] dependency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these semantics can be confusing 😕

I do think it is important to have either a verb or preposition to help provide context to the words "upstream" and "downstream". I personally also agree with your directionality which is why I chose HAS_DOWNSTREAM for the origin --> dependency relationship and HAS_UPSTREAM for the dependency --> origin relationship. By doing this, the nodes retrieved from a table using the HAS_DOWNSTREAM relationship will also show up in the downstream tab in the UI.

Another approach could be origin -- [UPSTREAM_TO] --> dependency and dependency -- [DOWNSTREAM_OF] --> origin. While this provides the same level of semantic context, retrieving a node with the UPSTREAM_TO relationship would actually show up in the downstream tab in the UI which could be slightly confusing.

Above all, I think that consistency here is probably the most important and would love to get input from a broader audience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong feelings, but my 2c: the uplink (down -> up) would be HAS_UPSTREAM, but the downlink would be DOWNSTREAM_TO, because the downstream link is causal, but the upstream link is inferred/denormalized (due to the downstream). https://neo4j.com/developer/guide-data-modeling/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read the relationship with the conventions proposed the semantic relationships don't seem to resonate with me, to me DOWNSTREAM_TO reads as if it is downstream, rather than the relationship created is downstream to the source. I put it into a quick picture just to baseline some of the options:

lineage

I'm happy to switch it up since consistency is really the key aspect here - I'll post this to the OSS chat to see if we can get some additional perspectives / votes.

Signed-off-by: Grant Seward <grant@stemma.ai>
Signed-off-by: Grant Seward <grant@stemma.ai>
Copy link
Contributor

@dorianj dorianj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Note that this is blocked on amundsen-io/rfcs#32



class TableLineage(GraphSerializable):
class BaseLineage(GraphSerializable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

"""

def __init__(self,
table_key: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing this key (vs properties) seems better indeed 👍

"""
LABEL = 'Lineage'
ORIGIN_DEPENDENCY_RELATION_TYPE = 'UPSTREAM'
DEPENDENCY_ORIGIN_RELATION_TYPE = 'DOWNSTREAM'
ORIGIN_DEPENDENCY_RELATION_TYPE = 'HAS_DOWNSTREAM'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong feelings, but my 2c: the uplink (down -> up) would be HAS_UPSTREAM, but the downlink would be DOWNSTREAM_TO, because the downstream link is causal, but the upstream link is inferred/denormalized (due to the downstream). https://neo4j.com/developer/guide-data-modeling/

@dorianj
Copy link
Contributor

dorianj commented Apr 16, 2021

amundsen-io/rfcs#32 has landed -- please ping when the relationship label thing is resolved, and the conflict is fixed and I'll land

… table-lineage-csv-ingest

Signed-off-by: Grant Seward <grant@stemma.ai>

# Conflicts:
#	tests/unit/extractor/test_csv_extractor.py
Signed-off-by: Grant Seward <grant@stemma.ai>
@sewardgw
Copy link
Contributor Author

@dorianj - I would consider the relationship label as "resolved" unless you really want to change it. IMHO, the values @verdan and I came up with originally feel the most intuitive (see the simple graph examples in the image above) and, without a clear majority in any particular direction from the community, I would be inclined to keep it as is.

@dorianj dorianj merged commit f5e6ba4 into amundsen-io:master Apr 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants