Skip to content

Conversation

@vladhlinsky
Copy link
Contributor

What changes were proposed in this pull request?

Create spark_column_lineage type and relationship definition to add support of column level lineage for CREATE TABLE AS SELECT ... statements and views. Column level lineage refers to lineage created between the input and output columns.
For example:

hive > create table employee_ctas as select id from employee;

For the above query, lineage is created from employee to employee_ctas, and also from employee.id to employee_ctas.id.

How was this patch tested?

Manually using modified version of Spark Atlas Connector:

  • Installed and started Atlas.
  • 1100-spark_model.json is updated with proposed changes. Atlas is restarted.
  • Executed the next statements using spark-shell:
spark.sql("create table sparkemployee_1_2(id int,name string)");
spark.sql("create table sparkemployee_ctas_1_2 as select id from sparkemployee_1_2");
  • Verified that each table has column entities and spark_column_lineage entity is created.

@vladhlinsky
Copy link
Contributor Author

Attaching screenshots.
Screenshot from 2020-03-11 23-39-45
Screenshot from 2020-03-10 21-51-16
Screenshot from 2020-03-10 21-51-26
Screenshot from 2020-03-10 21-51-47
Screenshot from 2020-03-11 23-40-34

@vladhlinsky
Copy link
Contributor Author

cc @HeartSaVioR @sarathsubramanian

Copy link
Contributor

@sarathsubramanian sarathsubramanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good. +1. Thanks @vladhlinsky.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nixonrodrigues
Copy link
Collaborator

+1 for PR, thanks @vladhlinsky for PR.

@nixonrodrigues nixonrodrigues merged commit 9d6f1f6 into apache:master Mar 16, 2020
asfgit pushed a commit that referenced this pull request Mar 16, 2020
…inition to add support of column level lineage (#93)

(cherry picked from commit 9d6f1f6)
@lyyprean
Copy link

what version is spark-atlas-conection used

@lyyprean
Copy link

What version does spark-atlas-connector use?

1 similar comment
@CavalierHE
Copy link

What version does spark-atlas-connector use?

@pPanda-beta
Copy link

@lyyprean @CavalierHE

I think cloudera has not open sourced it yet, and completely hiding the implementation of column lineage harvester!

@CavalierHE
Copy link

@lyyprean @CavalierHE

I think cloudera has not open sourced it yet, and completely hiding the implementation of column lineage harvester!

maybe,how about apache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants