GitHub - bernhard-42/Spark-ETL-Atlas: A small project to show how to add lineage to Atlas when using Spark as ETL tool

Spark as ETL

If we use Saprk as ETL tool, no lineage information will be written to Atlas.

Assume we want to combine three Hive tables to an un-normalized flat one. A Spark code could look like (more details in spark-etl.scala):

val employees = sqlContext.sql("select * from employees.employees")
val departments = sqlContext.sql("select * from employees.departments")
val dept_emp = sqlContext.sql("select * from employees.dept_emp")

val flat = employees.withColumn("full_name", concat(employees("last_name"), lit(", "), employees("first_name"))).
                     select("full_name", "emp_no").
                     join(dept_emp,"emp_no").
                     join(departments, "dept_no")

flat.registerTempTable(tempTable)
sqlContext.sql(s"create table default.employees_flat3 stored as ORC as select * from ${tempTable}")

Adding Lineage

From a lineage perspective, "employees_flat3" is derived from the other threee tables.

The jupyter notebook StoreSparkLineage.ipynb shows how to add this information to Atlas using the REST API

Lineage Graph

Atlas view of the Spark Processor

![Spark Process in Atlas.png](Spark Process in Atlas.png)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Lineage.png		Lineage.png
Readme.md		Readme.md
Spark Process in Atlas.png		Spark Process in Atlas.png
StoreSparkLineage.ipynb		StoreSparkLineage.ipynb
spark-etl.scala		spark-etl.scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark as ETL

Adding Lineage

Lineage Graph

Atlas view of the Spark Processor

About

Releases

Packages

Languages

bernhard-42/Spark-ETL-Atlas

Folders and files

Latest commit

History

Repository files navigation

Spark as ETL

Adding Lineage

Lineage Graph

Atlas view of the Spark Processor

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages