spark-sql-etl-framework

Multi Stage SQL based ETL Processing Framework Written in PySpark:

process_sql_statements.py is a PySpark application which reads config from a YAML document (see config.yml in this project). The configuration specifies a set of input sources - which are table objects avaiable from the catalog of the current SparkSession (for instance an AWS Glue Catalog) - in the sources section. The transforms section is a list of
transformations written as SQL statements using temporary views in Spark SQL, this is akin to using CTE (common table expressions) or Volatile Tables when performing typical multi-stage complex ETL routines on traditional relation database systems. The targets section defines the location to write out the final output object to.

The sample configuration uses the framework (process_sql_statements.py) to process a multi stage SQL ETL routine using data from the AWS Sample Tickit Database which have been stored as S3 objects and catalogued using Hive/AWS Glue. Modify the config.yml file to specify your targets, projections, filters and transformations and run as follows:

spark-submit process_sql_statements.py config.yml

Dependencies

Spark 2.x

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
config.yml		config.yml
process_sql_statements.py		process_sql_statements.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

config.yml

config.yml

process_sql_statements.py

process_sql_statements.py

Repository files navigation

spark-sql-etl-framework

Dependencies

About

Releases

Packages

Languages

avensolutions/spark-sql-etl-framework

Folders and files

Latest commit

History

Repository files navigation

spark-sql-etl-framework

Dependencies

About

Resources

Stars

Watchers

Forks

Languages