ETL Data Pipelines

Author: Amelia Tang

This is the GitHub repository to document various ETL data pipelines I designed for different projects.

What is ETL?

Extract, Transform and Load (ETL) us a fundamental framework for streamlining data processing workflows. ETL pipelines facilitate the efficient extraction of data from diverse sources, transformation into a usable format, and loading into designated destinations for analysis.

Extract

This step involves gathering data from various sources such as databases, APIs, or files.

Transform

In this step, the extracted data is cleaned, validated, and transformed into a consistent format suitable for analysis.

Load

The transformed data is loaded into a target database or data warehouse, where it can be stored and accessed for further analysis or reporting.

Project 1 (Python BeautifulSoup, AWS EC2, S3, Glue and Athena)

ETL Diagrams

Extra public listing data from eBay.com using the Python script, transformed the data in AWS Glue and load the transformed data to AWS Athena for further analysis

ETL Data Pipeline Implementation on AWS

To demonstrate the implementation of the ETL data pipelines on AWS, I have created blog posts on Medium.com to document the process.

Extract: Run a Python web scrapping file stored in an S3 bucket on an EC2 instance and save the output csv file to the S3 bucket Run a Python Script Stored in S3 on EC2
Transformation & Load: Manipulate and transform the output csv file using AWS Glue and load the data to AWS Athena for query and analysis and save the result files to an S3 bucket How to Create a Powerful ETL Data Pipeline with Python and AWS Services

Project 2 (Apache Airflow, Python and PostgreSQL)

TBU

Project 3 (Databricks)

TBU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ETL Data Pipelines

What is ETL?

Extract

Transform

Load

Project 1 (Python BeautifulSoup, AWS EC2, S3, Glue and Athena)

ETL Diagrams

ETL Data Pipeline Implementation on AWS

Project 2 (Apache Airflow, Python and PostgreSQL)

Project 3 (Databricks)

Files

README.md

Latest commit

History

README.md

File metadata and controls

ETL Data Pipelines

What is ETL?

Extract

Transform

Load

Project 1 (Python BeautifulSoup, AWS EC2, S3, Glue and Athena)

ETL Diagrams

ETL Data Pipeline Implementation on AWS

Project 2 (Apache Airflow, Python and PostgreSQL)

Project 3 (Databricks)