Skip to content

Latest commit

 

History

History
37 lines (26 loc) · 2.03 KB

README.md

File metadata and controls

37 lines (26 loc) · 2.03 KB

ETL Data Pipelines

  • Author: Amelia Tang

This is the GitHub repository to document various ETL data pipelines I designed for different projects.

What is ETL?

Extract, Transform and Load (ETL) us a fundamental framework for streamlining data processing workflows. ETL pipelines facilitate the efficient extraction of data from diverse sources, transformation into a usable format, and loading into designated destinations for analysis.

Extract

This step involves gathering data from various sources such as databases, APIs, or files.

Transform

In this step, the extracted data is cleaned, validated, and transformed into a consistent format suitable for analysis.

Load

The transformed data is loaded into a target database or data warehouse, where it can be stored and accessed for further analysis or reporting.

Project 1 (Python BeautifulSoup, AWS EC2, S3, Glue and Athena)

ETL Diagrams

Extra public listing data from eBay.com using the Python script, transformed the data in AWS Glue and load the transformed data to AWS Athena for further analysis

ETL Data Pipeline Implementation on AWS

To demonstrate the implementation of the ETL data pipelines on AWS, I have created blog posts on Medium.com to document the process.

Project 2 (Apache Airflow, Python and PostgreSQL)

TBU

Project 3 (Databricks)

TBU