Skip to content

Data modeling - Data warehousing - Data pipeline leveraging PostgreSQL, Cassandra, Redshift, PySpark, andAirflow.

Notifications You must be signed in to change notification settings

ggbong734/udacity-data-engineering

Repository files navigation

Udacity Data Engineering Nanodegree

Projects completed in the Udacity Data Engineering Nanodegree DEND Nanodegree program. The underlying theme of the projects is to build data ETL pipelines for an imaginary music streaming startup.

Developed a PostgreSQL relational database and build an Extract-Transform-Load (ETL) pipeline for a music streaming startup. Tasks include:

  • Created fact and dimension tables in PostgreSQL following the star schema
  • Normalization of tables
  • Constructed an ETL pipeline to facilitate queries on which songs users listen to

Proficiencies used: Python pandas, PostgreSql, Star schema, Database normalization

Created a NoSQL database using Apache Cassandra to facilitate a few specific queries on users and songs played in music app. Tasks include:

  • Developed denormalized tables in Apache Cassandra optimized for a set of queries and business needs

Proficiencies used: NoSQL database, Apache Cassandra, Database denormalization

Created a database warehouse utilizing Amazon Redshift. Tasks include:

  • Create AWS Redshift cluster/instance, IAM Roles, Security groups
  • Developed a star schema database in Redshift with optimization to analyze songs users played
  • Built an ETL Pipeline in Python and SQL that copies data from S3 buckets into staging tables in Redshift

Technologies used: Python, Amazon Redshift, SQL, PostgreSQL

Enabled distributed computing in the ETL pipeline in project 3 by utilizing PySpark and moving the data to a data lake. Tasks include:

  • Extract, process, and load data in S3 using Spark and load them to a data lake in S3

Technologies used: PySpark, S3

Automate the creation of data warehouse, ETL process, and data monitoring tasks using Apache Airflow:

  • Using Airflow to automate ETL pipelines using Airflow, Python, Amazon Redshift.
  • Writing custom operators to perform tasks such as creating tables, staging data, loading data, and data quality checks.
  • Transforming data from various sources into a star schema optimized for the analytics team's use cases.

Technologies used: Apache Airflow, S3, Amazon Redshift.

README template reference: Daniel Diamond of Pymetrics

About

Data modeling - Data warehousing - Data pipeline leveraging PostgreSQL, Cassandra, Redshift, PySpark, andAirflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published