GitHub - devgoalposts/Udacity-Data-Warehouse

ETL Pipeline using S3 and Amazon Redshift using JSON

Take logs from a music streaming app and connect contextualize that data with a subset of the Million Song Dataset.

Motivation

Learn how to create a ETL pipeline using data stored in the cloud. This involved taking data from AWS S3, putting it into staging tables in AWS Redshift, then transforming the data from the staging table into the appropriate table within the designed schema.

Schema

Tests

[✅] run python3 etl.py without getting an errors
- ~7 Min Runtime on 2 Node Cluser

Eplanation of the files in the repository

sql_queries.py is where the core code of this project resides. It includes the queries to create the tables, copy the data from S3 into redshift, and sql insert queries that transform the data within redshift. This file even organizes the queries into lists so they can be easily be run and called together.

etl.py is where all the required sql queries from sql_queries.py get called in order to get the data from S3 into the correct schema inside redshift.

The purposes for each of the python notebooks is help me understand the data I am playing with so I can do small ETL experiments that do not require iterating over all the files in the S3 Bucket. I also used it to document my thinking process while coming up with these queries, for example when I write select queries separately before writing the insert query.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
README.md		README.md
RedShiftProject.png		RedShiftProject.png
brainstorming_queries.ipynb		brainstorming_queries.ipynb
connect_redshift.ipynb		connect_redshift.ipynb
create_tables.py		create_tables.py
dwh.cfg.tmp		dwh.cfg.tmp
etl.py		etl.py
rubric_checklist.md		rubric_checklist.md
sql_queries.py		sql_queries.py
troubleshooting_sql_queries.ipynb		troubleshooting_sql_queries.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline using S3 and Amazon Redshift using JSON

Motivation

Schema

Tests

Eplanation of the files in the repository

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline using S3 and Amazon Redshift using JSON

Motivation

Schema

Tests

Eplanation of the files in the repository

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages