A fictional startup Sparkify in the music streaming industry looks to provide a great service to the community that enjoys its music streaming app. Sparkify holds a collection of event data and metadata (logs in JSON format) output from the activity of its users. As data engineers we have been asked to perform the necessary data wrangling and analytics in order to support data teams (scientists and analysts) needs, discovering the insights (i.e. behaviours, interactions) between users and the music. For this job we have adopted Postgres as the database platform, built a data model under the star schema (dimensions and fact tables), and built an ETL pipeline to extract-transform-load data from the local files to a sparkifydb (Postgres) database.
sparkify-postgres
| .gitignore # Config file for Git
| README.md # Repository description
| analytics.ipynb # Sample queries to run on sparkifydb
| create_tables.py # Creates sparkifydb tables (drops content if exists)
| etl.py # Runs ETL pipeline to process data
| requirements.txt # Contains libraries needed to run scripts
| sql_queries.py # Queries to build sparkifydb tables
| test_etl.ipynb # Tests pipeline operations
| test_sql_queries.ipynb # Tests queries on sparkifydb
| transaction_postgres.py # Defines connection & methods to sparkifydb
|
└--graphics
| docker_hub.png # Postgres image reference in Docker Hub
| erd_sparkifydb.png # Sparkifydb star schema (Postgres)
| logo.png # Sparkify logo
Clone this project, and to up and running it locally you can use Docker, because in its repository/hub there is an image with Postgres installed, database authentication, and the Sparkify raw data to be manipulated (image ref: postgres-student-image, thanks to user onekenken). For the sake of this project the data and metadata under munging is part of the Million Song Dataset1.
- Install Docker
- Login Docker Hub
- Pull image: postgres-student-image
- Run container: postgres-student-image
- On terminal run:
python3 create_tables.py
python3 etl.py
- Stop container (& Remove container)