Skip to content

dit-x/data_modeling_with_postgres

Repository files navigation

Table of Contents

  1. Project Description.
  2. Project Objective.
  3. Project Development
  4. References

📌Project Description

Project source - udacity


This end-to-end data modeling project entails designing and implementing a data warehouse solution for a music streaming industry. The core responsibility is to model an analytical database for the data analytics team interested in understanding users' activity to improve business decisions. The current data available is operational data, JSON metadata files on the songs in the app.


✅ Project Objective

The project objective includes

  1. The design of an optimized data ingestion pipeline architecture
  2. The design of the data warehouse ERD (Entity Relationship Diagram) for the data analytics team.
  3. The implementation of Data Extraction, Transformation and Loading (ETL).

🛠Project Development

Data Pipeline Architecture

for architecture design - https://app.diagrams.net/

Data-model-ETL

The architecture is purposely designed to achieve the project objective:

  • Data Source: User log and song data are archived in file directories.
  • Processing: The data are loaded into pandas for data extraction and transformation
  • Ingestion: Since the DWH is in Postgres, SQL insert commands together with python are used to load the data in DWH because of SQL readability and simplicity across the data team.
  • Storage: Data is stored in PostgreSQL for analytical purposes
  • Visualization: Analytics dashboard for business decisions.

ERD (Entity Relationship Diagram) Design

The data model below shows a star schema optimized for queries on the song play analysis utilizing 3 levels of normalization (1NF, 2NF, and 3NF), where appropriate, using the song and log datasets.

Fact Table

songplays — records in log data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users — users in the app user_id, first_name, last_name, gender, level

songs — songs in the music database song_id, title, artist_id, year, duration

artists — artists in the music database artist_id, name, location, latitude, longitude

time — timestamps of records in songplays broken down into the specific unit start_time, hour, day, week, month, year, weekday


Screenshot (148)


Project Template

In addition to the data files, the project workspace includes five files:

  1. test.ipynb displays the first few rows of each table to check the database.
  2. create_tables.py drops and creates the postgres table. This drops (if exists) and create tables. Run at the start of the project
  3. etl.ipynb reads and processes a single file from song_data and log_data and loads the data into the tables. It contains a detailed explanation of the ETL process to the tables.
  4. etl.py reads and processes files from song_data and log_data and loads them into the tables.
  5. sql_queries.py contains all the SQL queries used to perform all SQL command to ingest data to DB

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published