Project: Data Modeling with Postgres

A startup named Sparkify wants to analyze user activities using their song and user data. The current data is spread among several JSON files, making it hard to query and analyze.

This project aims to create an ETL pipeline to load song and user data to a Postgres database, making it easier to query and analyze data.

Datasets

Data is currently collected for song and user activities, in two directories: data/log_data and data/song_data, using JSON files.

Song dataset format

{
  "num_songs": 1,
  "artist_id": "ARGSJW91187B9B1D6B",
  "artist_latitude": 35.21962,
  "artist_longitude": -80.01955,
  "artist_location": "North Carolina",
  "artist_name": "JennyAnyKind",
  "song_id": "SOQHXMF12AB0182363",
  "title": "Young Boy Blues",
  "duration": 218.77506,
  "year": 0
}

Log dataset format

{
  "artist": "Survivor",
  "auth": "Logged In",
  "firstName": "Jayden",
  "gender": "M",
  "itemInSession": 0,
  "lastName": "Fox",
  "length": 245.36771,
  "level": "free",
  "location": "New Orleans-Metairie, LA",
  "method": "PUT",
  "page": "NextSong",
  "registration": 1541033612796,
  "sessionId": 100,
  "song": "Eye Of The Tiger",
  "status": 200,
  "ts": 1541110994796,
  "userAgent": "\"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",
  "userId": "101"
}

Schema

Fact tables

Songplays

Records in log data associated with song plays i.e. records with page set to NextSong.

Column	Type	Nullable
songplay_id	integer	not null
start_time	timestamp without time zone	not null
user_id	integer	not null
level	character varying	not null
song_id	character varying(18)
artist_id	character varying(18)
session_id	integer	not null
location	character varying	not null
user_agent	character varying	not null

Primary key: songplay_id

Dimension tables

Users

Users in the app.

Column	Type	Nullable
user_id	integer	not null
first_name	character varying	not null
last_name	character varying	not null
gender	character(1)	not null
level	character varying	not null

Primary key: user_id

Songs

Songs in music database.

Column	Type	Nullable
song_id	character varying(18)	not null
title	character varying	not null
artist_id	character varying(18)	not null
year	integer	not null
duration	double precision	not null

Primary key: song_id

Artists

Artists in music database.

Column	Type	Nullable
artist_id	character varying(18)	not null
name	character varying	not null
location	character varying	not null
latitude	double precision
longitude	double precision

Primary key: artist_id

Time

Timestamps of records in songplays broken down into specific units.

Column	Type	Nullable
start_time	timestamp without time zone	not null
hour	integer	not null
day	integer	not null
week	integer	not null
month	integer	not null
year	integer	not null
weekday	integer	not null

Build

Pre-requisites:

Python 3
pipenv
pyenv (optional)
PostgreSQL Database

To install project python dependencies, you should run:

pipenv install

Database

The database can be installed locally or ran using Docker, which is the preferred method.

To use docker to run Postgres, you should run:

docker run --net=host --name postgres -e POSTGRES_PASSWORD=your_password -d postgres

Access and user setup

To initially access the database, you should run:

psql -h localhost -U postgres

You should run the following commands under psql to setup user access to Postgres and create the initial sparkifydb database:

CREATE ROLE student WITH ENCRYPTED PASSWORD 'student';
ALTER ROLE student WITH LOGIN;
ALTER ROLE student CREATEDB;
CREATE DATABASE sparkifydb OWNER student;

Running

To run the project locally, use pipenv to activate the virtual environment:

pipenv shell

And run the scripts to create database tables:

./create_tables.py

and populate data into tables:

./etl.py

Data can be verified using the provided test.ipynb jupyter notebook:

jupyter notebook

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
.gitignore		.gitignore
.python-version		.python-version
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
create_tables.py		create_tables.py
db.py		db.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Data Modeling with Postgres

Datasets

Song dataset format

Log dataset format

Schema

Fact tables

Songplays

Dimension tables

Users

Songs

Artists

Time

Build

Database

Access and user setup

Running

About

Releases

Packages

Contributors 2

Languages

danielmt/udacity-dataeng-project1

Folders and files

Latest commit

History

Repository files navigation

Project: Data Modeling with Postgres

Datasets

Song dataset format

Log dataset format

Schema

Fact tables

Songplays

Dimension tables

Users

Songs

Artists

Time

Build

Database

Access and user setup

Running

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages