# Staging Data

These are the staging tables:
* staging_songs
* staging_events

After ETL job we can analyze the data.

In [35]:
from configparser import ConfigParser
from sqlalchemy import create_engine
import pandas as pd

config = ConfigParser()
config.read('./../dwh.cfg')

HOST = config.get('CLUSTER', 'HOST')
DB_NAME = config.get('CLUSTER', 'DB_NAME')
DB_USER = config.get('CLUSTER', 'DB_USER')
DB_PASSWORD = config.get('CLUSTER', 'DB_PASSWORD')
DB_PORT = config.get('CLUSTER', 'DB_PORT')

conn_string = f"postgresql://{DB_USER}:{DB_PASSWORD}@{HOST}:{DB_PORT}/{DB_NAME}"
conn = create_engine(conn_string, client_encoding="UTF-8")

## 1. Staging Tables Stats

In [36]:
pd.read_sql(f"""
SELECT 'staging_songs' as table_name, count(*) as count from staging_songs
UNION
SELECT 'staging_events' as table_name, count(*) as count from staging_events;
""", con = conn)

Unnamed: 0,table_name,count
0,staging_songs,385252
1,staging_events,8056


## 2. Duplicate Record Check

We have the following matching criteria between these 2 tables:

* staging_events.artist   <-> staging_songs.artist_name
* staging_events.song     <-> staging_songs.title
* staging_events.length   <-> staging_songs.duration

In [37]:
print("Song playing counts in staging_events:")
pd.read_sql(f"""
SELECT artist, song, length, COUNT(*) as total FROM staging_events
WHERE page = 'NextSong'
GROUP BY artist, song, length
ORDER BY total DESC
LIMIT 5
""", con = conn)

Song playing counts in staging_events:


Unnamed: 0,artist,song,length,total
0,Dwight Yoakam,You're The One,239.3073,37
1,BjÃÂ¶rk,Undo,348.57751,28
2,Kings Of Leon,Revelry,201.79546,27
3,Harmonia,Sehr kosmisch,655.77751,21
4,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Horn Concerto No. 4 in E flat K495: II. Romanc...,277.15873,19


In [38]:
print("Song counts with the same artist_name, title, duration values in staging_songs:")
pd.read_sql(f"""
SELECT artist_name, title, duration, COUNT(*) as total FROM staging_songs
GROUP BY artist_name, title, duration
ORDER BY total DESC
LIMIT 5
""", con = conn).head(5)

Song counts with the same artist_name, title, duration values in staging_songs:


Unnamed: 0,artist_name,title,duration,total
0,Foo Fighters,The Deepest Blues Are Black,238.34077,3
1,The All-American Rejects,Real World,235.04934,3
2,Aerosmith,Pink,235.36281,3
3,Kings Of Leon,Taper Jean Girl,185.28608,3
4,Foo Fighters,Hell,117.002,3


It seems, we may have duplicate song data on the match criteria between these 2 staging tables.
One approach is, selecting the first item in staging_songs table that matches on the criteria.
Before applying the decision, lets check the other fields on duplicate staging song table.

In [39]:
pd.read_sql(f"""
SELECT * FROM staging_songs
WHERE artist_name = 'Foo Fighters' and title = 'The Deepest Blues Are Black' and duration = 238.34077
""", con = conn).head(10)

Unnamed: 0,song_id,num_songs,title,artist_name,artist_latitude,year,duration,artist_id,artist_longitude,artist_location
0,SOVSGJB12A8C13F772,1,The Deepest Blues Are Black,Foo Fighters,,2005,238.34077,AR6XPWV1187B9ADAEB,,"Seattle, WA"
1,SOVSGJB12A8C13F772,1,The Deepest Blues Are Black,Foo Fighters,,2005,238.34077,AR6XPWV1187B9ADAEB,,"Seattle, WA"
2,SOVSGJB12A8C13F772,1,The Deepest Blues Are Black,Foo Fighters,,2005,238.34077,AR6XPWV1187B9ADAEB,,"Seattle, WA"


It seems that results are the same. And lastly lets make a query with song_id='SOVSGJB12A8C13F772' to make sure we have no other songs with the same id.

In [40]:
pd.read_sql(f"""
SELECT * FROM staging_songs
WHERE song_id='SOVSGJB12A8C13F772'
""", con = conn).head(10)

Unnamed: 0,song_id,num_songs,title,artist_name,artist_latitude,year,duration,artist_id,artist_longitude,artist_location
0,SOVSGJB12A8C13F772,1,The Deepest Blues Are Black,Foo Fighters,,2005,238.34077,AR6XPWV1187B9ADAEB,,"Seattle, WA"
1,SOVSGJB12A8C13F772,1,The Deepest Blues Are Black,Foo Fighters,,2005,238.34077,AR6XPWV1187B9ADAEB,,"Seattle, WA"
2,SOVSGJB12A8C13F772,1,The Deepest Blues Are Black,Foo Fighters,,2005,238.34077,AR6XPWV1187B9ADAEB,,"Seattle, WA"
