### Milestone 10

The first query will show an overlap of the artists that are present in both datasets. This will be particularly useful as a view to quickly query the artists in both datasets. Musicbrainz stores both individual artists and bands in one table (Artist_Beam_DF) while discogs has separate entities for bands and individual artists. This view will enable easy querying of both datasets without multiple joins.

In [17]:
%%bigquery
select distinct d.name discogs_name from discogs_modeled.Artist d
join musicbrainz_modeled.Artist_Beam_DF m 
on m.artist_name = d.name or m.sort_name = d.name
group by discogs_name
union all
select distinct b.name discogs_name from discogs_modeled.Band b
join musicbrainz_modeled.Artist_Beam_DF a 
on a.artist_name = b.name or m.sort_name = b.name
limit 10

Unnamed: 0,discogs_name
0,Denise Karen Dyson
1,Tim Costello
2,Wayne Rogers
3,Sean Armstrong
4,Terre Thaemlitz
5,David Brown
6,Felix
7,Simo Soo
8,Michael Phillips
9,Liam Jason


For vizualization purposes, we would like to quantify the amount of data present in each of the datasets and how much data they share. This view does this, and the approach is explained in the third query.

In [1]:
%%bigquery
create or replace view reporting.v_classified_artists as 
(
    select mA.artist_name as artist_name, "MusicBrainz Artists Only" as domain,
    from `earnest-keep-266820.musicbrainz_modeled.Artist_Beam_DF` as mA
    where mA.artist_id not in
    (
      select a.artist_id as mID,
      from `earnest-keep-266820.discogs_modeled.Artist` as d
      join `earnest-keep-266820.musicbrainz_modeled.Artist_Beam_DF` as a
      on d.name = a.artist_name
    )
    group by artist_name

    union all

    select dA.name as artist_name, "Discogs Artists Only" as domain,
    from `earnest-keep-266820.discogs_modeled.Artist` as dA
    where dA.id not in
    (
      select d.id as dID,
      from `earnest-keep-266820.discogs_modeled.Artist` as d
      join `earnest-keep-266820.musicbrainz_modeled.Artist_Beam_DF` as m
      on d.name = m.artist_name
    )
    group by artist_name

    union all

    select d.name as artist_name, "Shared Artists" as domain,
    from `earnest-keep-266820.discogs_modeled.Artist` as d
    join `earnest-keep-266820.musicbrainz_modeled.Artist_Beam_DF` as m
    on d.name = m.artist_name
    group by artist_name
)

The second query will correlate artists and labels with URLs in the musicbrainz dataset. While both discogs and musicbrainz have information about URLs, only the discogs dataset ties the URL to the artist or label it is referencing. With information from the discogs dataset we can enrich the musicbrainz URL table.

In [10]:
%%bigquery
select d.url as url, m.url_id as musicbrainz_id, d.artist_id as discogs_id, "Artist" as url_type
from discogs_modeled.Artist_URL d
join musicbrainz_modeled.URL_Beam_DF m on d.url = m.link
group by d.url, musicbrainz_id, discogs_id
union all
(
    select L.url as url, m.url_id as musicbrainz_id, L.label_id as discogs_id, "Label" as url_type
    from discogs_modeled.LabelURL L
    join musicbrainz_modeled.URL_Beam_DF m on L.url = m.link
    group by L.url, musicbrainz_id, discogs_id
)
order by discogs_id
limit 10

Unnamed: 0,url,musicbrainz_id,discogs_id,url_type
0,http://www.seasonsrecordings.com/,187876,2,Label
1,http://www.whosampled.com/Josh-Wink/,3323931,3,Artist
2,http://www.joshwink.com/,8036,3,Artist
3,http://music.hyperreal.org/labels/rather_inter...,187883,8,Label
4,http://www.residentadvisor.net/dj/christiansmith,3885337,16,Artist
5,http://www.somarecords.com/,187864,18,Label
6,http://music.hyperreal.org/artists/tetsu/,4245821,25,Artist
7,http://www.groovecollective.com/,23572,30,Artist
8,http://music.hyperreal.org/labels/fax/,188025,60,Label
9,http://www.instinctrecords.com/,190317,63,Label


In [3]:
%%bigquery
create or replace view reporting.v_shared_urls as 
(
    select d.url as url, m.url_id as musicbrainz_id, d.artist_id as discogs_id, "Artist" as url_type
    from `earnest-keep-266820.discogs_modeled.Artist_URL` d
    join `earnest-keep-266820.musicbrainz_modeled.URL_Beam_DF` m on d.url = m.link
    group by d.url, musicbrainz_id, discogs_id

    union all
    (
        select L.url as url, m.url_id as musicbrainz_id, L.label_id as discogs_id, "Label" as url_type
        from `earnest-keep-266820.discogs_modeled.LabelURL` L
        join `earnest-keep-266820.musicbrainz_modeled.URL_Beam_DF` m on L.url = m.link
        group by L.url, musicbrainz_id, discogs_id
    )
)

We have linked 45,372 URLs to artists and labels to enrich the musicbrainz dataset.

In [4]:
%%bigquery
select count(*) from reporting.v_shared_urls

Unnamed: 0,f0_
0,45372


Our third query will connect the labels from the discogs dataset to those in the musicbrainz dataset. The two Label tables will be joined along shared label names. The group by statement is used because the musicbrainz dataset has multiple records for some labels with distinct label ids. Meanwhile, the discogs dataset has only one label id for the label of that name. The group by is performed on the discogs label name becuase it seems that musicbrainz may include redundant data.

In [2]:
%%bigquery
select l.label_name as discogs_label_name
from discogs_modeled.Label_SQL_Final as l
join musicbrainz_modeled.Label_Beam_DF as m
on m.label_name = l.label_name
group by l.label_name
limit 10

Unnamed: 0,discogs_label_name
0,Cutty Shark
1,CBS Direct
2,I.L.S.
3,Gerth Medien
4,Mökkitie Records
5,New World Records
6,Trance Communications Records
7,Nuf Sed
8,Steady Beat Records
9,6K Music


To make a good visualization in Data Studio, I would like to show what portion of all the distinct labels in our database is shared between the musicbrainz and discogs datasets. I will use the query above, which shows the intersection of both the datasets, to partition the union of all the labels into three categories: those that belong to just musicbrainz, those that belong to just discogs, and those that are shared by both. 

The first part of this union finds the labels in musicbrainz that do not exist in the intersection. They are classified as being exclusively labels in the musicbrainz set. The second part of the union finds the labels exclusively in the discogs dataset and classifies them as such. The third part of the union is the intersection that was used in the previous two parts when finding the XOR of the whole label domain.

In [2]:
%%bigquery
create or replace view reporting.v_classified_labels as 
(
    select mL.label_name as label_name, "MusicBrainz Labels Only" as domain,
    from `earnest-keep-266820.musicbrainz_modeled.Label_Beam_DF` as mL
    where mL.label_id not in
    (
      select m.label_id as mID,
      from `earnest-keep-266820.discogs_modeled.Label_SQL_Final` as d
      join `earnest-keep-266820.musicbrainz_modeled.Label_Beam_DF` as m
      on d.label_name = m.label_name
    )
    group by label_name

    union all

    select dL.label_name as label_name, "Discogs Labels Only" as domain,
    from `earnest-keep-266820.discogs_modeled.Label_SQL_Final` as dL
    where dL.label_id not in
    (
      select d.label_id as dID,
      from `earnest-keep-266820.discogs_modeled.Label_SQL_Final` as d
      join `earnest-keep-266820.musicbrainz_modeled.Label_Beam_DF` as m
      on d.label_name = m.label_name
    )
    group by label_name

    union all

    select d.label_name as label_name, "Shared Labels" as domain,
    from `earnest-keep-266820.discogs_modeled.Label_SQL_Final` as d
    join `earnest-keep-266820.musicbrainz_modeled.Label_Beam_DF` as m
    on d.label_name = m.label_name
    group by label_name
)

There were 1,560,508 disctinct music labels total in the database.

In [3]:
%%bigquery
select count(*) from reporting.v_classified_labels

Unnamed: 0,f0_
0,1560508


Of all the labels there are 114,138 of them that are shared between the two datasets

In [4]:
%%bigquery
select count(*) from reporting.v_classified_labels
where domain = "Shared Labels"

Unnamed: 0,f0_
0,114138
