# Artist-Artist relationships

By Alejandro Fernández Sánchez

## Setting up the connection

In [1]:
# Just in case you're the host and it's not already started
!service postgresql start

In [2]:
# Imports
import psycopg2
import pandas as pd
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv
import json
load_dotenv()

True

In [3]:
DB_NAME = os.getenv("DB_NAME")
DB_HOST = os.getenv("DB_HOST")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_PORT = os.getenv("DB_PORT")

In [4]:
# Establishing a connection via postgre's python driver
conn = psycopg2.connect(
    database=DB_NAME,
    host=DB_HOST,
    user=DB_USER,
    password=DB_PASS,
    port=DB_PORT
)
conn

<connection object at 0x7ff3ecfdee80; dsn: 'user=musicbrainz password=xxx dbname=musicbrainz_db host=localhost port=5432', closed: 0>

In [5]:
cursor = conn.cursor()  # Helps with querying without memory allocation
cursor

<cursor object at 0x7ff3ecd827a0; closed: 0>

In [6]:
# Helper function
def query_with_cursor(c, q, column_names=False, head=False):
    conn.rollback()  # This is needed if a previous query fails
    c.execute(q)
    if column_names:
        print([col[0] for col in c.description])
    count = 0
    for r in c:
        print(r)
        count += 1
        if head and count == 10:
            break

In [7]:
# Used for saving results to pandas dataframes
engine_url = f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(engine_url, pool_size=10, max_overflow=0)
engine

Engine(postgresql://musicbrainz:***@localhost:5432/musicbrainz_db)

## Types of artists relationships

How many are there?

In [8]:
query_with_cursor(
    cursor,
    "SELECT COUNT(*) FROM l_artist_artist"
)

(710951,)


Seems like a fairly big number, let's check how they relate to each other.

In [9]:
query =\
"""
SELECT id, name, description, long_link_phrase
FROM link_type
WHERE entity_type0 = 'artist'
  AND entity_type1 = 'artist'
ORDER BY id
"""
pd.read_sql_query(query, engine)

Unnamed: 0,id,name,description,long_link_phrase
0,102,collaboration,"This is used to specify that an <a href=""/doc/...",collaborated {minor:minorly} {additional:addit...
1,103,member of band,This indicates a person is a member of a group.,is/was {additional:an|a} {additional} {origina...
2,104,supporting musician,Indicates an artist doing long-time instrument...,is/was a supporting artist for
3,105,instrumental supporting musician,Indicates a musician doing long-time instrumen...,does/did {instrument} support for
4,106,musical relationships,,musical relationship
5,107,vocal supporting musician,Indicates a musician doing long-time vocal sup...,does/did {vocal:%|vocals} support for
6,108,is person,This links an artist's performance name (a sta...,performs as
7,109,parent,Indicates a parent-child relationship.,is the {step}parent of
8,110,sibling,This links two siblings (brothers or sisters).,has {half:half-}{step}sibling
9,111,married,This links artists who were married.,is/was married to


It seems like we have 22 possible relationships. They are all important, but there are three that differ from the rest.

Ids 1079 and 108. As I understand them, we should only have one entity of the same artist in the final CSVs. I'm going to store all occurrences of an artist in a list and stay with the most used instance.

Id 292. This relationship links a voice actor with their character.

## Artist dataset

We now have all the information needed to tackle the task of generating an artist dataset. Using relationships 108, 292, 1079 we'll create a list of known ids and names for each artist and store them following the JSONL convention.

The first subtask is to find pairs of entities that represents the same artist.

In [10]:
query =\
f"""
SELECT a0.id AS a0_id, a0.name AS a0_name, a1.id AS a1_id, a1.name AS a1_name
FROM l_artist_artist laa
JOIN artist a0 ON a0.id = laa.entity0
JOIN artist a1 ON a1.id = laa.entity1
WHERE laa.link IN (
    SELECT id
    FROM link
    WHERE link_type IN (1079, 108, 292)
)
"""
pairs = pd.read_sql_query(query, engine, dtype=str)
pairs.drop_duplicates(inplace=True)
pairs

Unnamed: 0,a0_id,a0_name,a1_id,a1_name
0,510355,Tom Salta,510353,Atlas Plug
1,515380,Sara Nicholas,512604,DJ Ginger Snapp
2,1816108,Alex Bilowitz,1303285,Alex Bilo
3,285421,Edward Upton,839935,BBII
4,567288,Charles Hilton Jr.,567286,CJ
...,...,...,...,...
75500,2251310,iNFO,2855446,FACTSIMILE
75501,2033492,橘田ほのか,2464746,Mai
75502,2163165,松ケンスケ,701594,パインツリー
75503,2855355,Milena Krstevska,2239357,Klara


We now separate each artist in their own list of entities (id and name).

In [11]:
# I've iterated though some algorithms that I came up with and this is the fastest one (that works)
# This algorithm groups all the different (same) artists in a list
seen_dict = dict()
artists_lists = list()
last_idx = -1
for _, row in pairs.iterrows():
    artist0 = {"id": row["a0_id"], "name": row["a0_name"]}
    artist1 = {"id": row["a1_id"], "name": row["a1_name"]}
    if artist0["id"] in seen_dict:
        if artist1["id"] in seen_dict:
            continue
        artist0_idx = seen_dict[artist0["id"]]
        artists_lists[artist0_idx].append(artist1)
        seen_dict[artist1["id"]] = artist0_idx
    elif artist1["id"] in seen_dict:
        artist1_idx = seen_dict[artist1["id"]]
        artists_lists[artist1_idx].append(artist0)
        seen_dict[artist0["id"]] = artist1_idx
    else:
        last_idx += 1
        artists_lists.append([artist0, artist1])
        seen_dict[artist0["id"]] = last_idx
        seen_dict[artist1["id"]] = last_idx
artists_lists[:5]

[[{'id': '510355', 'name': 'Tom Salta'},
  {'id': '510353', 'name': 'Atlas Plug'}],
 [{'id': '515380', 'name': 'Sara Nicholas'},
  {'id': '512604', 'name': 'DJ Ginger Snapp'}],
 [{'id': '1816108', 'name': 'Alex Bilowitz'},
  {'id': '1303285', 'name': 'Alex Bilo'}],
 [{'id': '285421', 'name': 'Edward Upton'},
  {'id': '839935', 'name': 'BBII'},
  {'id': '167634', 'name': 'EDMX'},
  {'id': '119129', 'name': 'Bass Potato'},
  {'id': '329054', 'name': 'David Michael Cross'},
  {'id': '105617', 'name': 'Computor Rockers'},
  {'id': '64070', 'name': 'DMX Krew'},
  {'id': '329053', 'name': 'Ed DMX'},
  {'id': '204935', 'name': 'Michael Knight'},
  {'id': '690681', 'name': '101 Force'},
  {'id': '1043503', 'name': 'Asylum Seekers'}],
 [{'id': '567288', 'name': 'Charles Hilton Jr.'},
  {'id': '567286', 'name': 'CJ'}]]

We now need to find which artist instance is the most common. For that we're going to be using the `tracks_no_va.csv` dataset.

In [12]:
tracks = pd.read_csv("../data/tracks_no_va.csv", dtype=str)
tracks.fillna("", inplace=True)
tracks

Unnamed: 0,name,date,year,month,artist_count,a0_id,a0_name,tags,a1_id,a1_name,a2_id,a2_name,a3_id,a3_name,a4_id,a4_name
0,*~ƒint_vœr!~*,201612,2016,12,1,2808021,Julius Androide,,,,,,,,,
1,round midnight,200706,2007,6,1,493882,Frédéric Loiseau,,,,,,,,,
2,round midnight,200900,2009,0,1,491037,Hank Jones Trio,,,,,,,,,
3,round midnight,198300,1983,0,1,490780,The Oscar Peterson Big 4,,,,,,,,,
4,round midnight,198500,1985,0,1,487956,Harvie Swartz,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24371887,hakuna matata,199400,1994,0,5,835571,Ryan van den Akker,,824938,Jurrian van Dongen,824937,David Verbeek,824936,Door Van Boeckel,56496,B.B. Queen
24371888,hakuna matata,201400,2014,0,5,883117,Krzysztof Tyniec,,702626,Urszula Janowska,702625,Paweł Tucholski,702622,Michał Mech,702621,Emilian Kamiński
24371889,hakuna matata,201500,2015,0,5,2627409,Alfonso Borbolla,,2627397,Oscarín Aguilar,1875805,Sergio Carranza,833798,Carlos Rivera,23323,[theatre]
24371890,hall of fame,202200,2022,0,5,2224956,Yoshio Furukawa,"605, 11, 4190, 359, 14, 2057, 21101, 78699",2224954,Yusup Dalmaz,356101,Vladimír Šimůnek,97700,岩田匡治,97699,崎元仁


In [16]:
len(tracks)

24371892

In [13]:
id_columns = [f"a{i}_id" for i in range(5)]
artist_freqs = (tracks[id_columns].melt().groupby(by=["value"]).count()).to_dict(index="value")["variable"]
artist_freqs

{'': 94018318,
 '10': 48,
 '1000': 160,
 '100000': 1,
 '1000000': 2,
 '1000001': 1,
 '1000002': 1,
 '1000003': 3,
 '1000004': 2,
 '1000005': 1,
 '1000006': 1,
 '1000007': 5,
 '1000008': 2,
 '1000010': 3,
 '1000011': 1,
 '1000012': 1,
 '1000013': 1,
 '1000014': 1,
 '1000015': 1,
 '1000016': 1,
 '1000017': 3,
 '1000018': 1,
 '1000019': 2,
 '100002': 5,
 '1000020': 3,
 '1000021': 3,
 '1000022': 1,
 '1000023': 1,
 '1000024': 3,
 '1000025': 3,
 '1000026': 2,
 '1000027': 2,
 '100003': 8,
 '1000032': 5,
 '1000034': 1,
 '1000035': 2,
 '100004': 20,
 '1000046': 1,
 '1000049': 2,
 '100005': 17,
 '1000056': 2,
 '1000057': 4,
 '1000058': 13,
 '100006': 1,
 '1000061': 1,
 '1000066': 5,
 '1000067': 42,
 '1000068': 8,
 '1000070': 1,
 '1000071': 2,
 '1000073': 4,
 '1000074': 1,
 '1000075': 6,
 '1000076': 6,
 '1000077': 6,
 '1000078': 10,
 '1000079': 10,
 '1000080': 32,
 '1000081': 40,
 '1000082': 25,
 '1000083': 19,
 '1000085': 15,
 '1000086': 5,
 '1000087': 3,
 '1000088': 3,
 '100009': 3,
 '1000090':

We are now ready to sort the lists so that the most common instance is the first element of the list.

In [14]:
# First element of the list will be the "main" instance of the artist
artists_lists = list(map(
    lambda artist_list: sorted(artist_list, key=lambda artist: artist_freqs.get(artist["id"], 0), reverse=True),
    artists_lists
))
artists_lists[:5]

[[{'id': '510355', 'name': 'Tom Salta'},
  {'id': '510353', 'name': 'Atlas Plug'}],
 [{'id': '515380', 'name': 'Sara Nicholas'},
  {'id': '512604', 'name': 'DJ Ginger Snapp'}],
 [{'id': '1816108', 'name': 'Alex Bilowitz'},
  {'id': '1303285', 'name': 'Alex Bilo'}],
 [{'id': '64070', 'name': 'DMX Krew'},
  {'id': '329053', 'name': 'Ed DMX'},
  {'id': '105617', 'name': 'Computor Rockers'},
  {'id': '167634', 'name': 'EDMX'},
  {'id': '329054', 'name': 'David Michael Cross'},
  {'id': '204935', 'name': 'Michael Knight'},
  {'id': '690681', 'name': '101 Force'},
  {'id': '839935', 'name': 'BBII'},
  {'id': '1043503', 'name': 'Asylum Seekers'},
  {'id': '119129', 'name': 'Bass Potato'},
  {'id': '285421', 'name': 'Edward Upton'}],
 [{'id': '567286', 'name': 'CJ'},
  {'id': '567288', 'name': 'Charles Hilton Jr.'}]]

So far we've only handled the "multiple instances" artists. The following block of code adds the "single instance" artists to the list. It doesn't matter if we do this after the sort, these lists are going to have only one instance of the artist after all.

In [17]:
# Has to be a better way to do this
# WARNING: ~3m execution time
seen_set = set(seen_dict.keys())
seen_set.add("")
for _, row in tracks.iterrows():
    artists_in_row = (
        {
            "id": row[f"a{i}_id"],
            "name": row[f"a{i}_name"],
        }
        for i in range(5) if row[f"a{i}_id"] not in seen_set
    )
    for artist in artists_in_row:
        artists_lists.append([artist])
        seen_set.add(artist["id"])
artists_lists[-5:]

KeyboardInterrupt: 

In [19]:
len(artists_lists)
len(seen_dict)

905510

At this point we have a list of lists, the following will transform what we have into a list of dictionaries, ready to be serialized.

In [16]:
artists = list()
for artists_list in artists_lists:
    current_artist = dict()
    current_artist["main_id"] = artists_list[0]["id"]
    current_artist["known_ids"] = [artist["id"] for artist in artists_list]
    current_artist["known_names"] = [artist["name"] for artist in artists_list]
    artists.append(current_artist)
artists[:5]

In [17]:
artists[-5:]

The only thing left is to actually serialize what we have.

In [18]:
jsons = [json.dumps(artist, ensure_ascii=False) for artist in artists]
unique_jsons = list(set(jsons))

with open("artists.jsonl", "w", encoding="utf-8") as out_file:
    for unique_json in unique_jsons:
        out_file.write(unique_json + "\n")

In [19]:
!wc -l artists.jsonl

We can now retrieve the data every time we want.

In [20]:
with open("artists.jsonl", "r", encoding="utf-8") as in_file:
    artist_data = [json.loads(line) for line in in_file]
print(artist_data[:5])

## Relationships dataset

The next step is to save the relationships between the artists in a CSV file. Let's start collecting the different relationships.

In [21]:
link_types = pd.read_sql_query("SELECT DISTINCT id FROM link_type  WHERE entity_type0 = 'artist' AND entity_type1 = 'artist'", engine)
relationships = pd.DataFrame({
    'id0': [],
    'name0': [],
    'id1': [],
    'name1': [],
    'relationship_type': [],
})
for link_type in filter(lambda lt: lt not in (108, 292, 1079), link_types.id):
    query =\
f"""
SELECT a0.id AS id0, a0.name AS name0, a1.id AS id1, a1.name AS name1, {link_type} AS relationship_type
FROM l_artist_artist laa
JOIN artist a0 ON a0.id = laa.entity0
JOIN artist a1 ON a1.id = laa.entity1
WHERE laa.link IN (
    SELECT id
    FROM link
    WHERE link_type = {link_type}
);
"""
    result = pd.read_sql_query(query, engine, dtype=str)
    if result.empty:
        continue
    relationships = pd.concat([relationships, result])
del result
relationships.drop_duplicates(inplace=True)
relationships

Now we filter the relationships so that we don't have artists that don't concern us.

In [22]:
mask = relationships[["id0", "id1"]].isin(artist_freqs.keys()).all(axis=1)
filtered_relationships = relationships[mask]
filtered_relationships

Now, we have the relationships, but we've made an artist dataset that will help in the task of replacing the non-important id with the main id for each artist. For this task a dictionary will be created.

In [23]:
changes_dict = dict()
for artist in artist_data:
    if len(artist["known_ids"]) > 1:
        for known_id in artist["known_ids"]:
            if known_id != artist["main_id"]:
                changes_dict[known_id] = artist["main_id"]
len(changes_dict)

These are the relationships that we need to modify.

In [24]:
mask = filtered_relationships[["id0", "id1"]].isin(changes_dict.keys()).any(axis=1)
filtered_relationships

In [25]:
filtered_relationships.loc[mask, ["id0", "id1"]] = filtered_relationships.loc[mask, ["id0", "id1"]].map(changes_dict)

filtered_relationships.loc[mask]

In [26]:
changes_dict["426487"]

We can make sure that there are no cyclic references this way (hoping for a False return):

In [27]:
filtered_relationships[["id0", "id1"]].isin(changes_dict.keys()).any(axis=1).any()

Now we can finally save our relationships CSV.

In [28]:
filtered_relationships.to_csv("relationships.csv", index=False)

In [29]:
!wc -l relationships.csv

## Tags

We now can get the tags from the database.

In [30]:
query = """
SELECT artist, STRING_AGG(tag::VARCHAR, ', ') as tags
FROM artist_tag
GROUP BY artist;
"""
tags = pd.read_sql_query(query, engine, dtype=str)
tags = tags[tags["artist"].isin([artist["id"] for artists_list in artists_lists for artist in artists_list])]
tags

In [31]:
mask = tags["artist"].isin(changes_dict.keys())
tags[mask]

In [32]:
tags.loc[mask, "artist"] = tags.loc[mask, "artist"].map(changes_dict)

We can now save the results for the future.

In [33]:
tags.to_csv("artist_tags.csv", index=False)

## Tracks dataset with main IDs

Now that we're here, why not do the same with the tracks dataset, which we have already in memory.

In [34]:
mask = tracks[id_columns].isin(changes_dict.keys()).any(axis=1)
tracks.loc[mask, id_columns]

In [35]:
# WARNING: ~10m execution time
tracks.loc[mask, id_columns] = tracks.loc[mask, id_columns].map(changes_dict)

tracks.loc[mask, id_columns]

In [36]:
tracks[id_columns].isin(changes_dict.keys()).any(axis=1).any()

In [37]:
tracks.to_csv("../data/tracks_no_va_merged.csv", index=False)

In [38]:
!wc -l tracks_no_va_merged.csv

## Cleanup

In [39]:
engine.dispose()
conn.close()

In [40]:
!service postgresql stop