# Artist-Artist relationships

By Alejandro Fernández Sánchez

## Setting up the connection

In [1]:
# Just in case you're the host and it's not already started
!service postgresql start

In [2]:
# Imports
import psycopg2
import pandas as pd
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv
import json
load_dotenv()

True

In [3]:
DB_NAME = os.getenv("DB_NAME")
DB_HOST = os.getenv("DB_HOST")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_PORT = os.getenv("DB_PORT")

In [4]:
# Establishing a connection via postgre's python driver
conn = psycopg2.connect(
    database=DB_NAME,
    host=DB_HOST,
    user=DB_USER,
    password=DB_PASS,
    port=DB_PORT
)
conn

<connection object at 0x7ff39cef6980; dsn: 'user=musicbrainz password=xxx dbname=musicbrainz_db host=localhost port=5432', closed: 0>

In [5]:
cursor = conn.cursor()  # Helps with querying without memory allocation
cursor

<cursor object at 0x7ff39ce0ef20; closed: 0>

In [6]:
# Helper function
def query_with_cursor(c, q, column_names=False, head=False):
    conn.rollback()  # This is needed if a previous query fails
    c.execute(q)
    if column_names:
        print([col[0] for col in c.description])
    count = 0
    for r in c:
        print(r)
        count += 1
        if head and count == 10:
            break

In [7]:
# Used for saving results to pandas dataframes
engine_url = f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(engine_url)
engine

Engine(postgresql://musicbrainz:***@localhost:5432/musicbrainz_db)

## Types of artists relationships

How many are there?

In [8]:
query_with_cursor(
    cursor,
    "SELECT COUNT(*) FROM l_artist_artist"
)

(671957,)


Seems like a fairly big number, let's check how they relate to each other.

In [9]:
query =\
"""
SELECT id, name, description, long_link_phrase
FROM link_type
WHERE entity_type0 = 'artist'
  AND entity_type1 = 'artist'
ORDER BY id
"""
pd.read_sql_query(query, engine)

Unnamed: 0,id,name,description,long_link_phrase
0,102,collaboration,"This is used to specify that an <a href=""/doc/...",collaborated {minor:minorly} {additional:addit...
1,103,member of band,This indicates a person is a member of a group.,is/was {additional:an|a} {additional} {origina...
2,104,supporting musician,Indicates an artist doing long-time instrument...,is/was a supporting artist for
3,105,instrumental supporting musician,Indicates a musician doing long-time instrumen...,does/did {instrument} support for
4,106,musical relationships,,musical relationship
5,107,vocal supporting musician,Indicates a musician doing long-time vocal sup...,does/did {vocal:%|vocals} support for
6,108,is person,This links an artist's performance name (a sta...,performs as
7,109,parent,Indicates a parent-child relationship.,is the {step}parent of
8,110,sibling,This links two siblings (brothers or sisters).,has {half:half-}{step}sibling
9,111,married,This links artists who were married.,is/was married to


It seems like we have 22 possible relationships. They are all important, but there are three that differ from the rest.

Ids 1079 and 108. As I understand them, we should only have one entity of the same artist in the final CSVs. I'm going to store all occurrences of an artist in a list and stay with the most used instance.

Id 292. This relationship links a voice actor with their character.

## Artist dataset

We now have all the information needed to tackle the task of generating an artist dataset. Using relationships 108, 292, 1079 we'll create a list of known ids and names for each artist and store them following the JSONL convention.

The first subtask is to find pairs of entities that represents the same artist.

In [10]:
query =\
f"""
SELECT a0.id AS a0_id, a0.name AS a0_name, a1.id AS a1_id, a1.name AS a1_name
FROM l_artist_artist laa
JOIN artist a0 ON a0.id = laa.entity0
JOIN artist a1 ON a1.id = laa.entity1
WHERE laa.link IN (
    SELECT id
    FROM link
    WHERE link_type IN (1079, 108, 292)
)
"""
pairs = pd.read_sql_query(query, engine, dtype=str)
pairs.drop_duplicates(inplace=True)
pairs

Unnamed: 0,a0_id,a0_name,a1_id,a1_name
0,510355,Tom Salta,510353,Atlas Plug
1,515380,Sara Nicholas,512604,DJ Ginger Snapp
2,805193,Péter Takács,182397,Deto
3,1816108,Alex Bilowitz,1303285,Alex Bilo
4,366859,Tobias Lützenkirchen,134438,LXR
...,...,...,...,...
71552,2733295,Jan Pettersen,2733293,Jinx
71553,107069,Chris Cowie,2516613,Q
71554,308020,Michael Baur,2733227,Code-22
71555,1823042,glass beach,2733316,glass beach 2


We now separate each artist in their own list of entities (id and name).

In [11]:
# I've iterated though some algorithms that I came up with and this is the fastest one (that works)
# This algorithm groups all the different (same) artists in a list
seen_dict = dict()
artists_lists = list()
last_idx = -1
for _, row in pairs.iterrows():
    artist0 = {"id": row["a0_id"], "name": row["a0_name"]}
    artist1 = {"id": row["a1_id"], "name": row["a1_name"]}
    if artist0["id"] in seen_dict:
        if artist1["id"] in seen_dict:
            continue
        artist0_idx = seen_dict[artist0["id"]]
        artists_lists[artist0_idx].append(artist1)
        seen_dict[artist1["id"]] = artist0_idx
    elif artist1["id"] in seen_dict:
        artist1_idx = seen_dict[artist1["id"]]
        artists_lists[artist1_idx].append(artist0)
        seen_dict[artist0["id"]] = artist1_idx
    else:
        last_idx += 1
        artists_lists.append([artist0, artist1])
        seen_dict[artist0["id"]] = last_idx
        seen_dict[artist1["id"]] = last_idx
artists_lists[:5]

[[{'id': '510355', 'name': 'Tom Salta'},
  {'id': '510353', 'name': 'Atlas Plug'}],
 [{'id': '515380', 'name': 'Sara Nicholas'},
  {'id': '512604', 'name': 'DJ Ginger Snapp'}],
 [{'id': '805193', 'name': 'Péter Takács'}, {'id': '182397', 'name': 'Deto'}],
 [{'id': '1816108', 'name': 'Alex Bilowitz'},
  {'id': '1303285', 'name': 'Alex Bilo'}],
 [{'id': '366859', 'name': 'Tobias Lützenkirchen'},
  {'id': '134438', 'name': 'LXR'},
  {'id': '131031', 'name': 'Karosa'},
  {'id': '121353', 'name': 'Richthoven'},
  {'id': '408293', 'name': 'L.Y.T.Z.'},
  {'id': '258876', 'name': 'Lützenkirchen'},
  {'id': '411300', 'name': 'Lu Tracks'},
  {'id': '165609', 'name': '7-7-0'},
  {'id': '159973', 'name': 'Toby Lee Connor'},
  {'id': '690598', 'name': 'Paratopic'}]]

We now need to find which artist instance is the most common. For that we're going to be using the `releases_no_va.csv` dataset.

In [12]:
releases = pd.read_csv("releases_no_va.csv", dtype=str)
releases.fillna("", inplace=True)
releases

Unnamed: 0,name,date,artist_credit,artist_count,a0_id,a0_name,a1_id,a1_name,a2_id,a2_name,a3_id,a3_name,a4_id,a4_name
0,!,2020-08-06,119635,1,119635,Kevin Drumm,,,,,,,,
1,Sabr,2009-02-16,2094632,1,1450753,Shahram Solati,,,,,,,,
2,Sabr Aur Shukr,2023-09-08,351688,1,351688,Shekhar Ravjiani,,,,,,,,
3,Sabra Shatila 1982,2019-12-28,1288322,1,1096452,Geography of Hell,,,,,,,,
4,Sabrana Djela 1976. - 1987.,2019-03-28,414860,1,414860,Paraf,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2841551,Glorious Percussion / In tempus praesens,2011-10-28,1085468,5,947108,Luzerner Sinfonieorchester,947105,Glorious Percussion,538910,Vadim Gluzman,420382,Jonathan Nott,153732,София Асгатовна Губайдулина
2841552,Glow of Benares,2017-11-24,2962164,5,1496607,Abhijit Banerjee,1058331,Randers Kammerorkester,494877,Kala Ramnath,310652,Aarhus Jazz Orchestra,23800,Lars Møller
2841553,Glowing Up,2021-06-11,3222555,5,2290023,TUSO,2290022,Tudor,2118382,Milwin,1886127,Discrete,1184231,Sofia Karlberg
2841554,Gloria and Other Choral Music,1988-12-28,2062383,5,551005,Donna Deam,476955,City of London Sinfonia,129845,The Cambridge Singers,36433,John Rutter,30462,Francis Poulenc


In [13]:
id_columns = [f"a{i}_id" for i in range(5)]
artist_freqs = (releases[id_columns].melt().groupby(by=["value"]).count()).to_dict(index="value")["variable"]
artist_freqs

{'': 10909872,
 '10': 6,
 '1000': 44,
 '1000007': 1,
 '1000008': 2,
 '1000017': 1,
 '100002': 1,
 '1000025': 1,
 '100003': 1,
 '100004': 4,
 '1000049': 1,
 '100005': 2,
 '1000058': 1,
 '1000060': 3,
 '1000067': 7,
 '1000068': 2,
 '1000073': 1,
 '1000077': 1,
 '1000078': 1,
 '1000079': 1,
 '1000080': 16,
 '1000081': 3,
 '1000082': 3,
 '1000083': 1,
 '1000085': 2,
 '100009': 1,
 '1000095': 1,
 '1000096': 2,
 '1000107': 1,
 '1000112': 1,
 '100012': 15,
 '1000120': 2,
 '1000121': 1,
 '1000130': 3,
 '1000131': 1,
 '1000135': 12,
 '1000138': 6,
 '100014': 1,
 '1000143': 3,
 '1000144': 1,
 '1000146': 1,
 '1000147': 2,
 '1000148': 1,
 '1000150': 1,
 '1000151': 1,
 '1000154': 1,
 '1000155': 2,
 '1000156': 1,
 '1000157': 1,
 '1000158': 3,
 '1000159': 2,
 '1000161': 4,
 '1000164': 1,
 '1000167': 1,
 '1000168': 3,
 '100017': 3,
 '1000170': 1,
 '1000171': 11,
 '100018': 2,
 '1000182': 1,
 '1000187': 1,
 '1000188': 4,
 '1000196': 3,
 '1000199': 2,
 '100020': 4,
 '1000200': 1,
 '1000202': 5,
 '100020

We are now ready to sort the lists so that the most common instance is the first element of the list.

In [14]:
# First element of the list will be the "main" instance of the artist
artists_lists = list(map(
    lambda artist_list: sorted(artist_list, key=lambda artist: artist_freqs.get(artist["id"], 0), reverse=True),
    artists_lists
))
artists_lists[:5]

[[{'id': '510355', 'name': 'Tom Salta'},
  {'id': '510353', 'name': 'Atlas Plug'}],
 [{'id': '515380', 'name': 'Sara Nicholas'},
  {'id': '512604', 'name': 'DJ Ginger Snapp'}],
 [{'id': '805193', 'name': 'Péter Takács'}, {'id': '182397', 'name': 'Deto'}],
 [{'id': '1816108', 'name': 'Alex Bilowitz'},
  {'id': '1303285', 'name': 'Alex Bilo'}],
 [{'id': '258876', 'name': 'Lützenkirchen'},
  {'id': '366859', 'name': 'Tobias Lützenkirchen'},
  {'id': '134438', 'name': 'LXR'},
  {'id': '690598', 'name': 'Paratopic'},
  {'id': '408293', 'name': 'L.Y.T.Z.'},
  {'id': '411300', 'name': 'Lu Tracks'},
  {'id': '159973', 'name': 'Toby Lee Connor'},
  {'id': '131031', 'name': 'Karosa'},
  {'id': '121353', 'name': 'Richthoven'},
  {'id': '165609', 'name': '7-7-0'}]]

So far we've only handled the "multiple instances" artists. The following block of code adds the "single instance" artists to the list. It doesn't matter if we do this after the sort, these lists are going to have only one instance of the artist after all.

In [15]:
# Has to be a better way to do this
# WARNING: ~3m execution time
for _, row in releases.iterrows():
    artists_in_row = (
        {
            "id": row[f"a{i}_id"],
            "name": row[f"a{i}_name"],
        }
        for i in range(5) if row[f"a{i}_id"] != ""
    )
    for artist in artists_in_row:
        if artist["id"] not in seen_dict:
            artists_lists.append([artist])
artists_lists[-5:]

[[{'id': '30462', 'name': 'Francis Poulenc'}],
 [{'id': '1205355', 'name': 'Loota'}],
 [{'id': '1205354', 'name': 'JayAllDay'}],
 [{'id': '1129217', 'name': 'KOHH'}],
 [{'id': '1033122', 'name': 'Okasian'}]]

At this point we have a list of lists, the following will transform what we have into a list of dictionaries, ready to be serialized.

In [16]:
artists = list()
for artists_list in artists_lists:
    current_artist = dict()
    current_artist["main_id"] = artists_list[0]["id"]
    current_artist["known_ids"] = [artist["id"] for artist in artists_list]
    current_artist["known_names"] = [artist["name"] for artist in artists_list]
    artists.append(current_artist)
artists[:5]

[{'main_id': '510355',
  'known_ids': ['510355', '510353'],
  'known_names': ['Tom Salta', 'Atlas Plug']},
 {'main_id': '515380',
  'known_ids': ['515380', '512604'],
  'known_names': ['Sara Nicholas', 'DJ Ginger Snapp']},
 {'main_id': '805193',
  'known_ids': ['805193', '182397'],
  'known_names': ['Péter Takács', 'Deto']},
 {'main_id': '1816108',
  'known_ids': ['1816108', '1303285'],
  'known_names': ['Alex Bilowitz', 'Alex Bilo']},
 {'main_id': '258876',
  'known_ids': ['258876',
   '366859',
   '134438',
   '690598',
   '408293',
   '411300',
   '159973',
   '131031',
   '121353',
   '165609'],
  'known_names': ['Lützenkirchen',
   'Tobias Lützenkirchen',
   'LXR',
   'Paratopic',
   'L.Y.T.Z.',
   'Lu Tracks',
   'Toby Lee Connor',
   'Karosa',
   'Richthoven',
   '7-7-0']}]

In [17]:
artists[-5:]

[{'main_id': '30462',
  'known_ids': ['30462'],
  'known_names': ['Francis Poulenc']},
 {'main_id': '1205355', 'known_ids': ['1205355'], 'known_names': ['Loota']},
 {'main_id': '1205354',
  'known_ids': ['1205354'],
  'known_names': ['JayAllDay']},
 {'main_id': '1129217', 'known_ids': ['1129217'], 'known_names': ['KOHH']},
 {'main_id': '1033122', 'known_ids': ['1033122'], 'known_names': ['Okasian']}]

The only thing left is to actually serialize what we have.

In [18]:
jsons = [json.dumps(artist, ensure_ascii=False) for artist in artists]
unique_jsons = list(set(jsons))

with open("artists.jsonl", "w", encoding="utf-8") as out_file:
    for unique_json in unique_jsons:
        out_file.write(unique_json + "\n")

In [19]:
!wc -l artists.jsonl

783597 artists.jsonl


We can now retrieve the data every time we want.

In [20]:
with open("artists.jsonl", "r", encoding="utf-8") as in_file:
    artist_data = [json.loads(line) for line in in_file]
print(artist_data[:5])

[{'main_id': '2121669', 'known_ids': ['2121669'], 'known_names': ['Nicholas Fairbank']}, {'main_id': '1110377', 'known_ids': ['1110377'], 'known_names': ["I've Lost"]}, {'main_id': '1169734', 'known_ids': ['1169734'], 'known_names': ['windy hill']}, {'main_id': '2323344', 'known_ids': ['2323344'], 'known_names': ['Aphangak']}, {'main_id': '60138', 'known_ids': ['60138'], 'known_names': ['Flat 6']}]


## Relationships dataset

The next step is to save the relationships between the artists in a CSV file. Let's start collecting the different relationships.

In [21]:
link_types = pd.read_sql_query("SELECT DISTINCT id FROM link_type  WHERE entity_type0 = 'artist' AND entity_type1 = 'artist'", engine)
relationships = pd.DataFrame({
    'id0': [],
    'name0': [],
    'id1': [],
    'name1': [],
    'relationship_type': [],
})
for link_type in filter(lambda lt: lt not in (108, 292, 1079), link_types.id):
    query =\
f"""
SELECT a0.id AS id0, a0.name AS name0, a1.id AS id1, a1.name AS name1, {link_type} AS relationship_type
FROM l_artist_artist laa
JOIN artist a0 ON a0.id = laa.entity0
JOIN artist a1 ON a1.id = laa.entity1
WHERE laa.link IN (
    SELECT id
    FROM link
    WHERE link_type = {link_type}
);
"""
    result = pd.read_sql_query(query, engine, dtype=str)
    if result.empty:
        continue
    relationships = pd.concat([relationships, result])
del result
relationships.drop_duplicates(inplace=True)
relationships

Unnamed: 0,id0,name0,id1,name1,relationship_type
0,448102,Xoel López,248824,Lovely Luna,102
1,359330,Miley Cyrus,686291,Helping Haiti,102
2,129154,Jay-J,472106,Jay-J & Macari,102
3,267439,Andrew Macari,472106,Jay-J & Macari,102
4,77944,Michael Bublé,686291,Helping Haiti,102
...,...,...,...,...,...
563,1237428,Curb Cobain,236309,Kurt Cobain,973
564,2731567,JELEE,2723888,橘ののか,973
565,1625587,Chor und Orchester Mantovani,210790,Mantovani,973
566,242,The Chemical Brothers,1468,The Dust Brothers,973


Now we filter the relationships so that we don't have artists that don't concern us.

In [22]:
mask = relationships[["id0", "id1"]].isin(artist_freqs.keys()).all(axis=1)
filtered_relationships = relationships[mask]
filtered_relationships

Unnamed: 0,id0,name0,id1,name1,relationship_type
0,448102,Xoel López,248824,Lovely Luna,102
1,359330,Miley Cyrus,686291,Helping Haiti,102
2,129154,Jay-J,472106,Jay-J & Macari,102
3,267439,Andrew Macari,472106,Jay-J & Macari,102
4,77944,Michael Bublé,686291,Helping Haiti,102
...,...,...,...,...,...
556,1004547,SpongeBOZZ,43102,SpongeBob SquarePants,973
557,2685255,Trio Messiaen,10371,Olivier Messiaen,973
558,2705719,GUNRINGER-Y,1603935,Gunslinger-R,973
563,1237428,Curb Cobain,236309,Kurt Cobain,973


Now, we have the relationships, but we've made an artist dataset that will help in the task of replacing the non-important id with the main id for each artist. For this task a dictionary will be created.

In [23]:
changes_dict = dict()
for artist in artist_data:
    if len(artist["known_ids"]) > 1:
        for known_id in artist["known_ids"]:
            if known_id != artist["main_id"]:
                changes_dict[known_id] = artist["main_id"]
len(changes_dict)

70660

These are the relationships that we need to modify.

In [24]:
mask = filtered_relationships[["id0", "id1"]].isin(changes_dict.keys()).any(axis=1)
filtered_relationships.loc[mask]

Unnamed: 0,id0,name0,id1,name1,relationship_type
122,426487,The Count of Monte Cristal,541571,The Count & Sinden,102
123,184688,Frank Tovey,195090,Mkultra,102
127,57716,Filippo “Naughty” Moscatello,337131,Naughty & Tolis,102
160,239310,Sharon den Adel,54309,Avantasia,102
190,436287,Dominick Martin,595980,St. Cal,102
...,...,...,...,...,...
392,2281011,KEN PLUS,923360,Macintosh Plus,973
402,2458694,fluttershy will never die,1162788,Fluttershy,973
456,2629185,月見英子,2316189,月見英子,973
556,1004547,SpongeBOZZ,43102,SpongeBob SquarePants,973


In [25]:
filtered_relationships.loc[mask, ["id0", "id1"]] = filtered_relationships.loc[mask, ["id0", "id1"]].replace(changes_dict)

filtered_relationships.loc[mask]

Unnamed: 0,id0,name0,id1,name1,relationship_type
122,344890,The Count of Monte Cristal,541571,The Count & Sinden,102
123,46990,Frank Tovey,195090,Mkultra,102
127,117552,Filippo “Naughty” Moscatello,337131,Naughty & Tolis,102
160,1588050,Sharon den Adel,54309,Avantasia,102
190,71451,Dominick Martin,595980,St. Cal,102
...,...,...,...,...,...
392,2281011,KEN PLUS,860871,Macintosh Plus,973
402,2458697,fluttershy will never die,1162788,Fluttershy,973
456,1577273,月見英子,1423112,月見英子,973
556,1700445,SpongeBOZZ,43102,SpongeBob SquarePants,973


In [26]:
changes_dict["426487"]

'344890'

We can make sure that there are no cyclic references this way (hoping for a False return):

In [27]:
filtered_relationships[["id0", "id1"]].isin(changes_dict.keys()).any(axis=1).any()

False

Now we can finally save our relationships CSV.

In [28]:
filtered_relationships.to_csv("relationships.csv", index=False)

In [29]:
!wc -l relationships.csv

125813 relationships.csv


## Releases dataset with main IDs

Now that we're here, why not do the same with the releases dataset, which we have already in memory.

In [30]:
mask = releases[id_columns].isin(changes_dict.keys()).any(axis=1)
releases.loc[mask, id_columns]

Unnamed: 0,a0_id,a1_id,a2_id,a3_id,a4_id
457,2630975,,,,
813,604380,,,,
900,1875555,,,,
902,2119905,,,,
962,1400830,,,,
...,...,...,...,...,...
2841284,2641987,2641986,2641985,2641984,1679364
2841297,2104904,2104903,2041481,1810635,1141182
2841349,741223,535356,352163,284443,180306
2841525,1181250,688686,469344,333071,137868


In [31]:
# WARNING: ~10m execution time
releases.loc[mask, id_columns] = releases.loc[mask, id_columns].replace(changes_dict)

releases.loc[mask, id_columns]

Unnamed: 0,a0_id,a1_id,a2_id,a3_id,a4_id
457,1245245,,,,
813,284452,,,,
900,1875579,,,,
902,741269,,,,
962,531216,,,,
...,...,...,...,...,...
2841284,2641987,1679364,2641984,2641984,1679364
2841297,1141182,1810635,2041481,1810635,1141182
2841349,741223,203282,158068,190601,180306
2841525,779144,688686,469344,333071,137868


In [32]:
releases[id_columns].isin(changes_dict.keys()).any(axis=1).any()

False

In [33]:
releases.to_csv("releases_no_va_merged.csv", index=False)

In [34]:
!wc -l releases_no_va_merged.csv

2841557 releases_no_va_merged.csv


## Cleanup

In [35]:
engine.dispose()
conn.close()

In [36]:
!service postgresql stop