<a href="https://www.kaggle.com/code/fraserwtt/metal-subgenre-album-art-data?scriptVersionId=130654813" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Metal Album Web Scraper

Using the Wikipedia API and its relatively well structured tree, I should be able to easily get enough data to build a basic classifier for various subgenres. A few reasons why Wikipedia was used instead of a more specialist website (e.g. metal-archives):

* Ease of use - Wikipedia has a good API, so it makes sense to do so.
* Because of the nature of Wikipedia moderation, there is an inherent notability filter so bands who released 1 EP and were never seen again. They probably also have the resources to hire a proper artist for their artwork, so there'll be less... "noise" in the dataset. Caveat here being something like black metal where the lo-fi aesthetic is something we do want to pick up on.
* Could be replicated for other kinds of music. Is there a difference in the art direction of different subgenres of jazz? Between different subgenres (and eras!) of rap?
* A specialist website is more likely to split hairs over crossovers between subgenres. For the purposes of this exercise (and personally, in general), we don't really care that Blind Guardian are influenced by bay-area thrash (and are at danger of being labelled as "Thrash/Power Metal") or that the use of orchestras would classify Fleshgod Apocalypse as Symphonic Death Metal -- we only want a few data classes to work with!

In [1]:
!pip install wget
!pip install spotipy

# Step 1: Import relevant libraries
import os
import re
import wget
import spotipy
from time import sleep
from shutil import rmtree
from pathlib import Path
from typing import List, Tuple
from kaggle_secrets import UserSecretsClient
from spotipy.oauth2 import SpotifyClientCredentials

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l- \ done
[?25hBuilding wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l- \ done
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=77c48705c77f8b647cb9fcfe6b4434f0197a24526ca5687cdda626f734815a6b
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
[0mCollecting spotipy
  Downloading spotipy-2.23.0-py3-none-any.whl (29 kB)
Collecting redis>=3.5.3
  Downloading redis-4.5.5-py3-none-any.whl (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.3/240.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: redis, spotipy
Successfully installed redis-4.5.5 spotipy-2.23.0
[0m

### Heavy Metal Subgenres

I will find a playlist that has a decent number of bands (e.g. [the Power Metal playlist](https://open.spotify.com/playlist/1cxAqhSGeWZ5vvQAitKpoJ?si=77f825b797cc4c7f) has XXX songs) by browsing Spotify manually, get the playlist id from the URL, and then use Spotify API to extract the band ids from all the bands on it. From here it is pretty simple to scrape the artwork from their discogs page.

In [2]:
# Initialize Spotify API

user_secrets = UserSecretsClient()
client_id = user_secrets.get_secret("SPOTIFY_CLIENT")
client_secret = user_secrets.get_secret("SPOTIFY_SECRET")
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [3]:
# Defining sub genres
genre_list = [
    'power metal'
    , 'black metal'
#     , 'death metal'
#     , 'thrash metal'
]

### Get Band Names

In [4]:
def get_artists_by_genre(genre, limit=50, pages=4):
    results = []

    for i in range(pages):
        results_page = sp.search(q=f'genre:"{genre}"', type='artist', limit=limit, offset=limit * i)
        results += results_page['artists']['items']

    return results

In [5]:
exclusion_lists = {
    'power metal': ['Accept', 'Amon Amarth', 'Children Of Bodom', 'Arch Enemy', 'Amorphis', 'Lordi', 'Equilibrium', 'Machinae Supremacy', 'Metal Church', 'Nevermore', 'X JAPAN', 'Queensrÿche', 'Stryper', 'Helloween', 'Mafumafu', 'Dimmu Borgir', 'Imperial Circus Dead Decadence', 'Savatage', 'Saxon', 'Moonspell', "Therion", 'Armored Saint', 'Sanctuary', 'Manilla Road'],
#     'death metal': ['Children of Bodom', 'Municipal Waste', 'DevilDriver', 'Exodus', 'Behemoth', 'Eluveitie', 'Kreator', 'Dimmu Borgir', 'Equilibrium', 'Scar Symmetry', 'Mercyful Fate', 'Æther Realm', 'Darkthrone', 'Dissection', 'Candlemass', 'Venom', 'Mors Principium Est', 'Slayer', 'Imperial Circus Dead Decadence', 'Toxic Holocaust', 'Belphegor', 'Ensiferum', 'Goatwhore', 'Gojira', 'In Flames', 'Children Of Bodom', 'Annihilator', 'Hypocrisy', 'Septicflesh', 'Sepultura', 'Havok', 'Sodom'],
#     'thrash metal': ['W.A.S.P.', 'Accept', 'Dying Fetus', 'Helloween', 'King Diamond', 'Armored Saint', 'Death Angel', 'At The Gates', 'Mercyful Fate', 'Decapitated', 'Iced Earth', 'Immortal', 'Bathory', 'Savatage', 'Venom', 'Nile', 'Metal Church', 'Anvil', 'Six Feet Under', 'Crimson Glory', 'Hellhammer', 'Cryptopsy', 'Immolation', 'Nevermore', 'Vader', 'Bloodbath', 'Possessed', 'Sanctuary', 'Manilla Road', 'Vicious Rumors', 'Sarcófago', 'Asphyx', 'Unleashed', 'Vital Remains', 'Grave', 'Incantation', 'Sinister', 'Deströyer 666', 'Cirith Ungol'],
    'black metal': ['Vader', 'Dystopia', 'Morbid Angel', 'Deicide', 'Entombed', 'Suffocation', 'Bloodbath', 'Gorguts', 'Hypocrisy', 'Immolation', 'Inferi', 'Cryptopsy', 'Dismember', "Therion", 'Sarcófago', 'Asphyx', 'Unleashed', 'Vital Remains', 'Grave', 'Incantation', 'Sinister']
}

In [6]:
# Step 3: Create hierarchy for each subgenre

genre_data = {}

try:
    rmtree('data')
except FileNotFoundError:
    pass
path = Path('data')
path.mkdir(exist_ok=True, parents=True)

for subgenre in genre_list:
    output_dir = os.path.join('data', subgenre.replace(' ', '_'))
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

for genre in genre_list:
    results = get_artists_by_genre(genre)
    genre_data[genre] = [artist['name'] for artist in results if artist['name'] not in exclusion_lists.get(genre, [])]

# for genre in genre_data:
#     print(genre_data[genre])

In [7]:
# Function for checking there aren't bands listed under >1 genre - if so manually added to exclusion list above.
def find_bands_duplicate_genres(d: dict):
    # Initialize an empty dictionary to store the strings as keys and the keys they belong to as values
    string_dict = {}
    
    # Iterate through the key-value pairs in the input dictionary
    for key, val_list in d.items():
        # Iterate through the strings in the current list
        for val in val_list:
            # If the string is already in the string_dict, append the current key to its value list
            if val in string_dict:
                string_dict[val].append(key)
            # If the string is not in the string_dict, create a new key-value pair with the string as the key and a list containing the current key as the value
            else:
                string_dict[val] = [key]

    # Now iterate through the string_dict to print out the strings that belong to more than one key
    for string, keys in string_dict.items():
        if len(keys) > 1:
            print(f'The string "{string}" is in the keys: {", ".join(keys)}')

find_bands_duplicate_genres(d=genre_data)

### Get Album Names

Much like the band lists themselves, whereas most are in list form, some are in table form. Fortunately the "Discography" section has a html id which seems to be pretty uniformly followed by some series of `<ul>` elements (sometimes in a div, sometimes not). Can revisit this situation later if I don't have enough data, but as the first list always seems to be albums (as opposed to singles, live albums etc), I can cut off at that for simplicity. Some other good reasons for this (your first EP might not have any art at all, for example).

In [8]:
def get_albums_of_artist(artist_name: str) -> List[Tuple]:
    # Initialise the array we're going to return, 
    output = []
    dedup_list = []
    
    # Fetch the artist's Spotify URI
    results = sp.search(q='artist:' + artist_name, type='artist', limit=1)
    try:
        artist_uri = results['artists']['items'][0]['uri']
    except IndexError:
        return []

    # Fetch the artist's albums
    albums = sp.artist_albums(artist_uri, album_type='album', limit=50)

    # Print the names of the albums
    for album in albums['items']:
        
        album_name = re.sub(r' \([^)]*\)', '', album['name'])
        
        if 'live' in album_name.lower():
            continue

        if any(substring.lower() in album_name.lower() for substring in dedup_list):
            continue

        # Append album name to dedup list as larger bands will often have several versions of same album
        dedup_list.append(album_name)

        # Append to output a tuple in format of (album_name, artwork_url)
        try:
            output.append({'title': album_name, 'url': album['images'][0]['url']})
        except IndexError:
            return []
    
    return output

In [9]:
def save_image(url:str, filename: str, subgenre:str):
    subdir = f'data/{subgenre}'
    image_path = f"{subdir}/{filename}.jpeg"
    try:
        wget.download(url, image_path)
    except FileNotFoundError:
        sleep(1)
        return


## Loop through functions

In [10]:
!rm -f /kaggle/working/.cache

for genre in genre_data:
    genre_formatted = genre.replace(' ', '_')
    for artist in genre_data[genre]:
        artist_albums = get_albums_of_artist(artist)
        print(f'Saving {len(artist_albums)} albums for {artist}...')
        for album in get_albums_of_artist(artist):
            save_image(
                url=album['url']
                , filename=f"{artist} - {album['title']}"
                , subgenre=genre_formatted
            )
        # Want to wait a bit for any rate limiting to calm down
        sleep(1)
    sleep(30)

Saving 16 albums for Sabaton...
Saving 18 albums for Powerwolf...
Saving 11 albums for DragonForce...
Saving 3 albums for Beast In Black...
Saving 21 albums for Nightwish...
Saving 2 albums for Brothers of Metal...
Saving 8 albums for Alestorm...
Saving 7 albums for Amaranthe...
Saving 5 albums for Wind Rose...
Saving 3 albums for Gloryhammer...
Saving 3 albums for Pentakill...
Saving 7 albums for Battle Beast...
Saving 18 albums for HammerFall...
Saving 5 albums for Unleash The Archers...
Saving 14 albums for Blind Guardian...
Saving 8 albums for Eluveitie...
Saving 15 albums for Manowar...
Saving 7 albums for Orden Ogan...
Saving 19 albums for Kamelot...
Saving 14 albums for Epica...
Saving 8 albums for Falconshield...
Saving 9 albums for Bloodbound...
Saving 12 albums for Korpiklaani...
Saving 18 albums for GALNERYUS...
Saving 8 albums for Dynazty...
Saving 12 albums for Avantasia...
Saving 18 albums for Stratovarius...
Saving 12 albums for Elvenking...
Saving 20 albums for Iced Ear

### Export

Now we've scraped the data, all good to check for any invalid characters and save and export as a Kaggle Dataset!

In [11]:
!rm -f /kaggle/working/.cache

import os
import re

# This function will convert file names according to your specification
def convert_filename(filename):
    new_name = re.sub('[^0-9a-zA-Z.]', '_', filename)  # replace any non-alphanumeric character (aside from ".") with "_"
    new_name = re.sub(' ', '_', new_name)  # replace spaces with "_"
    new_name = new_name.lower()  # make lowercase
    return new_name

# This function recursively goes through all directories and subdirectories
def rename_files_in_dir(dir_path):
    for root, dirs, files in os.walk(dir_path):
        for file in files:
            old_file_path = os.path.join(root, file)
            base, ext = os.path.splitext(file)
            new_file_path = os.path.join(root, convert_filename(base) + ext)
            if old_file_path != new_file_path:
                os.rename(old_file_path, new_file_path)
        for dir in dirs:
            old_dir_path = os.path.join(root, dir)
            new_dir_path = os.path.join(root, convert_filename(dir))
            if old_dir_path != new_dir_path:
                os.rename(old_dir_path, new_dir_path)

# Run the function for your directory
rename_files_in_dir('data')  # replace with your directory path