<a href="https://www.kaggle.com/code/fraserwtt/metal-subgenre-album-art-data?scriptVersionId=135935602" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Metal Album Web Scraper

Using the Spotify API, I should be able to easily get enough data to build a basic classifier for a few subgenres. A few reasons why Spotify was used instead of a more specialist website (e.g. metal-archives):

* Ease of use - Spotify has a good API, and there is a really useful Python wrapper for it.
* Because we're searching by biggest bands, there is an inherent notability filter so bands who released 1 EP and were never seen again. They probably also have the resources to hire a proper artist for their artwork, so there'll be less... "noise" in the dataset. Caveat here being something like black metal where the lo-fi aesthetic is something we do want to pick up on.
* Could be replicated for other kinds of music. Is there a difference in the art direction of different subgenres of jazz? Between different subgenres (and eras!) of rap?
* A specialist website is more likely to split hairs over crossovers between subgenres. For the purposes of this exercise (and personally, in general), we don't really care that Blind Guardian are influenced by bay-area thrash (and are at danger of being labelled as "Thrash/Power Metal") or that the use of orchestras would classify Fleshgod Apocalypse as Symphonic Death Metal -- we only want a few data classes to work with! _(Edit: Actually found the opposite problem to be an issue - e.g. classic rock being classed as power metal)_

In [1]:
!pip install wget
!pip install spotipy

# Step 1: Import relevant libraries
import os
import re
import wget
import spotipy
from time import sleep
from shutil import rmtree
from pathlib import Path
from typing import List, Tuple
from kaggle_secrets import UserSecretsClient
from spotipy.oauth2 import SpotifyClientCredentials

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l- \ done
[?25hBuilding wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l- \ | done
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=6b63f66d68875bbeb3b7cc2f3c33ad4b1609ebbcc3be11f750793422fcb46649
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
[0mCollecting spotipy
  Downloading spotipy-2.23.0-py3-none-any.whl (29 kB)
Collecting redis>=3.5.3
  Downloading redis-4.6.0-py3-none-any.whl (241 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.1/241.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: redis, spotipy
Successfully installed redis-4.6.0 spotipy-2.23.0
[0m

### Heavy Metal Subgenres

In [2]:
# Initialize Spotify API

user_secrets = UserSecretsClient()
client_id = user_secrets.get_secret("SPOTIFY_CLIENT")
client_secret = user_secrets.get_secret("SPOTIFY_SECRET")
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [3]:
# Defining sub genres
genre_list = [
    'power metal'
    , 'black metal'
    , 'thrash metal'
    , 'nu-metal'
]

### Get Band Names

In [4]:
def get_artists_by_genre(genre, limit=50, pages=10):
    results = []

    for i in range(pages):
        results_page = sp.search(q=f'genre:"{genre}"', type='artist', limit=limit, offset=limit * i)
        results += results_page['artists']['items']

    return results

In [5]:
exclusion_lists = {
    'power metal': ['Accept', 'Amon Amarth', 'Children Of Bodom', 'Arch Enemy', 'Amorphis', 'Lordi', 'Equilibrium', 'Machinae Supremacy', 'Metal Church', 'Nevermore', 'X JAPAN', 'Queensrÿche', 'Stryper', 'Mafumafu', 'Dimmu Borgir', 'Imperial Circus Dead Decadence', 'Savatage', 'Saxon', 'Moonspell', "Therion", 'Armored Saint', 'Sanctuary', 'Manilla Road', 'Aldious', 'A Sound of Thunder', 'Asura', 'Bonfire', 'Bruce Dickinson', 'Civil War', 'Dark Sarah', 'Doro', 'Dust in Mind', 'ELFENSJóN', 'Fates Warning', 'Helstar', 'KAMIJO', 'Lizzy Borden', "Mary's Blood", 'Narnia', 'Persuader', 'Rage', 'Quiet Riot', 'The Murder of My Sweet','Unlucky Morpheus', 'U.D.O.', 'Finntroll', 'Elyose', 'Fates Warning', 'Fraser Edwards', 'Heavenly', 'Whitecross', 'Eluveitie'],
    'black metal': ['Vader', 'Dystopia', 'Morbid Angel', 'Deicide', 'Entombed', 'Suffocation', 'Bloodbath', 'Gorguts', 'Bewitcher', 'Hypocrisy', 'Immolation', 'Inferi', 'Cryptopsy', 'Dismember', "Therion", 'Sarcófago', 'Asphyx', 'Unleashed', 'Vital Remains', 'Grave', 'Incantation', 'Sinister', 'Aara', 'Asunojokei', 'Craft', 'Deadlife', 'Grima'],
    'thrash metal': ['Helloween', 'Iced Earth', 'Crimson Glory', 'Jag Panzer', 'Cirith Ungol', 'Vicious Rumors', 'Omen', 'Agent Steel', 'Immortal', 'Bathory', 'Venom', 'Possessed', 'Hellhammer', 'Deströyer 666', 'Wraith', 'Sadistik Exekution', 'Bewitcher', 'Cruel Force', 'Desaster', 'Dark Angel'],
    'nu-metal': ['Lonewolf', 'Begotten', 'Megadeth', 'Slayer', 'Sepultura', 'Suicidal Tendencies', 'Anthrax', 'Cavalera Conspiracy']
}

In [6]:
# Step 3: Create hierarchy for each subgenre

genre_data = {}

try:
    rmtree('data')
except FileNotFoundError:
    pass
path = Path('data')
path.mkdir(exist_ok=True, parents=True)

for subgenre in genre_list:
    output_dir = os.path.join('data', subgenre.replace(' ', '_'))
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

for genre in genre_list:
    results = get_artists_by_genre(genre)
    genre_data[genre] = [artist['name'] for artist in results if artist['name'] not in exclusion_lists.get(genre, [])]

for genre in genre_data:
    print(genre_data[genre])

['Sabaton', 'Powerwolf', 'DragonForce', 'Beast In Black', 'Nightwish', 'Brothers of Metal', 'Amaranthe', 'Alestorm', 'Gloryhammer', 'Wind Rose', 'Battle Beast', 'Pentakill', 'Unleash The Archers', 'Blind Guardian', 'HammerFall', 'Manowar', 'Orden Ogan', 'Helloween', 'Kamelot', 'Epica', 'Bloodbound', 'Falconshield', 'Korpiklaani', 'GALNERYUS', 'Dynazty', 'Avantasia', 'Stratovarius', 'Iced Earth', 'Delain', 'Sonata Arctica', 'Elvenking', 'Turisas', 'Jorn Lande', 'Dream Evil', 'Týr', 'Symphony X', 'Twilight Force', 'Mystic Prophecy', 'Rhapsody', 'Rhapsody Of Fire', 'Arion', 'Follow The Cipher', 'Theocracy', 'Sirenia', 'Masterplan', 'Ensiferum', 'Edguy', 'Tarja', 'Van Canto', 'Visions of Atlantis', 'Xandria', 'Dionysus', 'Freedom Call', 'Metalite', 'Hulkoff', 'Majestica', 'Evergrey', 'Falconer', 'Wintersun', 'Pyramaze', 'At Vance', 'Primal Fear', 'Demons & Wizards', 'Versailles', 'ANGRA', 'Temperance', 'Riot', 'Visigoth', 'Gamma Ray', "Cain's Offering", 'Firewind', 'Warlock', 'Running Wild

In [7]:
# Function for checking there aren't bands listed under >1 genre - if so manually added to exclusion list above.
def find_bands_duplicate_genres(d: dict):
    # Initialize an empty dictionary to store the strings as keys and the keys they belong to as values
    string_dict = {}
    
    # Iterate through the key-value pairs in the input dictionary
    for key, val_list in d.items():
        # Iterate through the strings in the current list
        for val in val_list:
            # If the string is already in the string_dict, append the current key to its value list
            if val in string_dict:
                string_dict[val].append(key)
            # If the string is not in the string_dict, create a new key-value pair with the string as the key and a list containing the current key as the value
            else:
                string_dict[val] = [key]

    # Now iterate through the string_dict to print out the strings that belong to more than one key
    output = []
    for string, keys in string_dict.items():
        if len(keys) > 1:
            output.append(string)
            print(f'The string "{string}" is in the keys: {", ".join(keys)}')
    return output

find_bands_duplicate_genres(d=genre_data)

[]

### Get Album Names

In [8]:
def get_albums_of_artist(artist_name: str) -> List[Tuple]:
    # Initialise the array we're going to return, 
    output = []
    dedup_list = []
    
    # Fetch the artist's Spotify URI
    results = sp.search(q=f'{artist_name} genre:metal', type='artist', limit=1)
    try:
        artist_uri = results['artists']['items'][0]['uri']
    except IndexError:
        return []

    # Fetch the artist's albums
    albums = sp.artist_albums(artist_uri, album_type='album', limit=50)

    # Print the names of the albums
    for album in albums['items']:
        
        album_name = re.sub(r' \([^)]*\)', '', album['name'])
        
        if 'live' in album_name.lower():
            continue

        if any(substring.lower() in album_name.lower() for substring in dedup_list):
            continue

        # Append album name to dedup list as larger bands will often have several versions of same album
        dedup_list.append(album_name)

        # Append to output a tuple in format of (album_name, artwork_url)
        try:
            output.append({'title': album_name, 'url': album['images'][0]['url']})
        except IndexError:
            return []
    
    return output

In [9]:
def save_image(url:str, filename: str, subgenre:str):
    subdir = f'data/{subgenre}'
    image_path = f"{subdir}/{filename}.jpeg"
    try:
        wget.download(url, image_path)
    except FileNotFoundError:
        sleep(1)
        return


## Loop through functions

In [10]:
!rm -f /kaggle/working/.cache

for genre in genre_data:
    genre_formatted = genre.replace(' ', '_')
    for artist in genre_data[genre]:
        artist_albums = get_albums_of_artist(artist)
        print(f'Saving {len(artist_albums)} albums for {artist}...')
        for album in get_albums_of_artist(artist):
            save_image(
                url=album['url']
                , filename=f"{artist} - {album['title']}"
                , subgenre=genre_formatted
            )
        # Want to wait a bit for any rate limiting to calm down
        sleep(0.5)
    sleep(30)

Saving 16 albums for Sabaton...
Saving 18 albums for Powerwolf...
Saving 9 albums for DragonForce...
Saving 3 albums for Beast In Black...
Saving 19 albums for Nightwish...
Saving 2 albums for Brothers of Metal...
Saving 7 albums for Amaranthe...
Saving 8 albums for Alestorm...
Saving 4 albums for Gloryhammer...
Saving 5 albums for Wind Rose...
Saving 6 albums for Battle Beast...
Saving 3 albums for Pentakill...
Saving 5 albums for Unleash The Archers...
Saving 14 albums for Blind Guardian...
Saving 18 albums for HammerFall...
Saving 15 albums for Manowar...
Saving 7 albums for Orden Ogan...
Saving 15 albums for Helloween...
Saving 15 albums for Kamelot...
Saving 14 albums for Epica...
Saving 9 albums for Bloodbound...
Saving 8 albums for Falconshield...
Saving 11 albums for Korpiklaani...
Saving 17 albums for GALNERYUS...
Saving 5 albums for Dynazty...
Saving 11 albums for Avantasia...
Saving 17 albums for Stratovarius...
Saving 20 albums for Iced Earth...
Saving 10 albums for Delain.

### Export

Now we've scraped the data, all good to check for any invalid characters and save and export as a Kaggle Dataset!

In [11]:
!rm -f /kaggle/working/.cache

import os
import re

# This function will convert file names according to your specification
def convert_filename(filename):
    new_name = re.sub('[^0-9a-zA-Z.]', '_', filename)  # replace any non-alphanumeric character (aside from ".") with "_"
    new_name = re.sub(' ', '_', new_name)  # replace spaces with "_"
    new_name = new_name.lower()  # make lowercase
    return new_name

# This function recursively goes through all directories and subdirectories
def rename_files_in_dir(dir_path):
    for root, dirs, files in os.walk(dir_path):
        for file in files:
            old_file_path = os.path.join(root, file)
            base, ext = os.path.splitext(file)
            new_file_path = os.path.join(root, convert_filename(base) + ext)
            if old_file_path != new_file_path:
                os.rename(old_file_path, new_file_path)
        for dir in dirs:
            old_dir_path = os.path.join(root, dir)
            new_dir_path = os.path.join(root, convert_filename(dir))
            if old_dir_path != new_dir_path:
                os.rename(old_dir_path, new_dir_path)

# Run the function for your directory
rename_files_in_dir('data')  # replace with your directory path