# Metal Album Web Scraper

Using the Wikipedia API and its relatively well structured tree, I should be able to easily get enough data to build a basic classifier for various subgenres. A few reasons why Wikipedia was used instead of a more specialist website (e.g. metal-archives):

* Ease of use - Wikipedia has a good API, so it makes sense to do so.
* Because of the nature of Wikipedia moderation, there is an inherent notability filter so bands who released 1 EP and were never seen again. They probably also have the resources to hire a proper artist for their artwork, so there'll be less... "noise" in the dataset. Caveat here being something like black metal where the lo-fi aesthetic is something we do want to pick up on.
* Could be replicated for other kinds of music. Is there a difference in the art direction of different subgenres of jazz? Between different subgenres (and eras!) of rap?
* A specialist website is more likely to split hairs over crossovers between subgenres. For the purposes of this exercise (and personally, in general), we don't really care that Blind Guardian are influenced by bay-area thrash (and are at danger of being labelled as "Thrash/Power Metal") or that the use of orchestras would classify Fleshgod Apocalypse as Symphonic Death Metal -- we only want a few data classes to work with!

In [413]:
# Step 1: Import relevant libraries
import os
import re
import json
import wget
from PIL import Image
from urllib import request, error
from bs4 import BeautifulSoup
from typing import List

### Heavy Metal Subgenres

There are enough Death and Black Metal bands that Wikipedia has split them up into two pages. This is nice as it lets us work with imbalanced classes when our classifier is learning the artwork. Although annoying as it means there's a whole different page layout to scrape over.

I would imagine that there will be some difficulty differenciating between Thrash and Death metal, given that the latter was an outgrowth of the former and there is likely to be a lot of overlap in the early 90's.

In [414]:

# Step 2: Create a list of URLs to cycle through
url_list = [
    'https://en.wikipedia.org/wiki/List_of_black_metal_bands,_0-K',
    'https://en.wikipedia.org/wiki/List_of_black_metal_bands,_L-Z',
    # 'https://en.wikipedia.org/wiki/List_of_thrash_metal_bands',
    'https://en.wikipedia.org/wiki/List_of_power_metal_bands',
    # 'https://en.wikipedia.org/wiki/List_of_death_metal_bands,_!-K',
    # 'https://en.wikipedia.org/wiki/List_of_death_metal_bands,_L-Z',
]

In [415]:
exclusion_lists = {
    'List_of_black_metal_bands,_0-K': [
        '/wiki/Equilibrium_(band)', 'Falconer_(band)', '/wiki/Finntroll', '/wiki/Sepultura', '/wiki/Skeletonwitch', '/wiki/Sodom_(band)', '/wiki/Sunn_O)))', '/wiki/Wintersun'
    ],
    'List_of_black_metal_bands,_L-Z': [
        '/wiki/Sepultura', '/wiki/Skeletonwitch', '/wiki/Sodom_(band)', '/wiki/Sunn_O)))', '/wiki/Wintersun'
    ],
    'List_of_thrash_metal_bands': [
        '/wiki/3_Inches_of_Blood', '/wiki/Black_Tide', '/wiki/Bullet_for_My_Valentine', '/wiki/Celtic_Frost', '/wiki/Children_of_Bodom', '/wiki/God_Forbid',
        '/wiki/The_Haunted_(Swedish_band)', '/wiki/Hellhammer', '/wiki/Iced_Earth', '/wiki/Lamb_of_God_(band)', '/wiki/Machine_Head_(band)',
        '/wiki/Nevermore', '/wiki/Pantera', '/wiki/Soulfly', '/wiki/Sepultura', '/wiki/Sylosis', '/wiki/Trivium_(band)'
    ],
    'List_of_thrash_metal_bands': [
        '/wiki/3_Inches_of_Blood', '/wiki/Black_Tide', '/wiki/Bullet_for_My_Valentine', '/wiki/Celtic_Frost', '/wiki/Children_of_Bodom', '/wiki/God_Forbid',
        '/wiki/The_Haunted_(Swedish_band)', '/wiki/Hellhammer', '/wiki/Iced_Earth', '/wiki/Lamb_of_God_(band)', '/wiki/Machine_Head_(band)',
        '/wiki/Nevermore', '/wiki/Pantera', '/wiki/Soulfly', '/wiki/Sepultura', '/wiki/Sylosis', '/wiki/Trivium_(band)'
    ],
    'List_of_death_metal_bands,_!-K': [
        '/wiki/Akercocke', '/wiki/As_I_Lay_Dying_(band)', '/wiki/Battlelore', '/wiki/Behold..._The_Arctopus', '/wiki/Between_the_Buried_and_Me',
        '/wiki/Celtic_Frost', '/wiki/God_Forbid', '/wiki/Born_of_Osiris', '/wiki/Children_of_Bodom',
        '/wiki/Cradle_of_Filth', '/wiki/Cynic_(band)', '/wiki/Darkthrone', '/wiki/Hellhammer', '/wiki/Fear_Factory',
        '/wiki/Ensiferum', '/wiki/Divine_Heresy', '/wiki/DevilDriver', '/wiki/Sepultura', '/wiki/Sylosis', '/wiki/Trivium_(band)'
    ],
    'List_of_death_metal_bands,_L-Z': [
        '/wiki/Lamb_of_God_(band)', '/wiki/Meshuggah', '/wiki/Mr._Bungle', '/wiki/My_Dying_Bride', '/wiki/Paradise_Lost_(band)',
        '/wiki/Scar_Symmetry', '/wiki/Sepultura', '/wiki/Soilwork', '/wiki/Sonic_Syndicate',
        '/wiki/Soulfly', '/wiki/Cynic_(band)', '/wiki/Strapping_Young_Lad', '/wiki/Sylosis'
    ],
    'List_of_power_metal_bands': [
        '/wiki/Children_of_Bodom', '/wiki/Iron_Maiden', '/wiki/Judas_Priest', '/wiki/Machinae_Supremacy', '/wiki/Rainbow_(rock_band)',
        '/wiki/Scorpions_(band)', '/wiki/Trauma_(American_band)', '/wiki/X_Japan'
    ]
}

In [416]:
# Step 3: Create subdirectory hierarchy

parent_output_dir = 'data'
if not os.path.exists(parent_output_dir):
    os.makedirs(parent_output_dir)

for url in url_list:
    # Extract the subgenre using regex
    subgenre = re.search(r'List_of_(.+)_bands', url.split('/')[-1]).group(1)

    # Create a subfolder for each subgenre
    output_dir = os.path.join(parent_output_dir, subgenre)
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

### Get Band Names

Have to allow for two different formats, the first of which you have to navigate a table for and requires a few nested loops which is annoying. Beyond that its quite simple, the other format is just a series of h2 or h3 (depending on the page!) elements immediately followed by a ul elem.

In [417]:
def get_band_names(url: str) -> List[str]:

    html = request.urlopen(url)
    soup = BeautifulSoup(html, 'html.parser')
    tables = soup.find_all('table', {"class": "wikitable"})

    band_names = []
    
    # There seem to be two formats for these lists, one in table form and the other in list form.
    # Here is the code for the table format:
    for table in tables:
        for row in table.find_all('tr'):
            if first_td := row.find('td'):
                a_element = first_td.find('a')
                if a_element and a_element.has_attr('title'):
                    band_names.append(a_element['href'])

    # If the wiki page is in list form, band_names will return []
    if band_names == []:
        lists = soup.find_all('table', {"class": "multicol", "role": "presentation"})[0]
        for header in lists.findChildren(["h2", "h3"], recursive = True):
            band_names.extend([
                a['href'] for a in header.find_all_next('ul')[0].findChildren(['a']) if a and a.has_attr('title')
            ])

    return band_names


### Get Album Names

Much like the band lists themselves, whereas most are in list form, some are in table form. Fortunately the "Discography" section has a html id which seems to be pretty uniformly followed by some series of `<ul>` elements (sometimes in a div, sometimes not). Can revisit this situation later if I don't have enough data, but as the first list always seems to be albums (as opposed to singles, live albums etc), I can cut off at that for simplicity. Some other good reasons for this (your first EP might not have any art at all, for example).

In [418]:
def get_album_names(band: str) -> List[str]:
    band_discography = []

    try:
        html = request.urlopen(f"https://en.wikipedia.org{band}")
        soup = BeautifulSoup(html, 'html.parser')
    except error.HTTPError :
        return band_discography
    
    album_ul = soup.find_all('span', {"id": "Discography"})

    if len(album_ul) > 0:
        album_ul = album_ul[0].find_all_next("ul")
    if len(album_ul) > 0:
        album_ul = album_ul[0]

    try:
        band_discography.extend([
            a['href'] for a in album_ul.findChildren('a', recursive=True) if a.has_attr('title')
        ])
    except AttributeError:
        return []

    return band_discography


### Get Album Artwork

In [419]:
def get_featured_image_url(wikipedia_url: str) -> str:
    page_title = wikipedia_url.split("/")[-1]
    WIKI_REQUEST = f'https://en.wikipedia.org/w/api.php?action=query&prop=images&titles={page_title}&format=json'

    try:
        response = request.urlopen(WIKI_REQUEST)
        data = json.loads(response.read().decode())
    except (error.HTTPError, ValueError):
        return 'None'

    if 'query' not in data:
        return 'None'

    page_id = list(data['query']['pages'].keys())[0]
    page_data = data['query']['pages'][page_id]

    if 'images' in page_data:
        image_filenames = [img['title'] for img in page_data['images'] if not img['title'].startswith('File:Wiki')]

        if image_filenames:
            image_title = image_filenames[0].replace(' ', '_')
            image_info_request = f'https://en.wikipedia.org/w/api.php?action=query&prop=imageinfo&iiprop=url&titles={image_title}&format=json'

            try:
                img = request.urlopen(image_info_request)
                response = img.read()
                image_data = json.loads(response.decode())

            except (error.HTTPError, ValueError):
                return 'None'

            image_id = list(image_data['query']['pages'].keys())[0]
            image_info = image_data.get('query', {}).get('pages', {}).get(image_id, {}).get('imageinfo')

            if not image_info:
                return 'None'

            image_url = image_info[0]['url']

            return image_url
    return 'None'


def save_image(url:str, subgenre:str):

    def clean_non_album_images(path: str) -> bool:
        # Getting the image and checking its dimensions.
        img = Image.open(path)
        width, height = img.size

        if not 0.9 <= width / height <= 1.1:
            os.remove(os.path.join(path))
            return True
        return False

    try:
        subdir = f'data/{subgenre}'
        image_path = f"{subdir}/{url.split('/')[-1]}"
        wget.download(url, image_path)
        
        if clean_non_album_images(path = image_path):
            pass
        else:
            print(f'Image saved to {image_path}.')

    except error.HTTPError:
        pass


## Loop through functions

In [420]:


def check_for_junk_data(url:str, excl:List) -> bool:
    return any(s in url for s in excl)

junk = [
    '12in-Vinyl-LP-Record-Angle',
    '45_record',
    '45rpm',
    'Logo_in_use',
    '12_Inch_Single_BBQ_Band',
    '1966_The_Supremes',
    'Alternative_Tentacles_Logo',
    'Antestor-BoE-2012-2',
    'INDIE_LOGO_WHITE_BOX_outline',
    'Gummo_Album'
]
saved_albums = []

for genre_list in url_list:
    for band in get_band_names(url=genre_list):

        if band in exclusion_lists[genre_list.split('/')[-1]]:
            continue

        for album in get_album_names(band=band):
            album_url = get_featured_image_url(wikipedia_url=album)

            if album_url == "None" or album_url.lower()[-3:] not in ["jpg", "png"]:
                continue
            
            if check_for_junk_data(url=album_url, excl=junk):
                continue
            
            # Deduping
            if check_for_junk_data(url=album_url, excl=saved_albums):
                continue
            saved_albums.append(album_url.split('/')[-1])

            subgenre = re.search(r'List_of_(.+)_bands', genre_list.split('/')[-1]).group(1)
            save_image(url=album_url, subgenre=subgenre)

Image saved to data/black_metal/1349_-_Liberation.jpg.
Image saved to data/black_metal/1349_Beyond_The_Apocalypse.jpg.
Image saved to data/black_metal/1349_-_BlackFlame.jpg.
Image saved to data/black_metal/1349_-_Demonoir.jpg.
Image saved to data/black_metal/1349Cauldron.png.
Image saved to data/black_metal/AbbathCover.png.
Image saved to data/black_metal/Abigail_Williams_-_In_the_Shadow_of_a_Thousand_Suns.jpg.


KeyboardInterrupt: 

### Data Cleaning

There's probably a few dud images in there, and I will have a quick look through, but an easy win would be to get rid of something that isn't album art. Makes sense to get rid of anything that isn't (roughly) sqaure.