# Data Collection: Scrape the Met Website's Object Metadata & Images

The Met dataset currently available on Kaggle (https://www.kaggle.com/datasets/metmuseum/the-metropolitan-museum-of-art-open-access) includes over 400K rows of data conducive to meaningful analysis and ML modeling.

It also now offers an API for its collection data (https://metmuseum.github.io/), let's compare the data from scraping vs. that from the API. 

However, this data is not exactly the same as what is offered directly on its website, suggesting that the client-facing data is coming from a non-public API / backend. Since the Met's website contains the most complete and error-free information about its collection, the scraper created below is designed to acquire this data for comparison to its Kaggle and free API data.

One important consideration when scraping any website is copyright regulations. The data from the Met's public API states that it is available for public use without fees, but scraping content directly from the Met website does not necessarily offer the same protections, as it also features images that still under copyright. 

Therefore, while data scraped from the Met website can offer a wealth of insights, the scraper below is intended for educational purposes only, not for commercial use.

In [18]:
import os
import pandas as pd
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
import time

In [19]:
# Scrape HTML from URL based on selectors

async def get_html(url, *selectors, sleep=5, retries=3):
    html = None
    
    # Allow 3 retries
    for i in range(1, retries+1):
        
        # Sleep between tries to avoid overloading server
        time.sleep(sleep * i)
        
        try:
            async with async_playwright() as p:
                browser = await p.firefox.launch()
                page = await browser.new_page()
                await page.goto(url)
                
                # Print page title as confirmation
                title = await page.title()
                print(title)
                         
                # Iterate over *selectors
                # Concatenate scraped HTML
                for s in selectors:
                    html_next = await page.inner_html(s)
                    html = '{0}\n{1}'.format(html, html_next)
        
        # Catch timeout errors
        except PlaywrightTimeout:
            print(f"Timeout error on {url}")
            continue
        
        else:
            break
    
    return html

### Add \*args for selectors as necessary

The use of \*selector in the *get_html()* function represents a list of variables. It accepts as few or as many variables as entered whenever the function is called.

This example uses two (2) selector arguments:
* 1. '#artwork-section' -- top-level details (Title, Date, Nationality, 'On View') and image data
* 2. '#details #overview' -- object details  (Acq. Date, Medium, Credit, Artist, Country)

These selectors can be substituted for any combination and any number of combinations within reason (please respect the server you're scraping).

In [20]:
# Handle scraping requests for each URL in list

async def scrape_data(links, save_paths, DIR, *selectors):
    
    # Iterate over list of URLS
    for i in range(len(links)):
        
        # Extract URL from list
        url = links[i]
        
        # Generate save path from directory
        save_path = os.path.join(DIR, save_paths[i])

        if os.path.exists(save_path):
            continue

        # Get HTML from URL based on specified selectors
        html = await get_html(url, *selectors)
        
        if not html:
            continue

        # Save scraped HTML to file
        with open(save_path, "w+") as f:
            f.write(html)

In [22]:
# Main function

async def main():
    
    # Create directories for the scraped data
    OBJECTS_DIR = "data/met/scrapes/objects"
    
    # Import existing objects table to extract links and object ids
    met_objects = pd.read_csv('met_objects.csv')
    
    # Create list of URLs to objects
    met_links = met_objects['Link_Resource']

    # Create list of HTML paths to save scraped HTML content (one path per object)
    met_names = met_objects['Object_Name'].str.replace(" ", "-").str.lower()
    met_numbers = met_objects['Object_Number'].str.replace(".", "-")
    met_paths = "met_" + met_names + "_" + met_numbers + ".html"
    
    # Scrape data
    await scrape_data(met_links, met_paths, OBJECTS_DIR, '#artwork-section', '#details #overview')

In [None]:
# Run program

if __name__ == '__main__':
    await main()

## Test: Scrape a Single Page

The test case below returns content from the function *await get_html(url, url_index, \*selector)* for a single url. 

In [None]:
# Scrape test page

# Create directories for the scraped data
OBJECTS_DIR = "data/met/scrapes/objects"

# Import existing objects table to extract links and object ids
met_objects = pd.read_csv('met_objects.csv')

# Create list of URLs to objects
met_links = met_objects['Link_Resource']

html = await get_html(met_links[1789], '#artwork-section', '#details #overview')

with open("data/met/scrapes/objects/test.html", "w+") as f:
            f.write(html)