# Get referenced art

We explore the Mondriaan letters for references to art works in order
to look them up at the
[RKD = Netherlands Institute for Art History](https://rkd.nl/en/about-the-rkd/organisation).

They host images under the url `https://rkd.nl/explore/images/`.

If you add an artwork *key* to that URL you get to the landing page of an image,
from which metadata can be obtained, as well as thumbnails.

The Mondriaan - Letters dataset contains nodes of type `rs`, and some of them
contain such an artwork key.

We will gather the keys from the corpus, visit the landing pages, store them locally,
scrape them for metadata and a thumbnail urls, and download the thumbnails.

We store the thumbnails in the `source` directory in this repo, together with a yaml
file that stores the metadata for all this thumbnails.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import collections
import re
import urllib.request as ur
from urllib.error import HTTPError
import yaml
from lxml.html import parse as htmlParse
from lxml import etree

In [3]:
# magic for https urls, otherwise you get an error

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [4]:
from tf.app import use
from tf.core.files import expanduser, initTree

# Corpus search

We load the corpus, walk through it in search for any node that has
a non-empty `key` feature and a type with the string `art` in it.

In [5]:
A = use("annotation/mondriaan:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,13761.0,100
letter,14,982.93,100
body,14,849.93,86
text,14,849.93,86
chunk,86,160.0,100
div,93,219.99,149
teiHeader,14,124.57,13
p,95,73.39,51
postscript,6,62.83,3
revisionDesc,14,61.0,6


In [6]:
artworks = collections.defaultdict(list)
artkeys = {}

for n in N.walk():
    nType = F.otype.v(n)
    typ = F.type.v(n) or ""
    if "art" in typ:
        key = F.key.v(n)
        if key:
            artworks[typ].append((key, nType, n))
            if key.replace("0", ""):
                artkeys[key] = n

In [7]:
for (typ, works) in artworks.items():
    print(f"{typ}: {len(works)} references")
    for (key, nType, n) in works:
        print(f"\t{nType} node {n} has key {key}")

artwork-m: 11 references
	rs node 15163 has key 277201
	rs node 15164 has key 68554
	rs node 15215 has key 62319
	rs node 15216 has key 68733
	rs node 15222 has key 277201
	rs node 15241 has key 000000
	rs node 15242 has key 000000
	rs node 15243 has key 000000
	rs node 15284 has key 68728
	rs node 15305 has key 268864
	rs node 15312 has key 268864


It appears that those nodes are all `rs` nodes (`<rs>` elements of TEI).

Note that there are a few null keys, they do not point to an artwork, and we'll
skip them.

In [8]:
print(f"{len(artkeys)} distinct keys:")
for (key, n) in sorted(artkeys.items()):
    print(f"key {key:<6} in node {n}")

6 distinct keys:
key 268864 in node 15312
key 277201 in node 15222
key 62319  in node 15215
key 68554  in node 15164
key 68728  in node 15284
key 68733  in node 15216


That leaves us with this number of referenced artworks in the corpus.

# Retrieving landing pages

First we set up the locations where we'll save information.

In [9]:
aContext = A.context
ORG = aContext.org
REPO = aContext.repo
GRAPHICS_RELATIVE = aContext.provenanceSpec["graphicsRelative"]
REPO_DIR = f"{expanduser('~')}/{A.backend}/{ORG}/{REPO}"
LOCAL_DIR = f"{REPO_DIR}/_local"
GRAPHICS_DIR = f"{REPO_DIR}/{GRAPHICS_RELATIVE}"
RKD_DIR = f"{LOCAL_DIR}/RKD"
RKD_RAW = f"{RKD_DIR}/raw"
RKD_SPECS = f"{GRAPHICS_DIR}/images.yaml"

RKD_URL_BASE = "https://rkd.nl/explore/images/"

initTree(RKD_RAW)
initTree(GRAPHICS_DIR)

We define a generic function to visit a url and retrieve the content behind it.

In [10]:
def retrieve(url, binary=False):
    response = None
    msg = None
    
    try:
        response = ur.urlopen(url)
        status = response.status
    except HTTPError as e:
        status = 500
        msg = str(e)
        
    if status != 200:
        result = (False, status, msg or "ERROR")
    else:
        data = response.read()
        if not binary:
            data = data.decode("utf-8")
        result = (True, status, data)
    return result

We use the `retrieve()` function to define a function that gets
the landing page of an artwork.

In [11]:
def getRefPage(key):
    (good, status, content) = retrieve(f"{RKD_URL_BASE}/{key}")
    if good:
        with open(f"{RKD_RAW}/{key}.html", "w") as fh:
            fh.write(content)
        content = "OK"
    return (good, status, content)

And we use that to define the function that retrieves all artworks
that are referenced in the corpus.

In [12]:
def getRefPages(keys):
    good = True
    
    for key in keys:
        (thisGood, status, msg) = getRefPage(key)
        print(f"{key}: {status} {msg}")
        if not thisGood:
            good = False
    
    return good

Finally, we *execute* the function, and save the HTML somewhere
in our `_local` directory, which is not sent to GitLab.

In [13]:
getRefPages(sorted(artkeys))

268864: 200 OK
277201: 200 OK
62319: 200 OK
68554: 200 OK
68728: 200 OK
68733: 200 OK


True

# Distilling

We analyse the retrieved landing pages for useful information.

## Thumbnails

It turns out that there are thumbnail links in the `content` attributes
of `<meta>` elements that have also `property="og:image"`.

There are multiple of those, we look for the ones that have the string
`thumb/650x650` in their `content` attribute.

For some artworks we find multiple thumbnails, in that case we choose the first
we encounter.

## Metadata

We find metadata in elements with an attribute `itemprop`.
The value of this attribute is the metadata key, and the metadata itself is the
element content.
We normalize spaces in this content, and store it under the key.

It appears that some keys have multiple values, so we store all metadata as a list
of values under their key.

We use the module `lxml` to walk recursively through the content and pick up the
info during the walk.

In [14]:
def analyse(root, info, foundProps):
    WHITE_RE = re.compile(r"\s\s+")

    ignoreProps = set(
        """
        itemListElement
    """.strip().split()
    )

    def walk(elem):
        tag = elem.tag
        atts = elem.attrib
        if tag == "meta":
            if atts.get("property", None) != "og:image":
                return
            image = atts.get("content", None)
            if "thumb/650x650/" not in image:
                return
            info.setdefault("thumb", []).append(image)
            return
        
        itemProp = atts.get("itemprop", None)

        if itemProp:
            if itemProp not in ignoreProps:
                foundProps.add(itemProp)
                text = elem.text_content()
                info.setdefault(itemProp, []).append(WHITE_RE.sub(" ", text.strip()))
            return
        
        for child in elem.iterchildren(tag=etree.Element):
            walk(child)

    walk(root)

We then define a function that gets the info for all artworks.

In [15]:
def distillPages():
    files = []
    
    with os.scandir(RKD_RAW) as dh:
        for entry in dh:
            if entry.is_file():
                name = entry.name
                if name.endswith(".html"):
                    files.append(name)
                    
    foundProps = set()
    imageSpecs = {}
    
    for file in files:
        bareName = file.removesuffix(".html")
        tree = htmlParse(f"{RKD_RAW}/{file}")
        root = tree.getroot()
        info = dict(url=f"{RKD_URL_BASE}/{key}")
        analyse(root, info, foundProps)
        if "thumb" in info:
            info["thumb"] = info["thumb"][0]
        else:
            print(f"{bareName:<6}: No thumbnail")
        imageSpecs[bareName] = info
    
    with open(RKD_SPECS, "w") as fh:
        yaml.dump(imageSpecs, fh, allow_unicode=True)
        
    print(f"Found {len(foundProps)} props:")
    for prop in foundProps:
        print(f"\t{prop}")

Finally, we run the function, and list the distinct metadata keys
that we have encountered.

We collect the info for all images and store them in the source directory,
in the file `images.yaml`.

In [16]:
distillPages()

Found 5 props:
	description
	artist
	artworkSurface
	artMedium
	name


# Retrieve thumbnails

We collect the thumbnail urls from the `images.yaml` file and retrieve them.

First a function to retrieve a single thumbnail.

In [17]:
def getThumbnail(key, url):
    (good, status, content) = retrieve(url, binary=True)
    ext = url.rsplit(".", 1)[-1]
    
    if good:
        with open(f"{GRAPHICS_DIR}/{key}.{ext}", "wb") as fh:
            fh.write(content)
        content = "OK"
    return (good, status, content)

Then the function to get them all.

In [18]:
def getThumbnails():
    with open(RKD_SPECS) as fh:
        imageSpecs = yaml.load(fh, Loader=yaml.FullLoader)
    
    good = True
    
    for (key, info) in sorted(imageSpecs.items()):
        url = info.get("thumb", None)
        if url is None:
            continue
        (thisGood, status, msg) = getThumbnail(key, url)
        print(f"{key}: {status} {msg}")
        if not thisGood:
            good = False
            
    return good

Finally, we run the function.

All images end up in the source directory, next to the `images.yaml` file.

In [19]:
getThumbnails()

268864: 200 OK
277201: 200 OK
62319: 200 OK
68554: 200 OK
68728: 200 OK
68733: 200 OK


True

# Images in the corpus

We test whether we can include the images in the corpus.

In [21]:
A = use("annotation/mondriaan:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
folder,1,13761.0,100
letter,14,982.93,100
body,14,849.93,86
text,14,849.93,86
chunk,86,160.0,100
div,93,219.99,149
teiHeader,14,124.57,13
p,95,73.39,51
postscript,6,62.83,3
revisionDesc,14,61.0,6


Now let's find them:

In [22]:
query = """
rs key type=artwork-m
"""

results = A.search(query)

  0.00s 11 results


And display them:

In [23]:
for r in results[0:2]:
    A.pretty(r[0])