Getting the data... there is an API (see https://developers.wellcomecollection.org/docs/examples), but it limits to 10,000 results... in any case, it is straightforward to work with a snapshot... we can use some lightly adapted code from https://developers.wellcomecollection.org/docs/examples/working-with-snapshots-of-the-api to acquire the snapshot

In [None]:
from pathlib import Path
import requests
from tqdm.auto import tqdm
import gzip
import os
import io
import sys

snapshot_url = "https://data.wellcomecollection.org/catalogue/v2/works.json.gz"

data_dir = Path("./data").resolve()
data_dir.mkdir(exist_ok=True)

file_name = Path(snapshot_url).parts[-1]
zipped_path = data_dir / file_name
unzipped_path = zipped_path.with_suffix("")

# check whether the file already exists before doing any work
if not unzipped_path.exists():
  if not zipped_path.exists():

    # make a request to the snapshot URL and stream the response
    r = requests.get(snapshot_url, stream=True)
    
    # use the length of the response to create a progress bar for the download
    download_progress_bar = tqdm(
      unit="bytes",
      total=int(r.headers["Content-Length"]),
      desc=f"Downloading {file_name}",
    )

    # write the streamed response to our file in chunks of 1024 bytes
    with open(zipped_path, "wb") as f:
      for chunk in r.iter_content(chunk_size=1024):
        if chunk:
          f.write(chunk)
          download_progress_bar.update(len(chunk))

      download_progress_bar.close()

  # open the zipped file, and the unzipped file
  with gzip.open(zipped_path, "rb") as f_in, open(unzipped_path, "wb") as f_out:
    unzip_progress_bar = tqdm(
      unit="bytes",
      total=f_in.seek(0, io.SEEK_END), # measure the unzipped length of the zipped file using `.seek()`
      desc=f"unzipping {file_name}",
    )

    # we used `.seek()` to move the cursor to the end of the file, so we need to
    # move it back to the start before we start reading
    f_in.seek(0)

    # read the zipped file in chunks of 1MB
    for chunk in iter(lambda: f_in.read(1024 * 1024), b""):
      f_out.write(chunk)
      unzip_progress_bar.update(len(chunk))

    unzip_progress_bar.close()

Next we'll collect a set of interest. In the interests of speed we'll go for something small. Let's find all the books with typhoid as a subject. ... We start by iterating works.json looking for such books. works.json is a JSONL file, so each line is a separate JSON record. This makes line-by-line iteration an efficient way to read it ...but also slow when we read the whole file... so I could either pre-cook a special works.json... or use the web API after all...

RESTARTING... LET'S TRY USING THE API
Statistical methods like topic modelling probably should really be used with large volumes of data.
To keep this example reasonably small (and therefore fast) we'll try to work with a smallish set of books that is large enough to work reasonably well.
To begin with, we'll use the catalogue API to search for "typhoid".

(see https://developers.wellcomecollection.org/docs/examples for much more about working with the API)

In [None]:
import requests

catalogue_base_url = 'https://api.wellcomecollection.org/catalogue/v2/'

response = requests.get(
    catalogue_base_url + 'works',
    params={
        'include': 'identifiers,subjects',
        'pageSize': 100,
        'query': 'typhoid',
    },
)
if response.status_code != 200:
  print('error', file = sys.stderr)
response_data = response.json()
for k, v in response_data.items():
    if k == 'results': continue #there will be loads of this
    print(f'{k}: {v}')


When I ran this code, I got 1099 `totalResults`. Your results may differ, depending upon how Wellcome's collection has changed in the meantime. Anyway, this feels like a nice number of texts to start working with. Let's learn some more about them. We'll start by downloading the catalogue data for all of the pages of results.

In [None]:
from tqdm.auto import tqdm

#let's have a progress bar
catalogue_bar = tqdm(
  unit = 'pages',
  total = response_data['totalPages'],
  desc = 'downloading catalogue data',
)

#We already got the first page of results in the previous cell
catalogue_bar.update(1)
works = response_data['results']

#Now we'll add all of the other pages of results to the list "works"
while 'nextPage' in response_data:
  response = requests.get(response_data['nextPage'])
  catalogue_bar.update(1)
  if response.status_code != 200:
    print('error', file = sys.stderr)
  response_data = response.json()
  works.extend(response_data['results'])


Now that we have all of the catalogue data for our "typhoid" works, let's get a sense of what this covers. We'll just look at the contents of the first record.

In [None]:
from IPython.display import JSON as json_display
json_display(works[0], expanded = True)

This is quite a lot of data! We are interested in text about typhoid, so let's focus on the type of work that this is (is it something written, or something else, like a drawing or a photograph?) and the subject matter. We can use JSONPath to look this up.

We'll start with the "type" of the work. The last entry in the above JSON is workType. The label and type look relevant. Let's examine the values that these can take across the whole collection.

.... might want to add a cell about filtering out non-Wellcome items

In [None]:
from jsonpath_ng.ext import parse as json_parse
from collections import Counter

def count(query, data_list):
  empty = 0
  counter = Counter()
  searcher = json_parse(query)
  for datum in data_list:
    results = searcher.find(datum)
    if len(results) == 0:
      empty += 1
    else:
      for result in results: #we should have a list of DatumInContext
                             #this function assumes the value will be hashable, so it does not handle all queries
                             #for example, it will not work if "value" is a dict or a list
        counter[result.value] += 1
  return empty, counter

def dumpCount(query, data_list, min_proportion = 0):
  emptyCount, counter = count(query, data_list)
  total = len(data_list)
  below_min = 0
  for k, v in counter.most_common():
    proportion = v/total
    if proportion >= min_proportion:
      print(f'{v:4}/{total} ({100 * v/total:3.0f}%) {k}')
    else:
      below_min += 1
  if below_min > 0:
    print(f'{below_min} results hidden as below minimum proportion of {min_proportion * 100:.0f}%')
  if emptyCount > 0:
    print(f'{emptyCount:4}/{total} ({100 * emptyCount/total:3.0f}%) have no value')

print('workType types:')
dumpCount('$.workType.type', works)
print()
print('workType labels:')
dumpCount('$.workType.label', works)


We can see that there are a range of types of works in our results. At the time of writing, 3/4 of the works are books and several others are of types that could reasonably have text (e.g. "Archives and manuscripts", "Student dissertations", "E-books", "Manuscripts", "Journals". However, given that text is provided by an OCR pipeline, it is only printed texts that are likely to have online text available.

Working out the catalogue subject of a work is more complicated. Works in Wellcome Collection are classified according to a range of schemes. If we look in the above JSON, we can also see that the structure is fairly complex, involving a mixture of "Subjects" and "Concepts". Rather than unpick all this, we'll just look at a part of the structure to get a sense of how things are classified.

In [None]:
print('Subjects')
#label of every member of the subjects array which has a type of Subject
dumpCount('$.subjects[?(@.type=="Subject")].label', works, 0.02)

print()
print('Concepts')
#label of every node at any depth beneath subjects which has a type of concept
dumpCount('$.subjects..*[?(@.type=="Concept")].label', works, 0.02)

Straight away, we can see that both Subjects and Concepts are not available for about 1/5 of the collection (e.g. 204/1100 ( 19%) have no value).

We can also see that there are a lot of possible values here -- so many that I've written the code to hide all results applying to less than 2% of the works on typhoid.

We can also see that the phrase "typhoid fever" (with varying capitalization) covers 50% of the Subjects and 63% of the Concepts. This suggests that these specific values will get pretty good results in a search. What we cannot tell from this is how many of the works covered by other concepts are actually relevant.

[This could be a good place to introduce the difference between Wellcome and non-Wellcome works and to see what effect filtering down to just Wellcome has.]