# Filtering museum objects

We only want certain kinds of items from the Harvard Art Museums that fit well with our application. For example, the Art Museums seem to have a large collection of images of coins, and those don't really fit well with CLIP, so we filter them out in this step.

Other objects have no images, which make them not useful. Let's explore what these images look like!

In [1]:
import json

with open("../data/artmuseums.json", "r") as f:
    dataset = json.load(f)

In [2]:
len(dataset)

236120

In [3]:
dataset[0]

{'copyright': None,
 'contextualtextcount': 0,
 'creditline': 'Harvard Art Museums/Arthur M. Sackler Museum, Loan from the Trustees of the Arthur Stone Dewing Greek Numismatic Foundation',
 'accesslevel': 1,
 'dateoflastpageview': '2020-11-25',
 'classificationid': 50,
 'division': 'Asian and Mediterranean Art',
 'markscount': 0,
 'publicationcount': 0,
 'totaluniquepageviews': 7,
 'contact': 'am_asianmediterranean@harvard.edu',
 'colorcount': 4,
 'rank': 1003,
 'details': {'coins': {'reverseinscription': None,
   'dieaxis': '8',
   'metal': 'AR',
   'obverseinscription': None,
   'denomination': 'litra',
   'dateonobject': None}},
 'state': None,
 'id': 189263,
 'verificationleveldescription': 'Good. Object is well described and information is vetted',
 'period': 'Classical period, Early',
 'images': [{'date': '2004-06-03',
   'copyright': 'President and Fellows of Harvard College',
   'imageid': 38027,
   'idsid': 18778012,
   'format': 'image/jpeg',
   'description': None,
   'techn

In [5]:
sum(["primaryimageurl" not in item for item in dataset])

22994

In [6]:
len(set([item["primaryimageurl"] for item in dataset if "primaryimageurl" in item]))

210681

In [7]:
210681 + 22994

233675

In [9]:
len(dataset)

236120

Okay, it looks like about 10% of the items in the dataset don't have any image associated with them. Furthermore, there are 3000 image URLs that are duplicated. Let's look at some of these duplicates.

In [13]:
from collections import Counter

Counter([item["primaryimageurl"] for item in dataset if "primaryimageurl" in item]).most_common(40)

[('https://nrs.harvard.edu/urn-3:HUAM:INV180154_dynmc', 48),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180030_dynmc', 38),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180019_dynmc', 36),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180031_dynmc', 34),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180197_dynmc', 32),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180165_dynmc', 28),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180032_dynmc', 27),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV186007_dynmc', 26),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180113_dynmc', 25),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180100_dynmc', 25),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180157_dynmc', 23),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV179871_dynmc', 23),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180153_dynmc', 23),
 ('https://nrs.harvard.edu/urn-3:HUAM:VRS35829_dynmc', 22),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV141481_dynmc', 22),
 ('https://nrs.harvard.edu/urn-3:HUAM:INV180155_dynmc', 22),
 ('https://nrs.harvard.ed

In [20]:
[item["dimensions"] for item in dataset[0:50] + dataset[10000:10050]]

['0.71 g',
 '16.82 g',
 '17.33 g',
 '7.94 g',
 '7.63 g',
 '7.67 g',
 '7.49 g',
 '7.54 g',
 '2.35 g',
 '2.47 g',
 '1.06 g',
 '0.5 g',
 '0.7 g',
 '0.89 g',
 '8.43 g',
 '8.62 g',
 '8.47 g',
 '8.92 g',
 '8.7 g',
 '8.73 g',
 '8.43 g',
 '17.27 g',
 '8.44 g',
 '0.6 g',
 '4.79 g',
 '17.19 g',
 '17.17 g',
 '16.65 g',
 '16.69 g',
 '17.08 g',
 '16.77 g',
 '1.88 g',
 '2.1 g',
 '2.13 g',
 '1.34 g',
 '1.35 g',
 '1.34 g',
 '21.37 g',
 '1.46 g',
 '3.36 g',
 '0.7 g',
 '0.71 g',
 '0.69 g',
 '17.12 g',
 '8.03 g',
 '0.43 g',
 '16.32 g',
 '17.33 g',
 '16.93 g',
 '16.81 g',
 'image: 27.9 x 21.6 cm (11 x 8 1/2 in.)',
 'image: 15.1 x 8.9 cm (5 15/16 x 3 1/2 in.)',
 'image: 28.2 x 21.6 cm (11 1/8 x 8 1/2 in.)',
 'image: 29.1 x 21.7 cm (11 7/16 x 8 9/16 in.)',
 'image: 27.8 x 21.5 cm (10 15/16 x 8 7/16 in.)',
 'mount: 35.5 x 56 cm (14 x 22 1/16 in.)',
 'image: 28.2 x 21.6 cm (11 1/8 x 8 1/2 in.)',
 'image: 28.2 x 21.6 cm (11 1/8 x 8 1/2 in.)',
 'mount: 35.5 x 56 cm (14 x 22 1/16 in.)',
 'image: 28.1 x 21.5 cm (

Okay, looks like all of the coins have dimensions in grams. The weird print also has a `None` dimension. Maybe for now I'll just filter for all items that have an image and some 2D dimensions.

In [29]:
dim_type = Counter()
for item in dataset:
    kind = None
    dims = item["dimensions"]
    if isinstance(dims, str):
        idx = dims.find(":")
        if idx != -1:
            kind = dims[:idx]
        elif dims.endswith("g"):
            kind = "weight grams"
        elif " cm" in dims:
            kind = "other cm"
        else:
            kind = dims
    dim_type[kind] += 1

dim_type

Counter({'weight grams': 17437,
         'other cm': 83909,
         'sheet': 6767,
         '33.3 × 69.5 cm (13 1/8 × 27 3/8 in.)\r\nframe': 1,
         None: 53404,
         'actual': 5560,
         '19.3 x 22.8 cm (7 5/8 x 9 in.)\r\nmount': 1,
         'overall for matted triptych': 28,
         'Paper': 1758,
         'image': 34804,
         '22.6 x 47 cm (8 7/8 x 18 1/2 in.)\r\nmount': 1,
         '22.8 x 48 cm (9 x 18 7/8 in.)\r\nsheet': 1,
         '22.1 x 47.8 cm (8 11/16 x 18 13/16 in.)\r\nmount': 1,
         '56 x 47.8 cm (22 1/16 x 18 13/16 in.)\r\nmount': 1,
         '22.6 x 46.9 cm (8 7/8 x 18 7/16 in.)\r\nmount': 1,
         '22.5 x 47.6 cm (8 7/8 x 18 3/4 in.)\r\nsheet': 1,
         '40.6 x 58.5 cm (16 x 23 1/16 in.)\r\nmount': 1,
         '59.2 x 39.3 cm (23 5/16 x 15 1/2 in.)\r\nmount': 1,
         '59.6 x 42.1 cm (23 7/16 x 16 9/16 in.)\r\nmount': 1,
         '40.7 x 59.9 cm (16 x 23 9/16 in.)\r\nmount': 1,
         '36 x 47.2 cm (14 3/16 x 18 9/16 in.)\r\nmount': 1,

Okay, looks like this part of the data is super messy. Let's try to consolidate it a bit.