Skip to content
An exercise in browsing The Tate Collection in a CompuTATEional Linguistically-motivated way.
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
preprocessing
README.md

README.md

The Tate Collection and the English Language

This is just a small-ish project touching on data-organizing, Apache Spark-usage, database-creation, applied Natural Language Processing, flask app-construction, and in-cloud-deployment with the goal of browsing through The Tate Collection in (what is hopefully) a fun and interesting way. The end-goal is just a website in which the user may browse said collection via metadata supplied by the Tate, including artist, year, other "sub-collections" the particular work may be part of (e.g. an artist's stetchbook), and "keywords", which include single words ("sleepwalking"), multi-word expressions ("cement mixer"), and proper nouns ("World War II").

Most of the "exciting" stuff here involves the keywords.

Notes

I do not own this data. The artwork and all related metadata are properties (properties?) of the Tate. You should go to London and view them in person because they're quite spectacular. This project is just for fun. The artwork itself remains hosted on and accessible via Tate servers; that having been said, definitely get at me if you stumble across any dead links.

To the best of my knowledge, all keywords assigned to all artworks were human-assigned by Tate staff, although I performed automatic categorization of them into single words vs. multi-word expressions. Tate staff had already categorized proper nouns as such, although I did clean them up a bit during preprocessing, which will be discussed later.

Fun with Keywords

Currently, clicking on any keyword on an artwork's page will bring you to a random artwork corresponding to the particular keyword relationship.

So what's a "keyword relationship"? 🤓

The Same Keyword

All keywords will be presented, but only those that occur in the collection at least twice will be clickable. So for example, "Ackergill Tower" would just "randomly" bring you back to the artwork you're presently looking at, while "townscape" will present you with one of 11,153 different pieces of art.

Keywords with the Same Frequency of Occurrence

I computed count and relative frequencies across all keywords. 8,197 of them occur once (that's also the most-frequently-occuring frequency of occurrence) so now, you're in luck if you're on the artwork labeled "Ackergill Tower".

Single-Word Keywords that are Synonyms, Antonyms, or Hypernyms of Each Other

Insert description here!

Keyword-Level Cosine Similarity Using BNC word2vec Vectors

word2vec vectors were computed over the British National Corpus by Marek Rei and Ted Briscoe. I used the vectors_nnet_100 set with gensim to extract the vectors for every keyword and compare each keyword to every other using cosine similarity.

Multi-word expression and proper noun vectors were computed in the same way; that is, for each word in the keyword, extract its vector, and then sum all of the vectors extracted.

In an attempt to handle out-of-vocabulary (OOV) words, I converted every keyword to lower-case to match their representation in the vectors, and also used the Porter stemmer; however, I'm considering doing this over using a lemmatizer instead. 1,320 (7.63%) of keywords are OOV.

I then kept, for every keyword x, either:

  • all keywords with a cosine similarity >= 0.7 (where 1 indicates that the two keywords are identical), or,
  • if there were fewer than 20 such keywords, the top 20 ranked by decreasing cosine similarity, or,
  • if there were fewer than 20 such keywords with a cosine similarity > 0, all keywords with a cosine similarity > 0.

This resulted in a minimum number of related keywords per keyword of 20, a maximum number of 1,660 ("Joseph Beuys, dernier espace avec introspecteur', Anthony d'Offay, London, 1982"), and an average of 211. 6,283 keywords (36.3%) have exactly 20 related keywords.

Because word2vec is amazing, you'll notice some really fun relationships between keywords. For example, the keyword "afterlife", which occurs 15 times in the collection, has the following 20 related keywords:

Cosine Similarity
procreation 0.728117585182189
eroticism 0.656886279582977
laziness 0.656573057174682
piety 0.636407136917114
narcissism 0.635571479797363
aura 0.63342660665512
pessimism 0.631868779659271
humanity 0.619377076625824
oneness 0.618755936622619
escapism 0.615229189395904
futility 0.613060653209686
morality 0.608505070209503
reincarnation 0.603296399116516
devotion 0.602857828140258
patriotism 0.602445840835571
paranoia 0.600905776023864
gluttony 0.600560069084167
modernity 0.599378824234008
illogicality 0.599243938922882
humility 0.595150113105773

Keep in mind that this is based on the reference corpus; if our corpus consisted of, say, only philosophical and religious texts, we might see different words, or these same words presented in a different order, perhaps with higher cosine similarity scores. The goal of the BNC is to be a representative sample of British English, so, have fun speculating what "the representative British person" is writing about the afterlife. 😄

Artwork-Level Cosine Similarity Using BNC word2vec Vectors

This is basically the same as the keyword-level cosine similarity stuff, only, for every artwork with keywords x_1...x_n, the artwork's word2vec vector is the sum of the word2vec vector for x_1 .. the word2vec vector for x_n. Then these vectors are compared to every other artwork's vector and the five artworks with the highest cosine similarity to the given artwork are stored. Due to OOV issues, vectors could only be computed for 58,799 artworks; that is, for 10,403 artworks (15%), all of their keywords are OOV. Any artwork with n keywords and less than n OOV keywords, the OOV keywords were just omitted from that artwork's vector sum.

One thing I discovered during this process was a high-number of artworks with keyword cosine similarity scores of 1, despite the fact that they aren't the same artwork. This tends to happen when an artist has made multiple works of roughly the same thing as part of a series, as is the case with the über-prolific Joseph Mallord William Turner. His work makes up 57% (!) of the Tate's collection and is spread across 394 sub-collections. Here are the top 10 with the largest number of artworks:

Sub-Collection # of Artworks
Holland Sketchbook 562
Rivers Meuse and Moselle Sketchbook 537
Devonshire Coast, No.1 Sketchbook 452
Scotch Lakes Sketchbook 354
Paris and Environs Sketchbook 336
Yorkshire 2 Sketchbook 333
Yorkshire 6 Sketchbook 314
Edinburgh Sketchbook 281
Plymouth, Hamoaze Sketchbook 269
Hastings Sketchbook 230

Believe me when I say, every artwork in the "Holland Sketchbook" has roughly the same keywords. Not that there's anything wrong with that; it's just not particularly interesting when using this kind of mechanism to browse through art. You can just click on the keywords themselves and navigate through the "Holland Sketchbook" that way. Instead, the goal of this endeavour is to present you with, "I never would have thought that these artworks were at all related! Thanks, word2vec!"

To demonstrate, here are the top five most-similar artworks to "Shakespeare's Cliff at Dover" (D18842) from Turner's Holland Sketchbook:

Accession # Artist Artwork Sub-Collection Cosine Similarity
D35657 Turner Houses on Coast with Cliff in Distance (?Shakespeare’s Cliff, Dover) Dieppe and Kent Sketchbook 0.9999998807907104
D10474 Turner Shakespeare Cliff, Dover Richmond Hill; Hastings to Margate Sketchbook 0.9713542461395264
D35344 Turner Cliffs on the Coast; ?Walmer Castle Hythe and Walmer Sketchbook 0.9395263195037842
D19408 Turner Shakespeare’s Cliff, Dover Holland, Meuse and Cologne Sketchbook 0.9308038353919983
D17211 Turner Cliffs on Coast, ?near Folkstone or Dover Folkestone Sketchbook 0.9231697916984558

So...cool, Turner's artwork is most-similar to Turner's artwork, based on the keywords assigned to it.

But what about for someone like Roger Ackling, with only one piece of art in the entire Tate Collection (T03562, "Five Sunsets in One Hour")? The keywords assigned to that artwork are, "sunlight", "environment", "sunset", "text", "time", "nature", "Chillerton Down", "Isle Of Wight", and "England". (For reference, "England" appears as a keywork across 8% of artworks in the collection.)

Accession # Artist Artwork Cosine Similarity
N02001 Turner Study of Sea and Sky, Isle of Wight 0.8201642632484436
T02251 Read, David Charles Ryde, Isle of Wight 0.7976629734039307
T02329 Pringle, John Quinton The Window 0.7943012714385986
D20789 Turner View of the Solent 0.7913135290145874
D08176 Turner Moonlight at Sea (The Needles) 0.7871695756912231

41,062 (59.3%) of artworks have at least one work by Turner among their top-five most-similar artworks by keywords. So in an attempt to introduce some possible variation into the mix, I reran this but omitted all of Turner's work; now the top five similar artworks to Ackling's "Five Sunsets in One Hour" include the same two non-Turner works listed above, with a few others:

Accession # Artist Artwork Cosine Similarity
T02251 Read, David Charles Ryde, Isle of Wight 0.7976629734039307
T02329 Pringle, John Quinton The Window 0.7943012714385986
T02971 Daniell, William Ryde 0.7834650874137878
P78605 Cooper, Thomas Joshua South-most Arrival - The English Channel ; At the hour of the Total Solar Eclipse, but on the Day Before ; Bumble Rock, Lizard Point, Cornwall, Great Britain ; The South-most point of mainland Great Britain" 0.7771701812744141
T02975 Daniell, William Needles Cliff, & Needles, Isle of Wight 0.7647179961204529

Nothing against Turner, of course; brilliant artist, impressively prolific. 😅

Single-Word and Multi-Word Keyword Clustering Using (surprise) BNC word2vec Vectors

This was actually pretty challenging since every keyword here essentially exists in isolation. We can compare them just fine but without any additional context, it's basically impossible to say something like "anatomy" relates to "human" and thus both should appear in the same cluster.

To mitigate this, I essentially seeded the (for example) "human" cluster with the top 100 most similar unigram vectors from the language model. If "anatomy" was among those top 100, it was immediately a member of the "human" cluster; otherwise, I compared "anatomy"'s word2vec vector to the combined vector for all existing clusters created thus far. A cosine similarity greater than or equal to 0.45 meant "anatomy" was in, otherwise, a new cluster was created based on "anatomy" and its top 100 most similar unigram vectors. These initial clusters were later combined using a similar cosine-similarity-based approach and a threshold of 0.55. Also, any words that appeared in multiple clusters were made permament members of the cluster with which they shared the highest cosine similarity.

This approach gave me a fantastic set of 80 clusters...but only for the single-word and multi-word keywords. If I include the proper nouns in this "clustering" procedure, I get a lot of noisy, not-so-nice clusters.

So instead...

Proper Noun Keyword Named Entity Recognition

Using spaCy's named entity recognizer with the multi-language Wikipedia model I was able to get an impressive set of initial results that I could then tweak and refine for 95% of proper noun keywords. This particular model classifies named entities into one of PER, LOC, ORG, or MISC, then I came up with rules to automatically reassign some of the more event-y MISC entities as EVENT, and also added rules to further refine the other categories.

I wound up with a set of 849 proper noun keywords that I decided to manually review, clean, and create additional proper noun keywords for while keeping the originals in tact. For example, consider the keyword "Whitman, Walt, 'I Sing the Body Electric". spaCy identified not one, but two named entities in here, "Whitman, Walt" as PER and "I Sing the Body Electric" as MISC. I decided that both of those entities were valuable as keywords and entities, so now any artwork with that one keyword now has three, two of which have named entity classifications. Note that I didn't do this for keywords like "Young, William Drummond and Edward Drummond, photograph, 'Portrait Photograph of Walter Richard Sickert' 1923" as that clearly states the keyword actually about a photograph, and thus my rules label this as MISC.

Also note that while you may find these entity classifications useful for work outside of this application (yay!), classification decisions have been made that are considerate of the context in which the entity appears. For example, the proper noun keyword "St Martin" is classified here as as LOC and not a PER because it is used as a keyword for two artworks, both of which are of locations, not of Saint Martin the person.

While I'm not thrilled with all of the manual work that went into this, it wasn't a ton and TBH, I really needed to move on from the preprocessing phase of this project. 😅

Preprocessing

Since the data is basically static, preprocessing of the data obtained from the Tate's repo was done offline, the output of which, initially, was a series of CSV files. The CSV files were then loaded into a Postgres database on the Google Cloud and then subsequent preprocessing outputs were written directly to the database. I'll be the first to admit that the preprocessing scripts aren't...the most efficient. Basically, this part of the project was something I poked at in my spare time in which I'd write a script, let it run in the background, and then check on the output however many hours later. This is particularly true of the earlier scripts. 😬 The scripts are numbered to indicate the order in which they were run.

For the most part, I was able to perform all of the preprocessing in a reasonable amount of time using either (a) a MacBook Pro with a 2.6GHz i7 and 16gb of RAM while also using the same Mac to perform other low-ish-memory tasks, or (b) a 2017 12" MacBook while doing essentially nothing else. The exception to this was script #07, which computes artwork-level cosine similarity. Next to the initial preprocessing of the Tate data from JSON to CSV, this was honestly the single preprocessing step that took the longest time to run. 69,202 artworks, each of which had their cosine similarity computed against almost 69,201 other artworks...yeah. So, I reworked my original script to use Apache Spark and ran it on a small, low-memory, two-node Google Cloud Dataproc cluster. The README.md in the preprocessing/ subdirectory has more information about how I ran the script, exactly. On the clister, running this on all of the artwork took just under two hours, but running it without Turner's artwork, it finished up in about 26 minutes.

Future Work

  • do something with OOV keywords when looking up synonyms, antonyms, hypernyms, and word2vec vectors; could lemmatizing vs. stemming help?
  • more cool stuff with multi-word expression keywords; e.g. classifcation into idiomatic vs. fixed/semi-fixed/syntactically-flexible/institutionalized?
  • "tame" the Wikipedia categories by (a) doing something about the 20% of proper noun keywords for which there are no Wikipedia categories (for one of many reasons), and (b) attempt to "cluster" them in some useful way so that rather than having categories like "1929 United States House of Representatives" and "1930 United States House of Representatives", one can browse artwork using just "United States House of Representatives".
You can’t perform that action at this time.