# Introduction

This notebook outlines an experimental pipeline for matching museum objects with academic research articles.

It was developed as part of the 'Building an Object-Enriched Bibliography: Experiments in Linking Museum Objects and Academic Literature' investigation during the [Congruence Engine](https://perma.cc/58MF-4XWV) project at the Science Museum.

In this experiment, we were interested in finding ways of linking academic literature with musseum collections with relevance to the textile industry. We have therefore chosen three open-access academic articles that discuss textiles, as well as a sample dataset of textiles-related collections items from the Science Museum Group's [Collections Online](https://collection.sciencemuseumgroup.org.uk/).

#### Pipeline summary


*   Acquire full text articles from the [CORE API](https://api.core.ac.uk/docs/v3)
*   Load Science Museum Group collections data
*   Identify entities in one article using the named entity recognition (NER) model [GLiNER](https://github.com/urchade/GLiNER)
*   Create embeddings for entities and museum objects using [Sentence Transformers](https://sbert.net/)
*   Visualise entities and objects in vector space

Parts of this code were written with the help of Chat GPT 4o.




### Install the required packages

In [None]:
!pip install umap-learn gliner

### Import packages

In [None]:
import pandas as pd
from gliner import GLiNER
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import requests
import json
import plotly.express as px
from umap.umap_ import UMAP

### Acquire Full Text Articles
For this example we are going to use the CORE API, 'The world’s largest collection of open access research papers'. You can read the [API documentation here](https://api.core.ac.uk/docs/v3#section/Welcome!).

We will pull just three articles from the API in this case, but this could be performed at a much larger scale if desired. To make things easier, we have pre-selected the identifiers for our three articles in question.

Although not strictly necessary, we recommend that you use a free [API Key](https://core.ac.uk/services/api#what-is-included). The code below assumes that you have a acquired an API key first.

In [None]:
api_key = "YOUR_API_KEY" # Replace with your API key

identifiers = ["52191", "7314055", "573860020"] # These are the identifiers for the three articles that we have selected
                                                # Feel free to add more!

url_template = "https://api.core.ac.uk/v3/outputs/{identifier}"

headers = {
    "Authorization": f"Bearer {api_key}",
}

data = []

for identifier in identifiers:
    url = url_template.format(identifier=identifier)
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        article_data = response.json()
        data.append(article_data)
    else:
        print(f"Failed to fetch article {identifier}: {response.status_code} {response.reason}")

# Save the collected data to a JSON file
output_file = "/content/articles_data.json" # This will save in your Colab environment. You may wish to change the directory
with open(output_file, "w") as file:
    json.dump(data, file, indent=4)

print(f"Data for {len(data)} articles saved to {output_file}")

### Load objects and articles data



First, we will load the JSON file that we have retrieved from the CORE database

In [None]:
articles_df = pd.read_json('/content/articles_data.json')


In [None]:
articles_df.head()

Now we can load our museum objects data. While you could retrieve this from the [Science Museum Group API](https://www.sciencemuseumgroup.org.uk/our-work/our-collection/using-our-collection-api), we have sped things up by adding a csv file to GitHub. Bear in mind that this data dates from August 2024, and the most up-to-date information will always be accessible via the API.

In [None]:
objects_df = pd.read_csv('https://raw.githubusercontent.com/congruence-engine/Object-Enriched-Bibliography/refs/heads/main/datasets/objects_metadata.csv')

In [None]:
objects_df.head()

In [None]:
object_texts = objects_df['Collection Online Title'].tolist()


### Identify 'object' entities using GLiNER
We do this using GLiNER, a 'universal' named entity recognition model.
Here we have used **[gliner_medium-v2.1](https://huggingface.co/urchade/gliner_medium-v2.1)**.

In this instance we extract entities that the model recognises as 'objects' in a single article. You can run this separately on each of the three articles that we pulled from the CORE database if you like, to see how the extracted entities differ.

To see a full list of models in the GLiNER family, visit the model's [github repository](https://github.com/urchade/GLiNER).

In [None]:
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

In [None]:
# Retrieve the article text. In this case, we'll use index 0 which is the first article.
# You can change this number to 1 or 2 to test the other articles later on.
text = articles_df['fullText'].iloc[0]

# The code below will chunk the text so that it does not exceed the maximum content length of the model (which is 384 tokens)
def chunk_text(text, max_words=200):
    words = text.split()
    chunks = [" ".join(words[i:i + max_words]) for i in range(0, len(words), max_words)]
    return chunks

chunks = chunk_text(text, max_words=120)

In [None]:
labels = ["object"] # This defines the entities that we will be looking for in the text.
                    # Here we have gone for a very vague 'objects' category, but you can be as specific as you like
                    # You can also include multiple labels for different entities

all_entities = []

# Run GLiNER on each chunk and collect the entities
for chunk in chunks:
    entities = model.predict_entities(chunk, labels, threshold=0.4)
    all_entities.extend(entities)

In [None]:
# Extract unique object terms from the entities
article_objects = list(set(entity['text'] for entity in all_entities if entity['label'] == "object"))
print("Identified Objects:", article_objects)

In [None]:
len(article_objects)

### Create embeddings
This section of code will create embeddings for:

*   the 'object' entities extracted using GLiNER
*   the full dataset of objects from a subset of 'textile' objects in the SMG collection

To generate these embeddings, we will be using the Sentence Transformer **all-MiniLM-L6-v2**.



In [None]:
# Initialize embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# Generate embeddings for each identified 'object'' entity
article_embeddings = {term: embedding_model.encode(term) for term in article_objects}

# Encode museum object descriptions with the same model
object_texts = objects_df['Collection Online Title'].tolist()
objects_embeddings = embedding_model.encode(object_texts)

# Find closest matches between textile machine terms and museum objects
matches = []
for term, term_embedding in article_embeddings.items():
    similarities = cosine_similarity([term_embedding], objects_embeddings)[0]
    best_match_idx = np.argmax(similarities)
    best_match_score = similarities[best_match_idx]

    matches.append({
        'article_object': term,
        'best_match_object': objects_df['Collection Online Title'].iloc[best_match_idx],
        'similarity_score': best_match_score
    })

# Convert matches to DataFrame for easy viewing
matches_df = pd.DataFrame(matches)
matches_df = matches_df.sort_values(by='similarity_score', ascending=False)

In [None]:
matches_df

### Visualise the results

Here we visualise the results in vector space. This will help us to see how closely the extracted entities match the objects from the SMG dataset.

You can try the full process with each of the three articles that we loaded originally from CORE. You'll notice that one of the articles returns many more candidates than the others!

In [None]:
# Combine textile machine and object embeddings
combined_embeddings = np.vstack((list(article_embeddings.values()), objects_embeddings))

# Create labels for visualization
labels = ['NER Object'] * len(article_embeddings) + ['SMG Object'] * len(objects_embeddings)

In [None]:

reducer = UMAP(n_components=3, n_neighbors=15, min_dist=0.1, metric='cosine', random_state=42)
embedding_3d = reducer.fit_transform(combined_embeddings)

In [None]:
object_name = list(article_objects) + objects_df['Collection Online Title'].tolist()


In [None]:
embedding_df = pd.DataFrame({
    'x': embedding_3d[:, 0],
    'y': embedding_3d[:, 1],
    'z': embedding_3d[:, 2],
    'label': labels,
    'name': object_name
})

fig = px.scatter_3d(
    embedding_df,
    x='x',
    y='y',
    z='z',
    color='label',
    hover_data={
        'x': False,
        'y': False,
        'z': False,
        'label': True,
        'name': True
    },
    title='NER Terms and Museum Objects',
    opacity=0.7
)

fig.update_traces(marker=dict(size=4))
fig.show()