# ATRIUM Task 4.1.3 Experiment

This notebook demonstrates applying the USW vocabulary-based NER pipeline to records in a JSONL data file supplied by ATHENA, and merging the output results for further subsequent processing.
The pipeline identifies terms originating from the following three Linked Open Data controlled vocabularies:

* [FISH Event Types Thesaurus](http://purl.org/heritagedata/schemes/agl_et)
* [FISH Archaeological Sciences Thesaurus](http://purl.org/heritagedata/schemes/560)
* [Getty Art & Architecture Thesaurus - Activities facet](https://vocab.getty.edu/aat/300404112)

The [Forum on Information Standards in Heritage (FISH)](https://heritage-standards.org.uk/) terminology working group administer and maintain UK national standard controlled terminologies for cultural heritage - known as the [FISH Vocabularies](https://heritage-standards.org.uk/fish-vocabularies/). These are made freely available for download and use as Linked Open Data via the [Heritage Data](https://www.heritagedata.org/) site. 

The [Getty Art & Architecture Thesaurus (AAT)](http://vocab.getty.edu/aat/) is made available online as Linked Open Data. The (poly)hierarchical structure is subdivided at the top level into a number of facets. We have isolated terms originating from the [AAT Activities facet](https://vocab.getty.edu/aat/300404112) for this exercise.

The listing following the source code sections below shows actual output of the process. These results are also merged within the originating input data under the 'spans' section and saved as a new file. The following example illustrates how the spans identified in the 3rd result record are merged into the existing 'spans' section:

![alt text](img/added-span-results.png "Added span results")



In [1]:
%%capture
import warnings
# suppress user warnings during execution
warnings.filterwarnings(action='ignore', category=UserWarning)

# install prerequisites
%pip install spacy
%pip install srsly
%sx python -m spacy download en_core_web_sm

In [5]:
import spacy # for NER processing
from spacy import displacy # for visualisation of NER tagged text
import srsly # for JSONL serialization/deserialization
from rematch2 import VocabularyRuler, DocSummary # custom vocabulary-based NER components
from IPython.display import display, HTML # for displaying HTML output in Python notebook

# read records from a JSONL file
def read_data(file_name: str="") -> list:
    return list(srsly.read_jsonl(file_name))

# write records to a JSONL file
def write_data(file_name: str="", data: list=[]):
    srsly.write_jsonl(file_name, data)

# check if a given span exists in a list of spans
# comparing start position, end position and label
def span_exists(span: dict, lst: list) -> bool:
    start = span.get("start", "")
    end = span.get("end", "")
    label = span.get("label", "")
    return any(
        item.get("start") == start 
        and item.get("end") == end 
        and item.get("label") == label for item in lst
    )           

if __name__ == '__main__':
    # read input data from file
    FILE1 = "./data/input/sample_input.jsonl"
    FILE2 = "./data/input/sample_annotated_output.jsonl"
    FILE3 = "./data/input/journal_metadata.jsonl"
    FILE4 = "./data/input/report_metadata.jsonl"    
    input_filename = FILE2
    data = read_data(input_filename)

    # set up default base NER pipeline (English)
    nlp = spacy.load("en_core_web_sm", disable = ['ner'])

    # add custom pipeline NER components
    nlp.add_pipe("fish_event_types_ruler", last=True)   
    nlp.add_pipe("fish_archsciences_ruler", last=True)   
    nlp.add_pipe("aat_activities_ruler", last=True)  
    
    for item in data:
        identifier = item.get("meta", {}).get("id", "")

        # run NER pipeline against input text
        doc = nlp(item.get("text",""))
        
        # display HTML summary of NER results (see below)
        summary = DocSummary(doc)
        display(HTML(f"<h3>[ID: {identifier}]</h3>"))
        if(len(summary.spans(format="list")) == 0):
            display(doc.text)
        else:
            display(HTML(summary.doctext(format="html")))
        
        display(HTML(summary.spans(format="html")))
        display(HTML("<hr>"))

        # add new spans to the existing spans array,
        # checking for duplicates (in case multiple runs)
        the_spans = item.get("spans", []) 
        new_spans = summary.spans(format="list")
        for span in new_spans:
            if not span_exists(span, the_spans):
                the_spans.append(span)
        item["spans"] = the_spans
    
    # output the modified structure to a (new) JSONL file
    write_data(f"{input_filename}_plus_vocab_ner.jsonl", data)


start,end,token_start,token_end,label,id,text
22,38,2,3,AAT_ACTIVITY,http://vocab.getty.edu/aat/300223990,spatial analysis


start,end,token_start,token_end,label,id,text
13,22,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077121,collected
69,79,9,9,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379618,carbonized
173,182,41,41,AAT_ACTIVITY,http://vocab.getty.edu/aat/300138076,processed
173,182,41,41,AAT_ACTIVITY,http://vocab.getty.edu/aat/300078065,processed


start,end,token_start,token_end,label,id,text
37,60,7,8,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142158,magnetic susceptibility
97,104,14,14,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provide


start,end,token_start,token_end,label,id,text
76,93,15,16,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dated
76,93,15,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054717,radiocarbon dated
88,93,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,dated
88,93,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054714,dated
218,222,37,37,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,date
218,222,37,37,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054714,date


start,end,token_start,token_end,label,id,text
0,10,0,0,AAT_ACTIVITY,http://vocab.getty.edu/aat/300380461,Calibrated
188,215,31,33,AAT_ACTIVITY,http://vocab.getty.edu/aat/300081742,neutron activation analysis
196,215,32,33,AAT_ACTIVITY,http://vocab.getty.edu/aat/300251478,activation analysis


"This situation has changed with recent archaeological, paleontological, and wetland coring research conducted on O'ahu's 'Ewa Plain, a hot, dry emerged limestone reef characterized by numerous sinkholes."

start,end,token_start,token_end,label,id,text
29,47,3,4,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating
29,47,3,4,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054717,radiocarbon dating


start,end,token_start,token_end,label,id,text
3,14,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137546,investigate
81,89,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analysed
161,169,30,30,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analysed


start,end,token_start,token_end,label,id,text
128,136,22,22,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137483,employed


start,end,token_start,token_end,label,id,text
30,36,6,6,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379805,plowed


start,end,token_start,token_end,label,id,text
153,171,26,27,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating
102,113,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300262794,demonstrate
153,171,26,27,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054717,radiocarbon dating
165,171,27,27,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,dating
165,171,27,27,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054714,dating
212,220,36,36,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provides


start,end,token_start,token_end,label,id,text
8,14,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300055545,assess


start,end,token_start,token_end,label,id,text
60,70,9,10,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142177,OSL dating
105,112,18,18,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142215,working
60,70,9,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300266050,OSL dating
64,70,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,dating
64,70,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054714,dating
105,112,18,18,AAT_ACTIVITY,http://vocab.getty.edu/aat/300412186,working
143,154,24,24,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054608,constructed


start,end,token_start,token_end,label,id,text
148,156,23,23,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379429,sampling


'A small (one liter) flotation sample was taken from the soil surrounding and inside the skull.'

start,end,token_start,token_end,label,id,text
46,52,8,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077610,record


start,end,token_start,token_end,label,id,text
0,23,0,2,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142100,Amino acid racemisation
56,63,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provide


'Pollen studies concluded that the pre-human island Rapa Nui was dominated by a now extinct palm, Paschalococos disperta.'

start,end,token_start,token_end,label,id,text
78,82,17,17,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142111,burn
78,82,17,17,AAT_ACTIVITY,http://vocab.getty.edu/aat/300228062,burn
84,93,19,19,AAT_ACTIVITY,http://vocab.getty.edu/aat/300080091,describes


start,end,token_start,token_end,label,id,text
74,85,13,13,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137546,investigate


start,end,token_start,token_end,label,id,text
102,110,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300226216,examined


start,end,token_start,token_end,label,id,text
152,170,24,25,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating
97,105,14,14,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analyses
152,170,24,25,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054717,radiocarbon dating
164,170,25,25,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,dating
164,170,25,25,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054714,dating


start,end,token_start,token_end,label,id,text
22,29,4,4,AAT_ACTIVITY,http://vocab.getty.edu/aat/300343813,studied
215,223,40,40,AAT_ACTIVITY,http://vocab.getty.edu/aat/300237969,simulate


start,end,token_start,token_end,label,id,text
147,153,23,23,AAT_ACTIVITY,http://vocab.getty.edu/aat/300182748,washed
147,153,23,23,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053042,washed
155,160,25,25,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053758,dried
199,206,33,33,AAT_ACTIVITY,http://vocab.getty.edu/aat/300194584,assayed


start,end,token_start,token_end,label,id,text
35,42,5,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404521,showing
35,42,5,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054766,showing


'The 17 radiocarbon determinations from the Pulemelei mound site were used to generate a local prehistoric sequence for the Letolo area.'

start,end,token_start,token_end,label,id,text
0,17,0,1,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,Radiocarbon dates
0,17,0,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054717,Radiocarbon dates
12,17,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,dates
12,17,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054714,dates


start,end,token_start,token_end,label,id,text
0,16,0,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300223990,Spatial analysis


start,end,token_start,token_end,label,id,text
3,9,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137552,report


'A series of AMS 14C dates indicate that most of the ochres and all pieces of facetted ochre were deposited between 1200 and 1400 years ago.'

start,end,token_start,token_end,label,id,text
114,121,19,19,AAT_ACTIVITY,http://vocab.getty.edu/aat/300224146,removed
157,163,28,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053564,reduce
243,251,43,43,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077124,compared


'Background In 1997, we conducted exca vations on the banks of Ain Soda, a pool in the marshland of the Azraq Oasis in eastern Jordan, along the shore of what was once a large Pleistocene lake.'

start,end,token_start,token_end,label,id,text
41,69,6,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379555,principal component analysis


start,end,token_start,token_end,label,id,text
39,46,9,9,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077121,collect


start,end,token_start,token_end,label,id,text
91,109,20,21,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,Radiocarbon Dating
13,19,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077506,listed
13,19,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300079658,listed
91,109,20,21,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054717,Radiocarbon Dating
123,152,26,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300264220,accelerator mass spectrometry
135,152,27,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300225881,mass spectrometry


start,end,token_start,token_end,label,id,text
44,62,5,6,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating
44,62,5,6,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054717,radiocarbon dating
63,69,7,7,AAT_ACTIVITY,http://vocab.getty.edu/aat/300393207,served


'As a further test of sample integrity for the bioapatite stable isotope data, infrared analyses were performed on a subset of 33 samples of bone bioapatite that included all samples with greater than 1.2 percent or less than .7 percent carbonate carbon by weight.'

start,end,token_start,token_end,label,id,text
10,17,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137062,joining
72,81,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053768,supported


start,end,token_start,token_end,label,id,text
131,138,27,27,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379753,fencing
335,342,67,67,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054711,planted
427,439,88,88,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137847,representing


start,end,token_start,token_end,label,id,text
19,27,4,4,AAT_ACTIVITY,http://vocab.getty.edu/aat/300080091,describe
65,94,11,13,AAT_ACTIVITY,http://vocab.getty.edu/aat/300264220,accelerator mass spectrometry
77,94,12,13,AAT_ACTIVITY,http://vocab.getty.edu/aat/300225881,mass spectrometry
144,152,25,25,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053890,tempered


start,end,token_start,token_end,label,id,text
16,24,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provides
65,71,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300417305,felled
65,71,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300445429,felled


start,end,token_start,token_end,label,id,text
278,283,49,49,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404521,shown
278,283,49,49,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054766,shown


start,end,token_start,token_end,label,id,text
15,23,6,6,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analyzed


start,end,token_start,token_end,label,id,text
169,175,29,29,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054695,tested


start,end,token_start,token_end,label,id,text
3,9,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137552,report
165,182,27,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300225881,Mass Spectrometry
194,202,33,33,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137570,identify


start,end,token_start,token_end,label,id,text
16,24,3,3,AAT_ACTIVITY,http://vocab.getty.edu/aat/300226216,examined


start,end,token_start,token_end,label,id,text
14,20,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053130,models
14,20,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300155050,models
136,141,29,29,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053097,fixed


start,end,token_start,token_end,label,id,text
127,135,17,17,AAT_ACTIVITY,http://vocab.getty.edu/aat/300236367,adjusted


start,end,token_start,token_end,label,id,text
13,35,3,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300081693,trace element analysis
218,229,38,38,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053892,transported


start,end,token_start,token_end,label,id,text
0,7,0,0,AAT_ACTIVITY,http://vocab.getty.edu/aat/300239496,Running
30,37,5,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053130,modeled
30,37,5,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300155050,modeled
48,56,8,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analyzed
