# ATRIUM Task 4.1.3 Experiment

This notebook demonstrates applying the USW vocabulary-based NER pipeline to records in a JSONL data file supplied by ATHENA, and merging the output results for further subsequent processing.
The pipeline identifies terms originating from the following three Linked Open Data controlled vocabularies:

* [FISH Event Types Thesaurus](http://purl.org/heritagedata/schemes/agl_et)
* [FISH Archaeological Sciences Thesaurus](http://purl.org/heritagedata/schemes/560)
* [Getty Art & Architecture Thesaurus - Activities facet](https://vocab.getty.edu/aat/300404112)

The [Forum on Information Standards in Heritage (FISH)](https://heritage-standards.org.uk/) terminology working group administer and maintain UK national standard controlled terminologies for cultural heritage - known as the [FISH Vocabularies](https://heritage-standards.org.uk/fish-vocabularies/). These are made freely available for download and use as Linked Open Data via the [Heritage Data](https://www.heritagedata.org/) site. 

The [Getty Art & Architecture Thesaurus (AAT)](http://vocab.getty.edu/aat/) is made available online as Linked Open Data. The (poly)hierarchical structure is subdivided at the top level into a number of facets. We have isolated terms originating from the [AAT Activities facet](https://vocab.getty.edu/aat/300404112) for this exercise.

The listing following the source code sections below shows actual output of the process. These results are also merged within the originating input data under the 'spans' section and saved as a new file. The following example illustrates how the spans identified in the 3rd result record are merged into the existing 'spans' section:

![alt text](img/added-span-results.png "Added span results")



In [1]:
%%capture
import warnings
# suppress user warnings during execution
warnings.filterwarnings(action='ignore', category=UserWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)

# install prerequisites
%pip install spacy
%pip install srsly
%sx python -m spacy download en_core_web_sm

In [2]:
import spacy # for NER processing
from spacy import displacy # for visualisation of NER tagged text
from spacy.tokens import Span
import srsly # for JSONL serialization/deserialization
import os
import json
from slugify import slugify # for valid filenames
from rematch2 import VocabularyRuler, DocSummary, TextNormalizer, child_span_remover # custom vocabulary-based NER components
from IPython.display import display, HTML


# check if a given span exists in a list of spans
# comparing start/end positions and label
def span_exists(span: dict, lst: list) -> bool:
    #start = span.start #or 0
    #end = span.end #or 0
    #label = span.label #or ""
    return any(
        item["start"] == span.get("start", 0)
        and item["end"] == span.get("end", 0) 
        and item["label"] == span.get("label", "") for item in lst
    )           

# for reading supplementary lists from JSON files
def read_json(file_name):
    data = []
    try:
        with open(file_name, "r") as f:
            data = json.load(f)
    except Exception as e:
        print(f"Problem reading \"{file_name}\": {e}")
    return data

if __name__ == '__main__':
    # read JSONL input data from file
    input_file_path = "./data/athena"
    input_file_name1 = "sample_input.jsonl"
    input_file_name2 = "sample_annotated_output.jsonl"
    #input_file_name3 = "journal_metadata.jsonl"
    #input_file_name4 = "report_metadata.jsonl"

    supp_list_act = read_json("./supp_list_en_AAT_ACTIVITIES.json")

    # change this to process other files   
    input_file_name = input_file_name2 

    input_file_full = os.path.join(input_file_path, input_file_name) 
    data: list = list(srsly.read_jsonl(input_file_full))

    # set up default base NER pipeline (English)
    nlp = spacy.load("en_core_web_sm", disable = ['ner'])
    
    # add custom pipeline NER components
    nlp.add_pipe("normalize_text", before = "tagger")
    nlp.add_pipe("fish_event_types_ruler", last=True)   
    nlp.add_pipe("fish_archsciences_ruler", last=True) 
    nlp.add_pipe("aat_activities_ruler", last=True, config={"supp_list": supp_list_act}) 
    nlp.add_pipe("child_span_remover", last=True) 
    
    # process each item in the input data
    for item in data:
        identifier = item.get("meta", {}).get("id", "").strip()
        text = item.get("text", "")
        # run NER pipeline against input text
        doc = nlp(text)
        
        # display HTML summary of NER results (see below)
        summary = DocSummary(doc)
        display(HTML(f"<h3>[ID: {identifier}]</h3>"))
        if(len(summary.spans) == 0):
            display(text)
        else:
            display(HTML(summary.doctext_to_html()))
        
        display(HTML(summary.spans_to_html()))
        #display(HTML(summary.tokens(format="html")))
        display(HTML("<hr>"))

        # add new spans to the existing spans array,
        # checking for duplicates (in case multiple runs)
        the_spans: list = item.get("spans", []) 
        new_spans = summary.spans_to_list()
        for span in new_spans:
            if not span_exists(span, the_spans):
                the_spans.append(span)
        item["spans"] = the_spans
        #item["tokens2"] = summary.tokens_to_list()
    
    # create output file path if it does not already exist
    output_file_path = os.path.join(input_file_path, "output")
    if not os.path.exists(output_file_path):
        os.makedirs(output_file_path)

    # output the modified structure to a (new) JSONL file    
    output_file_name = os.path.join(output_file_path, f"{slugify(input_file_name)}-plus-vocab-ner.jsonl") 
    srsly.write_jsonl(output_file_name, data) 


[Substitution(find='ﬀ', repl='ff', ignoreCase=True), Substitution(find='ﬁ', repl='fi', ignoreCase=True), Substitution(find='ﬂ', repl='fl', ignoreCase=True), Substitution(find='ﬃ', repl='ffi', ignoreCase=True), Substitution(find='ﬄ', repl='ffl', ignoreCase=True), Substitution(find='ﬅ', repl='ft', ignoreCase=True), Substitution(find='ﬆ', repl='st', ignoreCase=True), Substitution(find='ß', repl='s', ignoreCase=True), Substitution(find='Ꜳ', repl='AA', ignoreCase=False), Substitution(find='ꜳ', repl='aa', ignoreCase=False), Substitution(find='Æ', repl='AE', ignoreCase=False), Substitution(find='æ', repl='ae', ignoreCase=False), Substitution(find='Œ', repl='OE', ignoreCase=False), Substitution(find='œ', repl='oe', ignoreCase=False), Substitution(find='(\\p{Letter})\\p{Separator}*[\\r\\n]([a-z])', repl='\\1 \\2', ignoreCase=True), Substitution(find='\\p{Dash_Punctuation}\\p{Separator}*[\\r\\n]([a-z])', repl='\\1', ignoreCase=True), Substitution(find='([^.])\\p{Separator}*[\\r\\n]', repl='\\1 '

start,end,token_start,token_end,label,id,text,span
22,38,2,3,AAT_ACTIVITY,http://vocab.getty.edu/aat/300223990,spatial analysis,spatial analysis


start,end,token_start,token_end,label,id,text,span
13,22,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077121,collected,collected
69,79,9,9,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142118,carbonised,carbonised
173,182,41,41,AAT_ACTIVITY,http://vocab.getty.edu/aat/300138076,processed,processed


start,end,token_start,token_end,label,id,text,span
41,64,7,8,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142158,magnetic susceptibility,magnetic susceptibility
101,108,14,14,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provide,provide


start,end,token_start,token_end,label,id,text,span
76,93,15,16,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dated,radiocarbon dated
218,222,37,37,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,date,date


start,end,token_start,token_end,label,id,text,span
190,217,31,33,AAT_ACTIVITY,http://vocab.getty.edu/aat/300081742,neutron activation analysis,neutron activation analysis


"This situation has changed with recent archaeological, paleontological, and wetland coring research conducted on O'ahu's 'Ewa Plain, a hot, dry emerged limestone reef characterized by numerous sinkholes."

start,end,token_start,token_end,label,id,text,span
0,24,0,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,Palaeobotanical analysis,Palaeobotanical analysis
29,47,3,4,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating,radiocarbon dating


start,end,token_start,token_end,label,id,text,span
3,14,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137546,investigate,investigate
83,91,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analysed,analysed
163,171,30,30,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analysed,analysed


start,end,token_start,token_end,label,id,text,span
0,20,0,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,Statistical analysis,Statistical analysis
34,51,5,6,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,measures analysis,measures analysis
134,142,22,22,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137483,employed,employed


start,end,token_start,token_end,label,id,text,span
30,36,6,6,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379805,plowed,plowed


start,end,token_start,token_end,label,id,text,span
102,113,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300262794,demonstrate,demonstrate
153,171,26,27,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating,radiocarbon dating
212,220,36,36,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provides,provides


start,end,token_start,token_end,label,id,text,span
8,14,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300055545,assess,assess
51,58,8,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053431,squared,squared


start,end,token_start,token_end,label,id,text,span
60,70,9,10,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142177,OSL dating,OSL dating
105,112,18,18,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142215,working,working
143,154,24,24,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054608,constructed,constructed


'One way ANOVA reveals a significant difference in alongshore meiofaunal density between the sampling station fronting the bulkhead at Site 1 and at sampling stations at a similar profile elevation at Sites 2 and 3.'

'A small (one liter) flotation sample was taken from the soil surrounding and inside the skull.'

start,end,token_start,token_end,label,id,text,span
46,52,8,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077610,record,record


start,end,token_start,token_end,label,id,text,span
0,23,0,2,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142100,Amino acid racemisation,Amino acid racemisation
49,54,8,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404795,dates,dates
56,63,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provide,provide


'Pollen studies concluded that the pre-human island Rapa Nui was dominated by a now extinct palm, Paschalococos disperta.'

start,end,token_start,token_end,label,id,text,span
88,97,19,19,AAT_ACTIVITY,http://vocab.getty.edu/aat/300080091,describes,describes


start,end,token_start,token_end,label,id,text,span
76,87,13,13,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137546,investigate,investigate


start,end,token_start,token_end,label,id,text,span
102,110,16,16,AAT_ACTIVITY,http://vocab.getty.edu/aat/300226216,examined,examined


start,end,token_start,token_end,label,id,text,span
87,107,13,14,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,macrofossil analyses,macrofossil analyses
154,172,24,25,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating,radiocarbon dating


start,end,token_start,token_end,label,id,text,span
22,29,4,4,AAT_ACTIVITY,http://vocab.getty.edu/aat/300343813,studied,studied
217,225,40,40,AAT_ACTIVITY,http://vocab.getty.edu/aat/300237969,simulate,simulate


start,end,token_start,token_end,label,id,text,span
149,155,26,26,AAT_ACTIVITY,http://vocab.getty.edu/aat/300182748,washed,washed
157,162,28,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053758,dried,dried
201,208,36,36,AAT_ACTIVITY,http://vocab.getty.edu/aat/300194584,assayed,assayed


start,end,token_start,token_end,label,id,text,span
35,42,5,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404521,showing,showing
374,388,62,63,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,model analyses,model analyses


'The 17 radiocarbon determinations from the Pulemelei mound site were used to generate a local prehistoric sequence for the Letolo area.'

start,end,token_start,token_end,label,id,text,span
0,17,0,1,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,Radiocarbon dates,Radiocarbon dates
22,37,3,4,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,pollen analysis,pollen analysis


start,end,token_start,token_end,label,id,text,span
0,16,0,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300223990,Spatial analysis,Spatial analysis


start,end,token_start,token_end,label,id,text,span
3,9,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137552,report,report
27,48,6,7,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,petrographie analysis,petrographie analysis


'A series of AMS 14C dates indicate that most of the ochres and all pieces of facetted ochre were deposited between 1200 and 1400 years ago.'

start,end,token_start,token_end,label,id,text,span
114,121,19,19,AAT_ACTIVITY,http://vocab.getty.edu/aat/300224146,removed,removed
157,163,28,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053564,reduce,reduce
243,251,43,43,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077124,compared,compared


'Background In 1997, we conducted exca vations on the banks of Ain Soda, a pool in the marshland of the Azraq Oasis in eastern Jordan, along the shore of what was once a large Pleistocene lake.'

start,end,token_start,token_end,label,id,text,span
41,69,6,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379555,principal component analysis,principal component analysis
135,139,20,20,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054769,open,open


start,end,token_start,token_end,label,id,text,span
42,49,11,11,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077121,collect,collect


start,end,token_start,token_end,label,id,text,span
13,19,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300077506,listed,listed
91,109,20,21,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,Radiocarbon Dating,Radiocarbon Dating
123,152,26,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300264220,accelerator mass spectrometry,accelerator mass spectrometry


start,end,token_start,token_end,label,id,text,span
14,42,2,3,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,dendrochronological analyses,dendrochronological analyses
44,62,5,6,FISH_ARCHSCIENCE,http://purl.org/heritagedata/schemes/560/concepts/142188,radiocarbon dating,radiocarbon dating
63,69,7,7,AAT_ACTIVITY,http://vocab.getty.edu/aat/300393207,served,served


start,end,token_start,token_end,label,id,text,span
78,95,14,15,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,infrared analyses,infrared analyses


start,end,token_start,token_end,label,id,text,span
10,17,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137062,joining,joining
32,51,4,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,coordinate analyses,coordinate analyses
74,83,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053768,supported,supported


start,end,token_start,token_end,label,id,text,span
135,142,27,27,AAT_ACTIVITY,http://vocab.getty.edu/aat/300379753,fencing,fencing
340,347,67,67,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054711,planted,planted
434,446,88,88,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137847,representing,representing


start,end,token_start,token_end,label,id,text,span
19,27,4,4,AAT_ACTIVITY,http://vocab.getty.edu/aat/300080091,describe,describe
65,94,11,13,AAT_ACTIVITY,http://vocab.getty.edu/aat/300264220,accelerator mass spectrometry,accelerator mass spectrometry
146,154,25,25,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053890,tempered,tempered


start,end,token_start,token_end,label,id,text,span
0,15,0,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,Pollen analysis,Pollen analysis
16,24,2,2,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137584,provides,provides
65,71,10,10,AAT_ACTIVITY,http://vocab.getty.edu/aat/300417305,felled,felled


start,end,token_start,token_end,label,id,text,span
280,285,49,49,AAT_ACTIVITY,http://vocab.getty.edu/aat/300404521,shown,shown
332,346,57,58,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,trace analyses,trace analyses


start,end,token_start,token_end,label,id,text,span
17,25,6,6,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analyzed,analyzed


'Coring proved an accurate and cost-effective alternative to traditional test-excavation, and its application in only two short field seasons doubled the number of sites tested in this region.'

start,end,token_start,token_end,label,id,text,span
3,9,1,1,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137552,report,report
56,72,10,11,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,isotope analysis,isotope analysis
165,182,27,28,AAT_ACTIVITY,http://vocab.getty.edu/aat/300225881,Mass Spectrometry,Mass Spectrometry
194,202,33,33,AAT_ACTIVITY,http://vocab.getty.edu/aat/300137570,identify,identify


start,end,token_start,token_end,label,id,text,span
16,24,3,3,AAT_ACTIVITY,http://vocab.getty.edu/aat/300226216,examined,examined


'Since CENTURY models only the top 20 cm of the soil, a uniform bulk density of 1.35 g cm-3, based on the mean of five measurements from fixed-depth 15 to 23 cm samples, was used throughout to calculate outputs on a per unit area basis.'

start,end,token_start,token_end,label,id,text,span
14,40,2,3,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,phytosociological analysis,phytosociological analysis
127,135,17,17,AAT_ACTIVITY,http://vocab.getty.edu/aat/300236367,adjusted,adjusted


start,end,token_start,token_end,label,id,text,span
13,35,3,5,AAT_ACTIVITY,http://vocab.getty.edu/aat/300081693,trace element analysis,trace element analysis
218,229,38,38,AAT_ACTIVITY,http://vocab.getty.edu/aat/300053892,transported,transported


start,end,token_start,token_end,label,id,text,span
0,7,0,0,AAT_ACTIVITY,http://vocab.getty.edu/aat/300239496,Running,Running
49,57,8,8,AAT_ACTIVITY,http://vocab.getty.edu/aat/300054595,analyzed,analyzed
