# Applying temporal information extraction techniques to GoTriple abstracts
This work has been produced as part of the [ATRIUM](https://atrium-research.eu/) project for Work Package 3 (Facilitating Discoverability of and Access to Humanities Resources), Task 3.4.2 (Translation of textual content). The code performs information extraction on multilingual abstracts sourced from the GoTriple portal, to identify temporal entities - named periods (from a specified authority of the [Perio.do](https://perio.do/) gazetteer) or year spans (phrases such as "*early 15th century*") mentioned anywhere within the abstract text.


In [1]:
%%capture

import warnings
# suppress user warnings during execution
warnings.filterwarnings(action='ignore', category=UserWarning)

# load required dependencies
%pip install --upgrade pip
%pip install spacy
%pip install srsly

%sx python -m spacy download en_core_web_sm
%sx python -m spacy download es_core_news_sm
%sx python -m spacy download fr_core_news_sm

## Obtaining GoTriple abstracts
Referencing the workflow [Using GoTriple data for your SSH data science tasks](https://marketplace.sshopencloud.eu/workflow/3rVKH7), there are two main methods to obtain abstracts from the [GoTriple portal](https://www.gotriple.eu/) - either via API calls, or via bulk download of GoTriple metadata records (data dumps). The available [data dumps]( https://zenodo.org/records/15784401) are grouped by domain/discipline - the source code below will download (and cache) an 89MB compressed JSONL GZ format file for the chosen domain 'archeo' (Archaeology and Prehistory). It will read and parse the comporessed JSONL data file, extracting abstracts for the specified language, for subsequent processing.

In [None]:
import os
import spacy # for NER processing
import srsly # for JSONL serialization/deserialization
import requests # for downloading files from URL
import urllib.parse # for URL encoding/decoding
from rematch2 import DocSummary
from rematch2.Util import *
from functools import reduce
from spacy.language import Language

# download file from URL to output path and return filename including path
# use a previously cached file if it exists , rather than downloading again
def download_file_from_url(url: str, output_path: str=".") -> str:    

    # ensure the intended output path exists   
    if not os.path.exists(output_path): os.makedirs(output_path)

    # just use previously cached file if it exists
    file_name = url.split("?")[0].split("/")[-1]
    file_path = f"{output_path}/{file_name}" 

    if os.path.exists(file_path):
        print(f"Using cached file {file_path}")        
    else:
        print(f"Downloading file {url} to {file_path}")
        response = requests.get(url, stream=True)
        with open(file_path, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    return file_path

# download GoTriple JSONL.GZ file for specified domain to output path and return filename including path
def download_gotriple_jsonl_gz_file_for_domain(domain: str="", output_path: str=".") -> str:
    disciplines = [
        "anthro-bio", "anthro-se", "archeo", "archi", "art", "class", 
        "demo", "droit", "eco", "edu", "envir", "genre", "geo", 
        "hisphilso", "hist", "info", "lang", "litt", "manag", "museo", 
        "musiq", "phil", "psy", "relig", "scipo", "socio", "stat"
    ]   
    discipline = domain.strip().lower()
    if discipline not in disciplines:
        raise ValueError(f"Domain '{domain}' not recognised - must be one of {disciplines}")

    url = f"https://zenodo.org/records/15784401/files/{urllib.parse.quote_plus(discipline)}_merged.jsonl.gz?download=1"
    return download_file_from_url(url, output_path)


# retrieve abstracts for specified language from a previously downloaded GoTriple JSONL.GZ file
# returns an array of records: [{id, lang, text}, {id, lang, text}] ...]
def get_abstracts_from_gotriple_jsonl_gz_file(file_path: str, language: str="en") -> list[dict]:
    data: list[dict] = []
    for record in srsly.read_gzip_jsonl(file_path, True):
        id: str = str(record.get("id", None))
        if id is not None:
            # get abstract only for specified language            
            abstracts: list = record.get("abstract", [])
            abstract = next(filter(lambda a: a.get("lang", "") == language, abstracts), None)
            if abstract is not None:
                data.append({
                    "id": id, 
                    "lang": abstract.get("lang", ""), 
                    "text": abstract.get("text", "") 
                })           

    return data


## Information extraction pipeline
We are using the open-source [spaCy](https://spacy.io/) Natural Language Processing library but not the default NER pipeline functionality, instead creating language-specific (English, French and Spanish) pipelines consisting of specialised components for performing temporal information extraction, each component then having a language-specific implementation. The pipeline has the following components:

* `tok2vec` - tokenizer. Splits the text into separate tokens. Part of the base spaCy pipeline
* `tagger` - Part of speech tagger to identify instances of NOUN, VERB etc. Part of the base spaCy pipeline
* `normalize_text` - Text normalisation to improve subsequent chances of matching. Normalises spelling, whitestace and punctuation
* `parser` - Dependency parser. Part of the base spaCy pipeline
* `attribute_ruler` - Part of the base spaCy pipeline
* `lemmatizer` - Assigns a base form to tokens to improve matching. Part of the base spaCy pipeline
* `ordinal_ruler` - identifies ordinals (e.g. "*Fifteenth*", "*15th*", "*15°*", "*XV*")
* `dateprefix_ruler` - identifies date prefixes (e.g. "*Early*", "*mediales*", "*intorno al*")
* `datesuffix_ruler` - identifies date suffixes (e.g. "*A.D.*", "*après J.C.*", "*D.C.*")
* `dateseparator_ruler` - identifies common separators (e.g. "*to*", "*and*", "*-*")
* `monthname_ruler` - identifies month names (e.g. "*March*", "*marzo*", "*mars*")
* `seasonname_ruler` - identifies season names (e.g. "*Autumn*", "*automne*", "*autunno*", "*otoño*")
* `yearspan_ruler` - identifies year spans (e.g. "*Early to mid 15th century*")
* `periodo_ruler` - identifies terms from specifies Perio.do authority (e.g. "*[Epipaléolithique](http://n2t.net/ark:/99152/p02chr452hz)*", "*[Bronze Age](http://n2t.net/ark:/99152/p0kh9ds7q8m)*", "*[Prima età imperiale](http://n2t.net/ark:/99152/p03dzfbztr7)*")
* `child_span_remover` - removes match results encompassed by a more specific larger match (e.g. "*[Bronze Age](http://n2t.net/ark:/99152/p0kh9ds7q8m)*" within match on "*[Early Bronze Age](http://n2t.net/ark:/99152/p0kh9ds7q8m)*")


The output will be an array of records each containing a `spans` element which is an array of temporal entities identified in the given text:  

```json
[{ "id": "123", "lang": "en", "text": "xxx", "spans": [] }, ...]
```

* `id` - identifier of the record from the data source
* `lang` - language code indicasting the language of the text
* `text` - the text to be processed
* `spans` - array of temporal entities identified in the given text. 

An example individual 'span' element and a description of its properties are shown below:

```json
{ "start": 103, "end": 130, "token_start": 21, "token_end": 25, "label": "YEARSPAN", "id": "", "text": "primera mitad del siglo XIX" }
```

* `start` - the zero-based starting character position of the identified entity in the text.
* `end` - the zero-based position of the character following the identified entity in the text.
* `token_start` - the starting token position of the identified entity in the text. A token usually equates to either a word or punctuation. 
* `token_end` - the ending token position of the identified entity in the text.
* `label` - a label indicating the 'type' of entity identified. In this case the value will be either "YEARSPAN" (matched on a regular expression pattern) or "PERIOD" (matched on a [Perio.do](https://perio.do/en/) named period)
* `id` - identifier of the entity identified where applicable (a perio.do match will have an associated identifier)
* `text` - the text being searched for matches

In [4]:
# prepare English language spaCy pipeline
nlp_en: Language = spacy.load("en_core_web_sm", disable = ['ner'])
nlp_en.add_pipe("normalize_text", before = "parser")
nlp_en.add_pipe("yearspan_ruler", last=True) 
# using 'HE Cultural Periods' authority
nlp_en.add_pipe("periodo_ruler", last=True, config={ "periodo_authority_id": "p0kh9ds" })
nlp_en.add_pipe("child_span_remover", last=True) 

# prepare French language spaCy pipeline
nlp_fr: Language = spacy.load("fr_core_news_sm", disable = ['ner'])
nlp_fr.add_pipe("normalize_text", before = "parser")
nlp_fr.add_pipe("yearspan_ruler", last=True)  
# using 'PACTOLS chronology periods used in DOLIA data' authority
nlp_fr.add_pipe("periodo_ruler", last=True, config={ "periodo_authority_id": "p02chr4" })
nlp_fr.add_pipe("child_span_remover", last=True) 

# prepare Spanish language spaCy pipeline
nlp_es: Language = spacy.load("es_core_news_sm", disable = ['ner'])
nlp_es.add_pipe("normalize_text", before = "parser")
nlp_es.add_pipe("yearspan_ruler", last=True)  
# using 'SIA+ Chrono-Cultural Categories' authority
nlp_es.add_pipe("periodo_ruler", last=True, config={ "periodo_authority_id": "p07h9k6" })
nlp_es.add_pipe("child_span_remover", last=True) 

# prepare I/O paths
data_directory: str = "./data/gotriple"
if not os.path.exists(data_directory): os.makedirs(data_directory)

# download GoTriple data file for domain 'archeo' (archeology, history, art history, cultural heritage)
print("downloading GoTriple data file for domain 'archeo'")
input_file_path = download_gotriple_jsonl_gz_file_for_domain("archeo", data_directory)

# process abstracts for each specified language
languages = ["en", "fr", "es"]

for language in languages:
    # extract abstracts from the downloaded file
    print(f"extracting abstracts for language '{language}'")
    abstracts = get_abstracts_from_gotriple_jsonl_gz_file(input_file_path, language)
    
    #timestamp: str = DT.now().strftime('%Y%m%d')
    output_file_name: str = f"ner-output-gotriple-abstracts-{language}.jsonl.gz"
    output_file_path: str = os.path.join(data_directory, output_file_name)

    output_data = []
    record_count = len(abstracts)
    print(f"processing {record_count} records ({language}) from GoTriple abstracts data file")
    current_record_index = 0
    for record in abstracts: 
        # progress notification every 1000 records
        if current_record_index % 1000 == 0:
            print(f"processing record {current_record_index} of {record_count}")

        # temp break after 3000 records for testing
        if current_record_index == 3000:
            break
        
        lang = record.get("lang", "")      
        text = record.get("text", "")        
        record["spans"] = []

        if(len(text) > 0):
            # run the pipeline on the input text
            if(lang == "fr"):
                doc = nlp_fr(text)
            elif (lang == "es"):
                doc = nlp_es(text)
            else:
                doc = nlp_en(text)             
            
            # get the spans identified by the pipeline
            summary = DocSummary(doc)
            record["spans"] = summary.spans_to_list()
            
        current_record_index += 1 
    
    counter1 = reduce(lambda acc, r: acc + (1 if len(r.get("spans", [])) > 0 else 0), abstracts, 0)
    print(f"identified {counter1} records with spans for language '{language}'")
    counter2 = reduce(lambda acc, r: acc + len(r.get("spans", [])), abstracts, 0)
    print(f"identified {counter2} spans in total for language '{language}'")

    # write the id, abstract and spans to JSON file (only records with identified spans)
    filtered_abstracts = list(filter(lambda rec: len(rec.get("spans", [])) > 0, abstracts))
    print (f"writing {len(filtered_abstracts)} abstracts with spans for language '{language}' to '{output_file_path}'")         
    srsly.write_gzip_jsonl(output_file_path, filtered_abstracts)
    print("done")   


downloading GoTriple data file for domain 'archeo'
Using cached file ./data/gotriple/archeo_merged.jsonl.gz
extracting abstracts for language 'en'
processing 57449 records (en) from GoTriple abstracts data file
processing record 0 of 57449
processing record 1000 of 57449
processing record 2000 of 57449
processing record 3000 of 57449
identified 1365 records with spans for language 'en'
identified 3584 spans in total for language 'en'
writing 1365 abstracts with spans for language 'en' to './data/gotriple/ner-output-gotriple-abstracts-en.jsonl.gz'
done
extracting abstracts for language 'fr'
processing 1066 records (fr) from GoTriple abstracts data file
processing record 0 of 1066
processing record 1000 of 1066
identified 436 records with spans for language 'fr'
identified 1482 spans in total for language 'fr'
writing 436 abstracts with spans for language 'fr' to './data/gotriple/ner-output-gotriple-abstracts-fr.jsonl.gz'
done
extracting abstracts for language 'es'
processing 2586 record

## Results
In an experimental run on 09/10/2025 the code extracted a total of 57,449 English, 1,066 French and 2,586 Spanish abstracts. 
For performance reasons the subsequent processing on the English abstracts was limited to only the first 3,000 records.

* 3,584 temporal entities were identified within 1,365 out of 3,000 English abstracts.
* 1,482 temporal entities were identified within 436 out of 1,066 French abstracts.
* 1,803 temporal entities were identified within 1,057 out of 2,586 Spanish abstracts.

In [7]:
# Get stats on each of the processed files   
for language in ["en", "fr", "es"]:
    data_directory: str = "./data/gotriple/"
    data_file_name: str = f"ner-output-gotriple-abstracts-{language}.jsonl.gz"
    data_file_path: str = os.path.join(data_directory, data_file_name)

    # Display the stats for the processed file
    if os.path.exists(data_file_path):
        print(f"Stats for file '{data_file_path}':")
        # read the data from the JSONL.GZ file (use read_gzip_jsonl for JSON Lines)
        data: list[dict] = []
        for record in srsly.read_gzip_jsonl(data_file_path, True):
            if record is not None:
                data.append(record)

        print(f"Total records: {len(data)}")
        total_spans = sum(len(record.get("spans", [])) for record in data)
        print(f"Total temporal entities identified: {total_spans}")
    else:
        print(f"File '{data_file_path}' does not exist.")
        print("No data to display stats for.")


Stats for file './data/gotriple/ner-output-gotriple-abstracts-en.jsonl.gz':
Total records: 1365
Total temporal entities identified: 3584
Stats for file './data/gotriple/ner-output-gotriple-abstracts-fr.jsonl.gz':
Total records: 436
Total temporal entities identified: 1482
Stats for file './data/gotriple/ner-output-gotriple-abstracts-es.jsonl.gz':
Total records: 1057
Total temporal entities identified: 1803


The output data files were then processed using a separate tool ([yearspans](https://github.com/cbinding/yearspans)) to add `minYear`, `maxYear`, `isoSpan` and `duration` properties to the identified temporal entities - see https://github.com/cbinding/yearspans/blob/main/ATRIUM-T3-2-process-gotriple-ner-abstracts.ipynb for details. Some example results from this further processing are shown below:

English:
```json
{
	"start": 786,
	"end": 794,
	"token_start": 163,
	"token_end": 164,
	"label": "PERIOD",
	"id": "http://n2t.net/ark:/99152/p0kh9dsbd33",
	"text": "Cold War",
	"minYear": "1946",
	"maxYear": "1991",
	"isoSpan": "1946/1991",
	"duration": 46
},
{
	"start": 81,
	"end": 106,
	"token_start": 17,
	"token_end": 21,
	"label": "YEARSPAN",
	"id": "",
	"text": "during the 4th century AD",
	"minYear": "0301",
	"maxYear": "0400",
	"isoSpan": "0301/0400",
	"duration": 100
},
{
    "start": 387,
	"end": 398,
	"token_start": 84,
	"token_end": 85,
	"label": "PERIOD",
	"id": "http://n2t.net/ark:/99152/p0kh9dscbkd",
	"text": "Middle Ages",
	"minYear": "1066",
	"maxYear": "1540",
	"isoSpan": "1066/1540",
	"duration": 475
},
{
	"start": 1029,
	"end": 1051,
	"token_start": 204,
	"token_end": 208,
	"label": "YEARSPAN",
	"id": "",
	"text": "6th and 8th century AD",
	"minYear": "0501",
	"maxYear": "0800",
	"isoSpan": "0501/0800",
	"duration": 300
},
```

Spanish:
```json
{
	"start": 197,
	"end": 208,
	"token_start": 43,
	"token_end": 45,
	"label": "YEARSPAN",
	"id": "",
	"text": "1404 y 1403",
	"minYear": "1403",
	"maxYear": "1404",
	"isoSpan": "1403/1404",
	"duration": 2
},
{
	"start": 157,
	"end": 167,
	"token_start": 27,
	"token_end": 28,
	"label": "PERIOD",
	"id": "http://n2t.net/ark:/99152/p07h9k6f5tz",
	"text": "Edad Media",
	"minYear": "0400",
	"maxYear": "1500",
	"isoSpan": "0400/1500",
		"duration": 1101
},
{
	"start": 517,
	"end": 525,
	"token_start": 96,
	"token_end": 97,
	"label": "YEARSPAN",
	"id": "",
	"text": "siglo XI",
	"minYear": "1001",
	"maxYear": "1100",
	"isoSpan": "1001/1100",
	"duration": 100
},
```

French:
```json
{
	"start": 1507,
	"end": 1528,
	"token_start": 250,
	"token_end": 254,
	"label": "YEARSPAN",
	"id": "",
	"text": "XIIe siecle av. J.-C.",
	"minYear": "-1199",
	"maxYear": "-1100",
	"isoSpan": "-1199/-1100",
	"duration": 100
},
{
	"start": 74,
	"end": 96,
	"token_start": 13,
	"token_end": 15,
	"label": "PERIOD",
	"id": "http://n2t.net/ark:/99152/p02chr45b7p",
	"text": "epoque romaine tardive",
	"minYear": "0300",
	"maxYear": "0499",
	"isoSpan": "0300/0499",
	"duration": 200
},
```

There are better (graphical) ways to visualise these results, such as periods on a timeline, or a histogram of the periods found. As this was only a short exercise to demonstrate use of the information extraction tools the further use and visualisation of the results is left to future work. 