# NER for geographical place names

This notebook demonstrates use of a vocabulary-driven Named Entity Recognition (NER) pipeline component for use with the [spaCy NLP library](https://spacy.io/) to locate place names within text passages. The *geonames_ruler* component performs NER producing a list of spans identifying the textual positions of matches within the input text. It utilizes an NLP pipeline to restrict matches to proper nouns (e.g. "*Wells*" not "*wells*"; "*Street*" not "*street*").

## Data sources
This custom pipeline component identifies place names originating from the following data sources. It does not currently look for named features (e.g. mountains, lakes or rivers), though this could be accommodated in future if it were seen as a specific requirement:

* GeoNames [admin1CodesASCII.txt](https://download.geonames.org/export/dump/readme.txt) file - "names in English for admin divisions"
* GeoNames [admin2Codes.txt](https://download.geonames.org/export/dump/readme.txt) file - "names for administrative subdivisions"
* GeoNames [cities500.zip](https://download.geonames.org/export/dump/readme.txt) file - "cities with a population > 500"

The [GeoNames](https://www.geonames.org/) data files listed above are available for download under a [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/) from https://download.geonames.org/export/dump/

The use of GeoNames data in the pipeline component facilitates **semantic linking** by providing identifiers for the place names found within the input text. 

## Configuration
Configuration is by an array of ISO country code(s), so to restrict to place names within the UK use ["GB"] - this is the default if no country codes are supplied. To include additional countries (e.g. *France*) use ["GB", "FR"]. For a full list of country codes see [here](https://download.geonames.org/export/dump/countryInfo.txt). Restricting the component to specific country code(s) can reduce (though not entirely eliminate) ambiguity. Any ambiguous names (e.g. *Newport*) in the examples below would display with multiple "PLACE" tags in the marked up text passage. The listing of the identified spans then shows the GeoNames identifiers for each of the alternative matches.

In [1]:
%%capture
import warnings
# suppress user warnings during execution
warnings.filterwarnings(action='ignore', category=UserWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)

# install prerequisites for this demonstration
%pip install -U spacy # spaCy NLP library
%sx python -m spacy download en_core_web_sm # English language trained pipeline 

# dependencies used by subsequent code cells
import spacy # NLP processing library
from spacy.tokens import Doc # main class for pipeline component I/O
from spacy import displacy # for visualisation of resultant marked up text
from IPython.display import display, HTML # for displaying results in this notebook
import pandas as pd  # for DataFrame display

import rematch2 # USW custom spaCy NER pipeline components


# for inline display of marked up input text     
def display_highlighted(doc: Doc) -> None:    
    displacy.render(
        docs = doc, 
        style = "span", 
        jupyter = True, 
        options = { 
            "spans_key": "rematch",
            "colors": { 
                "GPE": "palegreen",
                "PLACE": "palegreen"
            }
        } 
    )


# for tabular display of identified spans
def display_spans_table(doc: Doc) -> None:
    # et identified spans from the spaCy Doc oject
    spans = doc.spans.get("rematch", [])

    # create DataFrame with required columns
    df = pd.DataFrame([{
        "start": span.start_char,
        "end": span.end_char,
        "token_start": span.start,
        "token_end": span.end - 1,            
        "label": span.label_,
        "id": span.id_,
        "text": span.text
        } for span in spans])

    # render DataFrame as html table
    display(HTML(df.to_html(index=False, border=True)))


## Usage example 1
The following Python code demonstrates some basic usage of the *geonames_ruler* custom pipeline component. Note in the results that *Cardiff* occurs twice in GeoNames with different identifiers - as both a city and an administrative division, having different identifiers, it is left to the user to decide how to use these 2 (valid) results. Note also that *Leicestershire* is found via two routes - using the default spaCy NER functionality (tagged as "GPE" - Geo-Politial Entity - without an associated identifier) and via the custom pipeline component (tagged as "PLACE" with a GeoNames identifier). *Rhondda* appears as a match within *Rhondda Cynon Taf* - in such cases you may choose to optionally suppress spans encompassed by other spans.

In [10]:
# set up default base NER pipeline (English)
# this performs tokenisation, tagging, lemmatization etc.
nlp = spacy.load("en_core_web_sm")   

# add custom pipeline NER component(s) to the end of the default pipeline
nlp.add_pipe("geonames_ruler", last=True, config={"country_codes": ["GB"]})  

# run full NER pipeline on example test text
input_text = "Ashby-de-la-Zouch is a town located in Leicestershire (England). The USW university campus is located within Rhondda Cynon Taf about 10 miles North-West of Cardiff (near Pontypridd)."
doc = nlp(input_text)

# (optionally) also include any place entities identified by
# the default spaCy NER functionality for subsequent display 
for ent in filter(lambda e: e.label_ == "GPE", doc.ents):
    doc.spans["rematch"].append(ent)

# display the NER results
display_highlighted(doc)
display_spans_table(doc)


start,end,token_start,token_end,label,id,text
0,17,0,6,PLACE,http://sws.geonames.org/2656970/,Ashby-de-la-Zouch
39,53,12,12,PLACE,http://sws.geonames.org/2644667/,Leicestershire
55,62,14,14,PLACE,http://sws.geonames.org/6269131/,England
109,116,24,24,PLACE,http://sws.geonames.org/2639447/,Rhondda
109,126,24,26,PLACE,http://sws.geonames.org/3333247/,Rhondda Cynon Taf
156,163,34,34,PLACE,http://sws.geonames.org/3333241/,Cardiff
156,163,34,34,PLACE,http://sws.geonames.org/2653822/,Cardiff
170,180,37,37,PLACE,http://sws.geonames.org/2640104/,Pontypridd
39,53,12,12,GPE,,Leicestershire


## Usage example 2
The test data used here is a set of 20 example records provided by the Archaeology Data Service (ADS) for use in ATRIUM task T4.1 (text based workflows). The original data file is a google sheet located in the (privately shared) ATRIUM shared Google drive - under ATRIUM / WP4 / T4.1 / Subtask 4.1.2 / ADS-UoY_samples / oasis_reports_july_2024 / report_metadata. The results immediately following this source code illustrates HTML rendered output of the pipeline, but the results may be expressed in many different formats. For each result we include a link back to the original record so you can also see the data in context. 

This functionality is developed using pipeline components to maximise flexibility and reusability, allowing custom components to be combined as required. The components may quite easily be incorporated into a larger application.

In [14]:
# read a set of test records from source (CSV) data file 
# source data file columns are "file", "title", "abstract", "doi"
file_path = "./data/ner-input/oasis-report-metadata/report_metadata.csv" 

# read CSV input file
df = pd.read_csv(file_path, skip_blank_lines=True)
test_records = df.to_dict(orient="records")   

# set up default base NER pipeline (English)
nlp = spacy.load("en_core_web_sm")   

# add custom pipeline NER component(s) to the end of the default pipeline
nlp.add_pipe("geonames_ruler", last=True, config={"country_codes": ["GB"]})  

# (optionally) suppress matches on encompassed spans 
nlp.add_pipe("child_span_remover", last=True) 
   
# process each test record using the custom pipeline  
for record in test_records:
    test_id = record.get("doi","").strip()
    test_title = record.get("title","").strip()
    test_text = record.get("abstract","").strip()
    
    # run NER pipeline on title and text combined
    input_text = f"{test_title}. {test_text}"
    doc = nlp(input_text)

    # (optionally) add any place entities identified by the
    # default spaCy NER functionality for subsequent display. 
    # Note you may wish to omit this step; these will not have 
    # identifiers as they did not originate from GeoNames.
    for ent in filter(lambda e: e.label_ == "GPE", doc.ents):
        doc.spans["rematch"].append(ent) 

    # the identified spans are stored here
    # spans = doc.spans.get("rematch", [])

    # display record identifier. In this case the identifiers are URIs
    display(HTML(f"<strong><a href='{test_id}'>{test_id}</a></strong><br>"))
    
    # display input text highlighted with identified places
    display_highlighted(doc)

    # display table with identifiers and locations within the text
    display_spans_table(doc)
    
    # display horizontal rule before next record
    display(HTML("<hr>"))           

start,end,token_start,token_end,label,id,text
23,31,5,5,PLACE,http://sws.geonames.org/2635761/,Tiverton
23,31,5,5,PLACE,http://sws.geonames.org/2635762/,Tiverton
33,38,7,7,PLACE,http://sws.geonames.org/2651292/,Devon
200,208,34,34,PLACE,http://sws.geonames.org/2635761/,Tiverton
200,208,34,34,PLACE,http://sws.geonames.org/2635762/,Tiverton
210,215,36,36,PLACE,http://sws.geonames.org/2651292/,Devon
200,208,34,34,GPE,,Tiverton


start,end,token_start,token_end,label,id,text
54,64,8,8,PLACE,http://sws.geonames.org/2638017/,Sherington
71,79,11,11,PLACE,http://sws.geonames.org/2644826/,Lathbury
81,96,13,13,PLACE,http://sws.geonames.org/2654408/,Buckinghamshire
241,251,33,33,PLACE,http://sws.geonames.org/2638017/,Sherington
258,266,36,36,PLACE,http://sws.geonames.org/2644826/,Lathbury
268,283,38,38,PLACE,http://sws.geonames.org/2654408/,Buckinghamshire
1066,1071,173,173,PLACE,http://sws.geonames.org/2639376/,Ridge
71,79,11,11,GPE,,Lathbury
81,96,13,13,GPE,,Buckinghamshire
258,266,36,36,GPE,,Lathbury


start,end,token_start,token_end,label,id,text
24,32,5,5,PLACE,http://sws.geonames.org/2635978/,Thornley
34,47,7,8,PLACE,http://sws.geonames.org/2650629/,County Durham


start,end,token_start,token_end,label,id,text
24,33,5,5,PLACE,http://sws.geonames.org/2640131/,Ponteland
129,135,21,21,GPE,,c.2.64
452,462,70,70,GPE,,Overburden


start,end,token_start,token_end,label,id,text
54,62,9,9,PLACE,http://sws.geonames.org/2654928/,Bramford
64,71,11,11,PLACE,http://sws.geonames.org/2636561/,Suffolk
2008,2016,356,356,PLACE,http://sws.geonames.org/2654928/,Bramford
64,71,11,11,GPE,,Suffolk
701,709,124,124,GPE,,Medieval


start,end,token_start,token_end,label,id,text
94,104,17,17,PLACE,http://sws.geonames.org/2637183/,Sproughton
106,113,19,19,PLACE,http://sws.geonames.org/2646057/,Ipswich
115,122,21,21,PLACE,http://sws.geonames.org/2636561/,Suffolk
94,104,17,17,GPE,,Sproughton
106,113,19,19,GPE,,Ipswich
115,122,21,21,GPE,,Suffolk
278,283,55,55,GPE,,Field


start,end,token_start,token_end,label,id,text
62,71,11,11,PLACE,http://sws.geonames.org/2635597/,Towcester
73,89,13,13,PLACE,http://sws.geonames.org/2641429/,Northamptonshire
1422,1438,237,237,PLACE,http://sws.geonames.org/2641429/,Northamptonshire
2081,2086,353,353,PLACE,http://sws.geonames.org/2643071/,March
2222,2231,376,376,PLACE,http://sws.geonames.org/2635597/,Towcester
2233,2249,378,378,PLACE,http://sws.geonames.org/2641429/,Northamptonshire
73,89,13,13,GPE,,Northamptonshire
629,639,115,115,GPE,,Overburden
2064,2071,350,350,GPE,,Britain
2233,2249,378,378,GPE,,Northamptonshire


start,end,token_start,token_end,label,id,text
85,101,14,14,PLACE,http://sws.geonames.org/2641429/,Northamptonshire
640,656,127,127,PLACE,http://sws.geonames.org/2641429/,Northamptonshire
640,656,127,127,GPE,,Northamptonshire


start,end,token_start,token_end,label,id,text
114,120,17,17,PLACE,http://sws.geonames.org/2644516/,Lifton
122,127,19,19,PLACE,http://sws.geonames.org/2651292/,Devon
199,209,35,35,GPE,,Bartington


start,end,token_start,token_end,label,id,text
44,52,6,6,PLACE,http://sws.geonames.org/2655729/,Bicester
2179,2185,420,420,PLACE,http://sws.geonames.org/2640729/,Oxford
2325,2333,443,443,PLACE,http://sws.geonames.org/2655729/,Bicester
2335,2346,445,445,PLACE,http://sws.geonames.org/2640726/,Oxfordshire
2356,2361,448,448,PLACE,http://sws.geonames.org/2643071/,March
3158,3164,583,583,PLACE,http://sws.geonames.org/2640729/,Oxford
2325,2333,443,443,GPE,,Bicester
2335,2346,445,445,GPE,,Oxfordshire
2624,2633,491,491,GPE,,Alchester


start,end,token_start,token_end,label,id,text
58,62,11,11,PLACE,http://sws.geonames.org/2646900/,Hill
64,74,13,14,PLACE,http://sws.geonames.org/2645456/,Kings Lynn
76,83,16,16,PLACE,http://sws.geonames.org/2641455/,Norfolk
134,140,27,27,PLACE,http://sws.geonames.org/2640729/,Oxford
222,226,43,43,PLACE,http://sws.geonames.org/2646900/,Hill
228,238,45,46,PLACE,http://sws.geonames.org/2645456/,Kings Lynn
240,247,48,48,PLACE,http://sws.geonames.org/2641455/,Norfolk
76,83,16,16,GPE,,Norfolk
154,156,30,30,GPE,,OA
240,247,48,48,GPE,,Norfolk


start,end,token_start,token_end,label,id,text
35,46,8,9,PLACE,http://sws.geonames.org/2634258/,West Sussex
288,299,59,60,PLACE,http://sws.geonames.org/2634258/,West Sussex
27,33,6,6,GPE,,Pagham
35,46,8,9,GPE,,West Sussex
280,286,57,57,GPE,,Pagham
288,299,59,60,GPE,,West Sussex
1391,1401,256,257,GPE,,the County


start,end,token_start,token_end,label,id,text
0,7,0,0,PLACE,http://sws.geonames.org/2639690/,Rainham
0,7,0,0,PLACE,http://sws.geonames.org/2639691/,Rainham
20,30,4,4,PLACE,http://sws.geonames.org/6690863/,Hornchurch
32,38,6,6,PLACE,http://sws.geonames.org/2643743/,London


start,end,token_start,token_end,label,id,text
39,61,11,13,PLACE,http://sws.geonames.org/8224216/,Letchworth Garden City
63,76,15,15,PLACE,http://sws.geonames.org/2647043/,Hertfordshire
1371,1385,233,235,GPE,,Romano-British
2447,2461,419,421,GPE,,Romano-British


start,end,token_start,token_end,label,id,text
0,10,0,0,PLACE,http://sws.geonames.org/2639119/,Rossington
32,41,6,6,PLACE,http://sws.geonames.org/2651123/,Doncaster
32,41,6,6,PLACE,http://sws.geonames.org/3333143/,Doncaster
264,274,45,45,PLACE,http://sws.geonames.org/2639119/,Rossington
287,295,49,49,PLACE,http://sws.geonames.org/2634916/,Wadworth
424,434,72,72,PLACE,http://sws.geonames.org/2639119/,Rossington
856,865,145,145,PLACE,http://sws.geonames.org/2651123/,Doncaster
856,865,145,145,PLACE,http://sws.geonames.org/3333143/,Doncaster
938,948,155,155,PLACE,http://sws.geonames.org/2639119/,Rossington
953,962,157,157,PLACE,http://sws.geonames.org/2651123/,Doncaster


start,end,token_start,token_end,label,id,text
18,24,4,4,PLACE,http://sws.geonames.org/2633755/,Wistow
26,40,6,6,PLACE,http://sws.geonames.org/2644667/,Leicestershire


start,end,token_start,token_end,label,id,text
5,9,1,1,PLACE,http://sws.geonames.org/2646900/,Hill
16,23,4,4,PLACE,http://sws.geonames.org/2633916/,Willand
25,30,6,6,PLACE,http://sws.geonames.org/2651292/,Devon
243,247,41,41,PLACE,http://sws.geonames.org/2646900/,Hill
259,266,45,45,PLACE,http://sws.geonames.org/2633916/,Willand
270,275,47,47,PLACE,http://sws.geonames.org/2651292/,Devon
646,651,112,112,PLACE,http://sws.geonames.org/2643071/,March
659,664,115,115,PLACE,http://sws.geonames.org/2643071/,March
259,266,45,45,GPE,,Willand


start,end,token_start,token_end,label,id,text
0,14,0,0,PLACE,http://sws.geonames.org/2641235/,Northumberland


start,end,token_start,token_end,label,id,text
45,49,7,7,PLACE,http://sws.geonames.org/2634877/,Wall
72,78,12,12,PLACE,http://sws.geonames.org/2653049/,Church
572,579,99,99,PLACE,http://sws.geonames.org/6269131/,England
2547,2551,452,452,PLACE,http://sws.geonames.org/2634877/,Wall
2573,2579,459,459,GPE,,Trench
2609,2615,468,468,GPE,,Trench


start,end,token_start,token_end,label,id,text
33,37,3,3,PLACE,http://sws.geonames.org/2633352/,York
130,134,22,22,PLACE,http://sws.geonames.org/2633352/,York
185,191,31,31,PLACE,http://sws.geonames.org/2636671/,Street
193,197,33,33,PLACE,http://sws.geonames.org/2633352/,York
241,245,45,45,PLACE,http://sws.geonames.org/2633352/,York
281,285,53,53,PLACE,http://sws.geonames.org/2633352/,York
318,324,61,61,PLACE,http://sws.geonames.org/2636671/,Street
338,342,66,66,PLACE,http://sws.geonames.org/2633352/,York
427,431,81,81,PLACE,http://sws.geonames.org/2633352/,York
1264,1268,224,224,PLACE,http://sws.geonames.org/2633352/,York
