In [None]:
# pre-install required libraries
import warnings
#%pip install -upgrade pip
%pip install spacy
%pip install ipywidgets
#%pip install -U jupyter

# suppress user warnings during execution
warnings.filterwarnings(action='ignore', category=UserWarning)


# Using the Vocabulary Annotation Tool
## Introduction
The VocabularyAnnotator includes a series of custom spaCy pipeline components to perform vocabulary-based lookup matching on archaeological terms. These components are based on existing monolingual (English) thesauri so they can currently be used to identify and tag English language terms. The following entity types can be tagged by the VocabularyAnnotator. Entity types can be selectively included/excluded and the pipeline can also be re-ordered during initialisation:

| Entity Type   | Description   | Examples  |
|---------------|---------------| ----------|
| NAMEDPERIOD   | Match terms from Perio.do [Historic England Periods Authority File](http://n2t.net/ark:/99152/p0kh9ds) | *Medieval, Bronze Age* |
| OBJECT        | Match terms from the [FISH Archaeological Objects Thesaurus](http://purl.org/heritagedata/schemes/mda_obj) | *axe, sherds, ring* |
| COMPONENT     | Match terms from the [HE Components Thesaurus](http://purl.org/heritagedata/schemes/eh_com) | *rafter, truss, flue* |
| ARCHSCIENCE   | Match terms from the [FISH Archaeological Sciences Thesaurus](http://purl.org/heritagedata/schemes/560) | *lead isotope dating, palynology* |
| EVIDENCE      | Match terms from the [HE Evidence Thesaurus](http://purl.org/heritagedata/schemes/eh_evd) | *cropmark, artefact scatter* |
| EVENTTYPE     | Match terms from the [FISH Event Types Thesaurus](http://purl.org/heritagedata/schemes/agl_et) | *core sampling, geophysical survey, evaluation* |
| MATERIAL      | Match terms from the [FISH Building Materials Thesaurus](http://purl.org/heritagedata/schemes/eh_tbm) | *brass, quartz, pine, bone, leather* |
| MARITIME      | Match terms from the [FISH Maritime Craft Types Thesaurus](http://purl.org/heritagedata/schemes/eh_tmc) | *galley, salvage tug, dredger* |
| MONUMENT      | Match terms from the [FISH Thesaurus of Monument Types](http://purl.org/heritagedata/schemes/eh_tmt2) | *midden, weighbridge, kiln* |

In [2]:
# example input text from https://doi.org/10.5284/1100095
txt1 = """This collection comprises site data (images, CAD and reports) from an archaeological evaluation, an abacus or an abutment comprising the excavation of thirty-three trenches, at Handley Park, near Abthorpe, Northamptonshire In May 2014, carried out by Cotswold Archaeology. The evaluation was commissioned by Pegasus Planning Group, acting on behalf of Haymaker Energy Ltd, and was carried out prior to the submission of a planning application for the construction of a solar park on the site. Evidence for Late Bronze Age/Early Iron Age activity, comprising a pit and a ditch from which a small assemblage of pottery was recovered, was encountered on the south-east facing slope overlooking the valley of Silverstone Brook. In the same area the remains of a small Roman settlement, probably a farmstead and associated field system, were identified. The Roman features contained pottery, animal bone and fragments of Roman roof tile, the latter indicating that there may have been a building in the vicinity. The Middle/Late Saxon remains comprised a circular or oval enclosure, although no evidence was encountered for features within the enclosure. Pottery dateable to the 7th to 10th centuries, a fragment of an iron pin or bobbin and a metal fragment, possibly part of a bucket with mineralised wood fibres adhering to its surface, were recovered from the enclosure ditch. The archaeological features broadly corresponded with anomalies detected by a geophysical survey of the site, although in a number of instances there was only an approximate correlation with the geophysical survey results, possibly due to the highly variable geology. Many of the anomalies interpreted as being of possible archaeological significance were confirmed as geological in origin and several features were identified that were not detected by the geophysical survey."""

# example input text from ??
txt2 = """Aside from three residual flints, none closely datable, the earliest remains comprised a small assemblage of Roman pottery and ceramic building material, also residual and most likely derived from a Roman farmstead found immediately to the north within the Phase II excavation area. A single sherd of Anglo-Saxon grass-tempered pottery was also residual. The earliest features, which accounted for the majority of the remains on site, relate to medieval agricultural activity focused within a large enclosure. There was little to suggest domestic occupation within the site: the pottery assemblage was modest and well abraded, whilst charred plant remains were sparse, and, as with some metallurgical residues, point to waste disposal rather than the locations of processing or consumption. A focus of occupation within the Rodley Manor site, on higher ground 160m to the north-west, seems likely, with the currently site having lain beyond this and providing agricultural facilities, most likely corrals and pens for livestock. Animal bone was absent, but the damp, low-lying ground would have been best suited to cattle. An assemblage of medieval coins recovered from the subsoil during a metal detector survey may represent a dispersed hoard."""

# example input text from https://doi.org/10.5284/1017435
txt3 = """The project was a condition survey of the early 20th century submarine HMS/M A1, carried out for English Heritage by Wessex Archaeology in 2005.
The project involved routine diver survey of the HMS/M A1 and a site plan of the main elements of the wreck was produced supplemented by photographic and video recording. The impact of illegal diving activities on the wreck was assessed and recommendations were made for its future management. In addition, two geophysical anomalies located in proximity to the wreck were investigated to establish if they were associated with the wreck.
The dataset comprises databases relating to the diver recording system (DIVA) and ROV-track diver tracking, GIS shapefiles for the diver tracklogs and photographs.
HM Submarine A1
HMS/M A1 was built by Vickers in 1902 and was the first submarine which was British designed and built used by the Royal Navy. She was commissioned in 1903, but tragically lost with all hands in a collision off the Nab Light in 1904.
The submarine was raised in 1904 and then used for training and experimental purposes. HMS/M A1 was finally lost during an unmanned exercise. Despite the extensive searches, the Royal Navy was unable to locate the submarine. The wreck was rediscovered in 1989 and designated a historic wreck under the Protection of Wrecks Act 1973 in 1998.
Wessex Archaeology conducted a magnetometer survey, a sub bottom profiler survey and a multibeam sonar survey of the wreck site in 2003. More information on this can be found in the Wrecks on the Seabed archive. The geophysical survey data is archived through the MEDIN Data Archive Centres (British Geological Survey and United Kingdom Hydrographic Office)."""

# example input text from https://doi.org/10.5284/1100096
txt4 = """This collection comprises images, CAD, spreadsheets and a report from an Archaeological Evaluation of Land off Ellen Aldous Avenue, Hadleigh. This work was undertaken by Archaeology South-East between January to February 2021.
A preceding geophysical survey detected a range of anomalies of possible or probable archaeological origin, mainly concentrated in the western part of the site, indicating the potential presence of a series of ditched enclosures.
A total of fifty-five evaluation trenches were investigated across the northern 8.8ha of the overall 18.4ha site. Archaeological features were recorded in thirty-nine trenches and comprised ditches, pits and possible postholes. A close correspondence between the archaeological evaluation and geophysical survey results was evident, though smaller features such as pits and postholes had generally not been detected as geophysical anomalies.
Remains of Early Iron Age ditched enclosures, a possible trackway and a few pits were found in two distinct concentrations in the west and east of the evaluated area. Remains of Roman ditched field/ enclosure systems were recorded across the west half of the evaluated area. A further Roman ditch was found in the east. The significant quantity and range of artefacts and plant remains recovered from these Roman period features (especially from a few ditches in the west) suggests that they relate to a rural settlement, such as a farmstead, located in the near vicinity.
A number of ditches defining former field boundaries, along with quarries and other pits, relate to the agricultural use of this landscape in the late post-medieval and early modern periods. The boundary ditches are shown on historic mapping from the earlier 19th century onwards."""

# example input text from https://doi.org/10.5284/1100097
txt5 = """This collection comprises site data (images and GIS) from an archaeological evaluation of the land East of Willen Road, Newport Pagnell, Buckinghamshire, undertaken by Cotswold Archaeology in January 2021. A total of 41 trenches out of the proposed 63 trenches were excavated. A very high water table, large areas of standing water and near-constant rainfall prevented the remainder if the trenches from being opened.
Trial trenching revealed a very limited number of archaeological features, finds or deposits across the area; mainly consisting of Medieval to post-Medieval drainage ditches and field boundaries. The results suggest an extended period of use as pastoral land, latterly in association with Caldecote Farm and Mill, with a brief period of arable agriculture in the high Medieval period, evidenced by the standing remains of Ridge and Furrow across the central, southern part of the site. Very few archaeological artefacts were recovered, making precise dating of features difficult, and further supporting low levels of human activity or input within the area. Small quantities of artefacts were recovered including four flints, four pieces of animal bone, six sherds of Late Iron Age / Roman pottery, nineteen sherds of Medieval pottery - largely dating from between the late 11th-14th centuries - and five post-medieval artefacts. 80% of the finds were recovered from the pasture land in the central, eastern part of the site towards the river (Trenches 43-49)."""

# example input text from https://doi.org/10.5284/1100092
txt6 ="""This collection comprises site data (reports, images, GIS data and a project database) from an archaeological excavation at Lydney B Phase II, Archers Walk, Lydney, Gloucestershire undertaken by Cotswold Archaeology between February and May 2018. An area of 1.47ha was excavated within this part of a wider development area.
The earliest remains comprised three broadly datable flints, all found as residual finds. An Early Bronze Age collared urn within a small pit may be the remains of a grave, although no human remains were found. The first evidence for occupation is from the Roman period, with finds spanning the 1st to 3rd centuries AD, with a clear focus within the 2nd to 3rd centuries. Two phases of Roman activity were identified, the first comprising cereal-processing ovens and two crescent-shaped ditches, one associated with metalworking debris. The later phase comprised stone founded buildings associated with wells, enclosures, trackways and a single cremation deposit. These seem to indicate a Romanised farm below the status of a villa. Little animal bone survived, but the enclosures are suggestive of livestock farming. Occupation seems to have ended in the mid 3rd century, although the reasons for this are not apparent.
Further use of the site dates to the medieval period, between the late 12th and 15th centuries, when an agricultural building was constructed, probably an outlier of a manorial farm previously excavated to the west.
"""

# example input text from https://doi.org/10.5284/1100098
txt7 ="""This collection comprises site data (images and GIS) from an archaeological evaluation at Northfield Hostel, Littlemore, Oxford, In November 2021, undertaken by Cotswold Archaeology.
A total of 7 trenches were excavated across the 0.61ha site, which is located approximately 100m to the east of a Roman pottery production site identified during the construction of the Oxford Eastern Bypass. Archaeological remains were identified in trenches 3, 4 and 5. No evidence for any activity pre-dating the Roman period was identified, with the earliest dated feature being ditch 403, in trench 4, which produced pottery of mid-Roman date. The fill of this ditch was cut by a small pit or posthole, 405, that produced pottery of late Roman date, as did ditch / gully 503, to the southwest, indicating a C3 - C4 phase of activity and collectively suggesting use of the site from as early as the mid-2nd to the 4th century. No evidence for industrial activity on the site was seen in the form of either pottery wasters or kiln furniture, or industrial residue dumps, although the quantity of pottery recovered from the features investigated suggests a proximity to settlement or a working area. Consequently, it is possible that the activity identified on the site is related, albeit perhaps being on the periphery, to the kilns/ pottery production site noted during construction of the Eastern Bypass.
No evidence for Early Medieval (Saxon) or Medieval activity was identified during the evaluation, while a single sherd of post-medieval pottery recovered from the subsoil in trench 3 was most likely introduced onto the site via agricultural manuring practices.
Modern disturbance/ truncation to the natural substrate/ archaeological horizon was only seen in trench 7, where part of a large modern pit was seen, while in trench 6 the original subsoil and topsoil appear to have been left in-situ, buried beneath a thin layer of imported redeposited silt-clay and topsoil seemingly laid as part of the landscaping works associated with the construction of the Hostel complex.
"""

# example input text from https://doi.org/10.5284/1100091
txt8 = """This collection comprises site data (images and CAD site plans) from two phases of archaeological work undertaken by Cotswold archaeology at Mount Mill Farm, Wicken, Northamptonshire in May and June 2014 (MOM14) and January to March 2015 (MMW15).
MOM14:
An archaeological evaluation was undertaken by Cotswold Archaeology in May and June 2014 at Mount Mill Farm, Wicken, Northamptonshire. Forty-five trenches were excavated.
The evaluation recorded a series of substantial and well-preserved Iron Age ditches, corresponding to the three concentrations of enclosures and associated features recorded previously by a geophysical survey. It is likely that these three concentrations of features represent Iron Age farmsteads. The evaluation generally confirmed the limits of the three concentrations of Iron Age activity as defined by the geophysical survey. There was no evidence for activity pre-dating the Iron Age, and there were very few later features.
MMW15:
In January to March 2015, a programme of archaeological observation, investigation and recording was undertaken by Cotswold Archaeology during groundworks associated with the construction of a solar farm at Mount Mill Farm, Wicken, Northamptonshire. A previous archaeological evaluation and a geophysical survey of the site had recorded an Iron Age farmstead within the site's eastern field.
The programme of archaeological observation, investigation and recording identified a single undated ditch. The monitored groundworks were outside of the main concentration of Iron Age features recorded by the previous archaeological investigations. The lack of features exposed during the present works confirms that the previous investigations accurately defined the limits of the archaeological remains at the site.
"""

# example input text from https://doi.org/10.5284/1100093
txt9 ="""This collection comprises site data (images, a report, a project database and GIS data) from an archaeological excavation undertaken by Cotswold Archaeology between January and February 2020 at Lydney B Phase III, Archers Walk, Lydney, Gloucestershire. An area of 0.6ha was excavated within this phase (Phase III) of a wider development area.
Aside from three residual flints, none closely datable, the earliest remains comprised a small assemblage of Roman pottery and ceramic building material, also residual and most likely derived from a Roman farmstead found immediately to the north within the Phase II excavation area. A single sherd of Anglo-Saxon grass-tempered pottery was also residual.
The earliest features, which accounted for the majority of the remains on site, relate to medieval agricultural activity focused within a large enclosure. There was little to suggest domestic occupation within the site: the pottery assemblage was modest and well abraded, whilst charred plant remains were sparse, and, as with some metallurgical residues, point to waste disposal rather than the locations of processing or consumption. A focus of occupation within the Rodley Manor site, on higher ground 160m to the north-west, seems likely, with the currently site having lain beyond this and providing agricultural facilities, most likely corrals and pens for livestock. Animal bone was absent, but the damp, low-lying ground would have been best suited to cattle. An assemblage of medieval coins recovered from the subsoil during a metal detector survey may represent a dispersed hoard.
"""

# example input text from https://doi.org/10.5284/1100086
txt10 ="""This collection comprises site data (reports, images, spreadsheets, GIS data and site records) from two phases of archaeological evaluation undertaken by Oxford Archaeology in June 2018 (SAWR18) and February 2021 (SAWR21) at West Road, Sawbridgeworth, Hertfordshire.
SAWR18
In June 2018, Oxford Archaeology were commissioned by Taylor Wimpey to undertake an archaeological evaluation on the site of a proposed housing development to the north of West Road, Sawbridgeworth (TL 47842 15448). A programme of 19 trenches was undertaken to ground truth the results of a geophysical survey and to assess the archaeological potential of the site.
The evaluation confirmed the presence of archaeological remains in areas identified on the geophysics. Parts of a NW-SE‐aligned trackway were found in Trenches 1 and 2. Field boundaries identified by geophysics (also present on the 1839 tithe map) were found in Trenches 5 and 7, towards the south of the site, and in Trenches 12 and 16, in the centre of the site.
Geophysical anomalies identified in the northern part of the site were investigated and identified as geological. The archaeology is consistent with the geophysical survey results and it is likely that much of it has been truncated by modern agricultural activity.
SAWR21
Oxford Archaeology carried out an archaeological evaluation on the site of proposed residential development north of West Road, Sawbridgeworth, Hertfordshire, in February 2021. The fieldwork was commissioned by Taylor Wimpey as a condition of planning permission.
Preceding geophysical survey of the c 5.7ha development site was undertaken in 2016 and identified a concentration of linear and curvilinear anomalies in the north-east corner of the site and two areas of several broadly NW-SE aligned anomalies in the southern half of the site. Subsequent trial trench evaluation, comprising the investigation of 19 trenches, was undertaken by Oxford Archaeology in 2018, targeted upon the geophysical survey results. The evaluation revealed a small number of ditches in the centre and south of the site, correlating with the geophysical anomalies. Although generally undated, the ditches were suggestive of a trackway and associated enclosure/field boundaries. Other ditches encountered on site correlated with post-medieval field boundaries depicted on 19th-century mapping.
Given the results of the 2018 evaluation, in conjunction with those of the 2018 investigations at nearby Chalk's Farm, which uncovered the remains of late Bronze Age-early Iron Age and early Roman settlement and agricultural activity, it was deemed necessary to undertake a further phase of evaluation at the site. Four additional trenches were excavated in the southern half of the site to further investigate the previously revealed ditches.
The continuations of the trackway ditches were revealed in the centre of the site, with remnants of a metalled surface also identified. Adjacent ditches may demonstrate the maintenance and modification of the trackway or perhaps associated enclosure/field boundaries. Artefactual dating evidence recovered from these ditches was limited and of mixed date, comprising small pottery sherds of late Bronze Age-early Iron Age date and fragments of Roman ceramic building material. It is probable that these remains provide evidence of outlying agricultural activity associated with the later prehistoric and early Roman settlement evidence at Chalk's Farm.
A further undated ditch and a parallel early Roman ditch were revealed in the south of the site, suggestive of additional land divisions, probably agricultural features. A post-medieval field boundary ditch and modern land drains are demonstrative of agricultural use of the landscape during these periods.
"""

In [6]:
#import ipywidgets as widgets
from VocabularyAnnotator import VocabularyAnnotator

format = "html"  # "html" | "csv" | "json" | "ttl" | "dataframe"
entities = [
    "MONUMENT", "EVIDENCE", "MATERIAL",
    "MARITIME","EVENTTYPE", "ARCHSCIENCE", 
    "OBJECT", "COMPONENT", "NAMEDPERIOD"
]
# entities = ["OBJECT", "MONUMENT"] # can include/exclude and re-order pipe components
va = VocabularyAnnotator(entities)
results = va.annotateText(txt7, format=format)
display(results)



None