In [5]:
from pathlib import Path
import spacy
nlp = spacy.load("en_core_web_sm")

In [4]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [14]:
doc.ents

(The World Bank,
 17,473,211,
 MEXICO,
 UMBRELLA,
 LA VENTA II,
 April 24, 2006,
 January 2006,
 Mexican,
 1,
 $=$ $\mathrm,
 U S,
 0.095$ $\begin{array,
 { l l l } { 1 \mathrm,
 U S } \mathbb,
 0,
 January 1 December 31,
 BLT,
 BOT,
 CAS,
 CER,
 CFE   
 CM   
 CO2,
 DOE,
 ER,
 GEF,
 GoM,
 GHG,
 IMN,
 INEGI,
 IRR,
 MW,
 NPV,
 Build-Lease-Transfer   
 Build Margin,
 Build-Operate-Transfer   
 Country Assistance Strategy   
 Clean Development Mechanism   
 National Center of Energy Control (Centro Nacional de Control de Energía,
 Certified Emissions Reduction   
 National Electric Commission,
 Comisión Nacional de Electricidad,
 Carbon Dioxide,
 Interconnected Mexican National,
 National Institute of Statistics,
 Geography and Computer Science,
 Instituto Nacional de Estadística,
 Geografía,
 Informática,
 Infrastructure Public Expenditure Review   
 Independent Power Producers,
 Operation,
 Maintenance   
 Publicly Finance Works,
 Mexican Petroleum (Petróleos Mexicanos,
 Proyectos de Im

## Named Entity Recognition
This notebook extracts named entities from a PDF to compare spaCy out-of-the-box with an LLM. The list of entities for each approach is compared. 

Then an LLM process is used to rectify the two lists, returning a JSONL file to pass back into the pipeline for another round of NER.

The desired output is a spacy `doc` with annotations of both spaCy and LLM entities. After Named Entity Linking, a process of entity resolution may be used to simplify entities to those that can be resolved back to Wikidata articles. After the doc is annotated, the annotations can be used for more complete triple extraction.


### first pass: doc as a whole
The first pass will use spacy to process the document as a whole. 

In [6]:
fpath = "../output/test/auto/test.md"
text = Path(fpath).read_text(encoding='utf-8')

doc  = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

The World Bank 12 26 ORG
17,473,211 202 212 CARDINAL
MEXICO 258 264 ORG
UMBRELLA 281 289 ORG
LA VENTA II 291 302 EVENT
April 24, 2006 313 327 DATE
January 2006 379 391 DATE
Mexican 412 419 NORP
1 425 426 CARDINAL
$=$ $\mathrm 440 452 MONEY
U S 455 458 ORG
0.095$ $\begin{array 464 484 MONEY
{ l l l } { 1 \mathrm 486 507 GPE
U S } \mathbb 510 523 ORG
0 544 545 CARDINAL
January 1 December 31 587 608 DATE
BLT 640 643 PERSON
BOT 653 656 ORG
CAS 660 663 ORG
CER 684 687 ORG
CFE   
CM   
CO2 691 707 ORG
DOE 719 722 ORG
ER 733 735 ORG
GEF 747 750 ORG
GoM 754 757 NORP
GHG 761 764 ORG
IMN 781 784 PERSON
INEGI 790 795 NORP
IRR 814 817 ORG
MW 821 823 ORG
NPV 827 830 ORG
Build-Lease-Transfer   
Build Margin 876 912 WORK_OF_ART
Build-Operate-Transfer   
Country Assistance Strategy   
Clean Development Mechanism   
National Center of Energy Control (Centro Nacional de Control de Energía 932 1092 ORG
Certified Emissions Reduction   
National Electric Commission 1097 1158 ORG
Comisión Nacional de Electr