# Construction Named Entity Recognition
An ML notebook for training a model for NER.

## Tagging Schemes
The `BILUO` (Beginning, Inside, Last, Unit) tagging scheme and the `IOB` (Inside, Outside, Beginning) tags are both provided in the tagged dataset. The schemes differ in how they tag multi-word entities. `BILUO` provides more explicit information about the boundaries of entities, potentially leading to better performance in some cases, while `IOB` is simpler and can be easier to implement.

### IOB Scheme
- `I` – Token is inside an entity.
- `O` – Token is outside an entity.
- `B` – Token is the beginning of an entity.

### BILUO Scheme
- `B` – Token is the beginning of a multi-token entity.
- `I` – Token is inside a multi-token entity.
- `L` – Token is the last token of a multi-token entity.
- `U` – Token is a single-token unit entity.
- `O` – Token is outside an entity.

### Imports

In [None]:
import logging
import spacy
from spacy import displacy
from spacy.training import (
    offsets_to_biluo_tags,
    biluo_to_iob
)
from utilities import safe_make_dir

# Force spaCy to run on the GPU
spacy.require_gpu()

# Load the Natural Language Pipeline
nlp = spacy.load("en_core_web_trf")

### Simple Example

In [None]:
text = "What amazing device did Steve Jobs create at Apple in 2007?" 
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

### Complex Example

In [None]:
text = "Show all Single Pole Light Switches that Homer J. Simpson installed on Level 1 at Apple between April and June of 2024."
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

### Apply Rule based entity recognition with named entity patterns
See [Spacy Documentation](https://spacy.io/usage/rule-based-matching#entityruler) for details on how to apply rules with patterns.

**NOTE**: If you haven't yet, you should generate the patterns via the `named_entity_patterns` notebook.

In [None]:
if nlp.has_pipe("entity_ruler") == False:
    config = {"overwrite_ents": True}
    ruler = nlp.add_pipe("entity_ruler", config=config).from_disk("./data/patterns.jsonl")
    print("✅ Added custom patterns:", len(ruler.patterns))

### Visualize our new named entity patterns

In [None]:
text = "Show me all single pole light switches and all other electrical equipment that James Bond installed in the kitchen area on Level 1 from May to June."
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

### Extracting Document Tags Programatically

In [None]:
# Extract the recongized tags in the document
tags = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
print('1️⃣', tags)

# Extract the list of tokens
tokens = ([token.text for token in doc])
# Extract the fine-grained parts of speech tags for each token
pos_tags = ([token.tag_ for token in doc])
# Extract the BILOU tags
biluo_tags = offsets_to_biluo_tags(doc, tags)
# Convert the BILOU tags to IBO tags
ibo_tags = biluo_to_iob(biluo_tags)

print('2️⃣', tokens)
print('3️⃣', biluo_tags)

# Validate the array lengths are all the same
assert len(tokens) == len(pos_tags) == len(biluo_tags) == len(ibo_tags), "The lengths of the arrays must be equal"

### Save the model to disk

In [None]:
# Safe make the training directory
safe_make_dir('./training')
# Save the model to disk
nlp.to_disk('./training/ner')