How to change alignment_mode in a hf pipelines in code? #13548

mikrzol · 2024-06-26T10:06:29Z

I'm having the same problem described here: microsoft/presidio#1262 - some of the annotations are skipped because of alignment problems between the spaCy pipeline and the hf pipeline wrapper.

I would like to simply substitute one of the components of the spaCy pipeline with a HF model I trained for NER and use it for this task. I'm trying do this using this code:

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.remove_pipe("ner")

nlp.add_pipe("hf_token_pipe", config={"model": "mikrz/bert-vir_naeus-ner"})

The model loads correctly and technically works fine, but some of the tokens are skipped, e.g.:

text = 'A novel bacteriophage vB_SauS_SA2 (hereafter designated SA2) that infects Staphylococcus aureus was isolated.'
doc = nlp(text)

For the code above I'm getting this warning:

spacy_huggingface_pipelines/token_classification.py:129: UserWarning: Skipping annotation, {'entity_group': 'VIR', 'score': 0.67677, 'word': '_ SauS _ SA2', 'start': 24, 'end': 33} is overlapping or can't be aligned for doc 'A novel bacteriophage vB_SauS_SA2 (hereafter designated SA2) that infects Staphylococcus aureus was ...'
  warnings.warn(

I saw that in microsoft/presidio#1262 someone recommended changing the alignment_mode of the hf pipeline component. How can I do this in code when using the default en_core_web_sm model?
For my use case, I need the spans of the named entities in the form of word numbers, so in the example above it would be:

vB_SauS_SA2: (3,3),
Staphylococcus aureus: (11,12)

where we get the position the the start and end of the named entitites.

I'd like to change just this part - I would like to avoid training a custom pipeline which, if I understand https://spacy.io/usage/training correctly, seems to be necessary when creating a new spaCy pipeline from the config file. Or did I misunderstand and there is an option to just change parts of the config file? If that's the case, could you instruct me how to create a config file, where to put it and how to use it?

The text was updated successfully, but these errors were encountered:

Siddharth-Latthe-07 · 2024-06-27T12:42:44Z

@mikrzol I think the issue arises from the fact that the tokens produced by the Hugging Face model might not align perfectly with the tokens produced by the spaCy model. This misalignment can lead to warnings about overlapping annotations or annotations that can't be aligned.
Try out this sample code snippet which adjust the spans returned by the Hugging Face model to match the token boundaries of the spaCy model more accurately.

import spacy
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from spacy.tokens import Span

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Remove default NER pipeline component
nlp.remove_pipe("ner")

# Load Hugging Face model and tokenizer
model_name = "mikrz/bert-vir_naeus-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
hf_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Custom pipeline component for Hugging Face NER
def hf_ner_component(doc):
    # Tokenize using spaCy to ensure alignment
    tokens = [token.text for token in doc]
    text = " ".join(tokens)
    
    # Use Hugging Face pipeline for NER
    hf_ner_results = hf_pipeline(text)
    
    # Convert Hugging Face NER results to spaCy entities
    entities = []
    for ent in hf_ner_results:
        start = ent['start']
        end = ent['end']
        label = ent['entity']
        span = doc.char_span(start, end, label=label, alignment_mode='contract')
        if span:
            entities.append(span)
    
    # Assign entities to the doc
    doc.ents = entities
    return doc

# Add custom component to spaCy pipeline
nlp.add_pipe(hf_ner_component, last=True)

# Example usage
text = 'A novel bacteriophage vB_SauS_SA2 (hereafter designated SA2) that infects Staphylococcus aureus was isolated.'
doc = nlp(text)

# Print named entities detected by the new NER component
for ent in doc.ents:
    print(ent.text, ent.label_)

please let me know if the above works
thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to change alignment_mode in a hf pipelines in code? #13548

How to change alignment_mode in a hf pipelines in code? #13548

mikrzol commented Jun 26, 2024

Siddharth-Latthe-07 commented Jun 27, 2024

How to change alignment_mode in a hf pipelines in code? #13548

How to change alignment_mode in a hf pipelines in code? #13548

Comments

mikrzol commented Jun 26, 2024

Siddharth-Latthe-07 commented Jun 27, 2024