Spacy Tokenizer Boundary Issue. #7592

ZohaibRamzan · 2021-03-27T13:03:32Z

I am using spacy tokenizer within stanza pipeline. In some of the sentences, spacy tokenizer does not tokenize sentence ending point '.' as seperate token which in my case is needed.
Here is my code;

nlp= stanza.Pipeline('en', processors={'tokenize':'spacy'})

sentence='To 10-30mm2 section of stained material in a 2ml microfuge tube, add 600µl Lysis Buffer and 10µl Proteinase K.'
sentence=sentence.rstrip()
doc=nlp(unidecode(sentence))  # initialize stanza pipeline for every new sentence
token=[word.text for sent in doc.sentences for word in sent.words]

The result is;

["To","10","-","30mm2","section","of","stained","material","in","a","2ml","microfuge","tube",",","add","600ul","Lysis","Buffer","and","10ul","Proteinase","K."]

I want last two tokens as 'K' and '.' .
Can i do that?

The text was updated successfully, but these errors were encountered:

polm · 2021-03-29T03:08:34Z

You can get what you want like this:

import spacy

nlp = spacy.load("en_core_web_sm") 

text =  'Add 600µl Lysis Buffer and 10µl Proteinase K.'

# add periods to the suffix search

suffixes = nlp.Defaults.suffixes + [r"\."]
sregex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = sregex.search

for word in nlp(text):
    print(word)

For more details, see the docs on tokenizer exceptions.

Ah wait, this doesn't work as-is with Stanza - let me see how to apply it. It should work since you're using the spaCy tokenizer.

polm · 2021-03-29T03:43:24Z

Hm, so that was more complicated than I expected.

My example above works with the standard spaCy tokenizer, but it turns out the Stanza tokenizer has a somewhat different implementation and doesn't have the hooks for tokenizer exceptions. So what you can do is replace the tokenizer in your pipeline with a standard spaCy tokenizer, like below.

import spacy
import stanza
import spacy_stanza

from spacy.tokenizer import Tokenizer

nlp = spacy_stanza.load_pipeline("en")
nlp.tokenizer = Tokenizer(nlp.vocab)

text =  'Add 600µl Lysis Buffer and 10µl Proteinase K.'

# add periods to the suffix search
suffixes = nlp.Defaults.suffixes + [r"\."]
sregex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = sregex.search

for word in nlp(text):
    print(word)

That said, you need to check your performance with this setup - I think it will make tokenization different from what the Stanza models were trained with, which could affect performance, though it seems likely the overall changes will be minor enough it shouldn't matter.

adrianeboyd · 2021-03-30T07:35:58Z

This example that replaces the tokenizer with nlp.tokenizer = X is not going to work because the entire stanza pipeline is actually implemented as a custom tokenizer, so replacing the tokenizer removes the stanza processing.

If you use processors={'tokenize':'spacy'}, then the default English tokenizer is getting initialized within stanza and you don't have much control over it, which is also why this option only works for English.

If you really want customized spacy tokenization with the stanza pipeline, then you'll have to provide pretokenized texts (whitespace tokenization) with the tokenize_pretokenized=True option instead.

adrianeboyd · 2021-03-30T10:14:59Z

Ah, I think you could customize the tokenizer directly. It's a little buried, but it looks like you can access it as this to modify the suffixes:

nlp.tokenizer.snlp.processors["tokenize"]._variant.nlp.tokenizer

github-actions · 2021-04-07T00:06:44Z

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions · 2021-10-24T00:01:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added the feat / tokenizer Feature: Tokenizer label Mar 28, 2021

polm mentioned this issue Mar 29, 2021

Spacy Tokenizer Boundary Issue. explosion/spacy-stanza#69

Closed

svlandeg added the resolved The issue was addressed / answered label Mar 31, 2021

github-actions bot closed this as completed Apr 7, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy Tokenizer Boundary Issue. #7592

Spacy Tokenizer Boundary Issue. #7592

ZohaibRamzan commented Mar 27, 2021 •

edited by svlandeg

Loading

polm commented Mar 29, 2021 •

edited

Loading

polm commented Mar 29, 2021

adrianeboyd commented Mar 30, 2021

adrianeboyd commented Mar 30, 2021

github-actions bot commented Apr 7, 2021

github-actions bot commented Oct 24, 2021

Spacy Tokenizer Boundary Issue. #7592

Spacy Tokenizer Boundary Issue. #7592

Comments

ZohaibRamzan commented Mar 27, 2021 • edited by svlandeg Loading

polm commented Mar 29, 2021 • edited Loading

polm commented Mar 29, 2021

adrianeboyd commented Mar 30, 2021

adrianeboyd commented Mar 30, 2021

github-actions bot commented Apr 7, 2021

github-actions bot commented Oct 24, 2021

ZohaibRamzan commented Mar 27, 2021 •

edited by svlandeg

Loading

polm commented Mar 29, 2021 •

edited

Loading