Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacy Tokenizer Boundary Issue. #7592

Closed
ZohaibRamzan opened this issue Mar 27, 2021 · 6 comments
Closed

Spacy Tokenizer Boundary Issue. #7592

ZohaibRamzan opened this issue Mar 27, 2021 · 6 comments
Labels
feat / tokenizer Feature: Tokenizer resolved The issue was addressed / answered

Comments

@ZohaibRamzan
Copy link

ZohaibRamzan commented Mar 27, 2021

I am using spacy tokenizer within stanza pipeline. In some of the sentences, spacy tokenizer does not tokenize sentence ending point '.' as seperate token which in my case is needed.
Here is my code;

nlp= stanza.Pipeline('en', processors={'tokenize':'spacy'})

sentence='To 10-30mm2 section of stained material in a 2ml microfuge tube, add 600µl Lysis Buffer and 10µl Proteinase K.'
sentence=sentence.rstrip()
doc=nlp(unidecode(sentence))  # initialize stanza pipeline for every new sentence
token=[word.text for sent in doc.sentences for word in sent.words]

The result is;

["To","10","-","30mm2","section","of","stained","material","in","a","2ml","microfuge","tube",",","add","600ul","Lysis","Buffer","and","10ul","Proteinase","K."]

I want last two tokens as 'K' and '.' .
Can i do that?

@polm
Copy link
Contributor

polm commented Mar 29, 2021

You can get what you want like this:

import spacy

nlp = spacy.load("en_core_web_sm") 

text =  'Add 600µl Lysis Buffer and 10µl Proteinase K.'

# add periods to the suffix search

suffixes = nlp.Defaults.suffixes + [r"\."]
sregex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = sregex.search

for word in nlp(text):
    print(word)

For more details, see the docs on tokenizer exceptions.

Ah wait, this doesn't work as-is with Stanza - let me see how to apply it. It should work since you're using the spaCy tokenizer.

@polm
Copy link
Contributor

polm commented Mar 29, 2021

Hm, so that was more complicated than I expected.

My example above works with the standard spaCy tokenizer, but it turns out the Stanza tokenizer has a somewhat different implementation and doesn't have the hooks for tokenizer exceptions. So what you can do is replace the tokenizer in your pipeline with a standard spaCy tokenizer, like below.

import spacy
import stanza
import spacy_stanza

from spacy.tokenizer import Tokenizer

nlp = spacy_stanza.load_pipeline("en")
nlp.tokenizer = Tokenizer(nlp.vocab)

text =  'Add 600µl Lysis Buffer and 10µl Proteinase K.'

# add periods to the suffix search
suffixes = nlp.Defaults.suffixes + [r"\."]
sregex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = sregex.search

for word in nlp(text):
    print(word)

That said, you need to check your performance with this setup - I think it will make tokenization different from what the Stanza models were trained with, which could affect performance, though it seems likely the overall changes will be minor enough it shouldn't matter.

@adrianeboyd
Copy link
Contributor

This example that replaces the tokenizer with nlp.tokenizer = X is not going to work because the entire stanza pipeline is actually implemented as a custom tokenizer, so replacing the tokenizer removes the stanza processing.

If you use processors={'tokenize':'spacy'}, then the default English tokenizer is getting initialized within stanza and you don't have much control over it, which is also why this option only works for English.

If you really want customized spacy tokenization with the stanza pipeline, then you'll have to provide pretokenized texts (whitespace tokenization) with the tokenize_pretokenized=True option instead.

@adrianeboyd
Copy link
Contributor

Ah, I think you could customize the tokenizer directly. It's a little buried, but it looks like you can access it as this to modify the suffixes:

nlp.tokenizer.snlp.processors["tokenize"]._variant.nlp.tokenizer

@svlandeg svlandeg added the resolved The issue was addressed / answered label Mar 31, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Apr 7, 2021

This issue has been automatically closed because it was answered and there was no follow-up discussion.

@github-actions github-actions bot closed this as completed Apr 7, 2021
@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer resolved The issue was addressed / answered
Projects
None yet
Development

No branches or pull requests

4 participants