Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer: English: non-breaking exclamations #1251

kwhumphreys opened this issue Aug 9, 2017 · 5 comments

Tokenizer: English: non-breaking exclamations #1251

kwhumphreys opened this issue Aug 9, 2017 · 5 comments


Copy link

@kwhumphreys kwhumphreys commented Aug 9, 2017

>>> import en_core_web_sm
>>> nlp = en_core_web_sm.load()
>>> doc = nlp('Difference!.')
>>> for tok in doc: print(tok)
>>> doc = nlp('Candidate!Welcome!')
>>> for tok in doc: print(tok)

Your Environment

  • Operating System:
  • Python Version Used: 3.5.2
  • spaCy Version Used: 1.9.0
  • Environment Information:
Copy link

@erip erip commented Aug 9, 2017

What's the expected output? Split on punct?

Copy link

@ines ines commented Aug 10, 2017

Thanks! Splitting the exclamation mark as an infix looks reasonable – I can't immediately think of cases where this would be problematic, or reasons why we would have decided against including it in the infixes rules. But we'll have to check it and see if all existing tests pass with this modification.

In the meantime, if this is very relevant to the data you're working with, you can always write custom tokenization rules and extend the infix patterns.

Copy link

@macks22 macks22 commented Aug 16, 2017

@kwhumphreys as an example of writing custom infix rules, I did something similar to add braces as infixes to correct for bad typing such as "bad(parentheses)spacing":

def create_tokenizer(cls=language.BaseDefaults, nlp=None, special_cases=None):
    """Override the default tokenizer to add in grouping chars as possible infixes.
    This corrects for grouping chars not surrounded by spaces. The default tokenizer
    will consider "bad(spacing)word" as a single token, whereas this tokenizer will
    correctly distinguish the individual words.
    rules = cls.tokenizer_exceptions
    if special_cases is not None:
        special_cases = dict(special_cases)
        rules = special_cases

    if cls.token_match:
        token_match = cls.token_match
    if cls.prefixes:
        prefix_search = lang_util.compile_prefix_regex(cls.prefixes).search
        prefix_search = None
    if cls.suffixes:
        suffix_search = lang_util.compile_suffix_regex(cls.suffixes).search
        suffix_search = None
    if cls.infixes:
        # This whole function is to add this one line to the original function:
        infixes = tuple(list(cls.infixes) + LIST_BRACES)
        infix_finditer = lang_util.compile_infix_regex(infixes).finditer
        infix_finditer = None
    vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
    return Tokenizer(vocab, rules=rules,
                     prefix_search=prefix_search, suffix_search=suffix_search,
                     infix_finditer=infix_finditer, token_match=token_match)

nlp = en_core_web_sm.load()
nlp.tokenizer = create_tokenizer(nlp.Defaults, nlp)

Note the commented line indicating where I made the change. There may be a simpler way to go about this, but this was the best I came up with.

Copy link
Contributor Author

@kwhumphreys kwhumphreys commented Aug 16, 2017

Thanks @macks22
I currently have:

        def custom_tokenizer(nlp):
            prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
            suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)
            custom_infixes = ['\.\.\.+', '(?<=[0-9])-(?=[0-9])', '[!&:,()]']
            infix_re = spacy.util.compile_infix_regex(custom_infixes)

            tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab,
            return lambda text: tokenizer(text)

        nlp = en_core_web_sm.load(create_make_doc=custom_tokenizer)

@honnibal honnibal closed this Nov 10, 2017
Copy link

@lock lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants