Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer: English: non-breaking exclamations #1251

Closed
kwhumphreys opened this issue Aug 9, 2017 · 5 comments

Comments

Projects
None yet
5 participants
@kwhumphreys
Copy link
Contributor

commented Aug 9, 2017

>>> import en_core_web_sm
>>> nlp = en_core_web_sm.load()
>>> doc = nlp('Difference!.')
>>> for tok in doc: print(tok)
... 
Difference!.
>>> doc = nlp('Candidate!Welcome!')
>>> for tok in doc: print(tok)
... 
Candidate!Welcome
!

Your Environment

  • Operating System:
  • Python Version Used: 3.5.2
  • spaCy Version Used: 1.9.0
  • Environment Information:
@erip

This comment has been minimized.

Copy link

commented Aug 9, 2017

What's the expected output? Split on punct?

@ines

This comment has been minimized.

Copy link
Member

commented Aug 10, 2017

Thanks! Splitting the exclamation mark as an infix looks reasonable – I can't immediately think of cases where this would be problematic, or reasons why we would have decided against including it in the infixes rules. But we'll have to check it and see if all existing tests pass with this modification.

In the meantime, if this is very relevant to the data you're working with, you can always write custom tokenization rules and extend the infix patterns.

@ines ines added the performance label Aug 10, 2017

@macks22

This comment has been minimized.

Copy link

commented Aug 16, 2017

@kwhumphreys as an example of writing custom infix rules, I did something similar to add braces as infixes to correct for bad typing such as "bad(parentheses)spacing":

def create_tokenizer(cls=language.BaseDefaults, nlp=None, special_cases=None):
    """Override the default tokenizer to add in grouping chars as possible infixes.
    This corrects for grouping chars not surrounded by spaces. The default tokenizer
    will consider "bad(spacing)word" as a single token, whereas this tokenizer will
    correctly distinguish the individual words.
    """
    rules = cls.tokenizer_exceptions
    if special_cases is not None:
        special_cases = dict(special_cases)
        special_cases.update(rules)
        rules = special_cases

    if cls.token_match:
        token_match = cls.token_match
    if cls.prefixes:
        prefix_search = lang_util.compile_prefix_regex(cls.prefixes).search
    else:
        prefix_search = None
    if cls.suffixes:
        suffix_search = lang_util.compile_suffix_regex(cls.suffixes).search
    else:
        suffix_search = None
    if cls.infixes:
        # This whole function is to add this one line to the original function:
        infixes = tuple(list(cls.infixes) + LIST_BRACES)
        infix_finditer = lang_util.compile_infix_regex(infixes).finditer
    else:
        infix_finditer = None
    vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
    return Tokenizer(vocab, rules=rules,
                     prefix_search=prefix_search, suffix_search=suffix_search,
                     infix_finditer=infix_finditer, token_match=token_match)

...
nlp = en_core_web_sm.load()
nlp.tokenizer = create_tokenizer(nlp.Defaults, nlp)

Note the commented line indicating where I made the change. There may be a simpler way to go about this, but this was the best I came up with.

@kwhumphreys

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2017

Thanks @macks22
I currently have:

        def custom_tokenizer(nlp):
            prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
            suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)
            custom_infixes = ['\.\.\.+', '(?<=[0-9])-(?=[0-9])', '[!&:,()]']
            infix_re = spacy.util.compile_infix_regex(custom_infixes)

            tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab,
                                                  nlp.Defaults.tokenizer_exceptions,
                                                  prefix_re.search,
                                                  suffix_re.search,
                                                  infix_re.finditer,
                                                  token_match=None)
            return lambda text: tokenizer(text)

        nlp = en_core_web_sm.load(create_make_doc=custom_tokenizer)

@honnibal honnibal closed this Nov 10, 2017

@lock

This comment has been minimized.

Copy link

commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.