Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing regex library with re to increase tokenization speed #3218

Merged
merged 20 commits into from Feb 1, 2019

Conversation

Projects
None yet
3 participants
@svlandeg
Copy link
Member

commented Jan 31, 2019

Description

First step to address issue #1642 :
Implemented the minimal changes required to replace the regex library back with re:

  • Removed regex from imports and requirements
  • Replaced class shortcuts like \p{Bengali} (not supported by re) with their actual unicode ranges
  • Converted _quotesand ALPHA variables to single-char concatenated strings, allowing the use of character classes [] instead of alternations with |
  • Rewrote part of the URL expression that was causing Issue #957 (pathological backtracking caused by greedy operators)

Some additional small changes:

  • Removed a few unnecessary (parts of) regular expressions
  • Removed some unnecessary backslashes in character ranges
  • Started unifying regular expressions, using the same parameterized format across languages
  • Added/enabled/expanded some unit tests, incl. those to flag pathological backtracking (test_issue957 and test_issue2835)
  • Fixed few typos in CONTRIBUTING.md

Results

Tested on the UD train corpora:

  • Tokenization speed in WPS is 2-3 times faster
  • Tokenization accuracy of the blank models remains virtually the same

image

TODO

  • Evaluation of re-trained language models with re instead of regex
  • Run for ja & th (with CLI, as they're not installed properly on my system)

Open issues (to discuss)

  • Temporarily disabled some unit tests with double punctuation, such as "Hello''" and "Тест''", as these now oversegment
  • Oversegmentation of double punctuation was found to introduce a few more errors for UD training corpora for de, id and ur specifically. This could be fixed by introducing a new regular expression, unless we want to keep the initial oversegmentation
  • The back-tracking issue may creep in again in the future if new expressions with overlapping greedy operators would be introduced

Future work (next PR)

  • Investigate whether the unicode ranges can be further reduced (they are quite ugly now)
  • Investigate the necessity & speed impact of e.g. non-latin characters in latin texts
  • Use more character classes instead of alternations for the other variables in char_classes
  • Clean up code further to enhance maintainability of the punctuation rules

Types of change

Enhancement: tokenization speed

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.
@svlandeg

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2019

Oh, looks like French unicode characters É and Î are tripping up the unit tests for Python 2.7 ...
[edit: fixed!]

@honnibal honnibal merged commit 46dfe77 into explosion:develop Feb 1, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@svlandeg svlandeg deleted the svlandeg:feature/remove_regex branch Feb 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.