Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spaCy tokenizer does not split correctly tokens separated by a slash (/) ending in a digit #2926

Closed
arimbr opened this issue Nov 14, 2018 · 2 comments
Labels
feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy

Comments

@arimbr
Copy link

arimbr commented Nov 14, 2018

spaCy tokenizer seems not to tokenize correctly tokens separated by slash (/) when some of them end with a digit.

How to reproduce the behaviour

In [57]: import spacy
In [58]: nlp = spacy.load('fr')

In [59]: [t for t in nlp('Learn html5/css3/javascript/jquery')]
Out[59]: [Learn, html5/css3/javascript, /, jquery] # UNEXPECTED

In [60]: [t for t in nlp('Learn html/css/javascript/jquery')]
Out[60]: [Learn, html, /, css, /, javascript, /, jquery] # EXPECTED

Your Environment

  • spaCy version: 2.0.11
  • Platform: Linux-4.15.0-36-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.5
  • Models: fr, en

Related issue #891

@ines ines added feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy labels Nov 14, 2018
@ines
Copy link
Member

ines commented Jan 7, 2019

Merging this with the master issue in #1642!

@ines ines closed this as completed Jan 7, 2019
@lock
Copy link

lock bot commented Feb 6, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

2 participants