Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For hindi text period '.' punctuation not splitting text #3625

Closed
gauravgr opened this issue Apr 22, 2019 · 2 comments
Closed

For hindi text period '.' punctuation not splitting text #3625

gauravgr opened this issue Apr 22, 2019 · 2 comments
Labels
feat / tokenizer Feature: Tokenizer help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome! lang / hi Hindi language data and models perf / accuracy Performance: accuracy

Comments

@gauravgr
Copy link

How to reproduce the behaviour

from spacy.lang.hi import Hindi
nlp = Hindi()
doc = nlp(u"hi. how हुए. होटल, होटल")
print([token.text for token in doc])

Output
['hi', '.', 'how', 'हुए.', 'होटल', ',', 'होटल']

Issue
For 'हुए.' '.' should be splitted, but it's part of 'हुए' after tokenization. Where as it works properly for ',' punctuation

Your Environment

  • spaCy version: 2.1.3
  • Platform: Windows-10-10.0.10240-SP0
  • Python version: 3.6.4
@gauravgr gauravgr changed the title For hindi text . punctuation not splitting text For hindi text period '.' punctuation not splitting text Apr 22, 2019
@ines
Copy link
Member

ines commented Apr 22, 2019

Hmmm, maybe the character classes used in the default punctuation rules currently don't include certain unicode characters, so the rules that say "split . after a letter" aren't applied? It should be okay, though – see here:

_uncased = _bengali + _hebrew + _persian + _sinhala
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
ALPHA_LOWER = group_chars(_lower + _uncased)
ALPHA_UPPER = group_chars(_upper + _uncased)

This is definitely worth investigating. Maybe it even makes sense to implement more specific punctuation rules for Hindi – I'm not sure which puncutation characters are common and which aren't, but we might even be able to use a much simpler rule set in this case.

@ines ines added feat / tokenizer Feature: Tokenizer help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) lang / hi Hindi language data and models perf / accuracy Performance: accuracy labels Apr 22, 2019
@ines ines closed this as completed Jul 11, 2019
@lock
Copy link

lock bot commented Aug 10, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer help wanted (easy) Contributions welcome! (also suited for spaCy beginners) help wanted Contributions welcome! lang / hi Hindi language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

2 participants