For hindi text period '.' punctuation not splitting text #3625

gauravgr · 2019-04-22T07:08:17Z

How to reproduce the behaviour

from spacy.lang.hi import Hindi
nlp = Hindi()
doc = nlp(u"hi. how हुए. होटल, होटल")
print([token.text for token in doc])

Output
['hi', '.', 'how', 'हुए.', 'होटल', ',', 'होटल']

Issue
For 'हुए.' '.' should be splitted, but it's part of 'हुए' after tokenization. Where as it works properly for ',' punctuation

Your Environment

spaCy version: 2.1.3
Platform: Windows-10-10.0.10240-SP0
Python version: 3.6.4

ines · 2019-04-22T13:05:03Z

Hmmm, maybe the character classes used in the default punctuation rules currently don't include certain unicode characters, so the rules that say "split . after a letter" aren't applied? It should be okay, though – see here:

spaCy/spacy/lang/char_classes.py

Lines 196 to 200 in ec0d840

    
           _uncased = _bengali + _hebrew + _persian + _sinhala 
        
           ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased) 
        
           ALPHA_LOWER = group_chars(_lower + _uncased) 
        
           ALPHA_UPPER = group_chars(_upper + _uncased)

This is definitely worth investigating. Maybe it even makes sense to implement more specific punctuation rules for Hindi – I'm not sure which puncutation characters are common and which aren't, but we might even be able to use a much simpler rule set in this case.

lock · 2019-08-10T11:42:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

gauravgr changed the title ~~For hindi text . punctuation not splitting text~~ For hindi text period '.' punctuation not splitting text Apr 22, 2019

ines added feat / tokenizer Feature: Tokenizer help wanted Contributions welcome! help wanted (easy) Contributions welcome! (also suited for spaCy beginners) lang / hi Hindi language data and models perf / accuracy Performance: accuracy labels Apr 22, 2019

yash1994 mentioned this issue Jul 11, 2019

Fix default punctuation rules for splitting Hindi text #3948

Merged

3 tasks

ines closed this as completed Jul 11, 2019

lock bot locked as resolved and limited conversation to collaborators Aug 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For hindi text period '.' punctuation not splitting text #3625

For hindi text period '.' punctuation not splitting text #3625

gauravgr commented Apr 22, 2019

ines commented Apr 22, 2019 •

edited

lock bot commented Aug 10, 2019

For hindi text period '.' punctuation not splitting text #3625

For hindi text period '.' punctuation not splitting text #3625

Comments

gauravgr commented Apr 22, 2019

How to reproduce the behaviour

Your Environment

ines commented Apr 22, 2019 • edited

lock bot commented Aug 10, 2019

ines commented Apr 22, 2019 •

edited