For hindi text period '.' punctuation not splitting text #3625
Labels
feat / tokenizer
Feature: Tokenizer
help wanted (easy)
Contributions welcome! (also suited for spaCy beginners)
help wanted
Contributions welcome!
lang / hi
Hindi language data and models
perf / accuracy
Performance: accuracy
How to reproduce the behaviour
from spacy.lang.hi import Hindi
nlp = Hindi()
doc = nlp(u"hi. how हुए. होटल, होटल")
print([token.text for token in doc])
Output
['hi', '.', 'how', 'हुए.', 'होटल', ',', 'होटल']
Issue
For 'हुए.' '.' should be splitted, but it's part of 'हुए' after tokenization. Where as it works properly for ',' punctuation
Your Environment
The text was updated successfully, but these errors were encountered: