Skip to content
Branch: master
Find file History
svlandeg and ines Improve Italian & Urdu tokenization accuracy (#3228)
## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'Γ¨`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
Latest commit 9745b0d Feb 4, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
__init__.py Improve Italian & Urdu tokenization accuracy (#3228) Feb 4, 2019
examples.py Add Urdu Language Support (#2430) Jun 22, 2018
lemmatizer.py
lex_attrs.py
punctuation.py
stop_words.py πŸ’« Tidy up and auto-format .py files (#2983) Nov 30, 2018
tag_map.py πŸ’« Tidy up and auto-format .py files (#2983) Nov 30, 2018
You can’t perform that action at this time.