New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #628 Greek word tokenizer added #629
Conversation
Smart addition. FYI the build server error'ed out on the pass (looks like a problem on Travis CI's side), so I'll merge as soon as I see that finished. |
Codecov Report
@@ Coverage Diff @@
## master #629 +/- ##
==========================================
+ Coverage 86.55% 86.57% +0.01%
==========================================
Files 138 138
Lines 8385 8394 +9
==========================================
+ Hits 7258 7267 +9
Misses 1127 1127
Continue to review full report at Codecov.
|
Saw that last night (problem seems to have been with Travis this time?)—looks like it’s passing on my end now. |
Fwiw I have seen a lot of interest in the Greek tools in recent months and hope to spend more time—starting with this PR—on building up/advising a Greek core development team. |
API addition of Greek by Patrick Burns PR #629
Yes, the hangups were all on Travis's side. Everything passed fine, eventually. |
* add odia language (cltk#606) * add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Bump vers for Odia alphabet * alpha odia * Fixes cltk#628 Greek word tokenizer added (cltk#629) * Check that tokens exist before handling them in Latin word tokenizer * Update files * Reset master * Reset master * rm whitespace * Add default Greek tokenizer * Comment on word tokenizer * Cleanup order of languages/functions, alphabetical * Update docs for Greek word tokenizer * Add unittest for Greek word tokenizer * Bump vers for tokenizer API addition of Greek by Patrick Burns PR cltk#629 * Fix Typo (cltk#627) * Fix Typo mrathi -> marathi * Update stops.py
* add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Fix typo UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS followed the usual convention of CLTK to separate with an _ ( underscore) * sync (#1) * add odia language (#606) * add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Bump vers for Odia alphabet * alpha odia * Fixes #628 Greek word tokenizer added (#629) * Check that tokens exist before handling them in Latin word tokenizer * Update files * Reset master * Reset master * rm whitespace * Add default Greek tokenizer * Comment on word tokenizer * Cleanup order of languages/functions, alphabetical * Update docs for Greek word tokenizer * Add unittest for Greek word tokenizer * Bump vers for tokenizer API addition of Greek by Patrick Burns PR #629 * Fix Typo (#627) * Fix Typo mrathi -> marathi * Update stops.py
* add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Fix typo UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS followed the usual convention of CLTK to separate with an _ ( underscore) * sync (#1) * add odia language (cltk#606) * add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Bump vers for Odia alphabet * alpha odia * Fixes cltk#628 Greek word tokenizer added (cltk#629) * Check that tokens exist before handling them in Latin word tokenizer * Update files * Reset master * Reset master * rm whitespace * Add default Greek tokenizer * Comment on word tokenizer * Cleanup order of languages/functions, alphabetical * Update docs for Greek word tokenizer * Add unittest for Greek word tokenizer * Bump vers for tokenizer API addition of Greek by Patrick Burns PR cltk#629 * Fix Typo (cltk#627) * Fix Typo mrathi -> marathi * Update stops.py
* add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Fix typo UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS followed the usual convention of CLTK to separate with an _ ( underscore) * sync (#1) * add odia language (#606) * add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Bump vers for Odia alphabet * alpha odia * Fixes #628 Greek word tokenizer added (#629) * Check that tokens exist before handling them in Latin word tokenizer * Update files * Reset master * Reset master * rm whitespace * Add default Greek tokenizer * Comment on word tokenizer * Cleanup order of languages/functions, alphabetical * Update docs for Greek word tokenizer * Add unittest for Greek word tokenizer * Bump vers for tokenizer API addition of Greek by Patrick Burns PR #629 * Fix Typo (#627) * Fix Typo mrathi -> marathi * Update stops.py * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com> * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com>
* add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Fix typo UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS followed the usual convention of CLTK to separate with an _ ( underscore) * sync (#1) * add odia language (#606) * add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Bump vers for Odia alphabet * alpha odia * Fixes #628 Greek word tokenizer added (#629) * Check that tokens exist before handling them in Latin word tokenizer * Update files * Reset master * Reset master * rm whitespace * Add default Greek tokenizer * Comment on word tokenizer * Cleanup order of languages/functions, alphabetical * Update docs for Greek word tokenizer * Add unittest for Greek word tokenizer * Bump vers for tokenizer API addition of Greek by Patrick Burns PR #629 * Fix Typo (#627) * Fix Typo mrathi -> marathi * Update stops.py * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com> * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com> * add swadesh_hi words and docs Signed-off-by: inishchith <inishchith@gmail.com> * Update hindi.rst * Update hindi.rst * Update hindi.rst
* add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Fix typo UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS followed the usual convention of CLTK to separate with an _ ( underscore) * sync (#1) * add odia language (#606) * add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Bump vers for Odia alphabet * alpha odia * Fixes #628 Greek word tokenizer added (#629) * Check that tokens exist before handling them in Latin word tokenizer * Update files * Reset master * Reset master * rm whitespace * Add default Greek tokenizer * Comment on word tokenizer * Cleanup order of languages/functions, alphabetical * Update docs for Greek word tokenizer * Add unittest for Greek word tokenizer * Bump vers for tokenizer API addition of Greek by Patrick Burns PR #629 * Fix Typo (#627) * Fix Typo mrathi -> marathi * Update stops.py * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com> * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com> * add swadesh_hi words and docs Signed-off-by: inishchith <inishchith@gmail.com> * remove duplicates * remove duplicates #2 remove duplicates + docs formatting
This PR adds a default word tokenizer for Greek (nb: I could have sworn this already existed, but apparently not; e.g. not in the docs). Until language-specific options are added, this word tokenizer just uses the existing NLTK word tokenizer wrapper split string into list of tokens. Also adds unittest and documentation.