Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #628 Greek word tokenizer added #629

Merged
merged 25 commits into from Jan 28, 2018
Merged

Fixes #628 Greek word tokenizer added #629

merged 25 commits into from Jan 28, 2018

Conversation

diyclassics
Copy link
Collaborator

This PR adds a default word tokenizer for Greek (nb: I could have sworn this already existed, but apparently not; e.g. not in the docs). Until language-specific options are added, this word tokenizer just uses the existing NLTK word tokenizer wrapper split string into list of tokens. Also adds unittest and documentation.

@kylepjohnson
Copy link
Member

Smart addition.

FYI the build server error'ed out on the pass (looks like a problem on Travis CI's side), so I'll merge as soon as I see that finished.

@kylepjohnson kylepjohnson mentioned this pull request Jan 28, 2018
@codecov-io
Copy link

Codecov Report

Merging #629 into master will increase coverage by 0.01%.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #629      +/-   ##
==========================================
+ Coverage   86.55%   86.57%   +0.01%     
==========================================
  Files         138      138              
  Lines        8385     8394       +9     
==========================================
+ Hits         7258     7267       +9     
  Misses       1127     1127
Impacted Files Coverage Δ
cltk/tests/test_tokenize.py 98.26% <100%> (+0.07%) ⬆️
cltk/lemmatize/latin/backoff.py 96.48% <100%> (ø) ⬆️
cltk/tokenize/word.py 93.33% <95.45%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4339605...8fcd3e4. Read the comment docs.

@diyclassics
Copy link
Collaborator Author

Saw that last night (problem seems to have been with Travis this time?)—looks like it’s passing on my end now.

@diyclassics
Copy link
Collaborator Author

Fwiw I have seen a lot of interest in the Greek tools in recent months and hope to spend more time—starting with this PR—on building up/advising a Greek core development team.

@kylepjohnson kylepjohnson merged commit 094f001 into cltk:master Jan 28, 2018
kylepjohnson added a commit that referenced this pull request Jan 28, 2018
API addition of Greek by Patrick Burns PR #629
@kylepjohnson
Copy link
Member

Yes, the hangups were all on Travis's side. Everything passed fine, eventually.

inishchith added a commit to inishchith/cltk that referenced this pull request Jan 29, 2018
*  add odia language (cltk#606)

*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Bump vers for Odia alphabet

* alpha odia

* Fixes cltk#628 Greek word tokenizer added (cltk#629)

* Check that tokens exist before handling them in Latin word tokenizer

* Update files

* Reset master

* Reset master

* rm whitespace

* Add default Greek tokenizer

* Comment on word tokenizer

* Cleanup order of languages/functions, alphabetical

* Update docs for Greek word tokenizer

* Add unittest for Greek word tokenizer

* Bump vers for tokenizer

API addition of Greek by Patrick Burns PR cltk#629

* Fix Typo (cltk#627)

* Fix Typo

mrathi -> marathi

* Update stops.py
kylepjohnson pushed a commit that referenced this pull request Jan 30, 2018
*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Fix typo 

UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS
followed the usual convention of CLTK to separate with an _ ( underscore)

* sync (#1)

*  add odia language (#606)

*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Bump vers for Odia alphabet

* alpha odia

* Fixes #628 Greek word tokenizer added (#629)

* Check that tokens exist before handling them in Latin word tokenizer

* Update files

* Reset master

* Reset master

* rm whitespace

* Add default Greek tokenizer

* Comment on word tokenizer

* Cleanup order of languages/functions, alphabetical

* Update docs for Greek word tokenizer

* Add unittest for Greek word tokenizer

* Bump vers for tokenizer

API addition of Greek by Patrick Burns PR #629

* Fix Typo (#627)

* Fix Typo

mrathi -> marathi

* Update stops.py
@inishchith inishchith mentioned this pull request Jan 31, 2018
inishchith added a commit to inishchith/cltk that referenced this pull request Jan 31, 2018
*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Fix typo 

UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS
followed the usual convention of CLTK to separate with an _ ( underscore)

* sync (#1)

*  add odia language (cltk#606)

*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Bump vers for Odia alphabet

* alpha odia

* Fixes cltk#628 Greek word tokenizer added (cltk#629)

* Check that tokens exist before handling them in Latin word tokenizer

* Update files

* Reset master

* Reset master

* rm whitespace

* Add default Greek tokenizer

* Comment on word tokenizer

* Cleanup order of languages/functions, alphabetical

* Update docs for Greek word tokenizer

* Add unittest for Greek word tokenizer

* Bump vers for tokenizer

API addition of Greek by Patrick Burns PR cltk#629

* Fix Typo (cltk#627)

* Fix Typo

mrathi -> marathi

* Update stops.py
kylepjohnson pushed a commit that referenced this pull request Feb 8, 2018
*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Fix typo 

UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS
followed the usual convention of CLTK to separate with an _ ( underscore)

* sync (#1)

*  add odia language (#606)

*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Bump vers for Odia alphabet

* alpha odia

* Fixes #628 Greek word tokenizer added (#629)

* Check that tokens exist before handling them in Latin word tokenizer

* Update files

* Reset master

* Reset master

* rm whitespace

* Add default Greek tokenizer

* Comment on word tokenizer

* Cleanup order of languages/functions, alphabetical

* Update docs for Greek word tokenizer

* Add unittest for Greek word tokenizer

* Bump vers for tokenizer

API addition of Greek by Patrick Burns PR #629

* Fix Typo (#627)

* Fix Typo

mrathi -> marathi

* Update stops.py

*  add classical_hindi stops.py

Signed-off-by: inishchith <inishchith@gmail.com>

*  add classical_hindi stops.py

Signed-off-by: inishchith <inishchith@gmail.com>
@diyclassics diyclassics deleted the greek-word-tokenizer branch February 9, 2018 15:56
kylepjohnson pushed a commit that referenced this pull request Feb 19, 2018
*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Fix typo 

UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS
followed the usual convention of CLTK to separate with an _ ( underscore)

* sync (#1)

*  add odia language (#606)

*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Bump vers for Odia alphabet

* alpha odia

* Fixes #628 Greek word tokenizer added (#629)

* Check that tokens exist before handling them in Latin word tokenizer

* Update files

* Reset master

* Reset master

* rm whitespace

* Add default Greek tokenizer

* Comment on word tokenizer

* Cleanup order of languages/functions, alphabetical

* Update docs for Greek word tokenizer

* Add unittest for Greek word tokenizer

* Bump vers for tokenizer

API addition of Greek by Patrick Burns PR #629

* Fix Typo (#627)

* Fix Typo

mrathi -> marathi

* Update stops.py

*  add classical_hindi stops.py

Signed-off-by: inishchith <inishchith@gmail.com>

*  add classical_hindi stops.py

Signed-off-by: inishchith <inishchith@gmail.com>

*  add swadesh_hi words and docs

Signed-off-by: inishchith <inishchith@gmail.com>

* Update hindi.rst

* Update hindi.rst

* Update hindi.rst
kylepjohnson pushed a commit that referenced this pull request Feb 22, 2018
*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Fix typo 

UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS
followed the usual convention of CLTK to separate with an _ ( underscore)

* sync (#1)

*  add odia language (#606)

*  add odia language

* Fix-Typo

*  add odia docs

* Update wiki summary to emphasize history

* Bump vers for Odia alphabet

* alpha odia

* Fixes #628 Greek word tokenizer added (#629)

* Check that tokens exist before handling them in Latin word tokenizer

* Update files

* Reset master

* Reset master

* rm whitespace

* Add default Greek tokenizer

* Comment on word tokenizer

* Cleanup order of languages/functions, alphabetical

* Update docs for Greek word tokenizer

* Add unittest for Greek word tokenizer

* Bump vers for tokenizer

API addition of Greek by Patrick Burns PR #629

* Fix Typo (#627)

* Fix Typo

mrathi -> marathi

* Update stops.py

*  add classical_hindi stops.py

Signed-off-by: inishchith <inishchith@gmail.com>

*  add classical_hindi stops.py

Signed-off-by: inishchith <inishchith@gmail.com>

*  add swadesh_hi words and docs

Signed-off-by: inishchith <inishchith@gmail.com>

* remove duplicates

* remove duplicates #2

remove duplicates + docs formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants