Dev #2

kylepjohnson · 2014-06-15T21:38:12Z

dev branch merge for training set fix

Dev

update

…418) * Fixes #2, adds English words to exceptions list for Latin tokenizer * Fixes #417: Check token against full regex before iterating in Latin backoff lemmatizer

* Check that tokens exist before handling them in Latin word tokenizer * Add helper function 'tokenize' to sentence tokenizer * Add custom word/sentence tokenizers to Latinlibrary 'book' loader * Add alternative sentence tokenizer for latinlibrary loader if CLTK_DATA is not present * Add GSoC Lemmatize files * Minor updates to lookup files, tests for main file * Improved docs for backoff.py; more consistent naming conventions throughout * More general updates to GSoC lemmatizer code and models. * Add test module for lemmatizer * Add more tests for base lemmatizer classes * Remove test_distance_sentences temporarily * Added a readme with GSoC 2016 info * Fixed markdown link * Update readme.md * do light cleanup * Fixes #2, adds English words to exceptions list for Latin tokenizer * Remove * import for nltk tag module; load old lemma model from cltk_data as pickle. * Deleted old_model; moved to cltk_data * Renamed ModelLemmatizer to TrainLemmatizer tests * Make all model imports check cltk_data * Fixed default settings--mostly preloading regex patterns--for various lemmatizers; modified tests accordingly * Updated docstrings for backoff.py * Moved model files out of module * Fix bad print statement * Clean up Greek lemmatize module * Remove extra readme files * Comment out TrigramPOSLemmatizer * Cleanup backoff.py; add test for BigramPOS lemmatizer * Remove function for old version of lemmatizer * Updated tests * Fixed typo in test_lemmatize imports * Fixes #417: Check token against full regex before iterating in Latin backoff lemmatizer * Updated docs for Latin lemmatzier

* Check that tokens exist before handling them in Latin word tokenizer * Add helper function 'tokenize' to sentence tokenizer * Add custom word/sentence tokenizers to Latinlibrary 'book' loader * Add alternative sentence tokenizer for latinlibrary loader if CLTK_DATA is not present * Add GSoC Lemmatize files * Minor updates to lookup files, tests for main file * Improved docs for backoff.py; more consistent naming conventions throughout * More general updates to GSoC lemmatizer code and models. * Add test module for lemmatizer * Add more tests for base lemmatizer classes * Remove test_distance_sentences temporarily * Added a readme with GSoC 2016 info * Fixed markdown link * Update readme.md * do light cleanup * Fixes #2, adds English words to exceptions list for Latin tokenizer * Remove * import for nltk tag module; load old lemma model from cltk_data as pickle. * Deleted old_model; moved to cltk_data * Renamed ModelLemmatizer to TrainLemmatizer tests * Make all model imports check cltk_data * Fixed default settings--mostly preloading regex patterns--for various lemmatizers; modified tests accordingly * Updated docstrings for backoff.py * Moved model files out of module * Fix bad print statement * Clean up Greek lemmatize module * Remove extra readme files * Comment out TrigramPOSLemmatizer * Cleanup backoff.py; add test for BigramPOS lemmatizer * Remove function for old version of lemmatizer * Updated tests * Fixed typo in test_lemmatize imports * Fixes #417: Check token against full regex before iterating in Latin backoff lemmatizer * Updated docs for Latin lemmatzier * Add seed parameter so that Lemmatizer runs can be replicable

* add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Fix typo UNSTRUCTURESCONSONANTS -> UNSTRUCTURED_CONSONANTS followed the usual convention of CLTK to separate with an _ ( underscore) * sync (#1) * add odia language (#606) * add odia language * Fix-Typo * add odia docs * Update wiki summary to emphasize history * Bump vers for Odia alphabet * alpha odia * Fixes #628 Greek word tokenizer added (#629) * Check that tokens exist before handling them in Latin word tokenizer * Update files * Reset master * Reset master * rm whitespace * Add default Greek tokenizer * Comment on word tokenizer * Cleanup order of languages/functions, alphabetical * Update docs for Greek word tokenizer * Add unittest for Greek word tokenizer * Bump vers for tokenizer API addition of Greek by Patrick Burns PR #629 * Fix Typo (#627) * Fix Typo mrathi -> marathi * Update stops.py * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com> * add classical_hindi stops.py Signed-off-by: inishchith <inishchith@gmail.com> * add swadesh_hi words and docs Signed-off-by: inishchith <inishchith@gmail.com> * remove duplicates * remove duplicates #2 remove duplicates + docs formatting

remove duplicates + docs formatting

fully refactor greek prosody, also reformat

kylepjohnson added 2 commits June 15, 2014 17:09

merged latin sent trainer and tokenizer

cefd6ef

format

10c0b82

kylepjohnson added a commit that referenced this pull request Jun 15, 2014

Merge pull request #2 from kylepjohnson/dev

44d2708

Dev

kylepjohnson merged commit 44d2708 into master Jun 15, 2014

kylepjohnson deleted the dev branch June 15, 2014 21:38

kylepjohnson pushed a commit that referenced this pull request Sep 6, 2015

Merge pull request #2 from kylepjohnson/master

350a7c4

update

kylepjohnson pushed a commit that referenced this pull request Aug 29, 2016

Fixes #2, adds English words to exceptions list for Latin tokenizer (#…

3cacd3f

…372)

kylepjohnson mentioned this pull request Aug 3, 2018

Add a function for Persian Word2Vec preprocessing. #815

Closed

clemsciences pushed a commit that referenced this pull request Sep 3, 2018

remove duplicates #2

dfdc69a

remove duplicates + docs formatting

kylepjohnson pushed a commit to kylepjohnson/cltk that referenced this pull request Aug 21, 2020

Merge pull request cltk#2 from kylepjohnson/update-prosody-johns

c967f70

fully refactor greek prosody, also reformat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev #2

Dev #2

kylepjohnson commented Jun 15, 2014

Dev #2

Dev #2

Conversation

kylepjohnson commented Jun 15, 2014