Permalink
Browse files

Fix Syllabification for gu[aeiou] (#621)

* Initial releases with unit tests and doctests

* Added sections and preliminary documentation for:
Scansion of Poetry
About the use of macrons in poetry
HexameterScanner
Hexameter
ScansionConstants
Syllabifier
Metrical Validator
ScansionFormatter
StringUtils module

Made minor formatting corrections elsewhere to quiet warnings encountered during transpiling the rst file during testing and verification.

* corrected documentation & doctest comments that were causing errors.
doctests run with an added command line switch:
nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest

* fixing broken doctest comment

* correcting documentation comment that causes doctest to err

* Corrections to make the build pass:
1. added install gensim to travis build script; its absence is causing an error in word2vec.py during the build.
2. Modified transcription.py so that the macronizer is initialized on instantiation of the Transcriber class and not at the module level; the macronizer file is 32MB and this also seems to cause an error with travis as github does not make large files displayable, and so it may not be available for the build. The macronizer object has been made a component of "self."

* moved package import inside of main so that it does not prevent the build from completing;
soon, we should move to update the dependencies of word2vec; gensim pulls in boto which isn't python3 compliant, there is a boto3 version which we may be able to slot in, but perhaps a larger question is boto necessary?

* correcting documentation

* corrected documentation & doctest comments that were causing errors.
doctests run with an added command line switch:
nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest

* Added: PentameterScanner, HendecasyllableScanner, more unittests and bug fixes; refactored Hexameter class into Verse class; pulled out VerseScanner, updated documentation

* updating contributors

* Additional testing and small bug fixes based on integration tests.

* Corrections for unittest

* Adding additional unit tests; catching errors if an invalid/unconfigured charset is used with Syllabifier

* fix for gu[aieou] syllabification
  • Loading branch information...
todd-cook authored and kylepjohnson committed Jan 10, 2018
1 parent ecc89b8 commit fe1d250f4a125855812717e35ae6f777190a4cfb
Showing with 35 additions and 11 deletions.
  1. +24 −0 cltk/prosody/latin/Syllabifier.py
  2. +11 −11 docs/latin.rst
@@ -83,10 +83,28 @@ def syllabify(self, words: str) -> list:
['ru', 'ptus']
>>> print(syllabifier.syllabify("Bīthÿnus"))
['Bī', 'thÿ', 'nus']
>>> print(syllabifier.syllabify("sanguen"))
['san', 'guen']
>>> print(syllabifier.syllabify("unguentum"))
['un', 'guen', 'tum']
>>> print(syllabifier.syllabify("lingua"))
['lin', 'gua']
>>> print(syllabifier.syllabify("languidus"))
['lan', 'gui', 'dus']
"""
cleaned = words.translate(self.remove_punct_map)
cleaned = cleaned.replace("qu", "kw")
cleaned = cleaned.replace("Qu", "Kw")
cleaned = cleaned.replace("gua", "gwa")
cleaned = cleaned.replace("Gua", "Gwa")
cleaned = cleaned.replace("gue", "gwe")
cleaned = cleaned.replace("Gue", "Gwe")
cleaned = cleaned.replace("gui", "gwi")
cleaned = cleaned.replace("Gui", "Gwi")
cleaned = cleaned.replace("guo", "gwo")
cleaned = cleaned.replace("Guo", "Gwo")
cleaned = cleaned.replace("guu", "gwu")
cleaned = cleaned.replace("Guu", "Gwu")
items = cleaned.strip().split(" ")
for char in cleaned:
if not char in self.ACCEPTABLE_CHARS:
@@ -102,6 +120,12 @@ def syllabify(self, words: str) -> list:
if "Kw" in syl:
syl = syl.replace("Kw", "Qu")
syllables[idx] = syl
if "gw" in syl:
syl = syl.replace("gw", "gu")
syllables[idx] = syl
if "Gw" in syl:
syl = syl.replace("Gw", "Gu")
syllables[idx] = syl
return StringUtils.remove_blank_spaces(syllables)
View
@@ -235,15 +235,15 @@ The backoff module also offers IdentityLemmatizer which returns the given token
With the TrainLemmatizer, the backoff module allows you to provide a dictionary of the form {'TOKEN1': 'LEMMA1', 'TOKEN2': 'LEMMA2'} for lemmatization.
.. code-block:: python
In [10]: tokens = ['arma', 'uirum', '-que', 'cano', ',', 'troiae', 'qui', 'primus', 'ab', 'oris']
In [11]: dict = {'arma': 'arma', 'uirum': 'uir', 'troiae': 'troia', 'oris': 'ora'}
In [12]: from cltk.lemmatize.latin.backoff import TrainLemmatizer
In [13]: lemmatizer = TrainLemmatizer(dict)
In [14]: lemmatizer.lemmatize(tokens)
Out[14]: [('arma', 'arma'), ('uirum', 'uir'), ('-que', None), ('cano', None), (',', None), ('troiae', 'troia'), ('qui', None), ('primus', None), ('ab', None), ('oris', 'ora')]
@@ -252,7 +252,7 @@ The TrainLemmatizer—like all of the lemmatizers in this module—can take a se
.. code-block:: python
In [15]: default = DefaultLemmatizer('UNK')
In [16]: lemmatizer = TrainLemmatizer(dict, backoff=default)
In [17]: lemmatizer.lemmatize(tokens)
@@ -263,22 +263,22 @@ With the ContextLemmatizer, the backoff module allows you to provide a list of l
There are subclasses included in the backoff lemmatizer for unigram and bigram context. Here is an example of the UnigramLemmatizer():
.. code-block:: python
In [18]: train_data = [[('cum', 'cum2'), ('esset', 'sum'), ('caesar', 'caesar'), ('in', 'in'), ('citeriore', 'citer'), ('gallia', 'gallia'), ('in', 'in'), ('hibernis', 'hibernus'), (',', 'punc'), ('ita', 'ita'), ('uti', 'ut'), ('supra', 'supra'), ('demonstrauimus', 'demonstro'), (',', 'punc'), ('crebri', 'creber'), ('ad', 'ad'), ('eum', 'is'), ('rumores', 'rumor'), ('adferebantur', 'affero'), ('litteris', 'littera'), ('-que', '-que'), ('item', 'item'), ('labieni', 'labienus'), ('certior', 'certus'), ('fiebat', 'fio'), ('omnes', 'omnis'), ('belgas', 'belgae'), (',', 'punc'), ('quam', 'qui'), ('tertiam', 'tertius'), ('esse', 'sum'), ('galliae', 'gallia'), ('partem', 'pars'), ('dixeramus', 'dico'), (',', 'punc'), ('contra', 'contra'), ('populum', 'populus'), ('romanum', 'romanus'), ('coniurare', 'coniuro'), ('obsides', 'obses'), ('-que', '-que'), ('inter', 'inter'), ('se', 'sui'), ('dare', 'do'), ('.', 'punc')], [('coniurandi', 'coniuro'), ('has', 'hic'), ('esse', 'sum'), ('causas', 'causa'), ('primum', 'primus'), ('quod', 'quod'), ('uererentur', 'uereor'), ('ne', 'ne'), (',', 'punc'), ('omni', 'omnis'), ('pacata', 'paco'), ('gallia', 'gallia'), (',', 'punc'), ('ad', 'ad'), ('eos', 'is'), ('exercitus', 'exercitus'), ('noster', 'noster'), ('adduceretur', 'adduco'), (';', 'punc')]]
In [19]: default = DefaultLemmatizer('UNK')
In [20]: lemmatizer = UnigramLemmatizer(train_sents, backoff=default)
In [21]: lemmatizer.lemmatize(tokens)
Out[21]: [('arma', 'UNK'), ('uirum', 'UNK'), ('-que', '-que'), ('cano', 'UNK'), (',', 'punc'), ('troiae', 'UNK'), ('qui', 'UNK'), ('primus', 'UNK'), ('ab', 'UNK'), ('oris', 'UNK')]
NB: Documentation is still be written for the remaining backoff lemmatizers, i.e. RegexpLemmatizer(), and ContextPOSLemmatizer().
Line Tokenization
=================
The line tokenizer takes a string input into ``tokenize()`` and returns a list of strings.
The line tokenizer takes a string input into ``tokenize()`` and returns a list of strings.
.. code-block:: python
@@ -289,7 +289,7 @@ The line tokenizer takes a string input into ``tokenize()`` and returns a list o
In [3]: untokenized_text = """49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""
In [4]: tokenizer.tokenize(untokenized_text)
Out[4]: ['49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']
The line tokenizer by default removes multiple line breaks. If you wish to retain blank lines in the returned list, set the ``include_blanks`` to ``True``.
@@ -299,7 +299,7 @@ The line tokenizer by default removes multiple line breaks. If you wish to retai
In [5]: untokenized_text = """48. Cum tibi contigerit studio cognoscere multa,\nFac discas multa, vita nil discere velle.\n\n49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""
In [6]: tokenizer.tokenize(untokenized_text, include_blanks=True)
Out[6]: ['48. Cum tibi contigerit studio cognoscere multa,','Fac discas multa, vita nil discere velle.','','49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']
Macronizer

0 comments on commit fe1d250

Please sign in to comment.