Skip to content

Commit

Permalink
Adds method to get valid onsets
Browse files Browse the repository at this point in the history
  • Loading branch information
Sedictious committed Jul 14, 2018
1 parent aebf29d commit a232607
Showing 1 changed file with 54 additions and 0 deletions.
54 changes: 54 additions & 0 deletions cltk/phonology/syllabify.py
Expand Up @@ -12,6 +12,60 @@
LOG.addHandler(logging.NullHandler())


def get_onsets(text, vowels="aeiou", threshold=0.0002):
"""
Source: Resonances in Middle High German: New Methodologies in Prosody,
2017, C. L. Hench
:param text: str list: text to be analysed
:param vowels: str: valid vowels constituting the syllable
:param threshold: minimum frequency count for valid onset, C. Hench noted
that the algorithm produces the best result for an untagged wordset of MHG,
when retaining onsets which appear in at least 0.02% of the words
Example:
Let's test it on the opening lines of Nibelungenlied
>>> text = ['uns', 'ist', 'in', 'alten', 'mæren', 'wunders', 'vil', 'geseit', 'von', 'helden', 'lobebæren',\\
'von', 'grôzer', 'arebeit', 'von', 'fröuden', 'hôchgezîten', 'von', 'weinen', 'und', 'von', 'klagen', 'von',\\
'küener', 'recken', 'strîten', 'muget', 'ir', 'nu', 'wunder', 'hœren', 'sagen']
>>> vowels = "aeiouæœôîöü"
>>> get_onsets(text, vowels=vowels)
['lt', 'm', 'r', 'w', 'nd', 'v', 'g', 's', 'h', 'ld', 'l', 'b', 'gr', 'z', 'fr', 'd', 'chg', 't', 'n', 'kl', 'k', 'ck', 'str']
Of course, this is an insignificant sample, but we could try and see
how modifying the threshold affects the returned onset:
>>> get_onsets(text, threshold = 0.05, vowels=vowels)
['m', 'r', 'w', 'nd', 'v', 'g', 's', 'h', 'b', 'z', 't', 'n']
"""
onset_dict = defaultdict(lambda: 0)
n = len(text)

for word in text:
onset = ''
candidates = []

for l in word:

if l not in vowels:
onset += l

else:
if onset != '':
candidates.append(onset)
onset = ''

for c in candidates:
onset_dict[c] += 1

return [onset for onset, i in zip(onset_dict.keys(), onset_dict.values()) if i/n > threshold]

This comment has been minimized.

Copy link
@clemsciences

clemsciences Jul 14, 2018

Member

I think

return [onset for onset, i in onset_dict if i/n > threshold]

is correct and more concise but I'm on my phone so I can't test it.

This comment has been minimized.

Copy link
@Sedictious

Sedictious Jul 14, 2018

Author Member

this is what I thought, but apparently it returns a ValueError. There may be a pythonism we are ignoring though...

This comment has been minimized.

Copy link
@clemsciences

clemsciences Jul 14, 2018

Member
return [onset for onset in onset_dict if onset_dict[onset]/n > threshold]

or

return [onset for onset, i in onset_dict.items() if i/n > threshold]


class Syllabifier:

def __init__(self, low_vowels=None, mid_vowels=None, high_vowels=None, flaps=None, laterals=None, nasals=None,
Expand Down

0 comments on commit a232607

Please sign in to comment.