refined functionality of tokenizer for interaction with other libraries #3

LinguList · 2016-11-07T12:08:34Z

I introduce a couple of refinements (at least I consider them as refinements):

one can now also pass an orthoprofile as list, which makes it easy to integrate orthoprofiles in software without loading them
the behavior of the transform command now offers the conversion of non-recognized items, following a greedy search for the best matches. This is helpful for the creation of orthoprofiles, as it tells the user where a particular conversion failed.
one can specify how missing items should be displayed, to avoid that they pass as normally converted items, default being .

Example:

In [1]: from segments.tokenizer import Tokenizer
In [2]: t = Tokenizer([['graphemes', 'ipa'], ['th', 'T'], ['kh', 'K'], ['a', 'a'], ['aa', 'A']])
In [3]: t.transform('khakha', 'ipa')
Out[3]: 'K a K a'
In [4]: t.transform('khaka', 'ipa')
Out[4]: 'K a <k> a'
In [7]: t.transform('khaakaa', 'ipa', missing=lambda x: '('+x+')')
Out[7]: 'K A (k) a a'
In [9]: t.transform('khaak aapa', 'ipa', missing=lambda x: '('+x+')', separator=' + ')
Out[9]: 'K A (k) + A (p) a'

Note that the behaviour is still not completely as wanted, as seen in out[7], as I the mapping is not gready after the wrong match of (k).

xrotwang · 2016-11-10T09:33:37Z

segments/tokenizer.py

@@ -130,7 +142,7 @@ def _init_profile(self):
            if grapheme not in self.op_graphemes:
                self.op_graphemes[grapheme] = 1
            else:
-                raise Exception("You have a duplicate in your orthography profile.")
+                raise Exception("{0} is duplicate in your orthography profile.".format(grapheme))


non-ASCII in error messages is problematic

Yep, I agree, now that I saw the problems in lingpy. One could also discuss whether one wants to actually throw an error there, as one could also ignore it and just issue a warning. But there needs to be some verbosity of to which characters cause the problem, as it is incredibly problematic to not know which character is causing the problem. Having something like a class-attribute that stores duplicates, and a warning that is thrown if this is encountered: would that be a good pragmatic solution?

I think the pragmatic solution is logging for any kind of debug output.

xrotwang · 2016-11-10T10:12:10Z

Unfortunately, I started doing some streamlining of the package overlapping with your changes:
https://github.com/bambooforest/segments/pull/4
Except for the handling of missing graphemes, I should have incorporated all of your changes, though.

LinguList · 2016-11-10T10:22:17Z

I think I can merge your #4 and then re-submit this PR with the modification for the missing graphemes. Regarding that missing graphemes code, I was anyway thinking that it is suboptimal still, and should rather go to the algorithm that searches the grapheme for matches in general. But reporting suboptimal matches directly may also blow up the search space, and apart from running a Dijkstra-like algorithm searching for the best suboptimal combination of all combinations of ngrams, I don't know how to do this in a "complete" and non-approximative way. So in short: current solution is pragmatic rather than exact, but I think it is helpful for creating profiles.

LinguList · 2016-11-21T20:24:57Z

Okay, no errors and conflicts with the merge I just created.

refined functionality of tokenizer for interaction with other libraries

ab017b3

xrotwang reviewed Nov 10, 2016

View reviewed changes

merged the approaches

a716ac0

bambooforest merged commit cf9eb5f into cldf:master Jan 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refined functionality of tokenizer for interaction with other libraries #3

refined functionality of tokenizer for interaction with other libraries #3

LinguList commented Nov 7, 2016

xrotwang Nov 10, 2016

LinguList Nov 10, 2016

xrotwang Nov 10, 2016

xrotwang commented Nov 10, 2016

LinguList commented Nov 10, 2016

LinguList commented Nov 21, 2016

refined functionality of tokenizer for interaction with other libraries #3

refined functionality of tokenizer for interaction with other libraries #3

Conversation

LinguList commented Nov 7, 2016

xrotwang Nov 10, 2016

Choose a reason for hiding this comment

LinguList Nov 10, 2016

Choose a reason for hiding this comment

xrotwang Nov 10, 2016

Choose a reason for hiding this comment

xrotwang commented Nov 10, 2016

LinguList commented Nov 10, 2016

LinguList commented Nov 21, 2016