Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refined functionality of tokenizer for interaction with other libraries #3

Merged
merged 2 commits into from
Jan 9, 2017

Conversation

LinguList
Copy link
Contributor

I introduce a couple of refinements (at least I consider them as refinements):

  • one can now also pass an orthoprofile as list, which makes it easy to integrate orthoprofiles in software without loading them
  • the behavior of the transform command now offers the conversion of non-recognized items, following a greedy search for the best matches. This is helpful for the creation of orthoprofiles, as it tells the user where a particular conversion failed.
  • one can specify how missing items should be displayed, to avoid that they pass as normally converted items, default being .

Example:

In [1]: from segments.tokenizer import Tokenizer
In [2]: t = Tokenizer([['graphemes', 'ipa'], ['th', 'T'], ['kh', 'K'], ['a', 'a'], ['aa', 'A']])
In [3]: t.transform('khakha', 'ipa')
Out[3]: 'K a K a'
In [4]: t.transform('khaka', 'ipa')
Out[4]: 'K a <k> a'
In [7]: t.transform('khaakaa', 'ipa', missing=lambda x: '('+x+')')
Out[7]: 'K A (k) a a'
In [9]: t.transform('khaak aapa', 'ipa', missing=lambda x: '('+x+')', separator=' + ')
Out[9]: 'K A (k) + A (p) a'

Note that the behaviour is still not completely as wanted, as seen in out[7], as I the mapping is not gready after the wrong match of (k).

@@ -130,7 +142,7 @@ def _init_profile(self):
if grapheme not in self.op_graphemes:
self.op_graphemes[grapheme] = 1
else:
raise Exception("You have a duplicate in your orthography profile.")
raise Exception("{0} is duplicate in your orthography profile.".format(grapheme))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-ASCII in error messages is problematic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree, now that I saw the problems in lingpy. One could also discuss whether one wants to actually throw an error there, as one could also ignore it and just issue a warning. But there needs to be some verbosity of to which characters cause the problem, as it is incredibly problematic to not know which character is causing the problem. Having something like a class-attribute that stores duplicates, and a warning that is thrown if this is encountered: would that be a good pragmatic solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pragmatic solution is logging for any kind of debug output.

@xrotwang
Copy link
Contributor

Unfortunately, I started doing some streamlining of the package overlapping with your changes:
https://github.com/bambooforest/segments/pull/4
Except for the handling of missing graphemes, I should have incorporated all of your changes, though.

@LinguList
Copy link
Contributor Author

I think I can merge your #4 and then re-submit this PR with the modification for the missing graphemes. Regarding that missing graphemes code, I was anyway thinking that it is suboptimal still, and should rather go to the algorithm that searches the grapheme for matches in general. But reporting suboptimal matches directly may also blow up the search space, and apart from running a Dijkstra-like algorithm searching for the best suboptimal combination of all combinations of ngrams, I don't know how to do this in a "complete" and non-approximative way. So in short: current solution is pragmatic rather than exact, but I think it is helpful for creating profiles.

@LinguList
Copy link
Contributor Author

Okay, no errors and conflicts with the merge I just created.

@bambooforest bambooforest merged commit cf9eb5f into cldf:master Jan 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants