-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refined functionality of tokenizer for interaction with other libraries #3
Conversation
@@ -130,7 +142,7 @@ def _init_profile(self): | |||
if grapheme not in self.op_graphemes: | |||
self.op_graphemes[grapheme] = 1 | |||
else: | |||
raise Exception("You have a duplicate in your orthography profile.") | |||
raise Exception("{0} is duplicate in your orthography profile.".format(grapheme)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-ASCII in error messages is problematic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree, now that I saw the problems in lingpy. One could also discuss whether one wants to actually throw an error there, as one could also ignore it and just issue a warning. But there needs to be some verbosity of to which characters cause the problem, as it is incredibly problematic to not know which character is causing the problem. Having something like a class-attribute that stores duplicates, and a warning that is thrown if this is encountered: would that be a good pragmatic solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pragmatic solution is logging for any kind of debug output.
Unfortunately, I started doing some streamlining of the package overlapping with your changes: |
I think I can merge your #4 and then re-submit this PR with the modification for the missing graphemes. Regarding that missing graphemes code, I was anyway thinking that it is suboptimal still, and should rather go to the algorithm that searches the grapheme for matches in general. But reporting suboptimal matches directly may also blow up the search space, and apart from running a Dijkstra-like algorithm searching for the best suboptimal combination of all combinations of ngrams, I don't know how to do this in a "complete" and non-approximative way. So in short: current solution is pragmatic rather than exact, but I think it is helpful for creating profiles. |
Okay, no errors and conflicts with the merge I just created. |
I introduce a couple of refinements (at least I consider them as refinements):
Example:
Note that the behaviour is still not completely as wanted, as seen in out[7], as I the mapping is not gready after the wrong match of (k).