Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation and discoverability of additional algorithms #19

Open
5 tasks
bbqsrc opened this issue Mar 19, 2020 · 1 comment
Open
5 tasks

Documentation and discoverability of additional algorithms #19

bbqsrc opened this issue Mar 19, 2020 · 1 comment
Assignees
Milestone

Comments

@bbqsrc
Copy link
Member

bbqsrc commented Mar 19, 2020

We have two algorithms at play in divvunspell that don't exist in hfst-ospell:

  • Case handling
  • Penalty weighting for first letter different, last letter difference and Damerau–Levenshtein distance for middle letters

Things to do to make this good:

  • Document somewhere sane how the algorithms behave
  • Add some information to --help either with a link or with the information itself
  • In the suggestion output for divvunspell, show the penalties, and the unmodified weights, as well as the modified weights
  • Document how to add the weight information to BHFST files so it can be controlled by the linguist
  • If possible, add a flag for disabling the penalty weighting algorithm (like --no-case-handling already does somewhat, but separate the two into different flags)
@bbqsrc bbqsrc self-assigned this Mar 19, 2020
@bbqsrc bbqsrc added this to the 1.0 milestone Mar 19, 2020
@nlhowell
Copy link

Just a ping: this is important for me; I have orthographic corrections that
specifically apply to the beginning and ends of words; these are given low (or
even zero!) weight.

hfst-ospell makes the correct suggestions, but divvunspell overrides some of
these with much less appropriate corrections. It would be great if I could add
some information to the .bhfst to modify this.

Here's an example.

Input: кера (final glyph is "cyrillic a")
Correct spelling: кера̄ (final glyph is "cyrillic a" + "combining macron")

Suggested spellings (hfst-ospell):

$ echo 'кера' | hfst-ospell tsez.zhfst -S | head
"кера" is NOT in the lexicon:
Corrections for "кера":
кера̄    1.000000
кека    10.000000
кекра    10.000000
кеза    10.000000
кура    10.000000
кераз    10.000000
кеца    10.000000
кецра    10.000000

Suggested spellings (divvunspell):

$ echo 'кера' | divvunspell -b ddo.bhfst -s | head
Reading from stdin...
Input: кера		[INCORRECT]
кеза		15
кека		15
кекра		15
кеца		15
кецра		15
кура		15
кера̄		16
кераз		25
керо		25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants