-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this possible? Scan dic file and obtain all forms of all files #1
Comments
This is possible, but really hard! The easiest thing is what I think you want though, to just slap some affixes onto some roots. This is not totally correct but it should get you somewhere. There are loads of corner cases that I don't understand and this example does not even touch on compound words. Hope this helps: https://gist.github.com/aarondandy/aaa622afeeb0cb86b0d4efe697c23be5 |
Ty for this code. But i am rather interested in perfectly working one
especially that would work with UTF8 database.
Currently there is unmunch command which works on English dataset but i
need another solution for arabic and turkish :D
There is another bash script but it only prints on screen and requires a
keyword to work.
…On Mon, Mar 6, 2017 at 9:08 AM, Aaron Dandy ***@***.***> wrote:
This is possible, but really hard! The easiest thing is what I think you
want though, to just slap some affixes onto some roots. This is not
*totally* correct but it should get you somewhere. There are loads of
corner cases that I don't understand and this example does not even touch
on compound words. Hope this helps: https://gist.github.com/aarondandy/
aaa622afeeb0cb86b0d4efe697c23be5
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD9Q4jn0xPqrVTVfsO5SJLLWAA4zdqvbks5ri6LegaJpZM4MRooe>
.
|
I worked around this by using Hunspell's |
unmunch doesnt work for UTF8 e.g. arabic
…On Sat, May 27, 2017 at 4:38 PM, Rian Stockbower ***@***.***> wrote:
I worked around this by using Hunspell's unmunch will generate all forms
of all words. This is probably quicker/easier for a one-off job. (And it
enables *really* fast comparisons against a HashSet<string>--at least an
order of magnitude faster than Hunspell itself.)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD9Q4jcwpa2ZWPoK-3JKrNtRN1O1KlrXks5r-CdIgaJpZM4MRooe>
.
|
|
ok lets say than for non-Latin characters
…On Sun, May 28, 2017 at 4:24 PM, Rian Stockbower ***@***.***> wrote:
unmunch doesnt work for UTF8
unmunch may not work for non-ASCII characters, or non-Latin characters,
or unusual character encodings, but it absolutely works on UTF-8 files. You
may want to read up on Unicode and character encodings
<https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>
.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD9Q4jn7CLvgNeWhQpoiiOJ5YaNPmohzks5r-XWGgaJpZM4MRooe>
.
|
What i want is simple
I would like to obtain all words that can be composed from the given word
E.g.
make/UAGS
in us.dic file
So i want to obtain all words that can be obtained from this word/suffix combination
e.g. results are : made, making, makes etc
The text was updated successfully, but these errors were encountered: