Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant and miscategorized stems in apertium-kaz.kaz.lexc #11

Open
IlnarSelimcan opened this issue Feb 18, 2019 · 4 comments
Open

Redundant and miscategorized stems in apertium-kaz.kaz.lexc #11

IlnarSelimcan opened this issue Feb 18, 2019 · 4 comments
Assignees

Comments

@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Feb 18, 2019

The vocabulary of apertium-kaz.kaz.lexc requires checking for redundancy, consistency and miscategorizations. Here are some examples:

кептірген:кептірген A1 ; ! ""
аршылған:аршылған A1 ; ! ""
жонылған:жонылған A1 ; ! ""
сүрілген:сүрілген A1 ; ! ""

Along with that, reasons why these are considered mistakes, and, generally, choices made should be documented in apertium-kaz/docs so that this kind of issues don't happen in the future.

At that point, (since the coverage of apertium-kaz is relatively high, that documentation will probably be more useful for other (Turkic) languages rather than for Kazakh.

@IlnarSelimcan IlnarSelimcan self-assigned this Feb 18, 2019
IlnarSelimcan added a commit that referenced this issue Feb 20, 2019
IlnarSelimcan added a commit to apertium/apertium-kaz-tat that referenced this issue Mar 9, 2019
IlnarSelimcan added a commit that referenced this issue Mar 9, 2019
@IlnarSelimcan
Copy link
Member Author

IlnarSelimcan commented Mar 12, 2019

Instead of going over the list of stems found in kaz.lexc and checking them, I decided to start with surface forms from a frequency list made out of the subset of kitap.kz books (the ones which presumably are all in the public domain) Kazakh translation of the Little Prince (just to try the idea on something smaller). You can find the words from that frequency list (the ones I already tested) here: https://github.com/taruen/apertiumpp/blob/master/data4apertium/vocabulary/kaz.rkt

My logic here was that:

  • checking the analyses of high frequency surface forms is more useful
  • it will reveal mis-categorizations and redundant entries in kaz.lexc anyway
  • kaz.lexc will get new stems in the process

Stiil, all stems currently in apertium-kaz.kaz.lexc will have to be checked. Once I'm done with surface forms from the Little Prince (and maybe the public domain subset of kitap.kz), I'll just take the difference of the wordlist in https://github.com/taruen/apertiumpp/blob/master/data4apertium/vocabulary/kaz.rkt and stemlist in apertium-kaz.kaz.lexc as what remains to be checked.

This is a reminder for myself to do that.

IlnarSelimcan added a commit to apertium/apertium-tat that referenced this issue Mar 13, 2019
IlnarSelimcan added a commit to apertium/apertium-kaz-tat that referenced this issue Mar 13, 2019
IlnarSelimcan added a commit to apertium/apertium-kaz-tat that referenced this issue Mar 13, 2019
IlnarSelimcan added a commit to apertium/apertium-tat that referenced this issue Mar 13, 2019
@jonorthwash
Copy link
Member

Note, a GCI student wrote a lexc parser and lexicon deduplicator a couple years ago. Let me know if you want help digging it up.

@jonorthwash
Copy link
Member

Relevant tools: apertium/apertium-on-github#51

@IlnarSelimcan
Copy link
Member Author

IlnarSelimcan commented Apr 10, 2019

Turns out that the explanatory dictionary of Kazakh has been put online kitap.kz. So the task is, at the minimum, to check POS of apertium-kaz.kaz.lexc stems with that dictionary.

However, that dictionary might be under some CC license, as other things on kitap.kz seem to be. If it is, then example sentences and explanations could be used in the apertium project too. I'll need to figure out which particular license that dictionary is published under. Also see: https://yvision.kz/post/416129

IlnarSelimcan added a commit that referenced this issue Jun 18, 2019
…ed', 'abbreviations', 'punctuation', 'proper' 2. sort entries alphabetically #11
IlnarSelimcan added a commit that referenced this issue Jul 6, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Jul 17, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Jul 17, 2019
IlnarSelimcan added a commit to IlnarSelimcan/dot that referenced this issue Jul 17, 2019
@IlnarSelimcan IlnarSelimcan reopened this Jul 17, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Jul 19, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Jul 24, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Jul 24, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Jul 24, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Aug 30, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 6, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 6, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 7, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 7, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 8, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 8, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 8, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Sep 12, 2019
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Apr 7, 2020
IlnarSelimcan added a commit to taruen/apertiumpp that referenced this issue Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants