Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transdcuer no longer meets Apertium Turkic standards #15

Open
jonorthwash opened this issue Sep 2, 2019 · 13 comments
Open

transdcuer no longer meets Apertium Turkic standards #15

jonorthwash opened this issue Sep 2, 2019 · 13 comments

Comments

@jonorthwash
Copy link
Member

jonorthwash commented Sep 2, 2019

The issue with the reorganisation of the lexicon in de4c77a is that different parts of speech are all lumped together.

Every single other Turkic transducer uses the lexicon names Nouns, Adjectives, Verbs, ProperNouns, etc. This is standardised for several reasons. One of which is so that we have an easy way to count the number of stems of a particular type. E.g., note that the countstems script was broken by your changes.

@IlnarSelimcan, could you justify why you did this reorganisation? Also, in principle this sort of major restructuring should be done in consultation with and by consensus among everyone it affects—that is, everyone who's committed to this repo, or at least the apertium-turkic mailing list.

@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Sep 2, 2019

First of all, apologies for having broken the old workflows.

Apertium-uzb and apertium-kaa are also affected by this.

If we decide to restore the old organisation, apertium-kaz can simply be reverted to d9ee49d All subsequent changes were also made on https://raw.githubusercontent.com/taruen/apertiumpp/master/apertiumpp-kaz/lexicon.rkt (stems from which I plan to merge back to apertium-kaz in some sensible way, once I finish proofreading them against explanatory dictionary).

Apertium-uzb and apertium-kaa had that organisation before GSoC, but committers didn't seem to be careful enough not to put adjectives to LEXICON Nouns, nouns to LEXICON Adjectives etc.

In short, the reasons why I had reduced lexicons to Common, Proper, Punctuation and Abbreviations were that:

  1. people didn't seem to respect the separation into POS-based lexicons anyway
  2. duplications. duplications. duplications Same word added as both N1 and N5 (ok, same lexicon); as CS and CC, both to Adjectives and Nouns. A plain wordlist, kept in alphabetical order, makes such duplications jump out at you immediately.

Iirc, even the creators of .lexc admit (in FSM book) that some more computationally-processable format should be used for storing the lexicons (from which them .lexc files are then derived). Either lexc2dix should be polished up so that we can easily query lexicons (to count stems etc), or we just should write lexicons in some other format. I see that as a real problem, but that's only my opinion.

@IlnarSelimcan
Copy link
Member

Last time I looked at it, lexc2dix was making some errors which I don't recall anymore.

@mansayk
Copy link
Member

mansayk commented Sep 2, 2019 via email

@jonorthwash
Copy link
Member Author

@mansayk, thank you for sharing your view on this—it's very helpful.

I'd just like to clarify one point. You say:

Jumping all the time through the file is not an option here and the search also doesn't help that much, unfortunately. It slows you down significantly.

I'm not sure I understand what the problem is. Could you provide more information on what you're having trouble with?

@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Sep 2, 2019

When checking a lexc file for miscategorized stems (maybe having an alphabetically sorted reference dictionary at hand, maybe not, but especially if you do), you must see all occurences of the particular stem in the lexc file (to see the continuations they have). That implies manual search.

That would mean typing Control-S in emacs, and then typing up the word you're looking for, and jumping through the file. Or selecting the word, and then searching: https://stackoverflow.com/questions/202803/searching-for-marked-selected-text-in-emacs

(I doubt that it's any faster in Vi(m) :P )

Imo that's significantly slower than going through an alphabetically sorted list and just deleting the lines where the stem has wrong cont. lexicon.

@mansayk , did you mean that?

@ftyers
Copy link
Member

ftyers commented Sep 2, 2019

22:41 <spectie> actually, ilnar and mansur have a point here
22:41 <spectie> for the kind of work they are doing their system is better 
22:41 <spectie> how about a compromise like
22:41 <spectie> Open (N, V, Adv, A)
22:42 <spectie> then Closed
22:42 <spectie> and within Closed 
22:42 <spectie> Pronouns ; Determiners ; ...
22:42 <spectie> and then have separate lexicons for each of the closed categories 
22:42 <spectie> i think i would be happy with that 
22:42 <spectie> also, "weird irregular stuff" usually happens in closed categories 

@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Sep 2, 2019

Another thing is that categories are not independent of each other, so to speak. Sure, some stems can belong to several categories at once, but there also cases when belonging to one category excludes belonging to another.

In my worldview at least, "foo A1" makes "foo ADV" redundant, as "hargle CC" would make "hargle CS" redundant (or incorrect).

Yet another thing are improperly lexicalised wordforms. Seeing "алдында ADV" right after "ал{д} N1" should make any conscious lexicographer think.

@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Sep 2, 2019

I think I like what Fran suggested. Indeed pronouns especially tend to have lots of hardcoded entries anyway, hence it makes sense to keep them and other closed categories separate.

@mansayk
Copy link
Member

mansayk commented Sep 2, 2019 via email

@jonorthwash
Copy link
Member Author

Okay, I have a better sense now of what the reasoning is. These are valid reasons, and I've experienced these issues myself. I like Fran's proposal—to keep "open" and "closed" categories separate. I would argue that closed categories should be broken down much the way we had them—or we could include conjunctions and the like with the open categories so they're near adverbs. Pronouns and determiners should definitely go together. Numbers should probably remain separate.

In any case, I'm okay lumping various categories together for the reasons stated, but I also think there are certain ways that we should keep things separate. Does this make sense? Is my general philosophy towards it compatible with everyone else's?

@ftyers
Copy link
Member

ftyers commented Sep 3, 2019

I was thinking something like:

LEXICON Root

Open ;
Closed ; 
Proper ; 
Punctuation ; 
Numerals ;

LEXICON Open

bar:bar N1 ; ! ""
foo:foo N1 ; ! ""
foo:foo V-TV ;  ! ""

LEXICON Closed 

Pronouns ;
Determiners ;
Conjunctions ;
Postpositions ; 

LEXICON Pronouns 

blah:blah PRON-PERS ; ! ""

LEXICON Proper 

LEXICON Punctuation

LEXICON Numerals 

@mansayk
Copy link
Member

mansayk commented Sep 3, 2019 via email

@jonorthwash
Copy link
Member Author

jonorthwash commented Sep 3, 2019

I would suggest to place LEXICON Open in the very end of the file, so it is easier to find where it ends when we sort it.

I'm used to having Punctuation and Numerals (and Guesser) at the end of the file, but it doesn't much matter. I think the reason these are normally at the ends is that they're kind of "afterthoughts", and once you have them set up, you're not going to touch them much. The latter is probably true of pronouns and determiners too, though, and those are usually at the beginning.

In any case, finding the end of a lexicon isn't difficult with vim: you just enter visual mode (v) at the top of the lexicon, and search (/) for LEXICON, and then go up one line to exclude that. Then sort (:sort). (I'm not at a computer now, so I might be misremembering a detail or two, but it's still doable.)

But I certainly don't mind having Open and Proper at the end of the file. It certainly makes sense if they're the main lexicons that are going to change after some level of development. The main issue will be that you can't have them both as the last lexicon of the file...

I propose a couple adjustments to @ftyers's proposal:

  • Keep determiners and pronouns together, or at least close, since there's often a certain amount of overlap; and
  • Put postpositions in with Open, so they're there with ADV, Nouns, and Verbs—other categories they typically "derive from" or "overlap with" (and in this way they're much less of a closed class than others).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants