Skip to content
This repository was archived by the owner on Feb 7, 2020. It is now read-only.

Conversation

balmas
Copy link
Member

@balmas balmas commented Jul 30, 2018

This is the first in a set of changes to enable disambiguation of morphological parser results with explicit annotations.

This is a porting of functionality from V1, which allowed us to disambiguate with data from treebanks. The treebanks contain morpho-syntactic annotations of specific words in specific texts. These annotations normally identify a specific lemma and inflection for a word. But they do not provide all of the information the parser does, such as the stem and suffix of the word, the conjugation, etc.

So, when we have this data, we normally want to provide a merged view of the lexeme, which retains the information from the parser which wasn't provided in the annotation, but which also narrows the inflections down to the one identified in the annotation.

There are corresponding changes in the components, morph-client and embed-lib repositories.

@balmas balmas requested review from irina060981 and kirlat July 30, 2018 18:38
@coveralls
Copy link

coveralls commented Jul 30, 2018

Coverage Status

Coverage increased (+0.3%) to 83.136% when pulling 28af398 on feature-treebank into 153e1e1 on master.

Copy link
Member

@kirlat kirlat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but have some questions about business logic.

src/lexeme.js Outdated
let keepInflections = []
// iterate through this lexemes inflections and keep only thoes that are disambiguatedBy by the supplied lexeme's inflection
for (let inflection of this.inflections) {
if (inflection.disambiguatedBy(lexeme.inflections[0])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it check only the first inflection of lexeme?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in my use case, the homonym that is used as a disambiguator specifies that it must be X inflection, so it will have only one. But, I see your point, particularly given your comments on the Components PR about the disambiguate method in Homonym, and think the code will be more flexible if we assume we might have a narrowing down from multiple possibilities, to a few, to a single one. I will go ahead and change this code accordingly. Thank you!

src/lemma.js Outdated
* @param {Lemma} lemma the lemma to compare
* @return {Boolean} true or false
*/
isEqual (lemma) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, it does check that word and part of speech are the same only. It does not check for full feature equality. So I'm thinking that isEqual name might be confusing in this case. Maybe it's better to rename it to something that more fully reflects a partiality of lemma comparison (and we might want to use isEqual later for full feature comparison)? What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirlat Yes, that's a good point. Is there a good naming convention here that you can suggest? I could use 'matches' or 'mayEqual' but if there is a more standard way to represent this I'd like to use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good if name would reflect an essence of a comparison. So if we're comparing a word and POFS what do really want to know about being equal between the two (from business logic standpoint)? Is it a word? Maybe we can call it matchesWord or matchesWordPOFS, or maybe isSameWord or maybe isSameWordPOFS.

It's really not easy. I think naming is probably the hardest part of software development :-).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little searching and apparently linguistically, this scenario is known as a "full homonym" - two words spelled the same with the same part of speech. Since that's what I'm testing here I will name the method accordingly - i.e. isFullHomonym . Sound good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think isFullHomonym sounds perfect!

*/
disambiguatedBy (infl) {
let matched = true
for (let feature of infl.constraints.obligatoryMatches) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since obligatoryMatches are set separately, I'm thinking if we can handle situations when obligatoryMatches are not set by mistake gracefully, because that seems to be a thing that is easy to miss. How do we set obligatoryMatches usually? Do we rely on a language model for that? Can we set constraints on a first disambiguatedBy call if they are not set yet? Maybe we could have a flag that will indicate if this field has been initialized yet (a flag might be an overkill, but do we need to distinguish between situations when there is no obligatory matches or when obligatory matches have not been initialized)?
Also, what should a function return if no obligatoryMatches are set (i.e. there is no information about obligatory matches within an inflection at the moment of comparison). Right now in that case a function will return true. Is this correct from business logic point of view? What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't actually set obligatoryMatches on an Inflection in any other cases right now. It's something that was added during some of the inflection table refactoring but didn't get used. I was piggy backing on it for convenience... you are correct in calling this out. I need to think about the possible scenarios a bit.

Copy link
Member

@irina060981 irina060981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have additional questions besides Kirill has asked.

@balmas
Copy link
Member Author

balmas commented Aug 2, 2018

@kirlat I made a significant number of changes in response to your feedback and would appreciate another review! Important changes to note:

  1. I eliminated the use of the obligatoryMatches property on inflections for disambiguation. In thinking about the business logic behind this I realized that I was using that because we didn't have an easy way to iterate the defined features of an inflection. Really what I wanted to do was iterate through the features on the inflection being used to disambiguate - anything that is defined there has to match, but it can be less specific than the thing it's used to disambiguate. It can't be more specific because then it's a different inflection. So I added a features property on the inflection which keeps the names of all the defined features for easy iteration.

  2. I have changed both Lexeme.disambiguate and Homonym.disambiguate to static functions which return new objects, rather than instance methods which update an existing object. I can imagine scenarios where we might want to retain the original object, so it seems safer to create a new object. I also added the ability for the disambiguation to use a sequence as you suggested. I'm not sure we have a use for it, but the effort of making the code work for it helped me to identify some serious business logic flaws in the previous version. We need to also account for the possibility, for example, that an annotation which is used for disambiguation doesn't match anything in the morphological parser results (because sometimes the parser is just wrong or missing information) and in that case the process of disambiguation actually adds to the object rather than reducing it.

Let me know what you think when you have a chance. Thanks!

Copy link
Member

@kirlat kirlat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bridget, your changes look really good to me!

@balmas balmas merged commit e14d7ad into master Aug 3, 2018
@balmas balmas deleted the feature-treebank branch August 3, 2018 12:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants