Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gender to singular word forms in German #58

Closed
highsource opened this issue Aug 3, 2018 · 9 comments
Closed

Add gender to singular word forms in German #58

highsource opened this issue Aug 3, 2018 · 9 comments

Comments

@highsource
Copy link
Contributor

Please see the discussion in #57.

  • Scope is only German language.
  • Add GrammaticalGender getGender() to the IWiktionaryWordForm.
  • When parsing the noun table, consider labels:
    • Genus
    • Genus 1
    • Genus 2
    • Genus 3
    • Genus 4
    • In case a label does not have the value m, n or f, log a warning.
  • When parsing the noun table, consider labels:
    • Singular
    • Singular 1, Singular 1*, Singular 1**
    • Singular 2, Singular 2*, Singular 2**
    • Singular 3, Singular 3*, Singular 3**
    • Singular 4, Singular 4*, Singular 4**
  • Assign the gender with the corresponding index in the word form. If there is no gender with the corresponding index, log a warning and assign null as gender to the word form.
  • For word forms with "plural" label assign null as gender to the word form.
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 3, 2018
- Extended `ParsingContext` to temporarily hold the gender index
(`Genus=`, `Genus 1=`, `Genus 2=`, `Genus 3=`, `Genus 4=`).
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 3, 2018
- Added license to the `ParsingContextText`.
@highsource
Copy link
Contributor Author

@chmeyer By the way, there was a bug in gender parsing. The forms mn and mnf were peviously not considered.
I've fixed thin in the frame of this issue - or should I create a new one dedicated only to this?

highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 3, 2018
- Moved gender parsing to a dedicated class.
- Fixed a bug where gender strings `mn` and `mnf` were not supported.
- Added logging for unrecognized gender strings.
@chmeyer
Copy link
Member

chmeyer commented Aug 6, 2018

I've fixed thin in the frame of this issue - or should I create a new one dedicated only to this?

Fine in this issue and PR, I guess.

@highsource
Copy link
Contributor Author

@chmeyer Another question. While working on #58 I've found out that some words do not have a gender for several reasons - like only plural form. Wiktionary authors use genus values x, 0, pl to denote this.

I think it would be good to introduce something like NO_GENDER enum value in GrammaticalGender. This would allow to distinguish "not specified" vs. "specified as no gender".
It is useful for me to be able to check if parsing rules cover all cases or not.

What do you think?

This would be probably not backwards compatible.

@chmeyer
Copy link
Member

chmeyer commented Aug 6, 2018

Theoretically, this is not a property related to gender, so I'm not in favor of the NO_GENDER solution. In fact, words having no singular or no plural forms do have a gender as well. It is often not straightforward to identify the gender for plural-only nouns (since they use the same plural articles), but it is definitely easy for singular-only, e.g., Abscheu (MASC), Brisanz (FEM), ABC (NEUT).

Currently, this entry-related property is encoded in PartOfSpeech.SINGULARE_TANTUM and PartOfSpeech.PLURALE_TANTUM as two special word class labels. This also not the best way of modeling it, but as it is already in the API, let's keep it this way.

Thus, I suggest changing the part of speech of nouns to SINGULARE_TANTUM if there are only singular forms and to PLURALE_TANTUM if there are only plural forms based on the word-form-parsing component. Mind that the part of speech property is also set at other code locations, so we need to make sure in the tests that it won't get overridden.

(Or if this yields chaos, we can think about separating out this morphological property into a separate attribute. In the long run, this would be the cleanest option.)

@highsource
Copy link
Contributor Author

Ok, I see.

The reason why I would like to do this is to ensure the completeness of parsing. At the moment "unknown" values are simply mapped to null. In some cases these were valid values which were not handled. In some other cases these were invalid value in the Wiktionary.

I am interested to fix both cases. Either by fixing the JWKTL code or by correcting articles in the Wiktionary.

But to do this, I need to have these problems reported first. For this I'd need to distinguish null as for missing value vs. null for non-handled value. This is why I thought having NO_GENDER would be practical.

I'll think about a different solution. Maybe introduce a lower-level GrammaticalGenderTag enum which contains all the possible values. Will only be used during the parsing, then mapped to GrammaticalGender for the resulting model.

@chmeyer
Copy link
Member

chmeyer commented Aug 6, 2018

Got it. How about GrammaticalGender.UNSPECIFIED then? I just lean against GrammaticalGender.NO_GENDER for word forms actually having a gender property...

@highsource
Copy link
Contributor Author

GrammaticalGender.UNSPECIFIED if gender is not specified and null if it is specified as x, 0, pl and so on? I don't know. I don't think it will be elegant but this will definitely be not backwards-compatible.

I think have a special parse-time enum will be better.

Here's a suggestion. I'll file an issue concerning mn and mnf and implement it using an "intermediate" enum. I'll also implement x, 0, pl there. Then we'll have code to discuss and decide if this is a way to go.

What do you think?

highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 6, 2018
  - Added a gender property to the Wiktionary word form
@chmeyer
Copy link
Member

chmeyer commented Aug 7, 2018

OK, using the enum at parsing time is fine.

@chmeyer chmeyer added this to the JWKTL 1.2.0 milestone Aug 7, 2018
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Corrected the URLs in the integration test
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Moved noun table extraction to a separate class.
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - DEWiktionaryEntryParserTest now sets file name as page title
  - Moved `(Einzahl)` and `(Mehrzahl)` to their own patterns.
  - Parsing `Genus` using a pattern as well.
  - Added tests for `Singular?`, `Singular i*`/`Singular i**`,
`Singular* i`.
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Added a test for Fote to check the label `Singular`
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Moved index extracted from the matcher to an utility class
  - Added a unit test for the pattern/matcher utility class
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Added a unit test for `Singular 1` referencing `Genus`
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Added a unit test for the pattern/matcher utility class
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Reworked Genus processing
  - Moved handling Genus, Singular, Einzahl, Plural, Mehrzahl into
separate methods
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Added Gams test where `Singular` refered to `Genus 1`
  - Refactored noun table extractor and moved `Genus`, `Singular`,
`Einzahl`, `Plural`, `Mehrzahl` to separate handler classes.
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Moved word form case handling into separate classes
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 10, 2018
  - Added forgotten license header
highsource pushed a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Introduced `ITemplateParameterHandler`
highsource pushed a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Added forgotten license header
highsource pushed a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Added unit tests for genus and number handlers
highsource pushed a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Fixing the test which was failing due to " " at the end of the file
name
highsource pushed a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Added tests for case handlers
highsource pushed a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Using index `1` for labels without index
  - Added tests to check setting and getting genus in noun table handler
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Introduced `ITemplateParameterHandler`
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Added forgotten license header
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Added unit tests for genus and number handlers
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Fixing the test which was failing due to " " at the end of the file
name
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Added tests for case handlers
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
  - Using index `1` for labels without index
  - Added tests to check setting and getting genus in noun table handler
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 11, 2018
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 24, 2018
  - Applied suggestion from code review to case handler tests
highsource added a commit to highsource/dkpro-jwktl that referenced this issue Aug 24, 2018
  - Renamed `ITemplateParameterHandler` to reflect its specifics to
`WiktionaryWordForm`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants