Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #58 Gender for singular german word forms #65

Merged

Conversation

highsource
Copy link
Contributor

No description provided.

  - Added a gender property to the Wiktionary word form
@highsource
Copy link
Contributor Author

This is a WIP PR, please do not merge yet.

@codecov-io
Copy link

codecov-io commented Aug 6, 2018

Codecov Report

Merging #65 into master will increase coverage by 0.21%.
The diff coverage is 88.18%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master      #65      +/-   ##
============================================
+ Coverage      76.3%   76.52%   +0.21%     
- Complexity     1323     1352      +29     
============================================
  Files           102      117      +15     
  Lines          4643     4749     +106     
  Branches        795      798       +3     
============================================
+ Hits           3543     3634      +91     
- Misses          890      902      +12     
- Partials        210      213       +3
Impacted Files Coverage Δ Complexity Δ
...arser/de/components/nountable/MehrzahlHandler.java 100% <100%> (ø) 2 <2> (?)
...arser/de/components/nountable/GenitiveHandler.java 100% <100%> (ø) 1 <1> (?)
...parser/de/components/nountable/EinzahlHandler.java 100% <100%> (ø) 3 <3> (?)
...mstadt/ukp/jwktl/api/entry/WiktionaryWordForm.java 94.11% <100%> (+0.56%) 21 <2> (+2) ⬆️
.../jwktl/parser/de/components/DEWordFormHandler.java 75.18% <100%> (-2.89%) 54 <0> (-18)
...ser/de/components/nountable/NominativeHandler.java 100% <100%> (ø) 1 <1> (?)
...tl/parser/de/components/nountable/CaseHandler.java 100% <100%> (ø) 2 <2> (?)
...ser/de/components/nountable/AccusativeHandler.java 100% <100%> (ø) 1 <1> (?)
.../parser/de/components/nountable/DativeHandler.java 100% <100%> (ø) 1 <1> (?)
.../parser/de/components/nountable/PluralHandler.java 100% <100%> (ø) 2 <2> (?)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b63cf81...7d0e549. Read the comment docs.

  - Corrected the URLs in the integration test
  - Moved noun table extraction to a separate class.
  - DEWiktionaryEntryParserTest now sets file name as page title
  - Moved `(Einzahl)` and `(Mehrzahl)` to their own patterns.
  - Parsing `Genus` using a pattern as well.
  - Added tests for `Singular?`, `Singular i*`/`Singular i**`,
`Singular* i`.
  - Added a test for Fote to check the label `Singular`
  - Moved index extracted from the matcher to an utility class
  - Added a unit test for the pattern/matcher utility class
  - Added a unit test for `Singular 1` referencing `Genus`
  - Added a unit test for the pattern/matcher utility class
  - Reworked Genus processing
  - Moved handling Genus, Singular, Einzahl, Plural, Mehrzahl into
separate methods
  - Added Gams test where `Singular` refered to `Genus 1`
  - Refactored noun table extractor and moved `Genus`, `Singular`,
`Einzahl`, `Plural`, `Mehrzahl` to separate handler classes.
  - Moved word form case handling into separate classes
  - Added forgotten license header
@highsource
Copy link
Contributor Author

Please don't merge yet, I still have to polish a few things.

  - Introduced `ITemplateParameterHandler`
  - Added forgotten license header
  - Added unit tests for genus and number handlers
  - Fixing the test which was failing due to " " at the end of the file
name
  - Added tests for case handlers
  - Using index `1` for labels without index
  - Added tests to check setting and getting genus in noun table handler
@highsource highsource force-pushed the feature/58-gender-for-singular-German-word-forms branch from f4dfa16 to bc97e73 Compare August 11, 2018 21:18
@highsource
Copy link
Contributor Author

I think I'm done here for the moment.
@chmeyer Could you please review the code?

assertFalse(accusativeHandler.canHandle(" Wen? (Einzahl)", null, null, null));
}

public void testGenitivSingular() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be testAkkusativSingular

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

assertEquals(GrammaticalCase.ACCUSATIVE, wordForm.getCase());
}

public void testWerOderWasEinzahl() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be testWenEinzahl

same applies to other handlers...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (other handler tests as well).

pluralHandler = new PluralHandler(nounTableHandler);
}

protected static final String PLURAL_PATTERN =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be necessary here, remove...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

import de.tudarmstadt.ukp.jwktl.api.util.GrammaticalCase;

public class DativeHandler extends CaseHandler {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally not a big fan of hundreds of tiny class files, so I would probably just instanciate CaseHandler with the pattern and the enum value within DEWordFormNounTableHandler. But of course this is a matter of taste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree on this one. I think several classes are quite appropriate here. Right, we could pass pattern to the CaseHandler, but then we'd have to keep these patterns somewhere. They'd probably land as string constants in some constants class which will be huge and messy.


public class PluralHandler extends PatternBasedIndexedParameterHandler {

protected static final String PLURAL_PATTERN =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be \\s*Plural\s*([1-4])?\\?\\*{0,2}$ (with case insensitive flag), i.e., optional whitespace, the term "plural", optional whitespace, optional index number, optional ?, up to two asterisks. This regex would be a bit more relaxed and thus keep working if the Wiktionary template evolves, e.g., allowing Plural 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I a little bit dislike "relaxed" or "lenient" approach.
We're parsing a specific Wiktionary template here, according to the rules defined in the template. If the template evolves we should update the code. Defining a relaxed regex and hoping that it will handle future cases is, from my point of view, not the right approach. I think we should be strict here.

Being strict also helps to detect errors made by authors in Wiktionary articles. I've personally corrected a few dozen of a little bit incorrectly defined declination tables.

I'd like to leave PLURAL_PATTERN as is.

*
* @author Alexey Valikov
*/
public interface ITemplateParameterHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface is particularly designed for WiktionaryWordForms, although it is not indicated by its name. Thus, the interface is not very generic and can hardly be reused. I suggest removing the interface or revising it such that other components can reuse it effectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but not quite sure how to address this.
I want to keep separate handler classes, so I need some common interface for them.
This is, indeed, specific to WiktionaryWordForms. I'm not quite sure how to allow other components to reuse it effectively. At the moment, there's neither need nor use case for that. The interface is also quite simple and will hardly be refactoring-resistant in the future.

Suggestion: just rename it to IWiktionaryWordFormTemplateParameterHandler to make the name reflect the specifics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with renaming it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

  - Applied suggestion from code review to case handler tests
  - Renamed `ITemplateParameterHandler` to reflect its specifics to
`WiktionaryWordForm`
@chmeyer chmeyer merged commit 0e09454 into dkpro:master Sep 26, 2018
@highsource highsource deleted the feature/58-gender-for-singular-German-word-forms branch December 19, 2018 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants