New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #58 Gender for singular german word forms #65
Fixes #58 Gender for singular german word forms #65
Conversation
- Added a gender property to the Wiktionary word form
This is a WIP PR, please do not merge yet. |
Codecov Report
@@ Coverage Diff @@
## master #65 +/- ##
============================================
+ Coverage 76.3% 76.52% +0.21%
- Complexity 1323 1352 +29
============================================
Files 102 117 +15
Lines 4643 4749 +106
Branches 795 798 +3
============================================
+ Hits 3543 3634 +91
- Misses 890 902 +12
- Partials 210 213 +3
Continue to review full report at Codecov.
|
- Corrected the URLs in the integration test
- Moved noun table extraction to a separate class.
- DEWiktionaryEntryParserTest now sets file name as page title - Moved `(Einzahl)` and `(Mehrzahl)` to their own patterns. - Parsing `Genus` using a pattern as well. - Added tests for `Singular?`, `Singular i*`/`Singular i**`, `Singular* i`.
- Added a test for Fote to check the label `Singular`
- Moved index extracted from the matcher to an utility class - Added a unit test for the pattern/matcher utility class
- Added a unit test for `Singular 1` referencing `Genus`
- Added a unit test for the pattern/matcher utility class
- Reworked Genus processing - Moved handling Genus, Singular, Einzahl, Plural, Mehrzahl into separate methods
- Added Gams test where `Singular` refered to `Genus 1` - Refactored noun table extractor and moved `Genus`, `Singular`, `Einzahl`, `Plural`, `Mehrzahl` to separate handler classes.
- Moved word form case handling into separate classes
- Added forgotten license header
Please don't merge yet, I still have to polish a few things. |
- Introduced `ITemplateParameterHandler`
- Added forgotten license header
- Added unit tests for genus and number handlers
- Fixing the test which was failing due to " " at the end of the file name
- Added tests for case handlers
- Using index `1` for labels without index - Added tests to check setting and getting genus in noun table handler
- Added Javadocs
f4dfa16
to
bc97e73
Compare
I think I'm done here for the moment. |
assertFalse(accusativeHandler.canHandle(" Wen? (Einzahl)", null, null, null)); | ||
} | ||
|
||
public void testGenitivSingular() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be testAkkusativSingular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
assertEquals(GrammaticalCase.ACCUSATIVE, wordForm.getCase()); | ||
} | ||
|
||
public void testWerOderWasEinzahl() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be testWenEinzahl
same applies to other handlers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (other handler tests as well).
pluralHandler = new PluralHandler(nounTableHandler); | ||
} | ||
|
||
protected static final String PLURAL_PATTERN = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be necessary here, remove...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
import de.tudarmstadt.ukp.jwktl.api.util.GrammaticalCase; | ||
|
||
public class DativeHandler extends CaseHandler { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally not a big fan of hundreds of tiny class files, so I would probably just instanciate CaseHandler with the pattern and the enum value within DEWordFormNounTableHandler. But of course this is a matter of taste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree on this one. I think several classes are quite appropriate here. Right, we could pass pattern to the CaseHandler
, but then we'd have to keep these patterns somewhere. They'd probably land as string constants in some constants class which will be huge and messy.
|
||
public class PluralHandler extends PatternBasedIndexedParameterHandler { | ||
|
||
protected static final String PLURAL_PATTERN = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be \\s*Plural\s*([1-4])?\\?\\*{0,2}$
(with case insensitive flag), i.e., optional whitespace, the term "plural", optional whitespace, optional index number, optional ?, up to two asterisks. This regex would be a bit more relaxed and thus keep working if the Wiktionary template evolves, e.g., allowing Plural 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I a little bit dislike "relaxed" or "lenient" approach.
We're parsing a specific Wiktionary template here, according to the rules defined in the template. If the template evolves we should update the code. Defining a relaxed regex and hoping that it will handle future cases is, from my point of view, not the right approach. I think we should be strict here.
Being strict also helps to detect errors made by authors in Wiktionary articles. I've personally corrected a few dozen of a little bit incorrectly defined declination tables.
I'd like to leave PLURAL_PATTERN
as is.
* | ||
* @author Alexey Valikov | ||
*/ | ||
public interface ITemplateParameterHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface is particularly designed for WiktionaryWordForm
s, although it is not indicated by its name. Thus, the interface is not very generic and can hardly be reused. I suggest removing the interface or revising it such that other components can reuse it effectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point but not quite sure how to address this.
I want to keep separate handler classes, so I need some common interface for them.
This is, indeed, specific to WiktionaryWordForm
s. I'm not quite sure how to allow other components to reuse it effectively. At the moment, there's neither need nor use case for that. The interface is also quite simple and will hardly be refactoring-resistant in the future.
Suggestion: just rename it to IWiktionaryWordFormTemplateParameterHandler
to make the name reflect the specifics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with renaming it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- Applied suggestion from code review to case handler tests
- Renamed `ITemplateParameterHandler` to reflect its specifics to `WiktionaryWordForm`
No description provided.