-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809
LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809
Conversation
…concatenated segment is same as its surface form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a reasonable restriction; people can always use synonym filter to modify forms later. I'm just curious though: does the tokenizer make the assumption that rules of are this form? It would lead to errors to have entries that violate this?
@@ -104,6 +104,8 @@ public int compare(String[] left, String[] right) { | |||
long ord = 0; | |||
|
|||
for (String[] values : featureEntries) { | |||
String surface = values[0].replaceAll(" ", ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe replace all white space here (ie including tabs) by using \s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah never mind my question, I see from the email discussion that is the case: we shouldn't allow this kind of entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe replace all white space here (ie including tabs) by using \s?
Thanks, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment regarding the tests but it looks good to me @mocobeta .
@@ -77,4 +78,19 @@ public void testRead() throws IOException { | |||
UserDictionary dictionary = TestJapaneseTokenizer.readDict(); | |||
assertNotNull(dictionary); | |||
} | |||
|
|||
@Test(expected = RuntimeException.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use expectThrows
and check that the message is correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean assertThrows
that was introduced from JUnit 4.13
https://junit.org/junit4/javadoc/latest/org/junit/Assert.html#assertThrows(java.lang.String,%20java.lang.Class,%20org.junit.function.ThrowingRunnable)
Lucene uses junit 4.12, so we cannot use it (Am I missing something?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed LuceneTestCase#expectThrows. Will fix this in another patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #830
UserDictionary dictionary = UserDictionary.open(new StringReader(invalidEntry)); | ||
} | ||
|
||
@Test(expected = RuntimeException.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Thanks, I will merge it soon. |
…cusSorealheis/lucene-solr into enhancement_blockUnkwon-default-true * 'enhancement_blockUnkwon-default-true' of github.com:MarcusSorealheis/lucene-solr: (51 commits) Harden AliasIntegrationTest.testClusterStateProviderAPI SOLR-13694: IndexSizeEstimator NullPointerException. adding <SpanPositionRange> into XML Query Parser SOLR-13693: Use strongly-typed setters for cache parameters. LUCENE-8933: Use 'expectThrows' instead of 'expected'. (apache#830) LUCENE-8933: Validate JapaneseTokenizer user dictionary entry (apache#809) SOLR-13240: make operation-not-null checks consistent in TestPolicy.testNodeLostMultipleReplica (Richard Goodman via Christine Poerschke) SOLR-13688: Run the bin/solr export command multithreaded SOLR-13464: fix javadoc typo that precommit somehow missed? SOLR-13464: Test work arounds SOLR-13399: Adding splitByPrefix param to IndexSizeTrigger; some splitByPrefix test and code cleanup SOLR-13647: Default solr.in.sh contains incorrect default value SOLR-13568: Precommit fail Java var until 9x. Fail var... SOLR-13573: Add SolrRangeQuery getters for bounds SOLR-13593: Allow to look up analyzer components by their SPI names in field type configuration. LUCENE-8948: Change 'name' argument in ICU factories to 'form'. SOLR-13680: use try-with-resource to close closeable resources SOLR-13682: command line option to export documents to a file SOLR-13682: precommit errors SOLR-13682: command line option to export documents to a file ...
Description
This adds a format check to Kuromoji user dictionary if the concatenated segment is same as its surface form. See https://issues.apache.org/jira/browse/LUCENE-8933