LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

mocobeta · 2019-07-27T08:19:18Z

Description

This adds a format check to Kuromoji user dictionary if the concatenated segment is same as its surface form. See https://issues.apache.org/jira/browse/LUCENE-8933

…concatenated segment is same as its surface form.

msokolov

It seems like a reasonable restriction; people can always use synonym filter to modify forms later. I'm just curious though: does the tokenizer make the assumption that rules of are this form? It would lead to errors to have entries that violate this?

msokolov · 2019-07-27T20:08:37Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/UserDictionary.java

@@ -104,6 +104,8 @@ public int compare(String[] left, String[] right) {
    long ord = 0;

    for (String[] values : featureEntries) {
+      String surface = values[0].replaceAll(" ", "");


maybe replace all white space here (ie including tabs) by using \s?

ah never mind my question, I see from the email discussion that is the case: we shouldn't allow this kind of entry

maybe replace all white space here (ie including tabs) by using \s?

Thanks, fixed.

jimczi

I left one comment regarding the tests but it looks good to me @mocobeta .

jimczi · 2019-08-13T07:17:00Z

lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/dict/UserDictionaryTest.java

@@ -77,4 +78,19 @@ public void testRead() throws IOException {
    UserDictionary dictionary = TestJapaneseTokenizer.readDict();
    assertNotNull(dictionary);
  }
+
+  @Test(expected = RuntimeException.class)


Can you use expectThrows and check that the message is correct ?

I think you mean assertThrows that was introduced from JUnit 4.13
https://junit.org/junit4/javadoc/latest/org/junit/Assert.html#assertThrows(java.lang.String,%20java.lang.Class,%20org.junit.function.ThrowingRunnable)

Lucene uses junit 4.12, so we cannot use it (Am I missing something?).

Sorry, I missed LuceneTestCase#expectThrows. Will fix this in another patch.

Fixed in #830

jimczi · 2019-08-13T07:17:06Z

lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/dict/UserDictionaryTest.java

+    UserDictionary dictionary = UserDictionary.open(new StringReader(invalidEntry));
+  }
+
+  @Test(expected = RuntimeException.class)


mocobeta · 2019-08-14T02:55:28Z

Thanks, I will merge it soon.

…cusSorealheis/lucene-solr into enhancement_blockUnkwon-default-true * 'enhancement_blockUnkwon-default-true' of github.com:MarcusSorealheis/lucene-solr: (51 commits) Harden AliasIntegrationTest.testClusterStateProviderAPI SOLR-13694: IndexSizeEstimator NullPointerException. adding <SpanPositionRange> into XML Query Parser SOLR-13693: Use strongly-typed setters for cache parameters. LUCENE-8933: Use 'expectThrows' instead of 'expected'. (apache#830) LUCENE-8933: Validate JapaneseTokenizer user dictionary entry (apache#809) SOLR-13240: make operation-not-null checks consistent in TestPolicy.testNodeLostMultipleReplica (Richard Goodman via Christine Poerschke) SOLR-13688: Run the bin/solr export command multithreaded SOLR-13464: fix javadoc typo that precommit somehow missed? SOLR-13464: Test work arounds SOLR-13399: Adding splitByPrefix param to IndexSizeTrigger; some splitByPrefix test and code cleanup SOLR-13647: Default solr.in.sh contains incorrect default value SOLR-13568: Precommit fail Java var until 9x. Fail var... SOLR-13573: Add SolrRangeQuery getters for bounds SOLR-13593: Allow to look up analyzer components by their SPI names in field type configuration. LUCENE-8948: Change 'name' argument in ICU factories to 'form'. SOLR-13680: use try-with-resource to close closeable resources SOLR-13682: command line option to export documents to a file SOLR-13682: precommit errors SOLR-13682: command line option to export documents to a file ...

mocobeta added 2 commits July 27, 2019 17:16

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry if the …

5f6df64

…concatenated segment is same as its surface form.

fix typo.

f7cca9e

msokolov approved these changes Jul 27, 2019

View reviewed changes

Use whitespace character class to remove all whitespace.

090bdc7

jimczi approved these changes Aug 13, 2019

View reviewed changes

jimczi mentioned this pull request Aug 13, 2019

Improve error handling/logging for illegal_argument_exception|failed to build synonyms errors elastic/elasticsearch#44243

Closed

mocobeta added 2 commits August 14, 2019 11:52

lucene/MIGRATE.txt

8e7e5a8

Merge branch 'master' into jira/LUCENE-8933-kuromoji-userdic-to-master

ea914b7

Update changes

df390db

mocobeta merged commit 73ba88a into apache:master Aug 14, 2019

mocobeta deleted the jira/LUCENE-8933-kuromoji-userdic-to-master branch August 16, 2019 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

mocobeta commented Jul 27, 2019

msokolov left a comment

msokolov Jul 27, 2019

msokolov Jul 27, 2019

mocobeta Jul 28, 2019

jimczi left a comment

jimczi Aug 13, 2019

mocobeta Aug 14, 2019

mocobeta Aug 14, 2019

mocobeta Aug 14, 2019

jimczi Aug 13, 2019

mocobeta commented Aug 14, 2019

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

Conversation

mocobeta commented Jul 27, 2019

Description

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mocobeta commented Aug 14, 2019