Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

Merged
merged 6 commits into from
Aug 14, 2019
Merged

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

merged 6 commits into from
Aug 14, 2019

Conversation

mocobeta
Copy link
Contributor

Description

This adds a format check to Kuromoji user dictionary if the concatenated segment is same as its surface form. See https://issues.apache.org/jira/browse/LUCENE-8933

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a reasonable restriction; people can always use synonym filter to modify forms later. I'm just curious though: does the tokenizer make the assumption that rules of are this form? It would lead to errors to have entries that violate this?

@@ -104,6 +104,8 @@ public int compare(String[] left, String[] right) {
long ord = 0;

for (String[] values : featureEntries) {
String surface = values[0].replaceAll(" ", "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe replace all white space here (ie including tabs) by using \s?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah never mind my question, I see from the email discussion that is the case: we shouldn't allow this kind of entry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe replace all white space here (ie including tabs) by using \s?

Thanks, fixed.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment regarding the tests but it looks good to me @mocobeta .

@@ -77,4 +78,19 @@ public void testRead() throws IOException {
UserDictionary dictionary = TestJapaneseTokenizer.readDict();
assertNotNull(dictionary);
}

@Test(expected = RuntimeException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use expectThrows and check that the message is correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean assertThrows that was introduced from JUnit 4.13
https://junit.org/junit4/javadoc/latest/org/junit/Assert.html#assertThrows(java.lang.String,%20java.lang.Class,%20org.junit.function.ThrowingRunnable)

Lucene uses junit 4.12, so we cannot use it (Am I missing something?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed LuceneTestCase#expectThrows. Will fix this in another patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #830

UserDictionary dictionary = UserDictionary.open(new StringReader(invalidEntry));
}

@Test(expected = RuntimeException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@mocobeta
Copy link
Contributor Author

Thanks, I will merge it soon.

@mocobeta mocobeta merged commit 73ba88a into apache:master Aug 14, 2019
@mocobeta mocobeta deleted the jira/LUCENE-8933-kuromoji-userdic-to-master branch August 16, 2019 09:23
MarcusSorealheis added a commit to MarcusSorealheis/lucene-solr that referenced this pull request Aug 20, 2019
…cusSorealheis/lucene-solr into enhancement_blockUnkwon-default-true

* 'enhancement_blockUnkwon-default-true' of github.com:MarcusSorealheis/lucene-solr: (51 commits)
  Harden AliasIntegrationTest.testClusterStateProviderAPI
  SOLR-13694: IndexSizeEstimator NullPointerException.
  adding <SpanPositionRange> into XML Query Parser
  SOLR-13693: Use strongly-typed setters for cache parameters.
  LUCENE-8933: Use 'expectThrows' instead of 'expected'. (apache#830)
  LUCENE-8933: Validate JapaneseTokenizer user dictionary entry (apache#809)
  SOLR-13240: make operation-not-null checks consistent in TestPolicy.testNodeLostMultipleReplica (Richard Goodman via Christine Poerschke)
  SOLR-13688: Run the bin/solr export command multithreaded
  SOLR-13464: fix javadoc typo that precommit somehow missed?
  SOLR-13464: Test work arounds
  SOLR-13399: Adding splitByPrefix param to IndexSizeTrigger; some splitByPrefix test and code cleanup
  SOLR-13647: Default solr.in.sh contains incorrect default value
  SOLR-13568: Precommit fail Java var until 9x. Fail var...
  SOLR-13573: Add SolrRangeQuery getters for bounds
  SOLR-13593: Allow to look up analyzer components by their SPI names in field type configuration.
  LUCENE-8948: Change 'name' argument in ICU factories to 'form'.
  SOLR-13680: use try-with-resource to close closeable resources
  SOLR-13682: command line option to export documents to a file
  SOLR-13682: precommit errors
  SOLR-13682: command line option to export documents to a file
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants