Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8933: Validate JapaneseTokenizer user dictionary entry #809

Merged
merged 6 commits into from
Aug 14, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions lucene/MIGRATE.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Apache Lucene Migration Guide

## Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933) ##

User dictionary now strictly validates if the (concatenated) segment is the same as the surface form. This change avoids
unexpected runtime exceptions or behaviours.
For example, these entries are not allowed at all and an exception is thrown when loading the dictionary file.

# concatenated "日本経済新聞" does not match the surface form "日経新聞"
日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞

# concatenated "日経新聞" does not match the surface form "日本経済新聞"
日本経済新聞,日経 新聞,ニッケイ シンブン,カスタム名詞

## TermsEnum is now fully abstract (LUCENE-8292) ##

TermsEnum has been changed to be fully abstract, so non-abstract subclass must implement all it's methods.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ public int compare(String[] left, String[] right) {
long ord = 0;

for (String[] values : featureEntries) {
String surface = values[0].replaceAll("\\s", "");
String concatenatedSegment = values[1].replaceAll("\\s", "");
String[] segmentation = values[1].replaceAll(" *", " ").split(" ");
String[] readings = values[2].replaceAll(" *", " ").split(" ");
String pos = values[3];
Expand All @@ -113,6 +115,12 @@ public int compare(String[] left, String[] right) {
" - the number of segmentations (" + segmentation.length + ")" +
" does not the match number of readings (" + readings.length + ")");
}

if (!surface.equals(concatenatedSegment)) {
throw new RuntimeException("Illegal user dictionary entry " + values[0] +
" - the concatenated segmentation (" + concatenatedSegment + ")" +
" does not match the surface form (" + surface + ")");
}

int[] wordIdAndLength = new int[segmentation.length + 1]; // wordId offset, length, length....
wordIdAndLength[0] = wordId;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@


import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.ja.TestJapaneseTokenizer;
import org.apache.lucene.util.LuceneTestCase;
Expand Down Expand Up @@ -77,4 +78,19 @@ public void testRead() throws IOException {
UserDictionary dictionary = TestJapaneseTokenizer.readDict();
assertNotNull(dictionary);
}

@Test(expected = RuntimeException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use expectThrows and check that the message is correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean assertThrows that was introduced from JUnit 4.13
https://junit.org/junit4/javadoc/latest/org/junit/Assert.html#assertThrows(java.lang.String,%20java.lang.Class,%20org.junit.function.ThrowingRunnable)

Lucene uses junit 4.12, so we cannot use it (Am I missing something?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed LuceneTestCase#expectThrows. Will fix this in another patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #830

public void testReadInvalid1() throws IOException {
// the concatenated segment must be the same as the surface form
String invalidEntry = "日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞";
UserDictionary dictionary = UserDictionary.open(new StringReader(invalidEntry));
}

@Test(expected = RuntimeException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

public void testReadInvalid2() throws IOException {
// the concatenated segment must be the same as the surface form
String invalidEntry = "日本経済新聞,日経 新聞,ニッケイ シンブン,カスタム名詞";
UserDictionary dictionary = UserDictionary.open(new StringReader(invalidEntry));
}

}