-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for inlined user dictionary in Nori #36123
Changes from 8 commits
fd4a36a
41babe7
450a578
adecda4
adcee29
e618198
4ac0894
3a97e16
cb3ca2a
39db1d9
a2af275
5fcfad4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,10 +29,16 @@ | |
|
||
import java.io.IOException; | ||
import java.io.Reader; | ||
import java.io.StringReader; | ||
import java.util.Collections; | ||
import java.util.HashSet; | ||
import java.util.List; | ||
import java.util.Locale; | ||
import java.util.Set; | ||
|
||
public class NoriTokenizerFactory extends AbstractTokenizerFactory { | ||
private static final String USER_DICT_OPTION = "user_dictionary"; | ||
private static final String USER_DICT_PATH_OPTION = "user_dictionary"; | ||
private static final String USER_DICT_RULES_OPTION = "user_dictionary_rules"; | ||
|
||
private final UserDictionary userDictionary; | ||
private final KoreanTokenizer.DecompoundMode decompoundMode; | ||
|
@@ -44,12 +50,22 @@ public NoriTokenizerFactory(IndexSettings indexSettings, Environment env, String | |
} | ||
|
||
public static UserDictionary getUserDictionary(Environment env, Settings settings) { | ||
try (Reader reader = Analysis.getReaderFromFile(env, settings, USER_DICT_OPTION)) { | ||
if (reader == null) { | ||
return null; | ||
} else { | ||
return UserDictionary.open(reader); | ||
List<String> ruleList = Analysis.getWordList(env, settings, USER_DICT_PATH_OPTION, USER_DICT_RULES_OPTION); | ||
StringBuilder sb = new StringBuilder(); | ||
if (ruleList == null || ruleList.isEmpty()) { | ||
return null; | ||
} | ||
// check for duplicate terms | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where does the requirement for checking for duplicates come from? Would it make sense to simply ignore them if they happen or to log it only instead of throwing an error? I can imagine ppl compiling larger dictionaries where duplicates might sneak in over time, maybe this shouldn't block loading the whole analyzer? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to align the Kuromoji and Nori regarding duplicates in the user dictionary. The Japanese tokenizer doesn't accept duplicates (#36100) while Nori just ignores them. However I agree that this is not the scope of this pr so I removed the duplicate detection and will open a new pr in a follow up. |
||
Set<String> terms = new HashSet<>(); | ||
for (String line : ruleList) { | ||
String[] split = line.split("\\s+"); | ||
if (terms.add(split[0]) == false) { | ||
throw new IllegalArgumentException("Found duplicate term: [" + split[0] + "] in user dictionary. "); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this stays an error or maybe gets converted to a log message, the line number would be a helpful debugging information for the user the file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, I'll make sure we report the line number in the follow up pr. |
||
} | ||
sb.append(line).append(System.lineSeparator()); | ||
} | ||
try (Reader rulesReader = new StringReader(sb.toString())) { | ||
return UserDictionary.open(rulesReader); | ||
} catch (IOException e) { | ||
throw new ElasticsearchException("failed to load nori user dictionary", e); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason this whole sections doesn't render when I build the docs locally. I played around with it a bit but couldn't get it to work but its probably worth taking another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. I forgot to add the end of section (e.g.
--
) so the whole section was not displayed. I pushed adcee29 to fix this.