Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for customizing the rule file in ICU tokenizer #13651

Merged
merged 1 commit into from Apr 22, 2016

Conversation

xuzha
Copy link
Contributor

@xuzha xuzha commented Sep 18, 2015

Lucene allows to create a ICUTokenizer with a special config argument
enabling the customization of the rule based iterator by providing
custom rules files.

This commit enable this feature. Users could provide a list of RBBI rule
files to ICU tokenizer.

closes #13146

super(index, indexSettings, name, settings);
config = getIcuConfig(env, settings);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to call a public non-final method from a constructor? This can cause issues when somebody subclasses IcuTokenizerFactory as initialization order gets important. I'd opt for making #getIcuConfig() private which avoids the issue. Alternatives: Declare #getIcuConfig() or the class as final.

@danielmitterdorfer danielmitterdorfer self-assigned this Feb 26, 2016
@danielmitterdorfer
Copy link
Member

@xuzha: Sorry that it took so long that somebody reviews your PR. I left a few comments. Just let me know if you have further questions. I think it would make sense to resolve conflicts first given that this branch is now quite out of date.

@xuzha
Copy link
Contributor Author

xuzha commented Feb 26, 2016

Thx @danielmitterdorfer for the review.
The PR is so outdated, I will gave it another look this weekend.

@dakrone
Copy link
Member

dakrone commented Apr 6, 2016

ping @xuzha, I think this needs updating and then another review by @danielmitterdorfer ?

@xuzha
Copy link
Contributor Author

xuzha commented Apr 6, 2016

Sorry @dakrone, I was sick last week. I will try to update this ASAP.

@xuzha
Copy link
Contributor Author

xuzha commented Apr 11, 2016

Thanks @danielmitterdorfer for the review.

I just pushed another commit to address the comments. Sorry that it took a while to update this, it looks so hacky to me :-)

@danielmitterdorfer
Copy link
Member

@xuzha I answered your questions. I hope that helps now.

@xuzha
Copy link
Contributor Author

xuzha commented Apr 11, 2016

Thanks @danielmitterdorfer, I just pushed another commit.

private final ICUTokenizerConfig config;
private static final String RULE_FILES = "rule_files";

public static final Setting<List<String>> SETTING_RULE_FILES =
Setting.listSetting(RULE_FILES, new ArrayList<>(), Function.identity(), Setting.Property.IndexScope);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could use Collections.emptyList() instead of new ArrayList<>()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha thanks @danielmitterdorfer ;-)

@danielmitterdorfer
Copy link
Member

Left a minor comment, otherwise LGTM.

@danielmitterdorfer danielmitterdorfer removed their assignment Apr 12, 2016
@danielmitterdorfer
Copy link
Member

@xuzha Can you please add an appropriate version label?

@xuzha
Copy link
Contributor Author

xuzha commented Apr 12, 2016

Thanks @danielmitterdorfer so much for the review. Will add a version when merging the PR.

I will leave this PR open for another few days. Comments are very welcomed. <3 <3 <3

Lucene allows to create a ICUTokenizer with a special config argument
enabling the customization of the rule based iterator by providing
custom rules files.

This commit enable this feature. Users could provide a list of RBBI rule
files to ICU tokenizer.

closes elastic#13146
@xuzha xuzha merged commit 9ae723d into elastic:master Apr 22, 2016
@xuzha
Copy link
Contributor Author

xuzha commented Apr 22, 2016

PR broke the integ tests. ;-(, changed using the new setting back the old version : 3e4b470

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for ICUTokenizerFactory and customizing the rule file in ICU tokenizer
5 participants