Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for the char_group tokenizer #3427

Merged
merged 9 commits into from
Oct 17, 2018
Merged

Conversation

Mpdreamz
Copy link
Member

pending #3424

…ses to do full blown endpoint testing

(cherry picked from commit d6f1ae5)
(cherry picked from commit f5f0c437871589b1fb90b6c4c6f09f0dfc296d7e)
(cherry picked from commit c74ed51e2c30804ffc1d50f95a17893a93bfa6ea)
(cherry picked from commit f2da9f51b43b188cc1b2d09f616fbf87ca268344)
(cherry picked from commit 7ecbee5435df02810ede7f07985e7bb13f66b6f3)
…ch implements the bulk of the setup and tests

(cherry picked from commit 8a6e99493a4174a87cc8680609afd0c482cf10d7)
Copy link
Contributor

@russcam russcam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the documentation should be updated.

{
/// <summary>
/// The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.
/// </summary>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the documentation should be

A list containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a
new token is started. This accepts either single characters like eg. -, or character groups: whitespace, letter, digit,
punctuation, symbol.

/// The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.
/// </summary>
[JsonProperty("tokenize_on_chars")]
IEnumerable<string> TokenizeOnCharacters { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specialized type that takes a union of enum and char? string is no doubt easier to use.

@russcam russcam changed the base branch from refactor/analysis-tests to 6.4 October 17, 2018 03:30
@russcam russcam merged commit 9ab4384 into 6.4 Oct 17, 2018
@russcam russcam deleted the feature/char-group-tokenizer branch October 18, 2018 09:45
russcam pushed a commit that referenced this pull request Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants