Skip to content

Commit

Permalink
LUCENE-8526: Add javadocs in CJKBigramFilter explaining the behavior …
Browse files Browse the repository at this point in the history
…of the StandardTokenizer on Hangul syllables.
  • Loading branch information
jimczi committed Oct 11, 2018
1 parent 971a0e3 commit c87778c
Showing 1 changed file with 8 additions and 0 deletions.
Expand Up @@ -43,6 +43,14 @@
* flag in {@link CJKBigramFilter#CJKBigramFilter(TokenStream, int, boolean)}.
* This can be used for a combined unigram+bigram approach.
* <p>
* Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries.
* Korean Hangul characters are treated the same as many other scripts'
* letters, and as a result, StandardTokenizer can produce tokens that mix
* Hangul and non-Hangul characters, e.g. "한국abc". Such mixed-script tokens
* are typed as <code>&lt;ALPHANUM&gt;</code> rather than
* <code>&lt;HANGUL&gt;</code>, and as a result, will not be converted to
* bigrams by CJKBigramFilter.
*
* In all cases, all non-CJK input is passed thru unmodified.
*/
public final class CJKBigramFilter extends TokenFilter {
Expand Down

0 comments on commit c87778c

Please sign in to comment.