/
analyzers.rst
1456 lines (935 loc) · 33.8 KB
/
analyzers.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
.. _sql-analyzer:
==================
Fulltext analyzers
==================
.. rubric:: Table of contents
.. contents::
:local:
.. _analyzer-overview:
Overview
========
Analyzers are used for creating fulltext-indexes. They take the content of a
field and split it into tokens, which are then searched. Analyzers filter,
reorder and/or transform the content of a field before it becomes the final
stream of tokens.
An analyzer consists of one tokenizer, zero or more token-filters, and zero or
more char-filters.
When a field-content is analyzed to become a stream of tokens, the char-filter
is applied at first. It is used to filter some special chars from the stream of
characters that make up the content.
Tokenizers take a possibly filtered stream of characters and split it into a
stream of tokens.
Token-filters can add tokens, delete tokens or transform them.
With these elements in place, analyzer provide fine-grained control over
building a token stream used for fulltext search. For example you can use
language specific analyzers, tokenizers and token-filters to get proper search
results for data provided in a certain language.
Below the builtin analyzers, tokenizers, token-filters and char-filters are
listed. They can be used as is or can be extended.
.. SEEALSO::
:ref:`fulltext-indices` for examples showing how to create tables which make
use of analyzers.
:ref:`create_custom_analyzer` for an example showing how to create a custom
analyzer.
:ref:`ref-create-analyzer` for the syntax reference.
.. _builtin-analyzer:
Built-in analyzers
==================
.. _standard-analyzer:
``standard``
------------
``type='standard'``
An analyzer of type standard is built using the :ref:`standard-tokenizer`
tokenizer with the :ref:`standard-tokenfilter` Token Filter,
:ref:`lowercase-tokenfilter` Token Filter, and :ref:`stop-tokenfilter` Token
Filter.
Lowercase all Tokens, uses *NO* stopwords and excludes tokens longer than 255
characters. This analyzer uses unicode text segmentation, which is defined by
`UAX#29`_.
For example, the standard analyzer converts the sentence
::
The quick brown fox jumps Over the lAzY DOG.
into the following tokens
::
quick, brown, fox, jumps, lazy, dog
.. rubric:: Parameters
stopwords
A list of stopwords to initialize the :ref:`stop-tokenfilter` filter with.
Defaults to the english stop words.
max_token_length
The maximum token length. If a token exceeds this length it is split in
max_token_length chunks. Defaults to ``255``.
.. _default-analyzer:
``default``
-----------
``type='default'``
This is the same as the `standard-analyzer`_ analyzer.
.. _simple-analyzer:
``simple``
----------
``type='simple'``
Uses the :ref:`lowercase-tokenizer` tokenizer.
.. _plain-analyzer:
``plain``
----------
``type='plain'``
The plain analyzer is an alias for the keyword_ analyzer and cannot be extended.
You must extend the keyword_ analyzer instead.
.. _whitespace-analyzer:
``whitespace``
--------------
``type='whitespace'``
Uses a :ref:`whitespace-tokenizer` tokenizer
.. _stop-analyzer:
``stop``
--------
``type='stop'``
Uses a :ref:`lowercase-tokenizer` tokenizer, with :ref:`stop-tokenfilter` Token
Filter.
.. rubric:: Parameters
stopwords
A list of stopwords to initialize the :ref:'stop-tokenfilter` filter with.
Defaults to the english stop words.
stopwords_path
A path (either relative to configuration location, or absolute) to a
stopwords file configuration.
.. _keyword-analyzer:
``keyword``
-----------
``type='keyword'``
Creates one single token from the field-contents.
.. _pattern-analyzer:
``pattern``
-----------
``type='pattern'``
An analyzer of type pattern that can flexibly separate text into terms via a
:ref:`regular expression <gloss-regular-expression>`.
.. rubric:: Parameters
lowercase
Should terms be lowercased or not. Defaults to true.
pattern
The regular expression pattern, defaults to \W+.
flags
The regular expression flags.
.. NOTE::
The regular expression should match the token separators, not the tokens
themselves.
Flags should be pipe-separated, e.g. ``CASE_INSENSITIVE|COMMENTS``. Check `Java
Pattern API`_ for more details about flags options.
.. _language-analyzer:
``language``
------------
``type='<language-name>'``
The following types are supported:
``arabic``, ``armenian``, ``basque``, ``brazilian``, ``bengali``,
``bulgarian``, ``catalan``, ``chinese``, ``cjk``, ``czech``, ``danish``,
``dutch``, ``english``, ``finnish``, ``french``, ``galician``, ``german``,
``greek``, ``hindi``, ``hungarian``, ``indonesian``, ``italian``, ``latvian``,
``lithuanian``, ``norwegian``, ``persian``, ``portuguese``, ``romanian``,
``russian``, ``sorani``, ``spanish``, ``swedish``, ``turkish``, ``thai``.
.. rubric:: Parameters
stopwords
A list of stopwords to initialize the stop filter with. Defaults to the
english stop words.
stopwords_path
A path (either relative to configuration location, or absolute) to a
stopwords file configuration.
stem_exclusion
The stem_exclusion parameter allows you to specify an array of lowercase words
that should not be stemmed. The following analyzers support setting
stem_exclusion:
``arabic``, ``armenian``, ``basque``, ``brazilian``, ``bengali``,
``bulgarian``, ``catalan``, ``czech``, ``danish``, ``dutch``, ``english``,
``finnish``, ``french``, ``galician``, ``german``, ``hindi``, ``hungarian``,
``indonesian``, ``italian``, ``latvian``, ``lithuanian``, ``norwegian``,
``portuguese``, ``romanian``, ``russian``, ``spanish``, ``swedish``,
``turkish``.
.. _snowball-analyzer:
``snowball``
------------
``type='snowball'``
Uses the :ref:`standard-tokenizer` tokenizer, with :ref:`standard-tokenfilter`
filter, :ref:`lowercase-tokenfilter` filter, :ref:`stop-tokenfilter` filter,
and :ref:`snowball-tokenfilter` filter.
.. rubric:: Parameters
stopwords
A list of stopwords to initialize the stop filter with. Defaults to the
english stop words.
language
See the language-parameter of :ref:`snowball-tokenfilter`.
.. _fingerprint-analyzer:
``fingerprint``
---------------
``type='fingerprint'``
The fingerprint analyzer implements a fingerprinting algorithm which is used by
the OpenRefine project to assist in clustering. Input text is lowercased,
normalized to remove extended characters, sorted, de-duplicated and concatenated
into a single token. If a stopword list is configured, stop words will also be
removed. It uses the :ref:`standard-tokenizer` tokenizer and the following
filters: :ref:`lowercase-tokenfilter`, :ref:`asciifolding-tokenfilter`,
:ref:`fingerprint-tokenfilter` and ref:`stop-tokenfilter`.
.. rubric:: Parameters
separator
The character to use to concatenate the terms. Defaults to a space.
max_output_size
The maximum token size to emit, tokens larger than this size will be
discarded. Defaults to ``255``.
stopwords
A pre-defined stop words list like _english_ or an array containing a list
of stop words. Defaults to ``\_none_``.
stopwords_path
The path to a file containing stop words.
.. _builtin-tokenizer:
Built-in tokenizers
===================
.. _standard-tokenizer:
Standard tokenizer
------------------
``type='standard'``
The tokenizer of type ``standard`` is providing a grammar based tokenizer,
which is a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in Unicode
Standard Annex #29.
.. rubric:: Parameters
max_token_length
The maximum token length. If a token exceeds this length it is split in
max_token_length chunks. Defaults to ``255``.
.. _classic-tokenizer:
Classic tokenizer
-----------------
``type='classic'``
The ``classic`` tokenizer is a grammar based tokenizer that is good for English
language documents. This tokenizer has heuristics for special treatment of
acronyms, company names, email addresses, and internet host names. However,
these rules don't always work, and the tokenizer doesn't work well for most
languages other than English.
.. rubric:: Parameters
max_token_length
The maximum token length. If a token exceeds this length it is split in
max_token_length chunks. Defaults to ``255``.
.. _thai-tokenizer:
Thai tokenizer
--------------
``type='thai'``
The ``thai`` tokenizer splits Thai text correctly, treats all other languages
like the `standard-tokenizer`_ does.
.. _letter-tokenizer:
Letter tokenizer
----------------
``type='letter'``
The ``letter`` tokenizer splits text at non-letters.
.. _lowercase-tokenizer:
Lowercase tokenizer
-------------------
``type='lowercase'``
The ``lowercase`` tokenizer performs the function of :ref:`letter-tokenizer`
and :ref:`lowercase-tokenfilter` together. It divides text at non-letters and
converts them to lower case.
.. _whitespace-tokenizer:
Whitespace tokenizer
--------------------
``type='whitespace'``
The ``whitespace`` tokenizer splits text at whitespace.
.. rubric:: Parameters
max_token_length
The maximum token length. If a token exceeds this length it is split in
max_token_length chunks. Defaults to ``255``.
.. _uaxemailurl-tokenizer:
UAX URL email tokenizer
-----------------------
``type='uax_url_email'``
The ``uax_url_email`` tokenizer behaves like the :ref:`standard-tokenizer`, but
tokenizes emails and URLs as single tokens.
.. rubric:: Parameters
max_token_length
The maximum token length. If a token exceeds this length it is split in
max_token_length chunks. Defaults to ``255``.
.. _ngram-tokenizer:
N-gram tokenizer
----------------
``type='ngram'``
.. rubric:: Parameters
min_gram
Minimum length of characters in a gram. default: 1.
max_gram
Maximum length of characters in a gram. default: 2.
token_chars
Characters classes to keep in the tokens, will split on characters that
don't belong to any of these classes. default: [] (Keep all characters).
**Classes:** letter, digit, whitespace, punctuation, symbol
.. _edgengram-tokenizer:
Edge n-gram tokenizer
---------------------
``type='edge_ngram'``
The ``edge_ngram`` tokenizer is very similar to :ref:`ngram-tokenizer` but only
keeps n-grams which start at the beginning of a token.
.. rubric:: Parameters
min_gram
Minimum length of characters in a gram. default: 1
max_gram
Maximum length of characters in a gram. default: 2
token_chars
Characters classes to keep in the tokens, will split on characters that
don't belong to any of these classes. default: [] (Keep all characters).
**Classes:** letter, digit, whitespace, punctuation, symbol
.. _keyword-tokenizer:
Keyword tokenizer
-----------------
``type='keyword'``
The ``keyworkd`` tokenizer emits the entire input as a single token.
.. rubric:: Parameters
buffer_size
The term buffer size. Defaults to ``256``.
.. _pattern-tokenizer:
Pattern tokenizer
-----------------
``type='pattern'``
The ``pattern`` tokenizer separates text into terms via a :ref:`regular
expression <gloss-regular-expression>`.
.. rubric:: Parameters
pattern
The regular expression pattern, defaults to \\W+.
flags
The regular expression flags.
group
Which group to extract into tokens. Defaults to -1 (split).
.. NOTE::
The regular expression should match the token separators, not the tokens
themselves.
Flags should be pipe-separated, e.g. ``CASE_INSENSITIVE|COMMENTS``. Check `Java
Pattern API`_ for more details about flags options.
.. _simple_pattern-tokenizer:
Simple pattern tokenizer
------------------------
``type='simple_pattern'``
Similar to the ``pattern`` tokenizer, this tokenizer uses a :ref:`regular
expression <gloss-regular-expression>` to split matching text into terms,
however with a limited, more restrictive subset of expressions. This is in
general faster than the normal ``pattern`` tokenizer, but does not support
splitting on pattern.
.. rubric:: Parameters
pattern
A `Lucene regular expression`_, defaults to empty string.
.. _simple_pattern_split-tokenizer:
Simple pattern split tokenizer
------------------------------
``type='simple_patten_split'``
The ``simple_pattern_split`` tokenizer operates with the same restricted subset
of :ref:`regular expressions <gloss-regular-expression>` as the
``simple_pattern`` tokenizer, but it splits the input on the pattern, rather
than the matching pattern.
.. rubric:: Parameters
pattern
A `Lucene regular expression`_, defaults to empty string.
.. _pathhierarchy-tokenizer:
Path hierarchy tokenizer
------------------------
``type='path_hierarchy'``
Takes something like this::
/something/something/else
And produces tokens::
/something
/something/something
/something/something/else
.. rubric:: Parameters
delimiter
The character delimiter to use, defaults to /.
replacement
An optional replacement character to use. Defaults to the delimiter.
buffer_size
The buffer size to use, defaults to 1024.
reverse
Generates tokens in reverse order, defaults to false.
skip
Controls initial tokens to skip, defaults to 0.
.. _analyzers_char_group:
Char group tokenizer
--------------------
``type=char_group``
Breaks text into terms whenever it encounters a character that is part of a
predefined set.
.. rubric:: Parameters
tokenize_on_chars
A list containing characters to tokenize on.
.. _builtin-token-filter:
Built-in token filters
======================
.. _standard-tokenfilter:
``standard``
------------
``type='standard'``
Normalizes tokens extracted with the :ref:`standard-tokenizer` tokenizer.
.. _classic-tokenfilter:
``classic``
-----------
``type='classic'``
Does optional post-processing of terms that are generated by the classic
tokenizer. It removes the english possessive from the end of words, and it
removes dots from acronyms.
.. _apostrophe-tokenfilter:
``apostrophe``
--------------
``type='apostrophe'``
Strips all characters after an apostrophe, and the apostrophe itself.
.. _asciifolding-tokenfilter:
``asciifolding``
----------------
``type='asciifolding'``
Converts alphabetic, numeric, and symbolic Unicode characters which are not in
the first 127 ASCII characters (the "Basic Latin" Unicode block) into their
ASCII equivalents, if one exists.
.. _length-tokenfilter:
``length``
----------
``type='length'``
Removes words that are too long or too short for the stream.
.. rubric:: Parameters
min
The minimum number. Defaults to 0.
max
The maximum number. Defaults to Integer.MAX_VALUE.
.. _lowercase-tokenfilter:
``lowercase``
-------------
``type='lowercase'``
Normalizes token text to lower case.
.. rubric:: Parameters
language
For options, see :ref:`language-analyzer` analyzer.
.. _ngram-tokenfilter:
``ngram``
---------
``type='ngram'``
.. rubric:: Parameters
min_gram
Defaults to 1.
max_gram
Defaults to 2.
.. _edgengram-tokenfilter:
``edge_ngram``
--------------
``type='edge_ngram'``
.. rubric:: Parameters
min_gram
Defaults to 1.
max_gram
Defaults to 2.
side
Either front or back. Defaults to front.
.. _porterstem-tokenfilter:
``porter_stem``
---------------
``type='porter_stem'``
Transforms the token stream as per the Porter stemming algorithm.
.. NOTE::
The input to the stemming filter must already be in lower case, so you will
need to use Lower Case Token Filter or Lower Case tokenizer farther down
the tokenizer chain in order for this to work properly! For example, when
using custom analyzer, make sure the lowercase filter comes before the
porterStem filter in the list of filters.
.. _shingle-tokenfilter:
``shingle``
-----------
``type='shingle'``
Constructs shingles (token n-grams), combinations of tokens as a single token,
from a token stream.
.. rubric:: Parameters
max_shingle_size
The maximum shingle size. Defaults to 2.
min_shingle_sizes
The minimum shingle size. Defaults to 2.
output_unigrams
If true the output will contain the input tokens (unigrams) as well as the
shingles. Defaults to true.
output_unigrams_if_no_shingles
If output_unigrams is false the output will contain the input tokens
(unigrams) if no shingles are available. Note if output_unigrams is set to
true this setting has no effect. Defaults to false.
token_separator
The string to use when joining adjacent tokens to form a shingle. Defaults
to " ".
.. _stop-tokenfilter:
``stop``
--------
``type='stop'``
Removes stop words from token streams.
.. rubric:: Parameters
stopwords
A list of stop words to use. Defaults to english stop words.
stopwords_path
A path (either relative to configuration location, or absolute) to a
stopwords file configuration. Each stop word should be in its own "line"
(separated by a line break). The file must be UTF-8 encoded.
ignore_case
Set to true to lower case all words first. Defaults to false.
remove_trailing
Set to false in order to not ignore the last term of a search if it is a
stop word. Defaults to true
.. _worddelimiter-tokenfilter:
``word_delimiter``
------------------
``type='word_delimiter'``
Splits words into subwords and performs optional transformations on subword
groups.
.. rubric:: Parameters
generate_word_parts
If true causes parts of words to be generated: "PowerShot" ⇒ "Power"
"Shot". Defaults to true.
generate_number_parts
If true causes number subwords to be generated: "500-42" ⇒ "500" "42".
Defaults to true.
catenate_words
If true causes maximum runs of word parts to be catenated: ``wi-fi`` ⇒
``wifi``. Defaults to false.
catenate_numbers
If true causes maximum runs of number parts to be catenated: "500-42" ⇒
"50042". Defaults to false.
catenate_all
If true causes all subword parts to be catenated: "wi-fi-4000" ⇒
"wifi4000". Defaults to false.
split_on_case_change
If true causes "PowerShot" to be two tokens; ("Power-Shot" remains two
parts regards). Defaults to true.
preserve_original
If true includes original words in subwords: "500-42" ⇒ "500-42" "500"
"42". Defaults to false.
split_on_numerics
If true causes ``j2se`` to be three tokens; ``j`` ``2`` ``se``. Defaults to true.
stem_english_possessive
If true causes trailing "'s" to be removed for each subword: "O'Neil's" ⇒
"O", "Neil". Defaults to true.
protected_words
A list of words protected from being delimiter.
protected_words_path
A relative or absolute path to a file configured with protected words (one
on each line). If relative, automatically resolves to ``config/`` based
location if exists.
type_table
A custom type mapping table
.. _stemmer-tokenfilter:
``stemmer``
-----------
``type='stemmer'``
A filter that stems words (similar to :ref:`snowball-tokenfilter`, but with
more options).
.. rubric:: Parameters
.. vale off
language/name
arabic, armenian, basque, brazilian, bulgarian, catalan, czech, danish,
dutch, english, finnish, french, german, german2, greek, hungarian,
italian, kp, kstem, lovins, latvian, norwegian, minimal_norwegian, porter,
portuguese, romanian, russian, spanish, swedish, turkish, minimal_english,
possessive_english, light_finnish, light_french, minimal_french,
light_german, minimal_german, hindi, light_hungarian, indonesian,
light_italian, light_portuguese, minimal_portuguese, portuguese,
light_russian, light_spanish, light_swedish.
.. vale on
.. _keywordmarker-tokenfilter:
``keyword_marker``
------------------
``type='keyword_marker'``
Protects words from being modified by stemmers. Must be placed before any
stemming filters.
.. rubric:: Parameters
keywords
A list of words to use.
keywords_path
A path (either relative to configuration location, or absolute) to a list
of words.
ignore_case
Set to true to lower case all words first. Defaults to false.
.. _kstem-tokenfilter:
``kstem``
---------
``type='kstem'``
High performance filter for english.
All terms must already be lowercased (use :ref:`lowercase-tokenfilter` filter)
for this filter to work correctly.
.. _snowball-tokenfilter:
``snowball``
------------
``type='snowball'``
A filter that stems words using a Snowball-generated stemmer.
.. rubric:: Parameters
.. vale off
language
Possible values: Armenian, Basque, Catalan, Danish, Dutch, English,
Finnish, French, German, German2, Hungarian, Italian, Kp, Lovins,
Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish,
Turkish.
.. vale on
.. _synonym-tokenfilter:
``synonym``
-----------
``type='synonym'``
Allows to easily handle synonyms during the analysis process. Synonyms are
configured using a file in the Solr/WordNet synonym format.
.. rubric:: Parameters
synonyms_path
Path to synonyms configuration file, relative to the configuration
directory.
ignore_case
Defaults to ``false``
expand
Defaults to ``true``
.. _compoundword-tokenfilter:
``*_decompounder``
------------------
``type='dictionary_decompounder'`` or ``type='hyphenation_decompounder'``
Decomposes compound words.
.. rubric:: Parameters
word_list
A list of words to use.
word_list_path
A path (either relative to configuration location, or absolute) to a list
of words.
min_word_size
Minimum word size(Integer). Defaults to 5.
min_subword_size
Minimum subword size(Integer). Defaults to 2.
max_subword_size
Maximum subword size(Integer). Defaults to 15.
only_longest_match
Only matching the longest(Boolean). Defaults to false
.. _reverse-tokenfilter:
``reverse``
-----------
``type='reverse'``
Reverses each token.
.. _elision-tokenfilter:
``elision``
-----------
``type='elision'``
Removes elisions.
.. rubric:: Parameters
articles
A set of stop words articles, for example ``['j', 'l']`` for content like
``J'aime l'odeur.``
.. _truncate-tokenfilter:
``truncate``
------------
``type='truncate'``
Truncates tokens to a specific length.
.. rubric:: Parameters
length
Number of characters to truncate to. default 10
.. _unique-tokenfilter:
``unique``
----------
``type='unique'``
Used to only index unique tokens during analysis. By default it is applied on
all the token stream.
.. rubric:: Parameters
only_on_same_position
If set to true, it will only remove duplicate tokens on the same position.
.. _patterncapture-tokenfilter:
``pattern_capture``
-------------------
``type='pattern_capture'``
Emits a token for every capture group in the :ref:`regular expression
<gloss-regular-expression>`.
.. rubric:: Parameters
preserve_original
If set to true (the default) then it would also emit the original token
.. _patternreplace-tokenfilter:
``pattern_replace``
-------------------