# 分词过滤器

分词过滤器接受来自分词器的结果流，可以在此基础上完成：
- 修改分词结果（如改为小写字母）;
- 删除词汇（如删除停止词）;
- 添加词汇（如同义词）;

Elasticsearch 有许多内置的过滤器，您可以使用它们来构建自定义文本分析器。

## 1. 内置分词过滤器

### 1.1. 撇号过滤器 (Apostrophe token filter)

删除撇号后的所有字符，包括撇号本身。<br>

这个过滤器包含在 Elasticsearch 内置的[土耳其语分析器](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#turkish-analyzer)中。它使用[Lucene撇号过滤器](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilter.html)，是为土耳其语构建的。

- 过滤器

In [None]:
text=$'{
    "tokenizer": "standard",
    "filter": [
        "apostrophe"
    ],
    "text": "Istanbul\'a veya Istanbul\'dan"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "standard_with_apostrophe_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "apostrophe"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "standard_with_apostrophe_analyzer",
    "text": "Istanbul\'a veya Istanbul\'dan"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.2. ASCII 折叠过滤器 (ASCII folding token filter)

将不属于基本拉丁 Unicode 块(前 127 个 ASCII 字符)的字母、数字和符号转换为 ASCII 等效字符（如果存在的话）。例如，过滤器将`à`更改为`a`。<br>
这个过滤器使用 Lucene 的 [ASCIIFoldingFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html)。

- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        "asciifolding"
    ],
    "text": "açaí à la carte"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "standard_with_asciifolding_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "asciifolding"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "standard_with_asciifolding_analyzer",
    "text": "açaí à la carte"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.3. 中日韩二元模型过滤器 (CJK bigram token filter)

使用 CJK (中文、日文和韩文) 标记形成“二元模型”。<br>
该过滤器包含在 Elasticsearch 的内置 [CJK 语言分析器中](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#cjk-analyzer)。它使用 Lucene 的 [CJKBigramFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html)。

- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        "cjk_bigram"
    ],
    "text": "東京都は、日本の首都であり"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "standard_with_cjk_bigram_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "cjk_bigram"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "standard_with_cjk_bigram_analyzer",
    "text": "東京都は、日本の首都であり"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.4. 中日韩字符宽度过滤器 (CJK width token filter)

将中日韩文字的宽度差异正常化如下:

- 将全宽度 ASCII 字符变体折叠成等效的基本拉丁字符;
- 将半宽度片假名字符变体折叠成等效的假名字符;
- 该过滤器包含在 Elasticsearch 的内置 [CJK 语言分析器](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#cjk-analyzer)中。它使用 Lucene 的 [CJKWidthFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html)

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        "cjk_width"
    ],
    "text": "ｼｰｻｲﾄﾞﾗｲﾅｰ"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "standard_with_cjk_width_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "cjk_width"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "standard_with_cjk_width_analyzer",
    "text": "ｼｰｻｲﾄﾞﾗｲﾅｰ"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.5. 经典过滤器 (Classic token filter)

对[“经典分词器”](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-classic-tokenizer.html)生成的“词汇”执行可选的后置处理。<br>
这个过滤器将英语所有格 (`'s`) 从单词的末尾移除，并将缩略词中的点移除。它使用 Lucene 的 [ClassicFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilter.html)。

- 过滤器

In [None]:
text=$'{
    "tokenizer": "standard",
    "filter": [
        "classic"
    ],
    "text": "The 2 Q.U.I.C.K. Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "standard_classic_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "classic"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "standard_classic_analyzer",
    "text": "The 2 Q.U.I.C.K. Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.6. 常用词二元模型过滤器 (Common grams token filter)

为一组指定的常用词生成[二元模型](https://en.wikipedia.org/wiki/Bigram)组合。例如，可以指定`is`和`the`作为常用单词。然后，该过滤器将标记`[the, quick, fox, is, brown]`转换为`[the, the_quick, quick, fox, fox_is, is, is_brown, brown]`. 

如果不想完全忽略常见单词，可以使用 “common_grams 过滤器”来代替 “stop 过滤器”。

这个过滤器使用 Lucene 的 [CommonGramsFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html)。

- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        {
            "type": "common_grams",
            "common_words": [
                "is",
                "the"
            ]
        }
    ],
    "text": "the quick fox is brown"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数
    - `common_words`: (Required*, array of strings) 词表。过滤器为这些词标记生成二元词组。此参数或`common_words_path`参数都是必需的。
    - `common_words_path`: (Required*, string) 包含令牌列表的文件的路径。过滤器为这些标记生成二元词组。此路径必须是绝对的或相对于配置位置的。该文件必须是 UTF-8 编码的。文件中的每个标记必须用一个换行符分隔。此参数或`common_words`参数都是必需的。
    - `ignore_case`: (Optional, boolean) 如果为`true`，则普通单词匹配的匹配不区分大小写。默认值为`false`
    - `query_mode`: (Optional, boolean) 如果为`true`，则过滤器从输出中排除以下词汇:
        - 常用词汇的一元模型;
        - 词汇后接普通单词的一元模型;
        - 默认值为`false`。我们建议为文本分析器启用此参数。


例如，可以启用此参数并将`is`和`the`指定为常用单词。此过滤器转换`[the, quick, fox, is, brown]`为`[the_quick, quick, fox_is, is_brown]`.

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "common_grams_filter_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "common_grams_filter"
                    ]
                }
            },
            "filter": {
                "common_grams_filter": {
                    "type": "common_grams",
                    "common_words": [
                        "a", 
                        "is",
                        "the"
                    ],
                    "ignore_case": true,
                    "query_mode": true
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "common_grams_filter_analyzer",
    "text": "the quick fox is brown"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.7. 条件过滤器 (Conditional token filter)

将一组过滤器应用于与所提供的“条件脚本”相符的词汇中。例如：将`"lowercase"`过滤器应用于符合`"token.getTerm().length() < 5"`条件的词汇中。

这个过滤器使用 Lucene 的 [ConditionalTokenFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html)

- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        {
            "type": "condition",
            "filter": [
                "lowercase"
            ],
            "script": {
                "source": "token.getTerm().length() < 5"
            }
        }
    ],
    "text": "THE QUICK BROWN FOX"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数：
    - `filter`: (Required, array of token filters) 过滤器的数组。如果某词汇与参数中的“条件脚本”匹配，则按照提供的顺序将这些过滤器应用于令牌。在定义索引的字段映射关系时，这些过滤器可以包括在自定义过滤器中。
    - `script`: (Required, [script object](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html)) Painless 脚本，表示一个条件。如果某词汇与此脚本匹配，则指定的过滤器将应用于该词汇。有关有效参数，请参阅[脚本参数](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html#_script_parameters)。只支持内联脚本，Painless 脚本在分析谓词上下文中执行，并且需要一个词汇属性。


- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "condition_filter_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "condition_filter"
                    ]
                }
            },
            "filter": {
                "condition_filter": {
                    "type": "condition",
                    "filter": [
                        "lowercase"
                    ],
                    "script": {
                        "source": "token.getTerm().length() < 5"
                    }
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "condition_filter_analyzer",
    "text": "THE QUICK BROWN FOX"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.8. 数字过滤器 (Decimal digit token filter)

将 Unicode Decimal_Number 常规类别中的所有数字转换为阿拉伯数字 (`0`-`9`)。例如，过滤器改变了孟加拉数字`৩`转为`3`。

这个过滤器使用 Lucene 的 [DecimalDigitFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/DecimalDigitFilter.html)

- 过滤器

In [None]:
text='{
    "tokenizer": "whitespace",
    "filter": [
        "decimal_digit"
    ],
    "text": "१-one two-२ ३"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "decimal_digit_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "decimal_digit"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "decimal_digit_analyzer",
    "text": "१-one two-२ ३"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.9. 分隔负载过滤器 (Delimited payload token filter)

> 旧的`delimited_payload_filter`名称已弃用的，不应该与新索引一起使用。使用`delimited_payload`代替

据指定的分隔符将词汇流分隔为词汇和负载。例如，可以使用带`|`分隔符的`delimited_payload`过滤器来将`the|1 quick|2 fox|3`拆分为`the`、`quick`和`fox`，它们的有效负载分别为`1`、`2`和`3`

这个过滤器使用 Lucene 的 [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html)

----

#### 负载

负载是用户定义的与词汇位置关联的二进制数据，并以base64编码的字节存储。<br>
默认情况下，Elasticsearch 不存储词汇负载。要存储有效载荷，您必须:    
- 将`term_vector`映射参数设置为`with_positions_payloads`或`with_positions_offsets_payloads`，用于存储任何有效载荷的字段。
- 使用一个包含`delimited_payload`过滤器的索引分析器


您可以使用[term vectors API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html)查看存储的有效载荷。

- 过滤器

In [None]:
text='{
    "tokenizer": "whitespace",
    "filter": [
        "delimited_payload"
    ],
    "text": "the|0 brown|10 fox|5 is|0 quick|10"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "delimited_payload_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "delimited_payload"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "delimited_payload_analyzer",
    "text": "the|0 brown|10 fox|5 is|0 quick|10"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.10. 字典解码器过滤器 (Dictionary decompounder token filter)

> 在大多数情况下，我们建议使用更快的`hyphenation_decompounder`令牌过滤器来代替这个过滤器。但是，您可以使用`dictionary_decompounder`过滤器来检查单词列表的质量，然后再在`hyphenation_decompounder`过滤器中实现它。

使用指定的单词列表和暴力方法来查找复合词中的子单词。如果找到，这些子单词将包含在词汇输出中。

这个过滤器使用 Lucene 的 [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html)，它是为日耳曼语言构建的。

- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        {
            "type": "dictionary_decompounder",
            "word_list": [
                "Donau", 
                "dampf",
                "meer",
                "schiff"
            ]
        }
    ],
    "text": "Donaudampfschiff"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数：
    - `word_list`: (Required*, array of strings) 待查找的子单词列表。如果找到，子单词将包含在词汇输出中。必须指定此参数或`word_list_path`参数;
    - `word_list_path`: (Required*, string) 待查找的子单词列表文件的路径。如果找到，子单词将包含在令牌输出中;
    - `max_subword_size`: (Optional, integer) 最大子字符长度。较长的子单词标记被排除在输出之外。默认为`15`;
    - `min_subword_size`: (Optional, integer) 最小子字符长度。较短的子单词标记被排除在输出之外。默认为`2`;
    - `min_word_size`: (Optional, integer) 最小字符长度。较短的字标记被排除在输出之外。默认为`5`;
    - `only_longest_match`: (Optional, boolean) 如果为`true`，则只包含最长的匹配子单词。默认值为`false`;


- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "dictionary_decompounder_filter_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "dictionary_decompounder_filter"
                    ]
                }
            },
            "filter": {
                "dictionary_decompounder_filter": {
                    "type": "dictionary_decompounder",
                    "word_list": [
                        "Donau", 
                        "dampf",
                        "meer",
                        "schiff"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "dictionary_decompounder_filter_analyzer",
    "text": "Donaudampfschiff"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.11. 边缘 n-gram 标记过滤器 (Edge n-gram token filter)

从词汇的开头形成指定长度的 “n-gram”。例如，您可以使用`edge_ngram`令牌过滤器将其快速更改为`qu`。如果没有自定义，过滤器默认情况下会创建`1`字符的边缘 "n-gram" 个字符。

这个过滤器使用 Lucene 的 [EdgeNGramTokenFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html)

> `edge_ngram`过滤器类似于 [ngram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html)。但是，`edge_ngram`只输出从标记开头开始的n个字符。这些边缘 "n-grams" 对于[“按类型搜索”](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html)查询非常有用。


- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        {
            "type": "edge_ngram",
            "min_gram": 1,
            "max_gram": 2
        }
    ],
    "text": "the quick brown fox jumps"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数：
    - `max_gram`: (Required*, array of strings) 待查找的子单词列表。如果找到，子单词将包含在词汇输出中。必须指定此参数或`word_list_path`参数;
    - `min_gram`: (Required*, string) 待查找的子单词列表文件的路径。如果找到，子单词将包含在令牌输出中;
    - `side`: (Optional, integer) 最大子字符长度。较长的子单词标记被排除在输出之外。默认为`15`;


- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "edge_ngram_filter_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "edge_ngram_filter"
                    ]
                }
            },
            "filter": {
                "edge_ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 2
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "edge_ngram_filter_analyzer",
    "text": "the quick brown fox jumps"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.12. 省略标记过滤器 (Elision token filter)

删除令牌开头的指定[省略](https://en.wikipedia.org/wiki/Elision)。例如，您可以使用此筛选器将`l'avion`更改为`avion`。

当未自定义时，过滤器默认删除以下法语部分: `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, `puisqu'`

这个过滤器的定制版本包含在 Elasticsearch 的几个内置语言文本分析器中:

- [加泰罗尼亚](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#catalan-analyzer)
- [法国](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#french-analyzer)
- [爱尔兰](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#irish-analyzer)
- [意大利](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#italian-analyzer)

这个过滤器使用 Lucene 的 [ElisionFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html)


- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        "elision"
    ],
    "text": "j’examine près du wharf"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数：
    - `max_gram`: (Required*, array of strings) 待查找的子单词列表。如果找到，子单词将包含在词汇输出中。必须指定此参数或`word_list_path`参数;
    - `min_gram`: (Required*, string) 待查找的子单词列表文件的路径。如果找到，子单词将包含在令牌输出中;
    - `side`: (Optional, integer) 最大子字符长度。较长的子单词标记被排除在输出之外。默认为`15`;


- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "elision_filter_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "elision"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "elision_filter_analyzer",
    "text": "j’examine près du wharf"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.13. 指纹标记过滤器 (Fingerprint token filter)

从分词结果流中排序并删除重复的词汇，然后将该流连接到单个输出结果中。

例如，这个过滤器将`[ the, fox, was, very, very, quick ]`词汇流更改如下:

1. 将代币按字母顺序排列`[ fox, quick, the, very, very, was ]`;
2. 删除重复的`very`词汇;
3. 输出单个结果:`[fox quick the very was ]`;

这个过滤器产生的输出标记对于指纹识别和聚集[OpenRefine](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint)项目中描述的文本非常有用。

这个过滤器使用 Lucene 的 [FingerprintFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html)


- 过滤器

In [None]:
text='{
    "tokenizer": "whitespace",
    "filter": [
        "fingerprint"
    ],
    "text": "zebra jumps over resting resting dog"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数：
    - `max_output_size`: (Optional, integer) 输出结果的最大字符长度，包括空格。默认为`255`。如果连接的令牌长度超过此值，则不会产生结果输出;
    - `separator`: (Optional, string) 用于连接结果分词流的字符。默认为空格;


- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "fingerprint_filter_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "fingerprint_filter"
                    ]
                }
            },
            "filter": {
                "fingerprint_filter": {
                    "type": "fingerprint",
                    "separator": "-"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "fingerprint_filter_analyzer",
    "text": "zebra jumps over resting resting dog"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.14. 扁平化标记图过滤器 (Flatten graph token filter)

将其它图标记过滤器产生的结果扁平化，例如：[synonym_graph (同义词图过滤器)](https://www.elastic.co/guide/en/elasticsearch/reference/current/token-graphs.html)或 [word_delimiter_graph](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html) 生成的标记图

将包含多位置标记的标记图压扁，使其适合于索引。索引不支持包含多位置标记的标记图。

> 扁平化图是一个有损过程。<br>
> 如果可能，避免使用展平图过滤器。相反，只在搜索分析器中使用图标记过滤器。这样就不需要平坦图过滤器了。

扁平化可以使下面的标记图
![Token Graph](./assets/token-graph-dns-synonym-ex.svg)

转换为<br>
![Token Graph](./assets/token-graph-dns-invalid-ex.svg)

- 过滤器

In [None]:
text='{
    "tokenizer": "standard",
    "filter": [
        {
            "type": "synonym_graph",
            "synonyms": [
                "dns, domain name system"
            ]
        },
        "flatten_graph"
    ],
    "text": "domain name system is fragile"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数：
    - `max_output_size`: (Optional, integer) 输出结果的最大字符长度，包括空格。默认为`255`。如果连接的令牌长度超过此值，则不会产生结果输出;
    - `separator`: (Optional, string) 用于连接结果分词流的字符。默认为空格;


- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "synonym_graph_filter_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "synonym_graph_filter",
                        "flatten_graph"
                    ]
                }
            },
            "filter": {
                "synonym_graph_filter": {
                    "type": "synonym_graph",
                    "synonyms": [
                        "dns, domain name system"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "synonym_graph_filter_analyzer",
    "text": "domain name system is fragile"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.15. Hunspell 令牌筛选器 (Hunspell token filter)

基于词典提取词干的词汇过滤器。该过滤器从文件系统 (`<path.conf>/hunspell`) 选取 Hunspell 字典。每个字典都应使用区域代码 (例如`zh_CN`) 设置（语言）命名的自己的目录。此字典目录应保存单个`.aff`和一个或多个`.dic`文件（所有这些文件将自动选取）。<br>
例如，假设使用默认的 Hunspell 位置，以下目录布局将定义`en_US`字典：
```plain
- conf
    |-- hunspell
    |    |-- en_US
    |    |    |-- en_US.dic
    |    |    |-- en_US.aff
```
每个字典都可以配置一个设置：
- `ignore_case`: 如果为`true`，则字典匹配将不区分大小写（默认值为`false`）;

此设置可以在`elasticsearch.yml`配置文件中全局配置:
- `indices.analysis.hunspell.dictionary.ignore_case`

或特定词典：
- `indices.analysis.hunspell.dictionary.en_US.ignore_case`

还可以包含在典目录下的`settings.yml`文件中（这将覆盖`elasticsearch.yml`中定义的设置）。


- 过滤器

In [None]:
text=$'{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        {
            "type": "hunspell",
            "locale": "en_US",
            "dedup": true
        }
    ],
    "text": "The 2 Q.U.I.C.K. Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数：
    - `locale`: 此筛选器的区位设置。如果未设置，则改为使用`lang`或`language`参数——即至少设置这三个参数的其中一个;
    - `dictionary`: 字典的名称。在通过`indices.analysis.hunspell.dictionary.location`之前配置 hunspell 字典的路径;
    - `dedup`: 如果只返回唯一的词汇，则需要将此词设置为`true`。默认值为`true`;
    - `longest_only`: 如果只应返回最长的词汇，则设置为`true`。默认值为`false`, 返回所有可能的结果;
    
> 与 snowball stemmers（基于算法）相反，这是一个基于字典查找的过滤器器，因此结果的质量取决于字典的质量。

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "hunspell_filter_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "hunspell_filter"
                    ]
                }
            },
            "filter": {
                "hunspell_filter": {
                    "type": "hunspell",
                    "locale": "en_US",
                    "dedup": true
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "hunspell_filter_analyzer",
    "text": "The 2 Q.U.I.C.K. Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';