# 字符过滤器

字符筛选器用于在字符流传递给分词器 ([Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html)) 之前对其进行预处理。

字符筛选器作为字符流接收原始文本，并且可以通过添加、删除或更改字符来转换流。例如，字符筛选器可用于将“印度-阿拉伯数字” `(٠‎١٢٣٤٥٦٧٨‎٩‎)` 转换为“阿拉伯语-拉丁”等效项`(0123456789)`，或从流中剥离像`<b>`这样的 HTML 元素。

Elasticsearch 具有许多内置字符筛选器，可用于构建自定义文本分析器 ([Custom analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html))。


## 1. 内置字符过滤器

### 1.1. HTML 标记字符筛选器 (HTML Strip Char Filter)

`html_strip`筛选器从文本中剥离 HTML 元素，并将 HTML 实体替换为其解码的值（例如，用`&`替换`&amp`）

- 过滤器

In [None]:
text='{
    "tokenizer": "keyword",
    "char_filter": [
        "html_strip"
    ],
    "text": "<p>I&apos;m so <b>happy</b>!</p>"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "html_strip_analyzer": {
                    "tokenizer": "keyword",
                    "char_filter": [
                        "html_strip"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "html_strip_analyzer",
    "text": "<p>I&apos;m so <b>happy</b>!</p>"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';