# 字符过滤器

字符筛选器用于在字符流传递给分词器 ([Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html)) 之前对其进行预处理。

字符筛选器作为字符流接收原始文本，并且可以通过添加、删除或更改字符来转换流。例如，字符筛选器可用于将“印度-阿拉伯数字” `(٠‎١٢٣٤٥٦٧٨‎٩‎)` 转换为“阿拉伯语-拉丁”等效项`(0123456789)`，或从流中剥离像`<b>`这样的 HTML 元素。

Elasticsearch 具有许多内置字符筛选器，可用于构建自定义文本分析器 ([Custom analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html))。


## 1. 内置字符过滤器

### 1.1. HTML 标记字符筛选器 (HTML Strip Char Filter)

`html_strip`筛选器从文本中剥离 HTML 元素，并将 HTML 实体替换为其解码的值（例如，用`&`替换`&amp`）

- 过滤器

In [None]:
text='{
    "tokenizer": "keyword",
    "char_filter": [
        "html_strip"
    ],
    "text": "<p>I&apos;m so <b>happy</b>!</p>"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "html_strip_analyzer": {
                    "tokenizer": "keyword",
                    "char_filter": [
                        "html_strip"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "html_strip_analyzer",
    "text": "<p>I&apos;m so <b>happy</b>!</p>"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.2. 映射字符过滤器 (Mapping Char Filter)

映射字符过滤器接受键和值的映射。每当遇到与键相同的字符串时，它就用与该键关联的值替换它们。

匹配是贪婪方式的，在给定词汇上以最长的模式匹配。也可以替换为空字符串。

- 过滤器

In [None]:
text='{
    "tokenizer": "keyword",
    "char_filter": [
        {
            "type": "mapping",
            "mappings": [
                "٠ => 0",
                "١ => 1",
                "٢ => 2",
                "٣ => 3",
                "٤ => 4",
                "٥ => 5",
                "٦ => 6",
                "٧ => 7",
                "٨ => 8",
                "٩ => 9"
            ]
        }
    ],
    "text": "My license plate is ٢٥٠١٥"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数:
    - `mappings`: 一个映射数组，每个元素的形式为: `键 => 值`;
    - `mappings_path`: 文本映射文件的路径(UTF-8编码)，其中每一行包含一个`键 => 值`映射，该路径可以是配置目录的绝对路径，也可以是相对于配置目录的路径

> `mappings`或`mappings_path`参数二者必须提供一个

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "mapping_filter_analyzer": {
                    "tokenizer": "keyword",
                    "char_filter": [
                        "mapping_filter"
                    ]
                }
            },
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                    "mappings": [
                        "٠ => 0",
                        "١ => 1",
                        "٢ => 2",
                        "٣ => 3",
                        "٤ => 4",
                        "٥ => 5",
                        "٦ => 6",
                        "٧ => 7",
                        "٨ => 8",
                        "٩ => 9"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "mapping_filter_analyzer",
    "text": "My license plate is ٢٥٠١٥"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.3. 正则表达式替换字符过滤器 (Pattern Replace Char Filter)

`pattern_replace`字符过滤器使用一个正则表达式来匹配替换字符串替换的字符。替换字符串可以引用正则表达式中的组。

匹配是贪婪方式的，在给定词汇上以最长的模式匹配。也可以替换为空字符串。

- 过滤器

In [None]:
text='{
    "tokenizer": "keyword",
    "char_filter": [
        {
            "type": "pattern_replace",
            "pattern": "(\\d+)-(?=\\d)",
            "replacement": "$1_"
        }
    ],
    "text": "My credit card is 123-456-789"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 参数:
    - `pattern`: (Required) Java风格的正则表达式;
    - `replacement`: 被替换字符串，它可以使用`$1`..`$9`的占位符，[如下所示](http://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-);
    - `flags`: Java正则表达式标志。标志应该用`|`分隔，例如`“CASE_INSENSITIVE|COMMENTS”`。

> 使用更改原始文本长度的替换字符串可以用于搜索，但是会导致不正确的高亮显示。

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "pattern_replace_filter_analyzer": {
                    "tokenizer": "keyword",
                    "char_filter": [
                        "pattern_replace_filter"
                    ]
                }
            },
            "char_filter": {
                "pattern_replace_filter": {
                    "type": "pattern_replace",
                    "pattern": "(\\d+)-(?=\\d)",
                    "replacement": "$1_"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "pattern_replace_filter_analyzer",
    "text": "My credit card is 123-456-789"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';