# 文本分析器

## 1. 内置文本分析器

### 1.1. 标准分析器 (`standard`)

- 文本分析

In [None]:
text=$'{
    "analyzer": "standard",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "standard_analyzer": {
                    "type": "standard",
                    "max_token_length": 5,
                    "stopwords": "_english_"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "standard_analyzer",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.2. 简单分析器 (`simple`)

将非字母的词汇切分，并转化为小写

- 文本分析

In [None]:
text=$'{
    "analyzer": "simple",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "simple_analyzer": {
                    "type": "simple"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "simple_analyzer",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.3. 空白文本分析器 (`whitespace`)

根据空格字符，将句子切分为词汇

- 文本分析

In [None]:
text=$'{
    "analyzer": "whitespace",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "whitespace_analyzer": {
                    "type": "whitespace"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "whitespace_analyzer",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.4. 停止词文本分析器 (`stop`)

根据停止词列表，将句子切分为词汇

- 文本分析

In [None]:
text=$'{
    "analyzer": "stop",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "stop_analyzer": {
                    "type": "stop",
                    "stopwords": [
                        "the", "over"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "stop_analyzer",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.5. 关键词文本分析器 (`keyword`)

不切分词汇

- 文本分析

In [None]:
text=$'{
    "analyzer": "keyword",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "keyword_analyzer": {
                    "type": "keyword"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "keyword_analyzer",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.6. 正则表达式文本分析器 (`pattern`)

根据正则表达式切分文本，支持小写和停止词列表

- 文本分析

In [None]:
text=$'{
    "analyzer": "pattern",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* change index analyzer as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "pattern_analyzer": {
                    "type": "pattern",
                    "pattern": "\\W|_",
                    "lowercase": true,
                    "stopwords": [
                        "the",
                        "over"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text=$'{
    "analyzer": "pattern_analyzer",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.7. 语言相关文本分析器 (参见 [说明文档](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html))

利用语言相关分析器进行文本分析

- 文本分析

In [None]:
text='{
    "analyzer": "cjk",
    "text": "阿美首脑会议将讨论巴以和平等问题"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "cjk_analyzer": {
                    "type": "cjk",
                    "stopwords": "_cjk_"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "cjk_analyzer",
    "text": "阿美首脑会议将讨论巴以和平等问题"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 1.8. 文本指纹分析器 (`fingerprint`)

根据文本计算文本指纹

- 文本分析

In [None]:
text='{
    "analyzer": "fingerprint",
    "text": "阿美首脑会议将讨论巴以和平等问题"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/_analyze?pretty' -d "$(echo $text)";

- 设置到索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "fingerprint_analyzer": {
                    "type": "fingerprint",
                    "max_output_size": 50,
                    "separator": ",",
                    "stopwords": "_cjk_"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "fingerprint_analyzer",
    "text": "阿美首脑会议将讨论巴以和平等问题"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

## 2. 自定义分析器

自定义的分析器包含如下参数：
- `tokenizer`: 指定分词器
- `char_filter`: 字符过滤器（筛选器）
- `filter`: 词汇过滤器（筛选器）
- `position_increment_gap`: 词汇间插入的间隙（参见[文档](https://www.elastic.co/guide/en/elasticsearch/reference/current/position-increment-gap.html)）

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "char_filter": [
                        "html_strip"
                    ],
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "custom_analyzer",
    "text": "Is this <b>déjà vu</b>?"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

## 3. 更新索引的文本分析器

### 3.1. 创建索引

In [None]:
echo -e "* create index as: ";

settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1
    }
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

### 3.2. 更新索引文本分析器

In [None]:
echo -e "* change index analyzer as: ";

settings='{
    "analysis": {
        "analyzer": {
            "standard_analyzer": {
                "type": "standard",
                "max_token_length": 5,
                "stopwords": "_english_"
            }
        }
    }
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_close?pretty';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer/_settings?pretty' -d "$(echo $settings)";

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_open?pretty';

### 3.3. 测试索引文本分析器

In [None]:
echo -e "* analyzed with index as: ";

text=$'{
    "analyzer": "standard_analyzer",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\'s bone."
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

### 3.4. 删除索引

In [None]:
echo -e "* delete index as:";

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

## 4. 第三方语法分析器

### 4.1. 安装插件

#### 4.1.1. 在线安装

- 直接安装

In [None]:
sudo bin/elasticsearch-plugin install analysis-icu
sudo bin/elasticsearch-plugin install analysis-smartcn

- docker 安装

In [None]:
docker exec elastic bin/elasticsearch-plugin install analysis-icu
docker exec elastic bin/elasticsearch-plugin install analysis-smartcn

#### 4.1.2. 离线安装

1. 下载插件：[smartcn](https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-smartcn/analysis-smartcn-7.6.2.zip), [icu](https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-icu/analysis-icu-7.6.2.zip);
2. 复制下载的文件到 docker 容器可访问的路径（例如 logs 目录）;
3. 执行下述命令安装插件

- 直接安装

In [None]:
sudo bin/elasticsearch-plugin install file:///usr/share/elasticsearch/logs/analysis-smartcn-7.6.2.zip
sudo bin/elasticsearch-plugin install file:///usr/share/elasticsearch/logs/analysis-icu-7.6.2.zip

- docker 安装

In [None]:
docker exec elastic bin/elasticsearch-plugin install file:///usr/share/elasticsearch/logs/analysis-smartcn-7.6.2.zip
docker exec elastic bin/elasticsearch-plugin install file:///usr/share/elasticsearch/logs/analysis-icu-7.6.2.zip

#### 4.1.3. 删除插件

- 直接删除

In [None]:
sudo bin/elasticsearch-plugin remove analysis-icu
sudo bin/elasticsearch-plugin remove analysis-smartcn

- docker 删除

In [None]:
docker exec elastic bin/elasticsearch-plugin remove analysis-icu
docker exec elastic bin/elasticsearch-plugin remove analysis-smartcn

### 4.2. SmartCN 分析插件

#### 4.2.1. 创建索引

In [None]:
echo -e "* create index as: ";

settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "smartcn_with_stop_analyzer": {
                    "tokenizer": "smartcn_tokenizer",
                    "filter": [
                        "porter_stem",
                        "smartcn_stop_filter"
                    ]
                }
            },
            "filter": {
                "smartcn_stop_filter": {
                    "type": "smartcn_stop",
                    "stopwords": [
                        "_smartcn_",
                        "stack",
                        "的"
                    ]
                }
            }
        }
    }
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

#### 4.2.2. 测试索引文本分析器

In [None]:
echo -e "* analyzed with index as: ";

text='{
    "analyzer": "smartcn_with_stop_analyzer",
    "text": "阿美首脑会议将讨论巴以和平等问题"
}';

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

#### 4.2.3. 删除索引

In [None]:
echo -e "* delete index as:";

curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

### 4.3. ICU 分析插件

#### 4.3.1. 根据`icu_analyzer`内置分析器创建索引

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "default_icu_analyzer": {
                    "type": "icu_analyzer",
                    "method": "nfkc_cf",
                    "mode": "compose"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "default_icu_analyzer",
    "text": "阿美首脑会议将讨论巴以和平等问题"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';

#### 4.3.2. 使用 ICU Normalization 字符过滤器

In [None]:
# create index
echo -e "* create index as: ";
settings='{
    "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "nfkc_cf_normalized": { 
                    "tokenizer": "icu_tokenizer",
                    "char_filter": [
                    "icu_normalizer"
                    ]
                },
                "nfd_normalized": { 
                    "tokenizer": "icu_tokenizer",
                    "char_filter": [
                        "nfd_normalizer"
                    ]
                }
            },
            "char_filter": {
                "nfd_normalizer": {
                    "type": "icu_normalizer",
                    "name": "nfc",
                    "mode": "decompose"
                }
            }
        }
    }
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X PUT 'http://localhost:9200/analyzer?pretty' -d "$(echo $settings)";

# test index analyzer
echo -e "\n* analyzed with index as: ";
text='{
    "analyzer": "nfkc_cf_normalized",
    "text": "阿美首脑会议将讨论巴以和平等问题"
}';
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X POST 'http://localhost:9200/analyzer/_analyze?pretty' -d "$(echo $text)";

# delete index
echo -e "\n* delete index as:";
curl -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' \
     -X DELETE 'http://localhost:9200/analyzer?pretty';