# BRICK 实体检索示例（模糊检索 + 混合检索）

这份 Notebook 演示如何通过 `entity_index` 包调用字符串通道的模糊检索与混合检索流程，方便图谱、Agent 等其他模块快速复用现有索引能力。

## 环境准备

- 运行前需完成实体索引的构建与写入，确保 Elasticsearch 中存在字符串/向量索引。
- `.env` 或外部环境变量需提供 `ES_*`、`EMBEDDING_*`、`HYBRID_*` 等配置；特别注意 `HYBRID_TYPE_MIX` 应为 JSON 字典字符串。
- Notebook 建议放置在 `entity_index/` 目录下执行，如在其他路径运行，请调整 `project_root`。

In [1]:

# -*- coding: utf-8 -*-
from __future__ import annotations

import json
import os
import sys
from pathlib import Path

# 将仓库根目录加入 sys.path，方便直接 import entity_index 包
project_root = Path.cwd().resolve().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# 可选：读取 .env 中的环境变量以便本地调试（线上部署可直接依赖外部环境）
env_path = project_root / ".env"
if env_path.exists():
    for raw_line in env_path.read_text(encoding="utf-8").splitlines():
        line = raw_line.strip()
        if not line or line.startswith("#") or "=" not in line:
            continue
        key, value = line.split("=", 1)
        os.environ.setdefault(key, value)

# Notebook 中演示时若未显式设置类型权重，可提供一个默认 JSON 值
os.environ.setdefault("HYBRID_TYPE_MIX", json.dumps({
    "Gene|Protein": 0.25,
    "Disease|Phenotype": 0.25,
    "Process|Function|Pathway|Cell_Component": 0.20,
    "Chemical": 0.10,
    "Species": 0.10,
    "Cell|Tissue": 0.10,
    "Mutation": 0.0,
}))

import pandas as pd

from entity_index.search.settings import get_search_config
from entity_index.search import adapters, string_client
from entity_index.search.hybrid_searcher import HybridEntitySearcher
from entity_index.search.schema import HYBRID_TYPE_KEYS

search_config = get_search_config()
es_client = search_config.es.create_client()

print(f"字符串索引: {search_config.string_index_name}")
print(f"向量索引: {search_config.vector_index_name or '已禁用'}")


字符串索引: brick_entities_v1_string
向量索引: brick_entities_v1_vector


  _transport = transport_class(


## 模糊检索（字符串通道）

此部分展示如何仅调用字符串索引完成模糊匹配，适用于别名、拼音等快速召回需求。

In [2]:

# 构造模糊检索的 payload，并确保七类实体字段齐全
payload_fuzzy = {
    "query_id": "demo-string-001",
    "options": {"top_k": 10, "return_diagnostics": True},
}
for type_key in HYBRID_TYPE_KEYS:
    payload_fuzzy.setdefault(type_key, [])

# 按需填入候选词，其余类型保持空列表
payload_fuzzy["Gene|Protein"] = ["EGFR", "表皮生长因子受体"]
payload_fuzzy["Disease|Phenotype"] = ["非小细胞肺癌"]

payload_fuzzy


{'query_id': 'demo-string-001',
 'options': {'top_k': 10, 'return_diagnostics': True},
 'Gene|Protein': ['EGFR', '表皮生长因子受体'],
 'Mutation': [],
 'Chemical': [],
 'Disease|Phenotype': ['非小细胞肺癌'],
 'Process|Function|Pathway|Cell_Component': [],
 'Species': [],
 'Cell|Tissue': []}

In [3]:

# 标准化 payload 并执行字符串通道检索，返回 HybridHit 列表
normalized_fuzzy = adapters.normalize_payload(payload_fuzzy, search_config)
string_hits = string_client.search_string_channel(
    es_client,
    normalized_fuzzy,
    search_config.string_index_name,
)

# 将主要字段整理为 DataFrame 方便展示
string_df = pd.DataFrame([
    {
        "entity_id": hit.entity_id,
        "primary_name": hit.primary_name,
        "type_key": hit.type_key,
        "node_type": hit.node_type,
        "string_score": getattr(hit.scores, "string_score", 0.0),
        "matched_alias": hit.matched_alias,
    }
    for hit in string_hits
])
string_df.head(search_config.top_k)


String search failed for type Gene|Protein on index brick_entities_v1_string: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x158afe030>: Failed to establish a new connection: [Errno 61] Connection refused))


String search failed for type Disease|Phenotype on index brick_entities_v1_string: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x158afe7b0>: Failed to establish a new connection: [Errno 61] Connection refused))


字符串检索会返回命中实体的 ID、主名称、匹配别名与得分，上层可据此做筛选或继续走知识图谱问答。

## 混合检索（字符串 + 向量融合）

下面演示完整混合检索流程：在字符串与向量通道召回候选后，通过权重融合得到最终排序。

In [4]:

import copy

hybrid_searcher = HybridEntitySearcher(es_client, search_config)

payload_hybrid = copy.deepcopy(payload_fuzzy)
payload_hybrid["query_id"] = "demo-hybrid-001"
payload_hybrid["options"].update({
    "top_k": 5,
    "return_diagnostics": True,
    "debug": False,
    "type_mix_override": {
        "Gene|Protein": 0.6,
        "Disease|Phenotype": 0.4,
    },
})
payload_hybrid


{'query_id': 'demo-hybrid-001',
 'options': {'top_k': 5,
  'return_diagnostics': True,
  'debug': False,
  'type_mix_override': {'Gene|Protein': 0.6, 'Disease|Phenotype': 0.4}},
 'Gene|Protein': ['EGFR', '表皮生长因子受体'],
 'Mutation': [],
 'Chemical': [],
 'Disease|Phenotype': ['非小细胞肺癌'],
 'Process|Function|Pathway|Cell_Component': [],
 'Species': [],
 'Cell|Tissue': []}

In [5]:

# 执行混合检索，结果为 HybridResponse 对象
hybrid_response = hybrid_searcher.search(payload_hybrid)
hybrid_response


String search failed for type Gene|Protein on index brick_entities_v1_string: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x158d00bf0>: Failed to establish a new connection: [Errno 61] Connection refused))


String search failed for type Disease|Phenotype on index brick_entities_v1_string: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x158d014c0>: Failed to establish a new connection: [Errno 61] Connection refused))


Vector knn search failed for type Gene|Protein on index brick_entities_v1_vector, fallback to script_score: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x12f777410>: Failed to establish a new connection: [Errno 61] Connection refused))


Vector search failed for type Gene|Protein on index brick_entities_v1_vector: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x158d03f20>: Failed to establish a new connection: [Errno 61] Connection refused))


Vector knn search failed for type Disease|Phenotype on index brick_entities_v1_vector, fallback to script_score: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x158d5e090>: Failed to establish a new connection: [Errno 61] Connection refused))


Vector search failed for type Disease|Phenotype on index brick_entities_v1_vector: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<elastic_transport._node._urllib3_chain_certs.HTTPSConnection object at 0x158d5e0f0>: Failed to establish a new connection: [Errno 61] Connection refused))


HybridResponse(query_id='demo-hybrid-001', standardized={}, diagnostics=[], logs=None)

In [6]:

# 将标准化输出与诊断信息转换为 DataFrame，便于分析
standardized_rows = [
    {"type_key": type_key, "entity_name": entity_name}
    for type_key, entities in hybrid_response.standardized.items()
    for entity_name in entities
]
standardized_df = pd.DataFrame(standardized_rows)
standardized_df

diagnostics = hybrid_response.diagnostics or []
diagnostics_df = pd.DataFrame([
    {
        "type_key": item.type_key,
        "entity_id": item.entity_id,
        "primary_name": item.primary_name,
        "final_score": item.final_score,
        "string_score": getattr(item.channel_scores, "string_score", 0.0),
        "vector_score": getattr(item.channel_scores, "vector_score", 0.0),
        "matched_alias": item.matched_alias,
    }
    for item in diagnostics
])
diagnostics_df


## 聚合字典检索（all_name2id）

all\_name2id 索引用于将任意别名快速映射到标准实体 ID，方便在正式检索前做统一归一。下面示例演示如何直接查询该索引，查看别名 `EGFR` 被归并到的实体。

In [None]:
from entity_index.text_utils import to_pinyin_tokens

def query_all_name2id(alias: str, size: int = 5):
    alias = alias.strip()
    if not alias:
        raise ValueError("alias 不能为空")
    should = [
        {"term": {"alias": {"value": alias, "boost": 6.0}}},
        {"match": {"search_terms": {"query": alias, "boost": 3.0}}},
        {"match": {"search_terms.ngram": {"query": alias, "boost": 1.5}}},
        {"match": {"search_terms.prefix": {"query": alias, "boost": 1.0}}},
    ]
    for token in to_pinyin_tokens(alias):
        should.append({"term": {"pinyin_terms": {"value": token, "boost": 1.0}}})
    response = es_client.search(
        index=search_config.alias_index_name,
        body={
            "size": size,
            "track_total_hits": False,
            "query": {"bool": {"should": should, "minimum_should_match": 1}},
        },
    )
    rows = [
        {
            "alias": hit.get("_source", {}).get("alias"),
            "entity_id": hit.get("_source", {}).get("entity_id"),
            "ctype": hit.get("_source", {}).get("ctype"),
            "score": float(hit.get("_score") or 0.0),
        }
        for hit in response.get("hits", {}).get("hits", [])
    ]
    return pd.DataFrame(rows)

query_all_name2id("EGFR", size=8)

`HybridResponse.standardized` 提供面向业务的最终候选；当需要定位权重或问题时，可对照 `diagnostics_df` 中的通道得分与匹配信息。