# 关键词搜索
 Keyword搜索，也称为“BM25（最佳匹配 25）”或“稀疏向量”搜索，返回具有最高 BM25F 分数的对象。

In [1]:
import json
import weaviate
from weaviate.auth import AuthApiKey

# 连接到本地部署的 Weaviate
client = weaviate.Client(
    url="http://127.0.0.1:8080",
    auth_client_secret=AuthApiKey("WVF5YThaHlkYwhGUSmCRgsX3tD5ngdN8pkih")
)

## 基本 BM25 搜索

In [4]:
### V3

response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_bm25(
      query="DB"
    )
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "answer": null,
          "question": "This vector DB is OSS & supports automatic property type inference on import"
        },
        {
          "answer": "Weaviate",
          "question": "This vector DB is OSS and supports automatic property type inference on import"
        },
        {
          "answer": "Weaviate",
          "question": "This vector DB is OSS and supports automatic property type inference on import"
        }
      ]
    }
  }
}


In [None]:
## V4
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    limit=3
)

for o in response.objects:
    print(o.properties)

## 检索 BM25F 分数

score您可以检索每个返回对象的BM25F值。

In [7]:
response = (
    client.query
    .get("Test", ["text", "source"])
    .with_bm25(
      query="白术"
    )
    .with_additional("score")
    .with_limit(3)
    .do()
)

print(json.dumps(response, ensure_ascii=False, indent=2))

{
  "data": {
    "Get": {
      "Test": [
        {
          "_additional": {
            "score": "0.86408603"
          },
          "source": "tests/药食同源养生方药集粹 乔铁，梁可，马进主编 辽宁科学技术出版社.txt",
          "text": "气郁体质：气机不顺，情绪郁结。气郁体质是肝脏的疏泄条达功能 相对不足造成的。常见表现为胃脘、胸腹、胁肋、乳房等部位的胀满疼 痛。气郁体质的养生原则是疏肝理气、补益肝血，应选用具有理气解郁、调理脾胃功能的食物，如大麦、荞麦、高粱、刀豆、蘑菇、豆豉、 苦瓜、萝 卜、洋葱、菊花、玫瑰、青椒、芹菜、茉莉、生姜、山楂、陈 皮以及佛手、橙子、柑皮、韭菜、茴香菜、大蒜、火腿、高粱皮、香橼 等具有行气作用的食物。少食收敛酸涩之物，如乌梅、南瓜、泡菜、石 榴、青梅、杨梅、草莓、阳桃、酸枣、李子、柠檬等。忌食奶油、肥肉、糯米等肥腻食物。亦不可多食冰冷食品，如雪糕、冰激凌、冰冻饮 料等。\n\n特禀体质：先天禀赋不足，以生理缺陷、过敏反应等为主要特征。 特禀体质常见表现为过敏质者常见哮喘、风团、咽痒、鼻塞、喷嚏等； 患遗传性疾病者有垂直遗传、先天性、家族性特征；患胎传性疾病者具 有母体影响胎儿个体生长发育及相关疾病特征。特禀体质的养生原则是 益气固表、温补肺脾肾。避免食用各种致敏食物，减少发作机会。饮食 调养要因时因人因地，并结合过敏原。选用性质平和、清淡、温补类食 物及补养肺气的食材，可降低过敏的发生。常用药物有黄芪、白术、荆 芥、防风、蝉蜕、乌梅、益母草、当归、生地、黄芩、丹皮等。饮食宜 清淡、均衡，粗细搭配适当，荤素配伍合理。少食荞麦（含致敏物质荞 麦荧光素）、蚕豆、白扁豆、牛肉、鹅肉、鲤鱼、虾、蟹、茄子、酒、 辣椒等，更应避免腥膻发物及含致敏物质的食物，减少发作机会。避免 接触致敏物质，如尘螨、花粉、油漆等。居室应通风良好，保持室内清 洁，被褥、床单要经常洗晒。不宜养宠物，以免对动物皮毛过敏。起居 应有规律，保证充足的睡眠，积极参加各种体育锻炼，增强体质。\n\n\n1. 四气\n四气又称四性，是指寒、凉、温、热四种性质。其中温热与寒凉属 于两类不同的性质。温与热、

In [None]:
### V4
from weaviate.classes.query import MetadataQuery

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    return_metadata=MetadataQuery(score=True),
    limit=3
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.score)

## ## 仅搜索选定的属性

关键字搜索可以仅搜索对象属性的子集。在此示例中，BM25 搜索仅使用该question属性来生成 BM25F 分数。

In [8]:
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_bm25(
      query="DB",
      properties=["question"]
    )
    .with_additional("score")
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "0.16569251"
          },
          "answer": null,
          "question": "This vector DB is OSS & supports automatic property type inference on import"
        },
        {
          "_additional": {
            "score": "0.1603982"
          },
          "answer": "Weaviate",
          "question": "This vector DB is OSS and supports automatic property type inference on import"
        },
        {
          "_additional": {
            "score": "0.1603982"
          },
          "answer": "Weaviate",
          "question": "This vector DB is OSS and supports automatic property type inference on import"
        }
      ]
    }
  }
}


In [None]:
## V4
from weaviate.classes.query import MetadataQuery

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="safety",
    query_properties=["question"],
    return_metadata=MetadataQuery(score=True),
    limit=3
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.score)

## 使用权重来提升属性

您可以衡量每个属性对整体 BM25F 得分的影响程度。本示例将该question属性提升了 2 倍，而该answer属性保持不变。

In [9]:
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_bm25(
      query="DB",
      properties=["question^2", "answer"]
    )
    .with_additional("score")
    .with_limit(3)
    .do()
  )

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "0.18382995"
          },
          "answer": null,
          "question": "This vector DB is OSS & supports automatic property type inference on import"
        },
        {
          "_additional": {
            "score": "0.17779541"
          },
          "answer": "Weaviate",
          "question": "This vector DB is OSS and supports automatic property type inference on import"
        },
        {
          "_additional": {
            "score": "0.17779541"
          },
          "answer": "Weaviate",
          "question": "This vector DB is OSS and supports automatic property type inference on import"
        }
      ]
    }
  }
}


In [None]:
## V4
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    query_properties=["question^2", "answer"],
    limit=3
)

for o in response.objects:
    print(o.properties)

## 设置标记化
BM25 查询字符串在使用倒排索引搜索对象之前会被标记化。

您必须在集合定义中为每个属性指定标记化方法。

In [11]:
class_obj = {
    "class": "Test2",
    "vectorizer": "text2vec-huggingface",  # this could be any vectorizer
    "properties": [
        {
            "name": "title",
            "dataType": ["text"],
            "moduleConfig": {
                "text2vec-huggingface": {  # this must match the vectorizer used
                    "vectorizePropertyName": True,
                    "tokenization": "lowercase"
                }
            }
        },
        {
            "name": "body",
            "dataType": ["text"],
            "moduleConfig": {
                "text2vec-huggingface": {  # this must match the vectorizer used
                    "skip": True,  # Don't vectorize body
                    "tokenization": "whitespace"
                }
            }
        },
    ],
}

client.schema.create_class(class_obj)

UnexpectedStatusCodeException: Create class! Unexpected status code: 422, with response body: {'error': [{'message': 'vectorizer: no module with name "text2vec-huggingface" present'}]}.

In [None]:
from weaviate.classes.config import Configure, Property, DataType, Tokenization

client.collections.create(
    "Article",
    vectorizer_config=Configure.Vectorizer.text2vec_huggingface(),

    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            vectorize_property_name=True,  # Use "title" as part of the value to vectorize
            tokenization=Tokenization.LOWERCASE  # Use "lowecase" tokenization
        ),
        Property(
            name="body",
            data_type=DataType.TEXT,
            skip_vectorization=True,  # Don't vectorize this property
            tokenization=Tokenization.WHITESPACE  # Use "whitespace" tokenization
        ),
    ]
)

## limit&offset
用于limit设置要返回的固定最大对象数。

或者，使用offset来对结果进行分页。

In [None]:
## V3
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_bm25(
      query="safety"
    )
    .with_additional("score")
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=2))

In [None]:
## V4
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="safety",
    limit=3,
    offset=1
)

for o in response.objects:
    print(o.properties)

## 限制结果组
要将结果限制为与查询距离相似的组，请使用autocut过滤器设置要返回的组数。

In [None]:
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer"])
    .with_bm25(
      query="safety"
    )
    .with_additional("score")
    .with_autocut(1)
    .do()
)

print(json.dumps(response, indent=2))

In [None]:
### V4
jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="safety",
    auto_limit=1
)

for o in response.objects:
    print(o.properties)

## 小组成绩
定义标准来对搜索结果进行分组。

In [None]:
## V4
from weaviate.classes.query import GroupBy

jeopardy = client.collections.get("JeopardyQuestion")

# Grouping parameters
group_by = GroupBy(
    prop="round",  # group by this property
    objects_per_group=3,  # maximum objects per group
    number_of_groups=2,  # maximum number of groups
)

# Query
response = jeopardy.query.bm25(
    query="California",
    group_by=group_by
)

for grp_name, grp_content in response.groups.items():
    print(grp_name, grp_content.objects)

## 筛选结果
要获得更具体的结果，请使用filter缩小搜索范围。



In [None]:
response = (
    client.query
    .get("JeopardyQuestion", ["question", "answer", "round"])
    .with_bm25(
      query="food"
    )
    .with_where({
        "path": ["round"],
        "operator": "Equal",
        "valueText": "Double Jeopardy!"
    })
    .with_additional("score")
    .with_limit(3)
    .do()
  )

print(json.dumps(response, indent=2))

In [None]:
from weaviate.classes.query import Filter

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    filters=Filter.by_property("round").equal("Double Jeopardy!"),
    return_properties=["answer", "question", "round"], # return these properties
    limit=3
)

for o in response.objects:
    print(o.properties)

### 标记化