# Knowledge Base Q&A Demo


This tutorial demonstrates how to use Qianfan API and ERNIE models to implement a basic knowledge base question and answer system.

The system comprises two primary components: knowledge base construction and knowledge base-driven Q&A. By integrating text embedding with embedding-based retrieval technology, it delivers better answers derived from the knowledge base.

## 1. Environmental Setup
Before starting, ensure your system meets these requirements:
- Python version 3.10-3.12 is installed.
- Ensure the following Python libraries are included: `openai`, `hashlib`, `json`, `numpy`, `textwrap`, `faiss`
- Deploy [ERNIE-4.5](https://github.com/PaddlePaddle/FastDeploy) series model services and correctly configure the corresponding service address `host_url`
- Set embedding model related parameters, including embedding model address `embedding_service_url`, model name `embedding_model`, key `qianfan_api_key`
    - For the embedding models supported by Qianfan, please refer to [Qianfan Embedding Models](https://cloud.baidu.com/doc/qianfan-docs/s/Um8r1tpwy).After confirming `embedding_model`, you need to set the corresponding `embedding_dim` according to the model.
    - You can log in to [Qianfan](https://console.bce.baidu.com/iam/#/iam/apikey/list) to create your API key.API keys are sensitive information and should be kept properly
- Set the `test_file` text file for testing

In [4]:
# Model service address configuration
host_url = "http://localhost:port/v1"

# Embedding service configuration
embedding_service_url = "https://qianfan.baidubce.com/v2"
qianfan_api_key = "bce-v3/xxx"  # Replace with your real API key
embedding_model = "embedding-v1"
embedding_dim = 384  # Embedding dimension, adjust according to the model
top_k = 3  # Number of retrieved texts returned per query
model_api_key = "api_key" # the API key for model, which can be disregarded and replaced with any arbitrary value when using the model deployed locally.

# Test file
test_file = "../data/coffee.txt"

### 1.1. Install Dependencies

In [None]:
!pip install openai numpy faiss-cpu

## 2. Main Implementation Structure Overview

- **Knowledge Base Construction**: When developing a knowledge base about question and answer system, building an embedding-based knowledge base is the core pre-work. It mainly includes two steps.

    1) **Text Chunking Processing**: First, divide the original document into chunks, transform lengthy documents into knowledge units which is suitable for semantic retrieval.

    2) **Text Chunk Embedding and Database Ingestion**: Transform text chunks into high-dimensional embeddings through embedding models. Establish associated storage between original texts and embedding data, forming an efficient retrievable "text-embedding" dual-index structure stored in the knowledge base.
        
    The construction of the knowledge base allows the system to quickly locate the most relevant text fragments by calculating the similarity between the query embedding and the knowledge base embeddings, providing a factual basis for the generation of accurate answers by the model, effectively solving the possible “hallucination” problems in the model.

- **Knowledge Base-based Q&A**: Use the ERNIE models API to create a knowledge base Q&A system. The question and answer process mainly includes the following three steps.

    1) **Retrieval Query Rewriting**: First, analyze whether the user's query needs obtain the latest information from the internet. When searching, rewrite the user's query to get the queries that be searched.

    2) **Knowledge Base Retrieval**: Generate embeddings for the query, compare them with the text embeddings stored in the knowledge base, and obtain relevant text paragraphs.

    3) **Generate the Final Answer**: The LLM synthesizes the final response by processing and consolidating relevant retrieved text passages.

## 3. Knowledge Base Construction

This example uses the `faiss` database to store and retrieve embeddings.

### 3.1. Text Chunking Processing

Suppose there is a document about coffee, you need to build an embeddings database for this document so that you can then retrieve relevant information based on the query.

Before embedding text, you need to chunk the text. This example sets `chunk_size` to 512, and adopts a line-by-line segmentation strategy to retrieve more relevant text fragments in fine-grained size.
- If the content of one line is less than 512, add the content of the next line; if it exceeds 512, look for the most recent punctuation point to truncate the line content.

In [5]:
def split_oversized_line(line: str, chunk_size: int) -> tuple:
    PUNCTUATIONS = {".", "。", "!", "！", "?", "？", ",", "，", ";", "；", ":", "："}

    if len(line) <= chunk_size:
        return line, ""

# Search from chunk_size position backwards
    split_pos = chunk_size
    for i in range(chunk_size, 0, -1):
        if line[i] in PUNCTUATIONS:
            split_pos = i + 1  # Include punctuation
            break

# Fallback to whitespace if no punctuation found
    if split_pos == chunk_size:
        split_pos = line.rfind(" ", 0, chunk_size)
        if split_pos == -1:
            split_pos = chunk_size  # Hard split

    return line[:split_pos], line[split_pos:]

def split_text_into_chunks(text: str, chunk_size: int) -> list:
    lines = [line.strip() for line in text.split("\n") if line.strip()]
    chunks = []
    current_chunk = []
    current_length = 0

    for line in lines:

# If adding this line would exceed chunk size (and we have content)
        if current_length + len(line) > chunk_size and current_chunk:
            chunks.append("\n".join(current_chunk))
            current_chunk = []
            current_length = 0

# Process oversized lines first
        while len(line) > chunk_size:
            head, line = split_oversized_line(line, chunk_size)
            chunks.append(head)

# Add remaining line content
        if line:
            current_chunk.append(line)
            current_length += len(line) + 1

    if current_chunk:
        chunks.append("\n".join(current_chunk))
    return chunks

with open(test_file, "r", encoding="utf-8") as f:
    test_text = f.read()

segments = split_text_into_chunks(test_text, chunk_size=512)
print(segments[:5])

['咖啡（英语：coffee）是指咖啡植物的种子即咖啡豆在经过烘焙磨粉后通过冲泡制成的饮料，是世界上流行范围最为广泛的饮料之一。咖啡在人类饮食中一般为日常的饮品，人们通常会为了提振精神，或在用餐和社交、阅读时饮用。咖啡原产于非洲东岸的埃塞俄比亚，咖啡起源于15-16世纪，从也门被传播至穆斯林世界，16世纪的威尼斯商人将咖啡引入意大利，随后17-18世纪由于欧洲对咖啡的需求，促使殖民者将咖啡树传播并栽种到美洲、东南亚和印度等热带地区，现今有超过70个国家种植咖啡树。未经烘焙的 咖啡生豆作为世界上最大的出口农产品，以及世界上交易量为广泛的热带农产品之一，也是发展中国家出口中最有价值的商品之一。采收的成熟咖啡果会经过剥离果肉的初步加工，再经过烘焙的工序，而成为能制作咖啡的咖啡豆。透过不同的冲泡方式与成分比例，咖啡有浓缩咖啡、卡布奇诺和拿铁咖啡等变化。咖啡豆的品种可大致分为两种：最为普遍的小果咖啡（阿拉比卡），以及颗粒较粗且酸味较低而苦味较浓的中果咖啡（罗布斯塔）。一些争议指咖啡的种植与它环境影响有关，例如肯亚咖啡豆在移植种植后失去了独有的肯亚酸，而肯亚的原种地土壤含有较高浓度的磷酸。因此，', '公平贸易咖啡与有机咖啡是一个不断扩大的市场。\n传说9世纪的埃塞俄比亚的牧羊人发现并咀嚼了咖啡果实，随后将咖啡果实带给了附近修道院的僧侣，但僧侣起初不愿食用果实，并把果实扔进火里，经过火烤的咖啡果中冒出香气引来僧侣前来查看，僧侣从余烬中捞出咖啡豆，并将其磨碎溶解在热水中，这才制成了世界上第一杯咖啡。但此故事截至1671年并没有得到任何记载，因此可能是杜撰的。亦有研究认为最初栽培的咖啡源自埃塞俄比亚的哈勒尔。埃塞俄比亚的阿克苏姆王国兴盛时曾一度占据也门南部，6世纪中期，萨珊帝国攻占也门后将阿克苏姆赶出南阿拉伯半岛，可以肯定的是咖啡是从埃塞俄比亚传播到也门的。', '咖啡传播到穆斯林世界后伊斯兰医学认可了咖啡的好处，认为其可以提振精神并防止酒和大麻对穆斯林的诱惑，15世纪的也门苏菲派修道院在祈祷时使用咖啡来帮助集中注意力。 16世纪初咖啡从也门的摩卡港传播到埃及，随后咖啡馆还出现在叙利亚阿勒颇，并于1554年在奥斯曼帝国首都伊斯坦布尔开业。1511年，由于也门麦加的宗教领袖认为咖啡具有刺激作用，便开始禁止穆斯林饮用咖啡，造成其余阿拉伯世界的苏丹和宗教领袖也相继效仿；其中两位奥斯曼帝国

### 3.2. Text Chunk Embedding and Database Ingestion

#### 3.2.1. Embedding Generation Example

This example uses Qianfan API to generate text embeddings.

An embedding is a distributed numerical representation of text, composed of an array of floating-point numbers. This representation is highly useful for various natural language processing tasks. In this example, we will use embeddings to calculate the similarity between queries and knowledge base texts.

In [6]:
from openai import OpenAI


def embed_fn(text):
    client = OpenAI(base_url=embedding_service_url, api_key=qianfan_api_key)
    response = client.embeddings.create(input=[text], model=embedding_model)
    return response.data[0].embedding

print(embed_fn("hello, world!"))

[0.06902910023927689, -0.02064981311559677, -0.09815074503421783, 0.08861374109983444, -0.05157714709639549, -0.13549675047397614, 0.027453964576125145, -0.023358309641480446, -0.026020705699920654, -0.01652238890528679, -0.0017636652337387204, -0.08663472533226013, 0.060991719365119934, -0.001433665631338954, -0.07765280455350876, 0.010916685685515404, 0.0037388403434306383, -0.008097093552350998, -0.004761330783367157, -0.03509196266531944, 0.025797778740525246, -0.053582921624183655, 0.0033174229320138693, -0.04033001884818077, 0.03428077697753906, 0.006707459222525358, 0.05069943889975548, 0.08020070195198059, 0.02807926945388317, 0.012830601073801517, 0.005108212120831013, 0.0682472214102745, 0.08067227900028229, -0.08299519866704941, -0.041390180587768555, 0.06293655186891556, -0.022655190899968147, 0.000996474758721888, -0.026314564049243927, 0.08683788776397705, 0.02757883071899414, -0.0027709805872291327, 0.028281809762120247, -0.018270548433065414, 0.08003634959459305, -0.121

#### 3.2.2.Text Chunk Embedding and Database Ingestion

Generate embeddings for each piece of text in the file and add embedding data to the `faiss` database, while storing the text information in the `.jsonl` file.

In [7]:
import hashlib
import json

import faiss
import numpy as np


def add_embeddings(file_path: str, segments: list[str]) -> bool:
    with open(file_path, "rb") as f:
        file_md5 = hashlib.md5(f.read()).hexdigest()
    if file_md5 in text_db["file_md5s"]:
        print(f"File already processed: {file_path} (MD5: {file_md5})")
        return

# Generate embeddings
    vectors = []
    for i, segment in  enumerate(segments):
        vectors.append(embed_fn(segment))
    vectors = np.array(vectors)
    print("embedding:")
    print(vectors)
    index.add(vectors.astype('float32'))

    start_id = len(text_db["chunks"])
    for i, text in enumerate(segments):
        text_db["chunks"].append({
            "file_md5": file_md5,
            "text": text,
            "vector_id": start_id + i
        })

    text_db["file_md5s"].append(file_md5)

index = faiss.IndexFlatIP(embedding_dim)
text_db = {
    "file_md5s": [],  # Save file_md5s to avoid duplicates
    "chunks": []      # Save chunks
}
add_embeddings(test_file, segments)
print("-------------------------------------------------------------------")
print("text_db:")
print(text_db)


embedding:
[[ 0.15310541  0.04237456  0.12356884 ...  0.02131988  0.01291032
  -0.00546286]
 [ 0.11144318  0.09504584  0.0567933  ...  0.          0.
   0.        ]
 [ 0.1457673   0.10907461  0.08814128 ...  0.02718093  0.
   0.03284565]
 ...
 [ 0.10487412 -0.02126087  0.08551556 ...  0.          0.
   0.        ]
 [ 0.13665409  0.0698209   0.04600666 ... -0.03795466  0.
   0.        ]
 [ 0.08973876 -0.02730799  0.00867874 ... -0.0553185   0.
   0.04129887]]
-------------------------------------------------------------------
text_db:
{'file_md5s': ['338f7fd3003e6f1d59f8ee92739ed88d'], 'chunks': [{'file_md5': '338f7fd3003e6f1d59f8ee92739ed88d', 'text': '咖啡（英语：coffee）是指咖啡植物的种子即咖啡豆在经过烘焙磨粉后通过冲泡制成的饮料，是世界上流行范围最为广泛的饮料之一。咖啡在人类饮食中一般为日常的饮品，人们通常会为了提振精神，或在用餐和社交、阅读时饮用。咖啡原产于非洲东岸的埃塞俄比亚，咖啡起源于15-16世纪，从也门被传播至穆斯林世界，16世纪的威尼斯商人将咖啡引入意大利，随后17-18世纪由于欧洲对咖啡的需求，促使殖民者将咖啡树传播并栽种到美洲、东南亚和印度等热带地区，现今有超过70个国家种植咖啡树。未经烘焙的 咖啡生豆作为世界上最大的出口农产品，以及世界上交易量为广泛的热带农产品之一，也是发展中国家出口中最有价值的商品之一。采收的成熟咖啡果会经过剥离果肉的初步加工，再经过烘焙的工序，而成为能制作咖啡的咖啡豆。

#### 3.2.3. Knowledge Base Persistence

In [8]:
faiss.write_index(index, "index.faiss")
with open("text_db.jsonl", 'w', encoding='utf-8') as f:
    json.dump(text_db, f, ensure_ascii=False, indent=2)

## 4. Knowledge Base-based Q&A
### 4.1. Rewrite the Retrieval Query
Determine if retrieval from the knowledge base is needed. If required, rewrite the query for retrieval. A prompt needs to be prepared to guide the model to complete the task and return results in standardized JSON format.

In [9]:
import textwrap
from datetime import datetime

QUERY_REWRITE_PROMPT = textwrap.dedent("""\
    【当前时间】
    {TIMESTAMP}

    【对话内容】
    {CONVERSATION}

    你的任务是根据上面user与assistant的对话内容，理解user意图，改写user的最后一轮对话，以便更高效地从知识库查找相关知识。具体的改写要求如下：
    1. 如果user的问题包括几个小问题，请将它们分成多个单独的问题。
    2. 如果user的问题涉及到之前对话的信息，请将这些信息融入问题中，形成一个不需要上下文就可以理解的完整问题。
    3. 如果user的问题是在比较或关联多个事物时，先将其拆分为单个事物的问题，例如‘A与B比起来怎么样’，拆分为：‘A怎么样’以及‘B怎么样’。
    4. 如果user的问题中描述事物的限定词有多个，请将多个限定词拆分成单个限定词。
    5. 如果user的问题具有**时效性（需要包含当前时间信息，才能得到正确的回复）**的时候，需要将当前时间信息添加到改写的query中；否则不加入当前时间信息。
    6. 只在**确有必要**的情况下改写，不需要改写时query输出[]。输出不超过 5 个改写问题，不要为了凑满数量而输出冗余问题。

    【输出格式】只输出 JSON ，不要给出多余内容
    ```json
    {{
    "query": ["改写问题1", "改写问题2"...]
    }}```
    """
)

query = "1675 年时，英格兰有多少家咖啡馆？"
conversation_str = f"user:\n{query}\n"
search_info_input = QUERY_REWRITE_PROMPT.format(
    TIMESTAMP=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    CONVERSATION=conversation_str
)
print(search_info_input)

【当前时间】
2025-06-25 14:54:36

【对话内容】
user:
1675 年时，英格兰有多少家咖啡馆？


你的任务是根据上面user与assistant的对话内容，理解user意图，改写user的最后一轮对话，以便更高效地从知识库查找相关知识。具体的改写要求如下：
1. 如果user的问题包括几个小问题，请将它们分成多个单独的问题。
2. 如果user的问题涉及到之前对话的信息，请将这些信息融入问题中，形成一个不需要上下文就可以理解的完整问题。
3. 如果user的问题是在比较或关联多个事物时，先将其拆分为单个事物的问题，例如‘A与B比起来怎么样’，拆分为：‘A怎么样’以及‘B怎么样’。
4. 如果user的问题中描述事物的限定词有多个，请将多个限定词拆分成单个限定词。
5. 如果user的问题具有**时效性（需要包含当前时间信息，才能得到正确的回复）**的时候，需要将当前时间信息添加到改写的query中；否则不加入当前时间信息。
6. 只在**确有必要**的情况下改写，不需要改写时query输出[]。输出不超过 5 个改写问题，不要为了凑满数量而输出冗余问题。

【输出格式】只输出 JSON ，不要给出多余内容
```json
{
"query": ["改写问题1", "改写问题2"...]
}```



Call the model interface for judgment.

In [10]:
judge_search_messages = [{"role": "user", "content": search_info_input}]

client = OpenAI(base_url=host_url, api_key=model_api_key)
search_info_res = client.chat.completions.create(
    model="default",
    messages=judge_search_messages
)

search_info_res = search_info_res.model_dump()
print(search_info_res)

{'id': 'chatcmpl-1a50b468-f25f-4d48-91d8-ff48c86e86a1', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': '```json\n{\n    "query": ["1675年 英格兰咖啡馆数量"]\n}\n```</s></s>', 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': None, 'reasoning_content': None}}], 'created': 1750834480, 'model': 'default', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': None, 'usage': {'completion_tokens': 26, 'prompt_tokens': 330, 'total_tokens': 356, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}}}


parse the model's results to json format.
- `query`: A list of queries to be retrieved.

In [11]:
import re

search_info_res = search_info_res["choices"][0]["message"]["content"]
json_match = re.search(r'```json\n(.*?)\n```', search_info_res, re.DOTALL)
json_str = json_match.group(1)
search_info_res = json.loads(json_str)

print(search_info_res)

{'query': ['1675年 英格兰咖啡馆数量']}


### 4.2. Knowledge Base Retrieval

If the previous step determines that online search is needed, retrieve relevant texts from the knowledge base.

The retrieval process is as follows:

1. Query Retrieval: Embed each query, use FAISS index to retrieve the `top_k` results with highest similarity, and collect all result indices into a unified list.

2. Result Deduplication: Deduplicate and sort the collected indices to eliminate duplicate retrieval results, preparing an ordered index sequence for subsequent processing.

3. Context Expansion: For each target index, dynamically determine the boundary range of its source file, expand the context window (±context_size) within file boundary limits, and generate a new index set with context included.

4. Continuous Chunks Merging: Sort the expanded indices by physical position, detect continuous index sequences belonging to the same file, and group them into continuous text chunk units.

5. Result Generation: Merge all text content within blocks and output the complete text with structured markup.

In [12]:
def search_with_context(query_list: list, context_size: int=2,) -> str:
# Step 1: Retrieve top_k results for each query and collect all indices
    all_indices = []
    for query in query_list:
        query_vector = np.array([embed_fn(query)]).astype('float32')
        _, indices = index.search(query_vector, top_k)
        all_indices.extend(indices[0].tolist())

# Step 2: Remove duplicate indices
    unique_indices = sorted(set(all_indices))
    print(f"Retrieved indices: {all_indices}")
    print(f"Unique indices after deduplication: {unique_indices}")

# Step 3: Expand each index with context (within same file boundaries)
    expanded_indices = set()
    file_boundaries = {}  # {file_md5: (start_idx, end_idx)}
    for target_idx in unique_indices:
        target_chunk = text_db["chunks"][target_idx]
        target_file_md5 = target_chunk["file_md5"]

        if target_file_md5 not in file_boundaries:
            file_start = target_idx
            while file_start > 0 and text_db["chunks"][file_start - 1]["file_md5"] == target_file_md5:
                file_start -= 1
            file_end = target_idx
            while (file_end < len(text_db["chunks"]) - 1 and
                text_db["chunks"][file_end + 1]["file_md5"] == target_file_md5):
                file_end += 1
        else:
            file_start, file_end = file_boundaries[target_file_md5]

# Calculate context range within file boundaries
        start = max(file_start, target_idx - context_size)
        end = min(file_end, target_idx + context_size)

        for pos in range(start, end + 1):
            expanded_indices.add(pos)

# Step 4: Sort and merge continue chunks
    sorted_indices = sorted(expanded_indices)
    groups = []
    current_group = [sorted_indices[0]]
    for i in range(1, len(sorted_indices)):
        if (sorted_indices[i] == sorted_indices[i-1] + 1 and
            text_db["chunks"][sorted_indices[i]]["file_md5"] ==
            text_db["chunks"][sorted_indices[i-1]]["file_md5"]):
            current_group.append(sorted_indices[i])
        else:
            groups.append(current_group)
            current_group = [sorted_indices[i]]
    groups.append(current_group)

# Step 5: Create merged text for each group
    result = ""
    for idx, group in enumerate(groups):
        result += f"\n段落{idx + 1}:\n"
        for idx in group:
            result += text_db["chunks"][idx]["text"] + "\n"
        print(f"Merged chunk range: {group[0]}-{group[-1]}")

    return result

relevant_passages = ""
if search_info_res.get("query", []):
    relevant_passages = search_with_context(search_info_res["query"])
print(relevant_passages)

Retrieved indices: [3, 2, 4]
Unique indices after deduplication: [2, 3, 4]
Merged chunk range: 0-6

段落1:
咖啡（英语：coffee）是指咖啡植物的种子即咖啡豆在经过烘焙磨粉后通过冲泡制成的饮料，是世界上流行范围最为广泛的饮料之一。咖啡在人类饮食中一般为日常的饮品，人们通常会为了提振精神，或在用餐和社交、阅读时饮用。咖啡原产于非洲东岸的埃塞俄比亚，咖啡起源于15-16世纪，从也门被传播至穆斯林世界，16世纪的威尼斯商人将咖啡引入意大利，随后17-18世纪由于欧洲对咖啡的需求，促使殖民者将咖啡树传播并栽种到美洲、东南亚和印度等热带地区，现今有超过70个国家种植咖啡树。未经烘焙的 咖啡生豆作为世界上最大的出口农产品，以及世界上交易量为广泛的热带农产品之一，也是发展中国家出口中最有价值的商品之一。采收的成熟咖啡果会经过剥离果肉的初步加工，再经过烘焙的工序，而成为能制作咖啡的咖啡豆。透过不同的冲泡方式与成分比例，咖啡有浓缩咖啡、卡布奇诺和拿铁咖啡等变化。咖啡豆的品种可大致分为两种：最为普遍的小果咖啡（阿拉比卡），以及颗粒较粗且酸味较低而苦味较浓的中果咖啡（罗布斯塔）。一些争议指咖啡的种植与它环境影响有关，例如肯亚咖啡豆在移植种植后失去了独有的肯亚酸，而肯亚的原种地土壤含有较高浓度的磷酸。因此，
公平贸易咖啡与有机咖啡是一个不断扩大的市场。
传说9世纪的埃塞俄比亚的牧羊人发现并咀嚼了咖啡果实，随后将咖啡果实带给了附近修道院的僧侣，但僧侣起初不愿食用果实，并把果实扔进火里，经过火烤的咖啡果中冒出香气引来僧侣前来查看，僧侣从余烬中捞出咖啡豆，并将其磨碎溶解在热水中，这才制成了世界上第一杯咖啡。但此故事截至1671年并没有得到任何记载，因此可能是杜撰的。亦有研究认为最初栽培的咖啡源自埃塞俄比亚的哈勒尔。埃塞俄比亚的阿克苏姆王国兴盛时曾一度占据也门南部，6世纪中期，萨珊帝国攻占也门后将阿克苏姆赶出南阿拉伯半岛，可以肯定的是咖啡是从埃塞俄比亚传播到也门的。
咖啡传播到穆斯林世界后伊斯兰医学认可了咖啡的好处，认为其可以提振精神并防止酒和大麻对穆斯林的诱惑，15世纪的也门苏菲派修道院在祈祷时使用咖啡来帮助集中注意力。 16世纪初咖啡从也门的摩卡港传播到埃及，随后咖啡馆还出现在叙利亚阿勒颇，

### 4.3. Generate Final Answer
#### 4.3.1. Model Input
The model's input is a message list that represents the context history of the conversation. Each message is a dictionary containing the following fields:
- `role`: Represents the role of the message sender, which can be:
    - `user`: User message, indicating user input
    - `assistant`: Model message, indicating the model's reply
- `content`: Specific text content

The input has the following characteristics:
- Knowledge Base Search: Splice the search results into `ANSWER_PROMPT` and provide them to the model as context.
- Multiple Rounds of Dialogue: Supporting the preservation of historical dialogue context

In [13]:
ANSWER_PROMPT = textwrap.dedent(
    """\
    你是阅读理解问答专家。

    【文档知识】
    {DOC_CONTENT}

    你的任务是根据对话内容，理解用户需求，参考文档知识回答用户问题，知识参考详细原则如下：
    - 对于同一信息点，如文档知识与模型通用知识均可支撑，应优先以文档知识为主，并对信息进行验证和综合。
    - 如果文档知识不足或信息冲突，必须指出“根据资料无法确定”或“不同资料存在矛盾”，不得引入文档知识与通识之外的主观推测。

    同时，回答问题需要综合考虑规则要求中的各项内容，详细要求如下：
    【规则要求】
    * 回答问题时，应优先参考与问题紧密相关的文档知识，不要在答案中引入任何与问题无关的文档内容。
    * 回答中不可以让用户知道你查询了相关文档。
    * 回复答案不要出现'根据文档知识'，'根据当前时间'等表述。
    * 论述突出重点内容，以分点条理清晰的结构化格式输出。

    【当前时间】
    {TIMESTAMP}

    【对话内容】
    {CONVERSATION}

    直接输出回复内容即可。
    """
)

if search_info_res.get("query", []):
    input = ANSWER_PROMPT.format(
        DOC_CONTENT=relevant_passages,
        TIMESTAMP=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        CONVERSATION=conversation_str
    )
else:
    input = query

messages = [{"role": "user", "content": input}]
print(messages)

[{'role': 'user', 'content': "你是阅读理解问答专家。\n\n【文档知识】\n\n段落1:\n咖啡（英语：coffee）是指咖啡植物的种子即咖啡豆在经过烘焙磨粉后通过冲泡制成的饮料，是世界上流行范围最为广泛的饮料之一。咖啡在人类饮食中一般为日常的饮品，人们通常会为了提振精神，或在用餐和社交、阅读时饮用。咖啡原产于非洲东岸的埃塞俄比亚，咖啡起源于15-16世纪，从也门被传播至穆斯林世界，16世纪的威尼斯商人将咖啡引入意大利，随后17-18世纪由于欧洲对咖啡的需求，促使殖民者将咖啡树传播并栽种到美洲、东南亚和印度等热带地区，现今有超过70个国家种植咖啡树。未经烘焙的 咖啡生豆作为世界上最大的出口农产品，以及世界上交易量为广泛的热带农产品之一，也是发展中国家出口中最有价值的商品之一。采收的成熟咖啡果会经过剥离果肉的初步加工，再经过烘焙的工序，而成为能制作咖啡的咖啡豆。透过不同的冲泡方式与成分比例，咖啡有浓缩咖啡、卡布奇诺和拿铁咖啡等变化。咖啡豆的品种可大致分为两种：最为普遍的小果咖啡（阿拉比卡），以及颗粒较粗且酸味较低而苦味较浓的中果咖啡（罗布斯塔）。一些争议指咖啡的种植与它环境影响有关，例如肯亚咖啡豆在移植种植后失去了独有的肯亚酸，而肯亚的原种地土壤含有较高浓度的磷酸。因此，\n公平贸易咖啡与有机咖啡是一个不断扩大的市场。\n传说9世纪的埃塞俄比亚的牧羊人发现并咀嚼了咖啡果实，随后将咖啡果实带给了附近修道院的僧侣，但僧侣起初不愿食用果实，并把果实扔进火里，经过火烤的咖啡果中冒出香气引来僧侣前来查看，僧侣从余烬中捞出咖啡豆，并将其磨碎溶解在热水中，这才制成了世界上第一杯咖啡。但此故事截至1671年并没有得到任何记载，因此可能是杜撰的。亦有研究认为最初栽培的咖啡源自埃塞俄比亚的哈勒尔。埃塞俄比亚的阿克苏姆王国兴盛时曾一度占据也门南部，6世纪中期，萨珊帝国攻占也门后将阿克苏姆赶出南阿拉伯半岛，可以肯定的是咖啡是从埃塞俄比亚传播到也门的。\n咖啡传播到穆斯林世界后伊斯兰医学认可了咖啡的好处，认为其可以提振精神并防止酒和大麻对穆斯林的诱惑，15世纪的也门苏菲派修道院在祈祷时使用咖啡来帮助集中注意力。 16世纪初咖啡从也门的摩卡港传播到埃及，随后咖啡馆还出现在叙利亚阿勒颇，并于1554年在奥斯曼帝国首都伊斯坦布尔开业。1511年，由于也门麦加的宗教领袖认

#### 4.3.2. Non-Streaming Request
##### Request Model
When sending a request to the API, the following main parameters need to be considered:
- `messages` (must): List of conversation messages
- `max_tokens` (optional): configuration parameter for maximum number of generated tokens
- `temperature` (optional): configuration parameter for controlling randomness in generated results
- `top_p` (optional): configuration parameter for nucleus sampling

In [17]:
client = OpenAI(base_url=host_url, api_key=model_api_key)
response = client.chat.completions.create(
    model="default",
    messages=messages,
    temperature=1.0,
    max_tokens=2048,
    top_p=0.7
)
response = response.model_dump()
print(response)

{'id': 'chatcmpl-546e2fc9-9ec1-4d3d-bd2b-c800bb616556', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': '1675年时，英格兰有3000多家咖啡馆。</s></s>', 'refusal': None, 'role': 'assistant', 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': None, 'reasoning_content': None}}], 'created': 1750763172, 'model': 'default', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': None, 'usage': {'completion_tokens': 18, 'prompt_tokens': 1978, 'total_tokens': 1996, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}}}


##### Model Output
- `content`: Final answer

In [18]:
content = response["choices"][0]["message"]["content"]
print(content)

1675年时，英格兰有3000多家咖啡馆。</s></s>


#### 4.3.3. Streaming Request
##### Request Model
When sending a request to the API, the following main parameters need to be considered:
- `messages` (must): List of conversation messages
- `max_tokens` (optional): configuration parameter for maximum number of generated tokens
- `temperature` (optional): configuration parameter for controlling randomness in generated results
- `top_p` (optional): configuration parameter for nucleus sampling
- `stream` (optional): configuration parameter for enabling/disabling streamed return

In [31]:
response = client.chat.completions.create(
    model="default",
    messages=messages,
    temperature=1.0,
    max_tokens=2048,
    top_p=0.7,
    stream=True
)
response_stream = []
for chunk in response:
    if not chunk.choices:
        continue
    response_stream.append(chunk.model_dump())

print(response_stream[:3])

[{'id': 'chatcmpl-bf801edc-abce-496a-8329-0994ec9c1a5b', 'choices': [{'delta': {'content': '', 'function_call': None, 'refusal': None, 'role': 'assistant', 'tool_calls': None, 'reasoning_content': ''}, 'finish_reason': None, 'index': 0, 'logprobs': None}], 'created': 1750411440, 'model': 'default', 'object': 'chat.completion.chunk', 'service_tier': None, 'system_fingerprint': None, 'usage': None}, {'id': 'chatcmpl-bf801edc-abce-496a-8329-0994ec9c1a5b', 'choices': [{'delta': {'content': '1', 'function_call': None, 'refusal': None, 'role': None, 'tool_calls': None, 'token_ids': [4], 'reasoning_content': None}, 'finish_reason': None, 'index': 0, 'logprobs': None, 'arrival_time': 0.12418651580810547}], 'created': 1750411440, 'model': 'default', 'object': 'chat.completion.chunk', 'service_tier': None, 'system_fingerprint': None, 'usage': None}, {'id': 'chatcmpl-bf801edc-abce-496a-8329-0994ec9c1a5b', 'choices': [{'delta': {'content': '6', 'function_call': None, 'refusal': None, 'role': None,

##### Model Output
- `content`: Final answer

In [32]:
content = ""
for res in response_stream:
    content += res["choices"][0]["delta"]["content"]
print(content)

1675年时，英格兰有3000多家咖啡馆。</s>
