# 2.5 Optimizing RAG Applications to Improve Question-Answer Accuracy  

## 🚄 Preface  

In the previous lessons, you have already discovered some issues with the RAG chatbot through automated evaluations. Optimizing prompt cannot resolve the problem of incorrect answers caused by inaccurate retrieval recall, just like how it would be difficult to provide the correct answer during an open-book exam if you were using the wrong reference book.In this section, you will gain a deeper understanding of the RAG workflow and attempt to improve the accuracy of your RAG application's question-answering.  

## 🍁 Course ObjectivesAfter completing this course, you will be able to:* Gain a deeper understanding of the implementation principles and technical details of RAG* Understand common issues with RAG applications and recommended solutions* Improve the performance of RAG applications through hands-on case studies

## 1. Review of the Previous ContentIn the previous chapter, you discovered that the Q&A bot was unable to adequately answer the question: "Which department is Zhang Wei from?" You can reproduce the issue using the following code:  

In [2]:
# Import the required dependency packagesfrom config.load_key import load_keyimport osimport loggingfrom llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex, PromptTemplatefrom llama_index.embeddings.dashscope import DashScopeEmbedding, DashScopeTextEmbeddingModelsfrom llama_index.llms.openai_like import OpenAILikefrom llama_index.core.node_parser import (    SentenceSplitter,    SemanticSplitterNodeParser,    SentenceWindowNodeParser,    MarkdownNodeParser,    TokenTextSplitter)from llama_index.core.postprocessor import MetadataReplacementPostProcessorfrom langchain_community.llms.tongyi import Tongyifrom langchain_community.embeddings import DashScopeEmbeddingsfrom datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import context_recall, context_precision, answer_correctnessfrom chatbot import ragfrom IPython.display import display

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Set log levellogging.basicConfig(level=logging.ERROR)

In [4]:
# Load API keyload_key()# Do not print the API Key to logs in production environment to avoid leakageprint(f"Your configured API Key is: {os.environ["DASHSCOPE_API_KEY"][:5]+"*"*5}")

你配置的 API Key 是：sk-76*****


In [5]:
# Configure the Qwen large language model and text vector modelSettings.llm = OpenAILike(    model="qwen-plus-0919",    api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",    api_key=os.getenv("DASHSCOPE_API_KEY"),    is_chat_model=True)

In [6]:
# Configure text vector model, set batch size and maximum input lengthSettings.embed_model = DashScopeEmbedding(    model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V3,    embed_batch_size=6,    embed_input_length=8192)

In [7]:
# Define the question-answering function def ask(question, query_engine):    # Update the prompt template    rag.update_prompt_template(query_engine=query_engine)    # Output the question    print('=' * 50)  # Generate a dividing line using multiplication    print(f'🤔 Question: {question}')    print('=' * 50 + '\n')  # Generate a dividing line using multiplication    # Get the answer    response = query_engine.query(question)    # Output the answer    print('🤖 Answer:')    if hasattr(response, 'print_response_stream') and callable(response.print_response_stream):        response.print_response_stream()    else:        print(str(response))    # Output reference documents    print('\n' + '-' * 50)  # Generate a dividing line using multiplication    print('📚 Reference Documents:\n')    for i, source_node in enumerate(response.source_nodes, start=1):        print(f'Document {i}:')        print(source_node)        print()    print('-' * 50)  # Generate a dividing line using multiplication    return response

In [8]:
query_engine = rag.create_query_engine(rag.load_index())response = ask('Which department is Zhang Wei in?', query_engine)

🤔 问题：张伟是哪个部门的

🤖 回答：
根据提供的参考信息，没有找到名为张伟的员工信息。如果您能提供更多详细信息，比如部门或其他联系方式，我可以帮助您进一步查找。如果有同名员工的情况，也请告知具体的信息以便准确查询。
--------------------------------------------------
📚 参考文档：

文档 1:
Node ID: cedac307-9961-4989-93b4-0e5d29817f0e
Text: ⽀持。  绩效管理部 韩杉 李⻜ I902 041 ⼈⼒资源 绩效专员 13800000041
hanshan@educompany.com 建⽴并维护员⼯绩效档案，定期组织绩效评价会议，协调各部⻔反馈，制定考核流程与标准，确保绩效
Score:  0.169


文档 2:
Node ID: 0a4d5beb-9653-4230-a702-0b77826dfa0d
Text: 核，提供⾏政管理与协调⽀持，优化⾏政⼯作流程。  ⾏政部 秦⻜ 蔡静 G705 034 ⾏政 ⾏政专员 13800000034
qinf@educompany.com 维护公司档案与信息系统，负责公司通知及公告的发布，
Score:  0.154


--------------------------------------------------


You will find that the reason for this issue is that the correct reference information (document chunks) was not recalled during the retrieval phase. How to improve this issue? You can refer to a few simple improvement strategies to preliminarily optimize the retrieval effect.  

## 2. Initial Optimization of Retrieval EffectivenessAs mentioned in the introduction, you need to ensure that the large language model (LLM) has access to the correct "reference materials" to provide accurate "answers." Therefore, you can try increasing the number of "reference materials" retrieved (increasing the number of document chunks recalled) or organizing the "knowledge points in the reference materials" into structured tables (structuring the document content). You can start with the former:

### 2.1 Allowing large language models (LLMs) to Access More Reference InformationSince the knowledge base contains information about Zhang Wei's employment history, you can expand the search scope and increase the probability of finding relevant information by increasing the number of document slices recalled at once. In the previous code, you only recalled 2 document slices. Now, you can increase the recall quantity to 5 and observe whether the recall performance has improved.  

#### 2.1.1 Adjusting the CodeYou can configure the following settings to allow the retrieval engine to recall the top 5 most relevant document slices.

In [9]:
index = rag.load_index()query_engine = index.as_query_engine(    streaming=True,    # Retrieve 5 document slices at once, default is 2    similarity_top_k=5)

In [10]:
response = ask('Which department is Zhang Wei in?', query_engine)

🤔 问题：张伟是哪个部门的

🤖 回答：
张伟是IT部的IT专员。
--------------------------------------------------
📚 参考文档：

文档 1:
Node ID: cedac307-9961-4989-93b4-0e5d29817f0e
Text: ⽀持。  绩效管理部 韩杉 李⻜ I902 041 ⼈⼒资源 绩效专员 13800000041
hanshan@educompany.com 建⽴并维护员⼯绩效档案，定期组织绩效评价会议，协调各部⻔反馈，制定考核流程与标准，确保绩效
Score:  0.169


文档 2:
Node ID: 0a4d5beb-9653-4230-a702-0b77826dfa0d
Text: 核，提供⾏政管理与协调⽀持，优化⾏政⼯作流程。  ⾏政部 秦⻜ 蔡静 G705 034 ⾏政 ⾏政专员 13800000034
qinf@educompany.com 维护公司档案与信息系统，负责公司通知及公告的发布，
Score:  0.154


文档 3:
Node ID: c4141cbe-98bb-4503-b5d5-9ca023d7456c
Text: 效管理部 南 ⻜ 0 ⼒资源 效专员 0
责制定绩效考核体系，组织绩效评估的实施与反馈，撰写评估报告，分析绩效数据以提出优化建议，提供决策
Score:  0.144


文档 4:
Node ID: 44cd3101-53be-403e-b8b9-d4f0b89fe9e0
Text: 组织公司活动的前期准备与后期评估，确保公司各项⼯作的顺利进⾏。  IT部 张伟 ⻢云 H802 036 IT⽀撑 IT专员
13800000036 zhangwei036@educompany.com 进⾏公司⽹络及硬件设备的配置
Score:  0.138


文档 5:
Node ID: bb9cc9d3-e247-414b-8d66-86e605933d64
Text: ⽀持，确保⼈⼒资源部⻔顺畅运作。  ⾏政部 黎晗 蔡静 G704 033 ⾏政 ⾏政专员 13800000033
lih@educompany.com 负责采购办公设备与耗材，登记与管理公司固定资产，协助实施绩效考
Score:  0.121


---------------

As you can see: after adjusting the number of recall, your Q&A bot is now able to answer "Which department is Zhang Wei in?" This is because the recalled document slices already contain information about Zhang Wei and his department.However, simply increasing the number of recalled slices is not a good solution. Think about it—if this method could solve the problem, why not recall the entire knowledge base? That way, no information would be missed... But this would not only exceed the input length limit of large language models (LLMs), but also reduce the efficiency and accuracy of the model's responses due to excessive irrelevant information.Moreover, in fact, there may be many colleagues named Zhang Wei in your company, which leads to another issue: when a user asks "Which department is Zhang Wei in?", the system cannot determine which Zhang Wei the user is referring to. Simply increasing the number of recall might retrieve information about multiple Zhang Weis, but the system would still be unable to accurately decide whose information to return. Therefore, we need to use other methods to further improve the RAG chatbot.

#### 2.1.2 Evaluate Improvement EffectivenessTo quantify the effectiveness of improvements in subsequent enhancements, you can continue to use Ragas from the previous chapter for evaluation. Suppose your company has three colleagues named Zhang Wei who work in the Teaching and Research Department, Course Development Department, and IT Department, respectively.

In [None]:
# Define evaluation functiondef evaluate_result(question, response, ground_truth):    # Get the response content    if hasattr(response, 'response_txt'):        answer = response.response_txt    else:        answer = str(response)    # Get the retrieved context    context = [source_node.get_content() for source_node in response.source_nodes]    # Construct evaluation dataset    data_samples = {        'question': [question],        'answer': [answer],        'ground_truth': [ground_truth],        'contexts': [context],    }    dataset = Dataset.from_dict(data_samples)    # Evaluate using Ragas    score = evaluate(        dataset=dataset,        metrics=[answer_correctness, context_recall, context_precision],        llm=Tongyi(model_name="qwen-plus-0919"),        embeddings=DashScopeEmbeddings(model="text-embedding-v3")    )    return score.to_pandas()

In [12]:
question = 'Which department is Zhang Wei in?'ground_truth = '''There are three employees named Zhang Wei in the company:- Zhang Wei in the Teaching and Research Department: Position is Teaching and Research Specialist, email zhangwei@educompany.com.- Zhang Wei in the Course Development Department: Position is Course Development Specialist, email zhangwei01@educompany.com.- Zhang Wei in the IT Department: Position is IT Specialist, email zhangwei036@educompany.com.'''

In [13]:
evaluate_result(question=question, response=response, ground_truth=ground_truth)

Evaluating: 100%|██████████| 3/3 [00:34<00:00, 11.51s/it]


Unnamed: 0,question,answer,ground_truth,contexts,answer_correctness,context_recall,context_precision
0,张伟是哪个部门的,张伟是IT部的IT专员。,公司有三名张伟，分别是：\n- 教研部的张伟：职位是教研专员，邮箱 zhangwei@edu...,[⽀持。 \n绩效管理部 韩杉 李⻜ I902 041 ⼈⼒资源 绩效专员 13800000...,0.447801,0.25,0.25


As you can see, the current RAG system is still unable to operate efficiently. The retrieved document chunks contain irrelevant information, and the relevant information has not been fully recalled, resulting in an incorrect final answer. You need to consider other improvement strategies.  

### 2.2 Provide Large Language Models with More Structured Reference InformationIn practical applications, the organizational structure of a document has a significant impact on retrieval performance. Imagine this: the same information is placed either in a well-structured table or scattered throughout a block of plain text. Which one would be easier to locate and understand? Clearly, the former.The same applies to large language models (LLMs). When information originally presented in a table is converted into plain text, although no information is lost, its structure is diminished. This is akin to turning an organized drawer into a pile of scattered items—while everything is still there, it becomes less convenient to find things.  

#### 2.2.1 Rebuild IndexMarkdown format is a great choice because it:* Has a clear structure and well-defined hierarchy* Simple syntax, easy to read and maintain* Particularly suitable for document organization in RAG chatbot scenariosTo validate the effectiveness of structured documents, the course has prepared an optimized Markdown format file. Next, you will:1. Add this Markdown file to the docs directory2. Rebuild the index3. Test the improvement in retrieval performance

In [14]:
# Copy the markdown formatted employee information document to the ./docs directory! mkdir -p ./docs/2_5! cp ./resources/2_4/Summary_of_departments_duties_and_key_roles_contact_information.md ./docs/2_5

In [15]:
print('=' * 50)print('📂 Loading documents...')print('=' * 50 + '\n')# Load documentsdocuments = SimpleDirectoryReader('./docs/2_5').load_data()print(f'✅ Document loading completed.\n')print('=' * 50)print('🛠️ Rebuilding index...')print('=' * 50 + '\n')# Rebuild indexindex = VectorStoreIndex.from_documents(    documents)print('✅ Index rebuilding completed!')print('=' * 50)

📂 正在加载文档...

✅ 文档加载完成。

🛠️ 正在重建索引...

✅ 索引重建完成！


In [16]:
query_engine = index.as_query_engine(    streaming=True,    similarity_top_k=5)

In [17]:
response = ask('Which department is Zhang Wei in?', query_engine)

🤔 问题：张伟是哪个部门的

🤖 回答：
张伟分别在三个不同的部门任职：

1. 教研部 - 内容设计教研专员
2. 课程开发部 - 内容开发课程开发专员
3. IT部 - IT支撑IT专员

为了更准确地找到您需要联系的张伟，请根据具体的工作内容或部门进一步确认。如果您有更详细的信息或者具体的事务需要咨询，可以告诉我，我会帮您提供更精确的帮助。
--------------------------------------------------
📚 参考文档：

文档 1:
Node ID: c9c72bb3-439d-430d-8c2e-cbd0b6f07768
Text: 各部门介绍  教研部:教育课程研究设计  课程开发部:技术内容需求开发  教材编写部:教材、练习题等内容汇编与修订
评估部:内容质量质检  市场部:市场活动营销  人力资源部:人力资源管理  IT部:IT技术支撑  绩效管理部:人员绩效考核设计
Score:  0.501


文档 2:
Node ID: c7f58946-4a40-4736-b82a-2c079f582c4c
Text: 各部门关键角色联系人  | 部门|员工姓名|员工主管|工位|工号|岗位|职位|电话|邮箱|工作职责| |
---|---|---|---|---|---|---|---|---|---| | 教研部|张伟|李琳|A101|001|内容设计|教研专
员|13800000001|zhangwei@educompany.com|负责教育课程的研究与开发,分析教学效果,整理教案,协助课程优化,
以及参与教育项目的评估和反馈。| | 教研部|王芳|李琳|A102|002|内容设计|教研专员|13800000002|wangfang@e
ducompany.com|负责制定学科教学方案,策划教学活动,编写教案,收集学生反馈,参与课程改进会议,提供专业意见。| |
教研部|刘杰|李琳|A103|003|内...
Score:  0.498


文档 3:
Node ID: 7a8310d8-7853-4ec8-9caa-6e4c4b6afe98
Text: com|进行公司网络及硬件设备的配置与维护,监控系统运行状态,及时处理技术问题与故障,提供技术支持及工具使用培训。| | IT部
|谢宇|马云|H803|037|I

#### 2.2.2 Evaluate Improvement EffectYou can see that your Q&A bot can accurately answer this question. You can run the Ragas evaluation again, and the evaluation data will also show that the answer accuracy has improved.

In [18]:
evaluate_result(question=question, response=response, ground_truth=ground_truth)

Evaluating: 100%|██████████| 3/3 [00:44<00:00, 14.94s/it]


Unnamed: 0,question,answer,ground_truth,contexts,answer_correctness,context_recall,context_precision
0,张伟是哪个部门的,张伟分别在三个不同的部门任职：\n\n1. 教研部 - 内容设计教研专员\n2. 课程开发部...,公司有三名张伟，分别是：\n- 教研部的张伟：职位是教研专员，邮箱 zhangwei@edu...,[各部门介绍\n\n教研部:教育课程研究设计\n\n课程开发部:技术内容需求开发\n\n教材...,0.745308,1.0,0.8875


## 3. Familiarize Yourself with the RAG WorkflowSo far, you have made some improvements to increase the accuracy of the Q&A for the RAG chatbot. However, in a real production environment, the problems you may encounter go far beyond this. Previously, you have already learned about some of the RAG workflow. Here, you can review the important steps to help you identify new areas for improvement:RAG (Retrieval Augmented Generation) is a technology that combines information retrieval and generative models, allowing it to leverage relevant information from an external knowledge base when generating answers. Its workflow can be divided into several key steps: parsing and slicing, vector storage, retrieval recall, answer generation, etc. You can refer back to the section 'Expanding the Knowledge Scope of the RAG chatbot' for specific concepts.<img src="https://img.alicdn.com/imgextra/i4/O1CN018d8e9G1V0jDAZMRXp_!!6000000002591-0-tps-1463-997.jpg" alt="RAG Working Principle" width="700px">Next, we will focus on each part of RAG and try to optimize its performance.

## 4. Various Stages of RAG Chatbot and Improvement Strategies### 4.1 Document Preparation StageIn traditional customer service systems, customer service personnel accumulate a knowledge base based on the questions raised by users and share it with other customer service staff for reference. This process is equally indispensable when building a RAG chatbot.- **Intent Space**: We can map the needs behind user questions as points, which together form a user intent space.- **Knowledge Space**: The knowledge points you have accumulated in the knowledge base documents constitute a knowledge space. These knowledge points can be a paragraph or a chapter.When we project the intent space and knowledge space together, we find that there are overlaps and differences between the two spaces. These areas correspond to our three subsequent optimization strategies:1. **Overlapping Area**:   - This refers to parts where user questions can be answered based on the content of the knowledge base, forming the foundation of ensuring the effectiveness of the RAG chatbot.   - For these user intents, you can continuously improve the quality of responses through **optimizing content quality, engineering, and algorithms**.2. **Uncovered Intent Space**:   - Due to the lack of supporting content in the knowledge base, large language models (LLMs) tend to generate "hallucination". For example, if the company has added a new "Data Analysis Department," but there are no related documents in the knowledge base, no matter how much you improve the engineering algorithms, the RAG chatbot will not be able to accurately answer this question.   - What you need to do is proactively **supplement the missing knowledge** and continually track changes in the user intent space.3. **Unused Knowledge Space**:   - Recalling irrelevant knowledge points may interfere with the LLM's responses.   - Therefore, you need to **optimize the recall algorithm** to avoid recalling unrelated content. Additionally, you should periodically check the knowledge base and **remove irrelevant content**.<img src="https://img.alicdn.com/imgextra/i1/O1CN01ZPlyjW1WQCudS8kcr_!!6000000002782-2-tps-2004-1152.png" alt="RAG Intent Space to Knowledge Space" width="700px">Before attempting to optimize engineering or algorithms, you should prioritize building a mechanism that continuously collects user intents. By systematically gathering real user needs to enrich the knowledge base content and inviting domain experts with deep understanding of user intents to participate in effect evaluation, a closed-loop optimization process of "data collection - knowledge update - expert validation" is formed to ensure the effectiveness of the RAG chatbot.Once you have prepared these, you can further optimize various stages of the RAG chatbot.

### 4.2 Document Parsing and Chunking PhaseFirst, the RAG application will parse the content of your document, and then slice the document content into chunks.If the document chunks that the large language model (LLMs) receives when answering questions lack key information, the response may be inaccurate. Similarly, if the document chunks contain too much irrelevant information (noise), it will also affect the quality of the response. In other words, either too little or too much information can impact the model's response effectiveness.Therefore, when parsing and chunking documents, it is necessary to ensure that the final chunks contain complete information but do not include excessive interfering information.  

#### 4.2.1 Problem Classification and Improvement StrategiesDuring the document parsing and chunking phase, you may encounter the following issues:<table border="1">  <thead>    <tr>      <th>Category</th>      <th>Subtype</th>      <th>Improvement Strategy</th>      <th>Scenario Example</th>    </tr>  </thead>  <tbody>    <tr>      <td rowspan="3">Document Parsing</td>      <td>Non-uniform document types, some formats are not supported for parsing <em>e.g., SimpleDirectoryLoader used earlier does not support Keynote files</em></td>      <td>Develop a parser for the corresponding format or convert the document format</td>      <td>For example, a company uses a large number of Keynote files to store employee information, but the existing parser does not support the Keynote format. A Keynote parser can be developed, or the files can be converted into a supported format (e.g., PDF).</td>    </tr>    <tr>      <td>Within the already supported document formats, there is some special content <em>e.g., embedded tables, images, videos, etc.</em></td>      <td>Improve the document parser</td>      <td>For example, a document contains many tables and images, and the current parser cannot correctly extract information from the tables. The parser can be improved to handle tables and images.</td>    </tr>    <tr>      <td>...</td>      <td>...</td>      <td>...</td>    </tr>    <tr>      <td rowspan="4">Document Chunking</td>      <td>The document contains much content with similar themes <em>e.g., in a work manual, each stage like requirements analysis, development, and release has precautions and operational guidance</em></td>      <td>Expand document titles and subtitles <em>"Precautions" => "Requirements Analysis > Precautions"</em>, create document metadata (tagging)</td>      <td>For example, a document contains precautions for multiple stages. When a user asks, "What are the precautions for requirements analysis?", the system returns precautions for all stages. Titles can be expanded and tagging can be used to distinguish content across different stages.</td>    </tr>    <tr>      <td>Document chunks are too long, introducing excessive noise</td>      <td>Reduce chunk length, or develop more suitable chunking strategies based on specific business needs</td>      <td>For example, a document's chunks are too long and contain multiple unrelated topics, resulting in irrelevant information being returned during retrieval. Chunk length can be reduced to ensure that each chunk contains only one topic.</td>    </tr>    <tr>      <td>Document chunks are too short, truncating useful information</td>      <td>Increase chunk length, or develop more suitable chunking strategies based on specific business needs</td>      <td>For example, each chunk in a document contains only one sentence, making it impossible to retrieve complete context during search. Chunk length can be increased to ensure that each chunk contains complete context.</td>    </tr>    <tr>      <td>...</td>      <td>...</td>      <td>...</td>    </tr>  </tbody></table>  

#### 4.2.2 Parsing PDF Files Using Model StudioIn the previous learning process, to allow you to quickly see the effects of format conversion, this course directly provided a Markdown document converted from a PDF. However, in real-world work scenarios, writing code to properly convert PDFs into Markdown is not an easy task.In practical work, you can also use DashScopeParse provided by Model Studio to parse files in formats such as PDF and Word. Behind DashScopeParse lies Alibaba Cloud's [Document Intelligence](https://www.aliyun.com/product/ai/docmind) service, which helps you recognize images within documents and extract structured text information from files in formats like PDF and Word.  

In [None]:
from llama_index.readers.dashscope.utils import ResultTypefrom llama_index.readers.dashscope.base import DashScopeParseimport osimport jsonimport nest_asyncio

In [None]:
nest_asyncio.apply()# Use environment variablesos.environ['DASHSCOPE_API_KEY'] = os.getenv('DASHSCOPE_API_KEY')

In [None]:
# Create a silent logger to replace the original loggersilent_logger = logging.getLogger(__name__)# Set the log level to ERROR to avoid outputting irrelevant information. If you need to view more detailed log information, set it to INFOsilent_logger.setLevel(logging.ERROR)class SilentDashScopeParse(DashScopeParse):    def __init__(self, *args, **kwargs):        # Replace the logger in all related modules        import llama_index.readers.dashscope.base as base_module        import llama_index.readers.dashscope.domain.lease_domains as lease_domains_module        import llama_index.readers.dashscope.utils as utils_module        base_module.logger = silent_logger        lease_domains_module.logger = silent_logger        utils_module.logger = silent_logger        # Call the parent class initialization        super().__init__(*args, **kwargs)

In [None]:
# The file is parsed into markdown text that is easy for programs and large models to process via the DashScopeParse interface.def file_to_md(file, category_id):    parse = SilentDashScopeParse(        result_type=ResultType.DASHSCOPE_DOCMIND,        category_id=category_id    )    documents = parse.load_data(file_path=file)    # Initialize an empty string to store Markdown content    markdown_content = ""    for doc in documents:        doc_json = json.loads(json.loads(doc.text))        for item in doc_json["layouts"]:            if item["text"] in item["markdownContent"]:                markdown_content += item["markdownContent"]            else:                # When DashScopeParse processes, it will also parse the text information inside document images into the initial markdown text (similar to OCR). This is sufficient for command-line screenshots and text screenshots in the example files of this article. No deep parsing of images is required in this example.                # For actual knowledge base documents, if they involve irregular, complex information in images and require a deeper understanding of the image content, you can call a vision model to further understand the meaning of the image.                # (In the data structure returned by DashScopeParse, for image data, the markdownContent field is the image URL, and the text field is the parsed text.)                # if ".jpg" in item["markdownContent"] or ".jpeg" in item["markdownContent"] or ".png" in item["markdownContent"]:                #     image_url = re.findall(r'\!\[.*?\]\((https?://.*?)\)', item["markdownContent"])[0]                #     print(image_url)                #     markdown_content = markdown_content + parse_image_to_text(image_url)+"\n"                # else:                #     markdown_content = markdown_content + item["text"]+"\n"                markdown_content = markdown_content + item["text"]+"\n"    return markdown_content### Example usage# 1. Optional configuration.# On the Bailian platform, different business spaces can be configured for different projects. By default, the default business space is used.# If you need to use a non-default space, go to [Bailian Console - Business Space Management](https://bailian.console.aliyun.com/?admin=1#/efm/business_management), configure the business space, and obtain the Workspace ID.# After completion, uncomment and modify this code to the actual value:# os.environ['DASHSCOPE_WORKSPACE_ID'] = "<Your Workspace id, Default workspace is empty.>"# 2. Optional configuration.# When files are parsed through DashScopeParse, the uploaded data directory ID needs to be configured. Go to [Bailian Console - Data Management](https://bailian.console.aliyun.com/#/data-center), configure categories, and obtain the ID.category_id="default" # It is recommended to modify this to a custom category ID for better file classification management.md_content = file_to_md(['./docs/Summary of responsibilities and key role contact information of various departments in the content company.pdf'], category_id)print("Parsed Markdown text:")print("-"*100)print(md_content)

Due to the diversity of sources for various file formats such as PDF/docx, there may be some minor formatting issues during the process of parsing files into markdown. For instance, table rows spanning pages in a PDF might be parsed into multiple lines.Large language models can be used to refine the generated markdown text, correcting issues like table of contents levels and missing information.  

In [None]:
from dashscope import Generation

In [None]:
def md_polisher(data):    messages = [        {'role': 'user', 'content': 'The following text is converted from PDF to markdown, and there may be some issues with the format and content. I need you to optimize it:1. Directory levels: If the directory level order is incorrect, please complete or modify it in markdown format;2. Content errors: If there are inconsistencies in the context, please correct them;3. Tables: Pay attention to inconsistencies between rows;4. The overall output should not differ significantly from the input; do not create content on your own—I need to polish the original text;5. Output format requirement: Markdown text, all your responses should be placed inside a markdown file.Special Note: Only output the converted markdown content itself, without any other information.The content to be processed is: ' + data}    ]    response = Generation.call(        model="qwen-plus-0919",        messages=messages,        result_format='message',        stream=True,        incremental_output=True    )    result = ""    print("Polished Markdown Text:")    print("-"*100)    for chunk in response:        print(chunk.output.choices[0].message.content, end='')        result += chunk.output.choices[0].message.content    return(result)

Through the above steps, you have successfully converted the PDF into Markdown and made some formatting corrections. At the same time, even if there are images in the document, the information in the images can also be extracted to build a knowledge base that is more conducive to search performance.  

#### 4.2.3 Using Multiple Document Chunking MethodsDuring the document chunking process, the chunking method can affect the effectiveness of retrieval recall. Let's understand the characteristics of different chunking methods through specific examples. First, create a general evaluation function.

In [None]:
def evaluate_splitter(splitter, documents, question, ground_truth, splitter_name):    """Evaluate the effectiveness of different document splitting methods"""    print(f"\n{'='*50}")    print(f"🔍 Testing with {splitter_name} method...")    print(f"{'='*50}\n")    # Build index    print("📑 Processing documents...")    nodes = splitter.get_nodes_from_documents(documents)    index = VectorStoreIndex(nodes, embed_model=Settings.embed_model)    # Create query engine    query_engine = index.as_query_engine(        similarity_top_k=5,        streaming=True    )    # Execute query    print(f"\n❓ Test question: {question}")    print("\n🤖 Model response:")    response = query_engine.query(question)    response.print_response_stream()    # Output reference snippets    print(f"\n📚 Reference snippets recalled by {splitter_name}:")    for i, node in enumerate(response.source_nodes, 1):        print(f"\nDocument snippet {i}:")        print("-" * 40)        print(node)    # Evaluate results    print(f"\n📊 Evaluation results for {splitter_name}:")    print("-" * 40)    display(evaluate_result(question, response, ground_truth))

Next, let's look at the characteristics and examples of various slicing methods:#### 4.2.3.1 Token SlicingSuitable for scenarios with strict requirements on the number of tokens, such as when using models with smaller context lengths.Example text: "LlamaIndex is a powerful RAG (Retrieval-Augmented Generation) framework. It provides various document processing methods. Users can choose the appropriate method based on their needs."Possible results after applying token slicing (chunk_size=10):* Slice 1: "LlamaIndex is a powerful RAG"* Slice 2: "framework. It provides various doc"* Slice 3: "ument processing methods. Users can"

In [None]:
token_splitter = TokenTextSplitter(    chunk_size=1024,    chunk_overlap=20)evaluate_splitter(token_splitter, documents, question, ground_truth, "Token")

#### 4.2.3.2 Sentence SlicingThis is the default slicing strategy, which maintains the integrity of sentences.The same text after sentence slicing:* Slice 1: "LlamaIndex is a powerful RAG framework."* Slice 2: "It provides various document processing methods."* Slice 3: "Users can choose the appropriate method based on their needs."

In [None]:
sentence_splitter = SentenceSplitter(    chunk_size=512,    chunk_overlap=50)evaluate_splitter(sentence_splitter, documents, question, ground_truth, "Sentence")

#### 4.2.3.3 Sentence Window SlicingEach slice contains surrounding sentences as the context window.Example text after using sentence window slicing (window_size=1):* Slice 1: "LlamaIndex is a powerful RAG framework." Context: "It provides various document processing methods."* Slice 2: "It provides various document processing methods." Context: "LlamaIndex is a powerful RAG framework. Users can choose the appropriate method based on their needs."* Slice 3: "Users can choose the appropriate method based on their needs." Context: "It provides various document processing methods."

In [None]:
sentence_window_splitter = SentenceWindowNodeParser.from_defaults(    window_size=3,    window_metadata_key="window",    original_text_metadata_key="original_text")# Note: Sentence window slicing requires a special post-processorquery_engine = index.as_query_engine(    similarity_top_k=5,    streaming=True,    node_postprocessors=[MetadataReplacementPostProcessor(target_metadata_key="window")])evaluate_splitter(sentence_window_splitter, documents, question, ground_truth, "Sentence Window")

#### 4.2.3.4 Semantic ChunkingAdaptively select chunk points based on semantic relevance.Example text: "LlamaIndex is a powerful RAG framework. It provides various document processing methods. Users can choose the appropriate method according to their needs. Additionally, it supports vector-based retrieval. This retrieval method is highly efficient."Possible results of semantic chunking:* Chunk 1: "LlamaIndex is a powerful RAG framework. It provides various document processing methods. Users can choose the appropriate method according to their needs."* Chunk 2: "Additionally, it supports vector-based retrieval. This retrieval method is highly efficient." (Note that this is grouped by semantic relevance.)  

In [None]:
semantic_splitter = SemanticSplitterNodeParser(    buffer_size=1,    breakpoint_percentile_threshold=95,    embed_model=Settings.embed_model)evaluate_splitter(semantic_splitter, documents, question, ground_truth, "Semantic")

#### 4.2.3.5 Markdown ChunkingA chunking method specifically optimized for Markdown documents.Example Markdown text:  ```markdown# RAG FrameworkLlamaIndex is a powerful RAG framework.## Features- Provides multiple document processing methods- Supports vector retrieval- Easy and convenient to use### Detailed DescriptionUsers can choose the appropriate method based on their needs.```Markdown slices will be intelligently divided based on heading levels:* Slice 1: "# RAG FrameworkLlamaIndex is a powerful RAG framework."* Slice 2: "## Features- Provides various document processing methods- Supports vector retrieval- Simple and convenient to use"* Slice 3: "### Detailed DescriptionUsers can choose the appropriate method according to their needs."

In [None]:
markdown_splitter = MarkdownNodeParser()evaluate_splitter(markdown_splitter, documents, question, ground_truth, "Markdown")

In practical applications, there's no need to overthink when choosing a chunking method. You can consider it this way:* If you are new to RAG, it is recommended to start with the default sentence chunking method, which provides good results in most scenarios.* When you find that the retrieval results are not ideal, you can try:  * Handling long documents and needing to maintain context? Try sentence window chunking.  * Is the document logical and highly specialized? Semantic chunking may be helpful.  * Is the model always reporting token limits exceeded? Token chunking can help you control precisely.  * Processing Markdown documents? Don’t forget there’s dedicated Markdown chunking.There is no best chunking method, only the one most suitable for your scenario. You can experiment with different chunking methods, observe Ragas evaluation results, and find the solution that best fits your needs. The learning process is all about constant trial and adjustment!

### 4.3 Vectorization and Storage Phase for SlicesAfter document slicing, you also need to index them for subsequent retrieval. A common approach is to use an word embedding model to vectorize the slices and store them in a vector database.In this phase, you need to choose an appropriate word embedding model and vector database, which is crucial for improving retrieval performance.#### 4.3.1 Understanding word embedding and VectorizationThe word embedding model can convert text into high-dimensional vectors to represent textual semantics. Similar texts will be mapped to nearby vectors, and during retrieval, documents with high similarity can be found based on the vector representation of the query._A directed line segment in a plane coordinate system is a 2-dimensional vector. For example, the directed line segment from the origin (0, 0) to point A (xa, ya) can be called vector A. The smaller the angle between vector A and vector B, the higher their similarity._<img src="https://img.alicdn.com/imgextra/i4/O1CN01wKAL7C1bhDgbxr2Aa_!!6000000003496-0-tps-1556-1382.jpg" width="400" ></td>  

In [None]:
import numpy as npdef cosine_similarity(a, b):    """Cosine similarity"""    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))# Example vectorsa = np.array([0.2, 0.8])b = np.array([0.3, 0.7])c = np.array([0.8, 0.2])print(f"Cosine similarity between A and B: {cosine_similarity(a, b)}")print(f"Cosine similarity between B and C: {cosine_similarity(b, c)}")

#### 4.3.2 Selecting the Appropriate Embedding ModelDifferent Embedding models may produce completely different vectors when calculating the same set of text. Generally, newer Embedding models perform better. For example, in the previous section, we used the text-embedding-v2 provided by Alibaba Cloud's Bailian platform. If you switch to a newer version, [text-embedding-v3](https://help.aliyun.com/zh/model-studio/user-guide/embedding), you will notice that even without performing the earlier optimizations, the retrieval performance will still improve to some extent.For instance, running the following code shows that different versions of the Embedding model yield varying similarity scores for the question "Which department is Zhang Wei from?" and different document chunks.

In [None]:
def compare_embeddings(query, chunks, embedding_models):    """Compare text similarity across different embedding models    Args:        query: Query text        chunks: List of text chunks to compare        embedding_models: Dictionary of embedding models, format {model_name: model_instance}    """    # Print input texts    print(f"Query: {query}")    for i, chunk in enumerate(chunks, 1):        print(f"Text {i}: {chunk}")    # Calculate and display similarity results for each model    for model_name, model in embedding_models.items():        print(f"\n{'='*20} {model_name} {'='*20}")        query_embedding = (model.get_query_embedding(query) if hasattr(model, 'get_query_embedding')                         else model.get_text_embedding(query))        for i, chunk in enumerate(chunks, 1):            chunk_embedding = model.get_text_embedding(chunk)            similarity = cosine_similarity(query_embedding, chunk_embedding)            print(f"Similarity between query and text {i}: {similarity:.4f}")# Prepare test dataquery = "Which department is Zhang Wei in?"chunks = [    "Core, providing administrative management and coordination support, optimizing administrative workflows. Administrative Department, Qin Fei, Cai Jing, G705, 034, Administrative, Administrative Specialist, 13800000034, qinf@educompany.com, maintaining company archives and information systems, responsible for company notices and announcements.",    "Organizing the preliminary preparation and post-assessment of company activities, ensuring smooth progress of all company tasks. IT Department, Zhang Wei, Ma Yun, H802, 036, IT Support, IT Specialist, 13800000036, zhangwei036@educompany.com, configuring company networks and hardware devices."]# Define embedding models to be testedembedding_models = {    "text-embedding-v2": DashScopeEmbedding(model_name="text-embedding-v2"),    "text-embedding-v3": DashScopeEmbedding(model_name="text-embedding-v3")}# Perform comparisoncompare_embeddings(query, chunks, embedding_models)

In addition to evaluating the performance of different Embedding models through similarity comparisons, you can also assess them from a practical application perspective. Below, you will use the Ragas evaluation tool to compare the actual performance of the text-embedding-v2 and text-embedding-v3 models within a RAG chatbot.By running the following code, you can clearly see that under the same RAG chatbot strategy, the overall performance of the text-embedding-v3 model is better than that of text-embedding-v2. Let's take a look at the specific evaluation process and results:  

In [None]:
def compare_embedding_models(documents, question, ground_truth, sentence_splitter):    """Compare the performance of different embedding models in RAG    Args:        documents: List of documents        question: Query question        ground_truth: Standard answer        sentence_splitter: Text splitter    """    # Document splitting    print("📑 Processing documents...")    nodes = sentence_splitter.get_nodes_from_documents(documents)    # Define the embedding model configurations to be tested    embedding_models = {        "text-embedding-v2": DashScopeEmbedding(            model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V2        ),        "text-embedding-v3": DashScopeEmbedding(            model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V3,            embed_batch_size=6,            embed_input_length=8192        )    }    # Test each model    for model_name, embed_model in embedding_models.items():        print(f"\n{'='*50}")        print(f"🔍 Testing {model_name}...")        print(f"{'='*50}")        # Build index and query engine        index = VectorStoreIndex(nodes, embed_model=embed_model)        query_engine = index.as_query_engine(streaming=True, similarity_top_k=5)        # Execute query        print(f"\n❓ Test question: {question}")        print("\n🤖 Model response:")        response = query_engine.query(question)        response.print_response_stream()        # Display recalled document fragments        print(f"\n📚 Recalled reference fragments:")        for i, node in enumerate(response.source_nodes, 1):            print(f"\nDocument fragment {i}:")            print("-" * 40)            print(node)        # Evaluate results        print(f"\n📊 Evaluation results for {model_name}:")        print("-" * 40)        evaluation_score = evaluate_result(question, response, ground_truth)        display(evaluation_score)# Prepare test datadocuments = SimpleDirectoryReader('./docs/2_5').load_data()sentence_splitter = SentenceSplitter(    chunk_size=1000,    chunk_overlap=200,)# Perform comparisoncompare_embedding_models(    documents=documents,    question=question,    ground_truth=ground_truth,    sentence_splitter=sentence_splitter)

You can see that:*   Newer versions of Embedding models generally yield better results (e.g., text-embedding-v3 performs better than v2).*   In practice, simply upgrading the Embedding model can significantly improve retrieval quality.*   We recommend you first try the latest text-embedding-v3 model, which delivers good performance across most tasks. Meanwhile, you can keep an eye on updates to DashScopeEmbedding models and choose to upgrade to a higher-performing version based on your actual needs.  

#### 4.3.3 Choosing the Right Vector DatabaseWhen building a RAG chatbot, you have multiple vector storage options to choose from, ranging from simple to complex:##### 4.3.3.1 In-memory Vector StorageThe simplest approach is to use the vector database built into LlamaIndex. Simply install the llama-index package, and with no additional configuration, you can quickly develop and test your RAG chatbot:  

In [None]:
from llama_index.core import VectorStoreIndex# Create in-memory vector indexindex = VectorStoreIndex.from_documents(documents)

The advantage is quick to get started, suitable for development and testing; the disadvantage is that data cannot be persisted, and it is limited by memory size.##### 4.3.3.2 Local Vector DatabaseWhen the data volume increases, open-source vector database such as Milvus, Qdrant, etc., can be used. These databases provide data persistence and efficient retrieval capabilities.

The advantage is that the functionality is complete and highly controllable; the disadvantage is that it requires self-deployment and maintenance.##### 4.3.3.3 Cloud Service Vector StorageFor production environments, it is recommended to use vector storage capabilities provided by cloud services. Alibaba Cloud offers multiple options:*   **Vector Retrieval Service (DashVector)**: Pay-as-you-go, automatic scaling, suitable for quickly starting projects. For detailed functionalities, please refer to [Vector Retrieval Service (DashVector)](https://www.aliyun.com/product/ai/dashvector).    *   **Vector Retrieval Service Milvus Edition**: Compatible with open-source Milvus, making it convenient to migrate existing applications. For detailed functionalities, please refer to [Vector Retrieval Service Milvus Edition](https://www.aliyun.com/product/milvus).    *   **Vector Capabilities of Existing Databases**: If you are already using Alibaba Cloud databases (RDS, PolarDB, etc.), you can directly utilize their vector functionalities.The advantages of cloud services include:*   No need to worry about operations and maintenance, automatic scaling.    *   Comprehensive monitoring and management tools are provided.    *   Pay-as-you-go, cost control.    *   Support for hybrid retrieval of vectors + scalars, improving retrieval accuracy.Recommendations:1.  Use in-memory vector storage during development and testing.    2.  Small-scale applications can use local vector databases.    3.  For production environments, it is recommended to use cloud services, and choose the appropriate service type based on specific needs.<figure align="center">  <img src="https://img.alicdn.com/imgextra/i4/O1CN01ked0xy1y2DM02aWzJ_!!6000000006520-0-tps-1942-932.jpg" width="600"/>  <figcaption style="color: #999">DashVector supports tag filtering such as age, name, etc., + vector similarity search.</figcaption></figure>  

### 4.4 Retrieval Recall PhaseThe main issue encountered during the retrieval phase is the difficulty in finding, from a large number of document slices, the fragment that is most relevant to the user's question and contains the correct answer information.From the perspective of intervention timing, solutions can be divided into two main categories:1. Before executing the retrieval, many user queries are incomplete or even ambiguous. You need to find ways to reconstruct the user's true intent to improve retrieval effectiveness.2. After executing the retrieval, you may discover some irrelevant information that needs to be filtered out to avoid interference with the subsequent answer generation.<table border="1">  <thead>    <tr>      <th>Timing</th>      <th>Improvement Strategy</th>      <th>Example</th>    </tr>  </thead>  <tbody>    <tr>      <td rowspan="7">Before Retrieval</td>      <td>Question Rewriting</td>      <td>"Are there any good restaurants nearby?" => "Please recommend a few highly-rated restaurants near me."</td>    </tr>    <tr>      <td>Question Expansion <em>Adding more information to make the search results more comprehensive</em></td>      <td>"Which department does Zhang Wei belong to?" => "Which department does Zhang Wei belong to? What are his contact details, responsibilities, and work objectives?"</td>    </tr>    <tr>      <td>Context Expansion Based on User Profile <em>Expanding the question based on user information and behavior data</em></td>      <td>Content Engineer asks "Work Precautions" => "What precautions should a content engineer take at work?" Project Manager asks "Work Precautions" => "What precautions should a project manager take at work?"</td>    </tr>    <tr>      <td>Tag Extraction <em>Extracting tags for subsequent tag filtering + vector similarity search</em></td>      <td>"What precautions should a content engineer take at work?" => <ul><li>Tag Filtering: {"Position": "Content Engineer"}</li><li>Vector Search: "What precautions should a content engineer take at work?"</li></ul></td>    </tr>    <tr>      <td>Ask the User</td>      <td>"What are the job responsibilities?" => Large Language Model (LLM) asks back: "May I ask which position’s job responsibilities you want to know about?" <em>Prompt examples for asking back can be found here:</em><a href="https://help.aliyun.com/zh/model-studio/use-cases/create-an-ai-shopping-assistant">Build an AI Shopping Assistant in 10 Minutes</a></td>    </tr>    <tr>      <td>Think and Plan Multiple Searches</td>      <td>"Zhang Wei is not available, who can I contact?" => LLM thinks and plans: => task_1: What are Zhang Wei's responsibilities, task_2: Who else has ${task_1_result} responsibilities => Execute multiple searches in sequence.</td>    </tr>    <tr>      <td>...</td>      <td>...</td>    </tr>    <tr>      <td rowspan="3">After Retrieval</td>      <td>reranking + Filtering <em>Most vector databases consider efficiency and sacrifice some accuracy; the retrieved slices may contain items with low relevance.</em></td>      <td>chunk1, chunk2..., chunk10 => chunk 2, chunk4, chunk5</td>    </tr>    <tr>      <td>Sliding Window Retrieval <em>After retrieving a slice, supplement it with several adjacent slices before and after. This is because adjacent slices often have semantic connections, and looking at a single slice might lose important information.</em> <em>Sliding window retrieval ensures that semantic connections between texts are not lost due to excessive segmentation.</em></td>      <td>A common implementation is sentence sliding windows. You can understand it using the simplified form below: Assume the original text is ABCDEFG (each letter represents a sentence). When the retrieved slice is D, after supplementing adjacent slices, it becomes BCDEF (taking 2 slices before and after). Here, BC and EF are the context of D. For example:<ul><li>BC may contain background information explaining D</li><li>EF may contain subsequent developments or results of D</li><li>These contextual pieces of information help you understand the full meaning of D more accurately</li></ul>By recalling these related context slices, you can improve the accuracy and completeness of the retrieval results.</td>    </tr>    <tr>      <td>...</td>      <td>...</td>    </tr>  </tbody></table>  

#### 4.4.1 Question Rewriting🤔 **Why is question rewriting necessary?**Imagine you are searching for keywords like "Find Zhang Wei" or "Zhang Wei Department." It seems simple, but for a RAG system, such scattered search terms might be difficult to handle. This is because, in real-world scenarios, there may be multiple colleagues named Zhang Wei, and the keywords entered by users are often too simplistic, lacking necessary contextual information.

In [None]:
question = "Find Zhang Wei"

✨ **What can problem rewriting bring?**Problem rewriting is like helping the system better understand user intent. For example, when you ask "Find Zhang Wei," the system can rewrite the question into a more complete form, such as "Please tell me all employees named Zhang Wei in the company and their departments." Such rewriting not only improves the accuracy of retrieval but also makes the answers more comprehensive.Next, you can experience different problem rewriting strategies through practical examples. In this case, you will use the following configuration:* Document: Markdown format* Chunking: Default sentence chunking strategy* Model: text-embedding-v3* Storage: Default vector storage

In [None]:
# Configure embedding modelSettings.embed_model = DashScopeEmbedding(    model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V3,    embed_batch_size=6,    embed_input_length=8192)# Load documentsdocuments = SimpleDirectoryReader('./docs/2_5').load_data()# Configure document splittersentence_splitter = SentenceSplitter(    chunk_size=1000,    chunk_overlap=200,)# Document splittingsentence_nodes = sentence_splitter.get_nodes_from_documents(documents)# Build indexsentence_index = VectorStoreIndex(sentence_nodes, embed_model=Settings.embed_model)

**[Conventional Method: Direct Retrieval without Rewriting the Question]**Before you attempt to rewrite the question, take a look at the results of using the original question for retrieval. This comparison will give you a more intuitive sense of the improvements that question rewriting can bring:  

In [None]:
# Create query enginequery_engine = sentence_index.as_query_engine(    streaming=True,    similarity_top_k=5)# Execute queryprint(f"❓ User question: {question}\n")streaming_response = query_engine.query(question)print("\n💭 AI Response:")print("-" * 40)streaming_response.print_response_stream()print("\n")# Display reference documentsprint("\n📚 Reference Sources:")print("-" * 40)for i, node in enumerate(streaming_response.source_nodes, 1):    print(f"\nDocument snippet {i}:")    print(f"Relevance score: {node.score:.4f}")    print("-" * 30)    print(node.text)# Evaluate resultsprint("\n📊 Response Quality Evaluation:")print("-" * 40)evaluation_score = evaluate_result(question, streaming_response, ground_truth)display(evaluation_score)

After running this code, you may find the results less than ideal. Although the system retrieved five relevant snippets, it did not find all the information about 'Zhang Wei.' Why is that?The issue lies in the way the question was asked. When a user asks, 'Which department is Zhang Wei in?' this question is easy for a person to understand but lacks important context for a large language model (LLM) — there is more than one Zhang Wei in the company! This is similar to going to a school with multiple teachers named Wang and asking, 'Which office is Teacher Wang in?' Someone is bound to ask back, 'Which Teacher Wang are you referring to?'So, what if we made the question more complete? For example, by clearly stating that you want to know the department information of 'all colleagues named Zhang Wei in the company.' Next, you can try using an LLM to rephrase the question and see if the results improve.  

**[Method 1: Using Large Language Models to Expand User Questions]**You can let the large language model (LLM) act as a question-rewriting assistant. It will help you rewrite simple questions to make them more complete and clear. For example, it will not only consider the possibility of multiple individuals named Zhang Wei but also supplement all related contextual information. Here’s how to do it specifically:  

In [None]:
query_gen_str = """System role setting:You are a professional question rewriting assistant. Your task is to expand the user's original question into a more complete and comprehensive question.Rules:1. Integrate possible ambiguities, related concepts, and contextual information into a complete question2. Use parentheses to supplement explanations for ambiguous concepts3. Add key qualifiers and modifiers4. Ensure that the rewritten question is clear and semantically complete5. For vague concepts, list the main possibilities in parenthesesOriginal question:{query}Please generate a comprehensive rewritten question, ensuring:- Contains the core intent of the original question- Covers possible interpretations of ambiguities- Uses clear logical connectives to link different aspects- When necessary, use parentheses to provide supplementary explanationsOutput format:[Comprehensive rewrite] - The rewritten question"""query_gen_prompt = PromptTemplate(query_gen_str)

In [None]:
def generate_queries(query: str):    response = Settings.llm.predict(        query_gen_prompt, query=query    )    return response

In [None]:
# Generate extended queriesprint("\n🔍 Original question:")print(f"   {question}")query = generate_queries(question)print("\n📝 Extended queries:")print(f"   {query}\n")# Create query enginequery_engine = sentence_index.as_query_engine(    streaming=True,    similarity_top_k=5)# Execute queryresponse = query_engine.query(query)print("💭 AI Response:")print("-" * 40)response.print_response_stream()print("\n")# Display reference documentsprint("\n📚 Reference sources:")print("-" * 40)for i, node in enumerate(response.source_nodes, 1):    print(f"\nDocument snippet {i}:")    print(f"Relevance score: {node.score:.4f}")    print("-" * 30)    print(node.text)# Evaluate resultsprint("\n📊 Response quality evaluation:")print("-" * 40)evaluation_score = evaluate_result(query, response, ground_truth)display(evaluation_score)

Running the code above, you will find that questions rewritten by large language models (LLMs) can achieve better retrieval results. However, for some complex questions, rewriting alone may not be sufficient.

[Method Two: Rewriting a Single Query into Multi-step Queries] In addition to rewriting the question, you can also try another approach: breaking down complex questions into simpler steps. LlamaIndex provides two powerful tools to achieve this functionality: * StepDecomposeQueryTransform: This tool can help break down a complex question into multiple sub-questions. For example, for "Which department does Zhang Wei belong to?", it would first decompose it into: 1. "How many employees named Zhang Wei are there in the company?" 2. "Which departments do these Zhang Weis belong to?" This allows for a more comprehensive retrieval of all information about Zhang Wei. * MultiStepQueryEngine: This query engine processes these sub-questions sequentially. It will first retrieve information on all Zhang Weis in the company, then query the department information for each Zhang Wei, and finally integrate the answers into a complete response, informing the user: "There are three Zhang Weis in the company, working in the Teaching and Research Department, Course Development Department, and IT Department, respectively." This method is like solving a math problem — breaking down a big problem into smaller ones often makes it easier to arrive at an accurate answer. However, note that this method involves multiple calls to large language models (LLMs), so it will consume more token.

In [None]:
from llama_index.core.indices.query.query_transform.base import (    StepDecomposeQueryTransform,)step_decompose_transform = StepDecomposeQueryTransform(verbose=True)# set Logging to DEBUG for more detailed outputsfrom llama_index.core.query_engine import MultiStepQueryEnginequery_engine = sentence_index.as_query_engine(streaming=True,similarity_top_k=5)query_engine = MultiStepQueryEngine(    query_engine=query_engine,    query_transform=step_decompose_transform,    index_summary="Company personnel information")print(f"❓ User question: {question}\n")print("🤖 AI is performing multi-step query...")response = query_engine.query(question)print("\n📚 Reference basis:")print("-" * 40)for i, node in enumerate(response.source_nodes, 1):    print(f"\nDocument fragment {i}:")    print("-" * 30)    print(node.text)# Evaluation resultsprint("\n📊 Multi-step query evaluation results:")print("-" * 40)evaluation_score = evaluate_result(question, response, ground_truth)display(evaluation_score)

In this way, the system will first understand the overall goal of the question and then break it down into several small steps to solve one by one. For example, for the question 'Which department is Zhang Wei in?', the system might first find all the individuals named Zhang Wei and then query their department information separately.  

**Method Three: Enhance Retrieval with Hypothetical Documents (HyDE)**The previous methods have all been about adjusting the question itself. Now, let's try a different approach: what if we first assume a possible answer? This is the unique aspect of the HyDE (Hypothetical Document Embeddings) method.Its working mechanism is quite interesting:1. First, let the large language model generate a "hypothetical answer document" based on the question.2. Use this hypothetical document to retrieve real documents.3. Finally, use the retrieved real documents to generate an actual answer.This is similar to when you're looking for a book and already have a rough outline of its content in mind, then use that outline to match similar books in the library. Let's see how this can be implemented specifically:  

In [None]:
from llama_index.core.indices.query.query_transform.base import (    HyDEQueryTransform,)from llama_index.core.query_engine import TransformQueryEngine# run query with HyDE query transformhyde = HyDEQueryTransform(include_original=True)query_engine = sentence_index.as_query_engine(streaming=True,similarity_top_k=5)query_engine = TransformQueryEngine(query_engine, query_transform=hyde)print(f"❓ User question: {question}\n")print("🤖 AI is analyzing using HyDE...")streaming_response = query_engine.query(question)print("\n💭 AI response:")print("-" * 40)streaming_response.print_response_stream()# Display reference documentsprint("\n📚 Reference sources:")print("-" * 40)for i, node in enumerate(streaming_response.source_nodes, 1):    print(f"\nDocument snippet {i}:")    print("-" * 30)    print(node.text)# Evaluate resultsprint("\n📊 HyDE Query Evaluation Results:")print("-" * 40)evaluation_score = evaluate_result(question, streaming_response, ground_truth)display(evaluation_score)

As you can see from the evaluation results, this method has indeed brought some improvements. You may be wondering: how does the system generate this 'hypothetical document'? Let’s take a look at what content the AI actually generated during this process:  

In [None]:
query_bundle = hyde(question)hyde_doc = query_bundle.embedding_strs[0]print(f"🤖 AI-generated hypothetical document:\n{hyde_doc}\n")

Interestingly, although this "hypothetical document" is entirely fabricated by AI, its structure and style are very similar to real company employee information. LlamaIndex provides flexible control mechanisms to optimize this process:The HyDEQueryTransform class allows us to precisely control the generation of hypothetical documents in the following ways:* Custom LLM: By passing different configurations of large language models through the llm parameter, you can choose a more suitable language model for generating hypothetical documents.* Prompt template: Customize the prompt template via the hyde_prompt parameter to precisely control the format and content of the output.* Query strategy: Use the include_original parameter to decide whether to combine the original query with the hypothetical document.TransformQueryEngine acts as a wrapper for the query engine, which will:1. First call HyDEQueryTransform to generate the hypothetical document.2. Use the hypothetical document for vector retrieval.3. Finally return the query results.This architecture allows us to optimize the retrieval effect by adjusting the parameters of HyDEQueryTransform without modifying the underlying query engine. Even though the specific content of the hypothetical document may not be entirely accurate, a well-designed configuration can help the system retrieve relevant information more accurately.

#### 4.4.2 Extracting Tags to Enhance RetrievalOn the basis of vector retrieval, we can also add tag filtering to improve retrieval accuracy. This method is similar to a library having both title search and a classification numbering system, which allows for more precise retrieval.There are two key scenarios for tag extraction:1. When building an index, extract structured tags from document slices2. During retrieval, extract corresponding tags from user queries for filteringLet's look at two examples to understand how to extract tags from different types of text:  

In [None]:
import osfrom openai import OpenAIclient = OpenAI(api_key=os.getenv("DASHSCOPE_API_KEY"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1")system_message = """You are a tag extraction expert. Please extract structured information from the text and output tags as required.---[Supported Tag Types]- Person Name- Department Name- Job Title- Technical Field- Product Name---[Output Requirements]1. Please output in JSON format, such as: [{"key": "Department Name", "value": "Teaching and Research Department"}]2. If a certain type of tag is not identified, do not output that type---The text to be analyzed is as follows:"""def extract_tags(text):    completion = client.chat.completions.create(        model="qwen-turbo",        messages=[            {'role': 'system', 'content': system_message},            {'role': 'user', 'content': text}        ],        response_format={"type": "json_object"}    )    return completion.choices[0].message.content

In [None]:
# Example 1: HR Documenthr_text = """Zhang Ming is the technical director of our AI R&D department. He led the team to develop a new generation of intelligent dialogue platform ChatMax and has rich experience in the field of natural language processing. If you need to know the project details, you can contact him directly."""print("HR Document Tag Extraction Results:")print(extract_tags(hr_text))# Example 2: Technical Documenttech_text = """This paper proposes a deep learning-based image recognition algorithm, which has made breakthrough progress in medical image analysis. The algorithm has been applied in the CT diagnosis system of Peking Union Medical College Hospital."""print("\nTechnical Document Tag Extraction Results:")print(extract_tags(tech_text))

When we build the index, we can store these tags together with document chunks. This way, during retrieval, for example, when a user asks "Which department is Zhang Wei in?", we can:1. Extract the name tag {"key": "Name", "value": "Zhang Wei"} from the question.2. First, use the tag to filter out all document chunks containing "Zhang Wei".3. Then, use vector similarity search to find the most relevant content.This combination of "tag filtering + vector retrieval" significantly improves the accuracy of retrieval. Especially when dealing with highly structured enterprise documents, this method works even better.

#### 4.4.3 Re-rankingYou can delete the previously constructed markdown file to reproduce the state at the beginning of this chapter where the response to "Which department is Zhang Wei in?" was not good.

In [None]:
![ -f ./docs/Summary_of_Department_Responsibilities_and_Key_Role_Contacts_in_Content_Company.md ] && rm ./docs/Summary_of_Department_Responsibilities_and_Key_Role_Contacts_in_Content_Company.md && echo "File deleted." || echo "File does not exist, no need to delete."

After deleting the file, you can execute the following code. As you can see, you have set it to retrieve 3 relevant document slices from the vector database.From the results, these 3 slices are, in fact, not sufficiently relevant, and the Q&A bot is unable to correctly answer the question "Which department is Zhang Wei from?".

In [None]:
from llama_index.llms.dashscope import DashScopefrom chatbot import rag

In [None]:
index = rag.create_index('./docs')query_engine = index.as_query_engine(    similarity_top_k=3,    streaming=True,)

In [None]:
response = ask("Which department is Zhang Wei in", query_engine=query_engine)

In [None]:
display(evaluate_result(question, response, ground_truth))

You can adjust the code to first retrieve 20 document slices from the vector database, then use the [text rerank](https://help.aliyun.com/zh/model-studio/getting-started/models#eafbfdceb7n03) provided by Alibaba Cloud Model Studio to rerank them, and filter out the three most relevant reference pieces of information.After running the code, you will notice that, with the same three reference pieces of information, this time the large language models (LLMs) is able to answer the question accurately.  

In [None]:
from llama_index.postprocessor.dashscope_rerank import DashScopeRerankfrom llama_index.core.postprocessor import SimilarityPostprocessor

In [None]:
query_engine = index.as_query_engine(    # First, set a larger number of recall slices    similarity_top_k=20,    streaming=True,    node_postprocessors=[        # In the rerank model, select the final number of slices you want to recall. Use the gte-rerank model from Tongyi Lab for reranking.        DashScopeRerank(top_n=3, model="gte-rerank"),        # Set a similarity threshold; slices below this threshold will be filtered out        SimilarityPostprocessor(similarity_cutoff=0.2)    ])

In [None]:
response = ask("Which department is Zhang Wei in", query_engine=query_engine)

In [None]:
display(evaluate_result(question, response, ground_truth))

### 4.5 Answer Generation PhaseNow, the large language models (large language models (LLMs)) will generate the final answer based on your question and the retrieved content. However, this answer may still not meet your expectations. The issues you might encounter include:1. No relevant information was retrieved, and the LLM fabricated an answer.2. Relevant information was retrieved, but the LLM did not generate the answer as required.3. Relevant information was retrieved, and the LLM provided an answer, but you expect the AI to give a more comprehensive response.To address these issues, you can analyze and resolve them from the following perspectives:1. Choosing the right LLM:   1. For simple information queries and summaries, a small-parameter model is sufficient to meet the needs, such as [qwen-turbo](https://help.aliyun.com/zh/model-studio/getting-started/models#ff492e2c10lub).   2. If you want the Q&A bot to perform complex logical reasoning, it is recommended to choose a larger-parameter LLM with stronger reasoning capabilities, such as [qwen-plus](https://help.aliyun.com/zh/model-studio/getting-started/models#bb0ffee88bwnk) or even [Qwen-Max](https://help.aliyun.com/zh/model-studio/getting-started/models#cf6cc4aa2aokf).   3. If your question requires reviewing a large number of document fragments, it is recommended to choose a model with a longer context length, such as [qwen-long](https://help.aliyun.com/zh/model-studio/getting-started/models#27b2b3a15d5c6), [qwen-turbo](https://help.aliyun.com/zh/model-studio/getting-started/models#ff492e2c10lub), or [qwen-plus](https://help.aliyun.com/zh/model-studio/getting-started/models#bb0ffee88bwnk).   4. If the RAG chatbot you are building is for non-general domains such as the legal field, it is recommended to use a model trained specifically for that domain, such as [Tongyi Fawei](https://help.aliyun.com/zh/model-studio/getting-started/models#f0436273ef1xm).2. Fully optimize the prompt template, for example:   1. Clearly request no fabrication of answers: large language models (LLMs) may produce some inaccurate content, commonly referred to as hallucination. You can reduce the likelihood of LLM hallucinations by requiring in the prompt: "If the provided information is insufficient to answer the question, please explicitly state 'Based on the available information, I cannot answer this question.' Do not fabricate answers."   2. Add content delimiters: If the retrieved document slices are randomly mixed into the prompt, it will be difficult for humans to see the structure of the entire prompt, and the LLM will also be affected. It is recommended to clearly separate the prompt and the retrieved slices so that the LLM can correctly understand your intent.   3. Adjust the template according to the type of question: Different types of questions may require different response paradigms. You can use the LLM to identify the question type and then map different prompt templates accordingly. For example, for some questions, you may want the LLM to first output the overall framework and then the details; for other questions, you may prefer the LLM to provide concise conclusions.3. Adjust the parameters of the LLM, for example:   1. If you want the LLM to produce the same output for the same question, you can pass the same seed value each time the model is invoked.   2. If you want the LLM to avoid always using repetitive sentences when answering user questions, you can appropriately increase the presence_penalty value.   3. If you are querying factual content, you can appropriately decrease the temperature or top_p values; conversely, when querying creative content, you can appropriately increase their values.   4. If you need to limit the word count (such as generating summaries or keywords), control costs, or reduce response time, you can appropriately lower the max_tokens value. However, if max_tokens is too low, it may lead to truncated output. Conversely, when generating long text, you can increase its value.   5. You can also refer to the [Qwen API Reference](https://help.aliyun.com/zh/model-studio/developer-reference/use-qwen-by-calling-api) to learn more about the usage instructions for various parameters.

In [None]:
from llama_index.llms.openai_like import OpenAILikefrom llama_index.core import Settingsimport os

In [None]:
# Factual query scenario - Low temperature, high certaintyfactual_llm = OpenAILike(    model="qwen-plus-0919",  # Use the Qwen-Plus model    api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",    api_key=os.getenv("DASHSCOPE_API_KEY"),    is_chat_model=True,    temperature=0.1,      # Lower temperature for more deterministic output    max_tokens=512,       # Control output length; however, if max_tokens is too small, it may lead to truncated output    presence_penalty=0.0, # Default presence_penalty    seed=42              # Fixed seed for reproducible output)

In [None]:
# Creative scenario - High temperature, more diversitycreative_llm = OpenAILike(    model="qwen-plus-0919",    api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",    api_key=os.getenv("DASHSCOPE_API_KEY"),    is_chat_model=True,    temperature=0.7,      # Increase temperature to make the output more creative    max_tokens=1024,      # Allow longer output    presence_penalty=0.6  # Increase presence_penalty to reduce repetition)

4. Model fine-tuning for large language models (LLMs): If all the above methods have been thoroughly attempted but still fall short of expectations, or if you hope to achieve further performance improvements, you can also try model fine-tuning tailored to your specific scenario. In subsequent chapters, you will learn and practice this process.  

## ✅ Summary of this sectionThrough the previous learning, you have understood the workflow of a simple RAG and common optimization techniques. You can also combine the knowledge you've learned with actual needs to route different questions to different RAG chatbots, thereby building a more powerful modular RAG chatbot. Additionally, through the previous lessons, you should also realize that large language models (LLMs) are not only useful for building question-answering systems. Leveraging LLMs to identify user intent and extract structured information (such as extracting tags from user questions as mentioned earlier) can also play a role in many other application scenarios.Of course, the optimization methods for RAG go far beyond those introduced in this course. The industry continues to research and explore RAG, and there are still many advanced RAG topics worth your study. From the previous learning, it is clear that building a well-rounded and high-performing RAG chatbot is not simple. In practical work, you may need to seize business opportunities more quickly and won't have time to delve into these details. Below are some directions worth exploring:* GraphRAG technology ingeniously combines the advantages of retrieval-augmented generation (RAG) and query-focused summarization (QFS), providing a powerful solution for handling large-scale text data. It merges the strengths of both technologies: RAG excels at finding precise detailed information, while QFS is better at understanding and summarizing the overall content of an article. Through this combination, GraphRAG can accurately answer specific questions and handle complex queries that require deeper understanding, making it particularly suitable for building intelligent question-answering systems.   If you want to delve deeper into how to practically apply GraphRAG, you can refer to the detailed tutorial provided by LlamaIndex: [Building a GraphRAG Application with LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/cookbooks/GraphRAG_v2/).* With Model Studio, you can refer to the document [Build a Private Knowledge Question-Answering Application Without Coding](https://help.aliyun.com/zh/model-studio/getting-started/build-knowledge-base-qa-assistant-without-coding) to quickly build a fairly effective RAG chatbot.* If your business processes are more complex, you can also leverage [Visual Workflow, agent orchestration application](https://help.aliyun.com/zh/model-studio/user-guide/application-introduction) on Model Studio to build a more powerful application.* Meanwhile, Model Studio also provides a series of [LlamaIndex components](https://help.aliyun.com/zh/model-studio/developer-reference/llamaindex/), allowing you to make full use of Model Studio capabilities while continuing to use the familiar LlamaIndex API to build RAG chatbots.  

## 🔥 Post-class Quiz### 🔍 Single-choice Question<details><summary style="cursor: pointer; padding: 12px; border: 1px solid #dee2e6; border-radius: 6px;"><b>In RAG applications, the length and content of document chunks significantly impact retrieval performance. If the chunk size is too large, introducing excessive noise, how should it be addressed❓</b>- A. Increase the number of documents- B. Reduce the chunk size, or develop a more reasonable chunking strategy based on business characteristics- C. Use a more advanced retrieval algorithm- D. Improve the training level of the large model**[Click to view the answer]**</summary><div style="margin-top: 10px; padding: 15px; border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">✅ **Reference Answer: B**  📝 **Explanation**: - Excessively long document chunks may include too much irrelevant information (noise), directly affecting retrieval accuracy.- For example, if a single chunk contains multiple topics, unrelated content may be retrieved during searches.- Optimizing the chunking strategy is the fundamental solution to address noise, as it controls input quality rather than relying on subsequent algorithmic or model compensation.</div></details>  

## ✉️ Evaluation and FeedbackThank you for studying the AliCloud Large Model ACP Certification Course. If you find any part of the course well-written or in need of improvement, we look forward to your [evaluation and feedback through this questionnaire](https://survey.aliyun.com/apps/zhiliao/Mo5O9vuie).Your criticism and encouragement are our motivation to move forward.  