# 2.2 Expanding the Knowledge Scope of the Q&A Bot

## 🚄 Preface

You have already learned that RAG chatbot is an effective solution for expanding the knowledge scope of large language models (LLMs). In this section, you will learn about the workflow of RAG chatbot and how to create a RAG chatbot application so that it can answer questions based on the company's policy documents.

## 🍁 Course Objectives

After completing this course, you will be able to:

* Understand the workflow of RAG chatbot
* Create a RAG chatbot application



## 1. How RAG Works

You might lose points in an exam because you forgot a concept or formula, but if the exam is open-book, you only need to find the most relevant knowledge point and add your understanding to answer the question.

The same applies to large language models (LLMs). During training, if the model has not seen certain knowledge points (e.g., your company's policy documents), directly asking it related questions will result in inaccurate answers. However, if relevant knowledge is provided as a reference during content generation, similar to an open-book exam, the quality of the large language models (LLMs)'s responses will significantly improve.

Retrieval Augmented Generation (RAG) is a solution that provides reference materials for LLMs. RAG applications typically consist of two parts: **indexing** and **retrieval generation**.

### 1.1 Indexing
You might mark reference materials before an exam to help you quickly locate relevant information during the test. Similarly, RAG applications often pre-mark references, a process called **indexing**, which includes four steps:<br>
1. **Document Parsing**<br>
Just as you convert visual information from books into text, RAG applications also need to load and parse knowledge base documents into a textual format that LLMs can understand.
2. **Text Chunking**<br>
You usually don't flip through an entire book when solving a problem; instead, you look for the most relevant paragraphs. Similarly, RAG applications segment the parsed documents to quickly retrieve the most relevant content later.
3. **Text Vectorization**<br>
During an open-book exam, you first search for the most relevant paragraphs in the reference materials before answering. In RAG applications, embedding models are used to digitally represent both the paragraphs and the question. After comparing their similarity, the most relevant paragraph is identified. This process is called text vectorization.<br>
    > If you're interested in the details of this process, you can explore the extended reading section of this tutorial.
4. **Index Storage**<br>
Index storage saves the vectorized paragraphs into a vector database, so RAG applications don't need to repeat these steps every time they respond, thus increasing response speed.

    <img src="https://img.alicdn.com/imgextra/i3/O1CN01h0y0Uy1WH30Q7FRDJ_!!6000000002762-2-tps-1592-503.png" width="800"><br>

    After indexing, RAG applications can retrieve relevant text segments based on user questions.

### 1.2 Retrieval Generation
Retrieval and generation correspond to the `Retrieval` and `Generation` stages in RAG. **Retrieval** is like searching for materials during an open-book exam, while **generation** involves answering based on the retrieved materials and the question.<br>
1. **Retrieval**<br>
The retrieval phase recalls the most relevant text segments. The question is vectorized using an embedding model, and semantic similarity is compared with the paragraphs in the vector database to identify the most relevant ones. Retrieval is the most critical part of a RAG application. Imagine finding the wrong material during an exam—your answer would be inaccurate. To improve retrieval accuracy, besides using powerful embedding models, techniques like reranking and sentence window retrieval can be applied. You can learn more about these in the next chapter.
2. **Generation**<br>
After retrieving relevant text segments, the RAG application generates the final prompt by combining the question and the retrieved text segments through a prompt template. The large language models (LLMs) then generates the response, leveraging its summarization abilities rather than relying solely on its internal knowledge.
    > A typical prompt template is: `Please answer the user's question based on the following information: {retrieved text segments}. The user's question is: {question}.`

    <img src="https://img.alicdn.com/imgextra/i1/O1CN01vbkBXC1HQ0SBrC1Ii_!!6000000000751-2-tps-1776-639.png" width="600"><br>

## 2. Creating a RAG Application

Building a RAG application requires implementing the above functionalities, and this process is not easy. However, with LlamaIndex, you can achieve the aforementioned functionalities without writing too much code.  



### 2.1 Please confirm your current Python environment  



Before running the code in this section of the course, please make sure you have switched to the newly created Python environment, such as the `Python (llm_learn)` environment created in the previous lessons.

<img src="https://img.alicdn.com/imgextra/i1/O1CN01B9bNMT27MDFvpBmnc_!!6000000007782-2-tps-1944-448.png" width="800">

**Note: In each subsequent lesson, you should check whether you need to manually switch the Notebook environment.**



### 2.2 A Simple RAG chatbot

As with the tutorial in the previous section, you need to run the following code to configure the Model Studio API Key into the environment.



In [4]:
from config.load_key import load_key
import os

load_key()
# In production environments, do not output the API Key to logs to avoid leakage
print(f"Your configured API Key is: {os.environ["DASHSCOPE_API_KEY"][:5]+"*"*5}")

Your configured API Key is: sk-4b*****


We have prepared some fictional company policy documents in the docs folder, and next you will create a RAG application based on these documents.  



In [5]:
!pip install -r ../requirements.txt



In [6]:
# Import dependencies
from llama_index.embeddings.dashscope import DashScopeEmbedding, DashScopeTextEmbeddingModels
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.openai_like import OpenAILike

# These two lines of code are used to suppress WARNING messages to avoid interference with reading and learning. It is recommended to set the log level as needed in a production environment.
import logging
logging.basicConfig(level=logging.ERROR)

print("Parsing files...")
# LlamaIndex provides the SimpleDirectoryReader method, which can directly load files from a specified folder into document objects, corresponding to the parsing process.
documents = SimpleDirectoryReader('./docs').load_data()

print("Creating index...")
# The from_documents method includes slicing and index creation steps.
index = VectorStoreIndex.from_documents(
    documents,
    # Specify embedding model
    embed_model=DashScopeEmbedding(
        # You can also use other embedding models provided by Alibaba Cloud: https://help.aliyun.com/zh/model-studio/getting-started/models#3383780daf8hw
        model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V2
    ))
print("Creating query engine...")
query_engine = index.as_query_engine(
    # Set to streaming output
    streaming=True,
    # Here we use the qwen-plus-0919 model. You can also use other Qwen text generation models provided by Alibaba Cloud: https://help.aliyun.com/zh/model-studio/getting-started/models#9f8890ce29g5u
    llm=OpenAILike(
        model="qwen-plus-0919",
        api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        is_chat_model=True
        ))
print("Generating response...")
streaming_response = query_engine.query('What tools should our company use for project management?')
print("The answer is:")
# Use streaming output
streaming_response.print_response_stream()

Parsing files...
Creating index...
Creating query engine...
Generating response...
The answer is:
For project management, your company should consider using tools such as Jira or Trello. These tools help in organizing tasks, tracking progress, and ensuring that projects adhere to the set timelines and requirements. Additionally, they facilitate better communication and collaboration among team members.

### 2.3 Saving and Loading Index
You may find that creating an index takes a relatively long time. If you can save the index locally and load it directly when needed, instead of rebuilding the index, this can significantly improve the response speed. LlamaIndex provides an easy-to-implement method for saving and loading indexes.  



In [8]:
# Save the index as a local file
index.storage_context.persist("knowledge_base/test")
print("Index files saved to knowledge_base/test")

Index files saved to knowledge_base/test


In [9]:
# Load the local index file as an index
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="knowledge_base/test")
index = load_index_from_storage(storage_context, embed_model=DashScopeEmbedding(
        model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V2
    ))
print("Successfully loaded index from knowledge_base/test path")

Successfully loaded index from knowledge_base/test path


After loading the index locally, you can test it again by asking questions to see if it works properly.  



In [10]:
print("Creating the query engine...")
query_engine = index.as_query_engine(
    # Set to streaming output
    streaming=True,
    # Use the qwen-plus-0919 model here. You can also use other text generation models provided by Alibaba Cloud: https://help.aliyun.com/zh/model-studio/getting-started/models#9f8890ce29g5u
    llm=OpenAILike(
        model="qwen-plus-0919",
        api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        is_chat_model=True
        ))
print("Generating response...")
streaming_response = query_engine.query('What tools should our company use for project management?')
print("The answer is:")
streaming_response.print_response_stream()

Creating the query engine...
Generating response...
The answer is:
For project management, your company should consider using tools such as Jira or Trello. These tools help in organizing tasks, tracking progress, and ensuring that projects adhere to the set timelines and requirements. Additionally, they facilitate better collaboration among team members and stakeholders.

You can encapsulate the above code so that it can be quickly reused in subsequent iterations.  



In [11]:
from chatbot import rag

# The citations have been indexed in previous steps, so the index can be loaded directly here. If you need to rebuild the index, you can add a line of code: rag.indexing()
index = rag.load_index(persist_path='./knowledge_base/test')
query_engine = rag.create_query_engine(index=index)

rag.ask('What tools should our company use for project management?', query_engine=query_engine)

For project management, your company should consider using tools such as Jira or Trello. These tools help in organizing tasks, tracking progress, and ensuring that projects adhere to the set timelines and requirements. Additionally, they facilitate better communication and collaboration among team members.

### 2.4 Multi-round Conversation
The multi-round conversation in RAG is slightly different from the mechanism of directly initiating multi-round conversations with large language models. From the tutorial in section 2.1, you have learned that multi-round conversations allow LLMs to refer to historical dialogue information. The method is to add historical dialogue information to the messages list.

During the retrieval phase in RAG applications, the system usually compares the semantic similarity between the user's input and text segments. However, directly comparing the user's input with text segments may lose historical dialogue information, leading to inaccurate retrieval results.

Suppose a user asks "Where is Zhang San's workstation?" in the first round of dialogue, and then asks "Who is his supervisor?" in the second round. If the question in the second round is directly compared with text segments for similarity, the retrieval system will not know who "he" refers to, thus likely retrieving incorrect text segments.

If both the complete historical dialogue and the question are input into the retrieval system, due to the large number of words, the retrieval system may fail to process it (embedding models perform worse on long texts than on short texts). The commonly used solution in the industry is:

1. Through the LLM, based on historical dialogue information, query rewriting. The new query will include key information from the historical dialogue.
2. Use the new query to follow the original process for retrieval and generation.

LlamaIndex provides convenient tools that can quickly implement multi-round conversations in RAG applications.



In [12]:
from llama_index.core import PromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core.chat_engine import CondenseQuestionChatEngine

custom_prompt = PromptTemplate(
    """
Given a conversation (between a human and an assistant) and a follow-up message from the human,
rewrite the message as a standalone question that includes all relevant context from the conversation.

<Chat History>
{chat_history}

<Follow-up Message>
{question}

<Standalone Question>
"""
)

# Historical conversation information
custom_chat_history = [
    ChatMessage(role=MessageRole.USER,content="What are the subtypes of content development engineers?"),
    ChatMessage(role=MessageRole.ASSISTANT, content="Comprehensive technical positions."),
]

query_engine = index.as_query_engine(
    # Set to streaming output
    streaming=True,
    # Use the qwen-plus-0919 model here; you can also use other text generation models provided by Alibaba Cloud: https://help.aliyun.com/zh/model-studio/getting-started/models#9f8890ce29g5u
    llm=OpenAILike(
        model="qwen-plus-0919",
        api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        is_chat_model=True
        ))
chat_engine = CondenseQuestionChatEngine.from_defaults(
    query_engine=query_engine,
    condense_question_prompt=custom_prompt,
    chat_history=custom_chat_history,
    llm=OpenAILike(
        model="qwen-plus-0919",
        api_base="https://dashscope.aliyuncs.com/compatible-mode/v1",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        is_chat_model=True
        ),
    verbose=True
)

streaming_response = chat_engine.stream_chat("What are the core responsibilities?")
for token in streaming_response.response_gen:
    print(token, end="")


Querying with: What are the core responsibilities of the subtypes of content development engineers?
The core responsibilities of content development engineers involve combining educational theory with technical practice to support the growth and development of learners through the creation of high-quality content. This role encompasses several detailed responsibilities:

1. Conducting in-depth research on the latest trends in educational technology, learning theories, and market demands. This involves analyzing competitors’ products, evaluating the effectiveness of existing educational resources, and exploring ways to integrate emerging technologies such as artificial intelligence and virtual reality into educational content.

2. Designing and developing high-quality educational materials and courses based on research and market feedback. This includes writing syllabi, creating courseware, and designing assessment tools, while ensuring that the content aligns with educational standards

Although the last question did not mention "content development engineer," the LLM still rewrote the question based on the historical dialogue information as "What are the core responsibilities of a content development engineer?" and provided the correct answer.  



## 📝3.Summary of this section
In this section, you have learned the following content:
1. **The working principle of RAG**<br>
A complete RAG application usually includes two phases: index building and retrieval generation. Index building consists of four steps: document parsing, text segmentation, text vectorization, and index storage. The retrieval generation phase includes two steps: retrieval and generation. After understanding the working principle of RAG, you can optimize and iterate on the RAG chatbot more effectively.
2. **Creating a RAG application**<br>
Using the highly integrated tools provided by LlamaIndex, you created a RAG application, and mastered the methods for saving and loading indexes. You also learned how to implement multi-round conversation in a RAG application.

Although the RAG chatbot can already answer questions like "What tools should our company use for project management?" quite well, its current functionality is still relatively simple. In subsequent tutorials, we will introduce methods to expand the capabilities of the RAG chatbot. The next section will cover how to improve the quality of the RAG chatbot's responses by optimizing prompts.



### Further Reading

#### Text Vectorization
Computers cannot directly understand how similar the two sentences "I like to eat apples" and "I love to eat apples" are, but they can understand the similarity between two vectors of the same dimension (usually measured using cosine similarity). Text vectorization converts natural language into numerical forms that computers can understand through embedding models.

The training of embedding models typically includes a phase of **contrastive learning**, where the input data consists of many text pairs (s1, s2) labeled as either related or unrelated. The model's training objective is to make the vector similarity of related text pairs as high as possible and the vector similarity of unrelated text pairs as low as possible.

In the **indexing** phase, assuming n chunks [c1, c2, c3, ..., cn] have been obtained through text segmentation, the embedding model will convert these n chunks into vectors: [v1, v2, v3, ..., vn], which are then stored in a vector database.

In the **retrieval** phase, assuming the user’s question is q, the embedding model will convert the question q into a vector vq and find the n most similar vectors to vq in the vector database (this value can be set by you). Through the index relationship between vectors and text segments, the corresponding text segments are retrieved as the search results.

## 🔥 Post-class Quiz
### 🔍 Multiple Choice Question

<details>
<summary style="cursor: pointer; padding: 12px; border: 1px solid #dee2e6; border-radius: 6px;">
<b>How should retrieval be conducted during multi-turn conversations in RAG applications? ❓</b>

- A. Input the complete historical dialogue information during the retrieval phase<br>
- B. Rewrite the input question based on historical dialogue information before entering the retrieval phase<br>
- C. Input the latest question during the retrieval phase<br>
- D. Migrate the text segments recalled from the previous round<br>

**[Click to view the answer]**
</summary>

<div style="margin-top: 10px; padding: 15px;  border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

✅ **Reference Answer: B**  
📝 **Explanation**:  
- In multi-turn conversations, directly using the original question (Option C) or the full history (Option A) can lead to retrieval noise or information redundancy.
- Option B dynamically rewrites the current question, maintaining conversational coherence while avoiding the outdated text migration issue of Option D, making it the optimal solution balancing efficiency and accuracy.

</div>
</details>  



## ✅ Evaluation Feedback
We welcome you to participate in the [Alibaba Cloud Large Language Model ACP Course Survey](https://survey.aliyun.com/apps/zhiliao/Mo5O9vuie) to provide feedback on your learning experience and course evaluation.
Your criticism and encouragement are our motivation to move forward!  

