# 2.1 Building a New Employee Q&A Robot with Large Language Models

## 🚄 Preface
You work at an educational content development company, and with the continuous influx of new employees, frequent Q&A demands have led to significant time and resource costs. You decide to use large language models (LLMs) technology to build a Q&A robot to improve the accuracy and efficiency of responses.

## 🍁 Course Objectives
After completing this course, you will be able to:

* Call Qwen-Max through its API
* Learn the working principles of LLM
* Understand the limitations of LLM and their solutions

## ⚠️ Environment Preparation
To ensure a smoother experience with the tutorial, we recommend that you first complete the [**Environment Preparation**](https://edu.aliyun.com/course/3130200/lesson/343310285) chapter in the **Alibaba Cloud Large Language Model Senior Engineer ACP Certification Course**, and ensure that the required environment for the course is correctly installed.


## 💻 1. Calling Qwen-Max via API

The most direct way to experience LLM is by interacting with them through a web interface (such as [Qwen-Max](https://tongyi.aliyun.com/qianwen/)). However, as a developer, you often need to integrate LLM capabilities into your own applications. You can use the widely adopted OpenAI Python SDK to call the Qwen-Max LLM. You have already installed the necessary dependencies in `1.0 Computing Environment Setup`. Before executing the following code, confirm that you have switched to the newly created Python environment, such as the `Python (llm_learn)` environment created in this example.

<a href="https://img.alicdn.com/imgextra/i4/O1CN01nf2EYI1pwhhMbOWHe_!!6000000005425-0-tps-2258-1004.jpg" target="_blank">
<img src="https://img.alicdn.com/imgextra/i4/O1CN01nf2EYI1pwhhMbOWHe_!!6000000005425-0-tps-2258-1004.jpg" width="600">
</a>

<a href="https://img.alicdn.com/imgextra/i3/O1CN01rn0jvB1Z1QJXUWaG2_!!6000000003134-0-tps-3138-914.jpg" target="_blank">
<img src="https://img.alicdn.com/imgextra/i3/O1CN01rn0jvB1Z1QJXUWaG2_!!6000000003134-0-tps-3138-914.jpg" width="600">
</a>  



To invoke Qwen, you need to go to Model Studio, Alibaba Cloud's large model service platform, and activate [Model Invocation Service](https://bailian.console.aliyun.com/#/model-market) and [Create an API key](https://bailian.console.aliyun.com/?apiKey=1#/api-key).

> If the following is displayed at the top of the page, it means that you have not yet activated the Model Studio invocation service. After activating the service, you can invoke the model.
> <img src="https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/5298748271/p856749.png" width="600">

Before using the API, you need to properly handle the security issues of the API key. Writing the API key directly into the code is a bad habit because it is easy to leak the key when sharing the code, and all parts of the API key encoded in plaintext need to be modified after replacing the API key. A safer and more convenient approach is to store the API key in an environment variable.

The following code will load your API key from the configuration file and set it as an environment variable. After the code is executed, the first five characters of the API key will be displayed (followed by asterisks), so you can confirm whether the configuration is correct without exposing the complete key.

In [None]:
# Load the API Key for invoking the Qwen large model
import os
from config.load_key import load_key
load_key()
print(f'''Your configured API Key is: {os.environ["DASHSCOPE_API_KEY"][:5]+"*"*5}''')

You need to press “Enter” to confirm the “API-KEY” you entered. After successful input, you will see the message ```“The API Key you configured is: sk-88***** ”```. 
If you need to change the “API-KEY”, please edit the “KEY.json” file in the parent directory. 
If you are using VS-CODE, the input box for the API-KEY will appear at the **top** of the window.  



Let's start with a simple conversation. The following code creates an assistant named "Company Assistant," which can answer questions about company operations. You can use the common question "Choosing a project management tool" as an example:  



In [2]:
from openai import OpenAI
import os
client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
def get_qwen_response(prompt):
    response = client.chat.completions.create(
        model="qwen-plus-0919",
        messages=[
            # system message is used to set the role and task of the large model
            {"role": "system", "content": "You are responsible for answering questions in an educational content development company. Your name is Company Xiaomi, and you need to answer your colleagues' questions."},
            # user message is used to input the user's question
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content
response = get_qwen_response("What tools should our company use for project management?")
print(response)

选择项目管理工具时，需要考虑团队的具体需求、项目类型、预算以及团队成员对新工具的接受程度等因素。以下是一些常用的项目管理工具，供您参考：

1. **Trello**：适合小型团队和简单项目管理。使用卡片式界面，可以直观地看到项目的进度。

2. **Jira**：特别适用于软件开发项目，支持敏捷开发方法。功能强大，但可能需要一定时间来学习如何高效使用。

3. **Asana**：适用于各种规模的团队和不同类型的项目。提供任务分配、时间线规划等功能，易于上手。

4. **Monday.com**：界面友好，可定制性强，适合创意团队和营销项目管理。

5. **Wrike**：提供强大的项目管理和协作功能，适合中大型企业使用。

6. **Notion**：不仅可以用作项目管理工具，还支持知识管理、文档编辑等多种用途，非常适合需要综合管理信息的团队。

7. **Microsoft Project**：传统的企业级项目管理软件，功能全面，适合大型复杂项目的管理。

建议您可以先确定团队的具体需求（如项目规模、团队人数、所需功能等），然后根据这些需求从上述选项中挑选最适合的工具进行试用。很多工具都提供了免费试用期，利用这个机会可以让团队成员体验并反馈，最终做出更加合适的选择。


If you want to implement multi-round conversations (allowing the large model to reference historical dialogue information for replies), you can refer to [multi-round conversation](https://help.aliyun.com/zh/model-studio/user-guide/text-generation#865b38621dwin).

After running the code above, you will notice that it takes some time (about 20 seconds) to see the complete response. This is because, by default, the API waits until the model has generated all the content before returning the result in one go. In practical applications, this waiting period might affect the user experience — imagine a scenario where users are staring at a blank interface for 20 seconds!

Fortunately, you can use "streaming output" to optimize this issue. With streaming output, the model outputs responses progressively as it generates them, similar to how humans type, allowing users to see partial responses immediately. This significantly enhances the interactive experience. Next, let’s take a look at how to implement streaming output...
> 💡 Tip: Streaming output only changes the way content is displayed; the model's reasoning process and the quality of the final answer remain unchanged. You can confidently use this feature to improve your application's user experience.

To implement streaming output, simply add the `stream=True` parameter to the previous code and adjust the output method:

In [3]:
def get_qwen_stream_response(user_prompt, system_prompt):
    response = client.chat.completions.create(
        model="qwen-plus-0919",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        stream=True
    )
    for chunk in response:
        yield chunk.choices[0].delta.content

response = get_qwen_stream_response(user_prompt="What tools should our company use for project management?", system_prompt="You are responsible for answering questions related to educational content development in the company. Your name is Company Xiaomi, and you need to answer your colleagues' questions.")
for chunk in response:
    print(chunk, end="")

选择适合公司项目的管理工具主要取决于你们的具体需求、团队规模、预算以及你们希望解决的具体问题。以下是一些较为流行的项目管理工具，您可以根据自身情况考虑：

1. **Trello**：非常适合敏捷开发和小型团队使用，界面直观易上手，支持看板式的任务管理方式。

2. **Jira**：对于软件开发团队来说，Jira 是一个非常强大的工具，它不仅支持敏捷开发流程，还提供了详细的报告和跟踪功能。

3. **Asana**：适用于各种规模的团队，提供任务分配、进度追踪等功能，支持多种视图（如列表、看板等）以适应不同工作场景。

4. **Monday.com**：一个高度可定制化的项目管理平台，适合需要高度灵活性的工作环境。支持创建自定义的工作流、报告等。

5. **Teambition**：国内较受欢迎的一款协作工具，除了基本的项目管理功能外，还包括日程安排、文件共享等特性，非常适合远程团队使用。

6. **Wrike**：提供全面的项目管理解决方案，包括时间线规划、资源管理等功能，适合中大型企业使用。

建议您可以先确定团队最需要哪些功能，比如是否重视敏捷开发的支持、是否有复杂的权限设置需求、是否需要集成其他服务等，然后再从上述选项中挑选最适合您团队的工具进行试用。大多数工具都提供免费试用期，利用这段时间充分测试其是否满足您的需求是非常重要的。

By asking the Qwen-Max model twice, you may discover some interesting phenomena:
1. Even if the questions are exactly the same, each response will be slightly different. This is one of the characteristics of large language models (large language models (large language models (LLMs))), similar to humans, as they express similar ideas in different ways.
2. The suggestions provided by the model focus on some general project management tools, such as Jira, Trello, or Asana. Although large language models (large language models (LLMs)) are highly knowledgeable, they do not understand the specific situation of your company, such as the existing toolchain, team size, budget constraints, etc.
> Information about the project management software used by the company can be found in the file docs/Content_Developer_Job_Guide.pdf.

These two phenomena are actually very interesting! Why do large language models (large language models (LLMs)) exhibit such characteristics? To understand this, we need to lift the "mysterious veil" of large language models (large language models (LLMs)) and see how they think and work. Don't worry, you will understand these concepts through simple and intuitive explanations.  



## 📚 2. How Large Language Models Work
In recent decades, artificial intelligence has undergone a profound evolution from basic algorithms to generative AI. Generative AI can create entirely new content, such as text, images, audio, and video, by learning from vast amounts of data, greatly promoting the widespread application of AI technology. Common application scenarios include intelligent question answering (such as Qwen-Max, GPT), creative drawing (such as Stable Diffusion), and code generation (such as Lingma), covering various fields and making AI accessible.

<a href="https://img.alicdn.com/imgextra/i3/O1CN01XhNFzh1bj3EybLhgk_!!6000000003500-0-tps-2090-1138.jpg" target="_blank">
<img src="https://img.alicdn.com/imgextra/i3/O1CN01XhNFzh1bj3EybLhgk_!!6000000003500-0-tps-2090-1138.jpg" width="600">
</a>

Intelligent question answering is one of the most classic and widely used applications of large language models (large language models (LLMs)), serving as the best example for exploring how large language models (LLMs) work. The following will introduce the workflow of large language models (LLMs) in question-answering scenarios to help you better understand the underlying technical principles.

### 2.1. The Question-Answering Workflow of large language models (LLMs)
Below is an example where the input text “ACP is a very” initiates a query to the LLM. The diagram below shows the complete process from initiating the query to outputting the text.

<a href="https://img.alicdn.com/imgextra/i1/O1CN01yLBhyu1gSAlt3oI0p_!!6000000004140-2-tps-2212-1070.png" target="_blank">
<img src="https://img.alicdn.com/imgextra/i1/O1CN01yLBhyu1gSAlt3oI0p_!!6000000004140-2-tps-2212-1070.png" width="800">
</a>

The question-answering workflow of large language models (LLMs) consists of five stages:

**Stage 1: Tokenization of Input Text**

A tokenization is the basic unit of text processing in large language models (LLMs), typically representing words, phrases, or symbols. We need to break down the sentence “ACP is a very” into smaller units with independent semantic meaning (tokenizations) and assign each tokenization an ID. If necessary, you can use the [Tokenizer API](https://help.aliyun.com/zh/dashscope/developer-reference/tokenization-api?spm=5176.28197632.0.0.2130607dUIVd7Y&disableWebsiteRedirect=true) to calculate tokenizations.

<a href="https://gw.alicdn.com/imgextra/i1/O1CN019gAS3k1DrhwpIdHl6_!!6000000000270-0-tps-2414-546.jpg" target="_blank">
<img src="https://gw.alicdn.com/imgextra/i1/O1CN019gAS3k1DrhwpIdHl6_!!6000000000270-0-tps-2414-546.jpg" width="800">
</a>

**Stage 2: Token Embedding**

Computers can only understand numbers and cannot directly comprehend the meaning of tokenizations. Therefore, tokenizations must be converted into numerical representations (i.e., vectors) so that they can be understood by computers. Token embedding transforms each tokenization into a vector of fixed dimensions.


**Stage 3: Inference by the LLM**

The LLM learns knowledge from vast amounts of pre-trained data. When we input new content, such as “ACP is a very,” the LLM combines its learned knowledge to make predictions. It calculates the probabilities of all possible tokenizations and generates a set of candidate tokenizations. Finally, the LLM selects one tokenization as the next output based on these calculations.

This explains why, when asked about internal project management tools within a company, the model cannot provide suggestions for internal tools, as its predictive ability is based solely on the pre-trained data and it lacks knowledge of information it has not been exposed to. Therefore, when requiring a Q&A bot to answer domain-specific questions, this issue needs to be addressed specifically, which will be further elaborated in Section 3 of this chapter.

**Stage 4: Output Tokens**

Since the LLM randomly selects from the candidate tokenizations based on their probabilities, this leads to the phenomenon that “even if the question is exactly the same, the answers are slightly different each time.” To control the randomness of the generated content, parameters such as temperature and top-p are commonly adjusted.

For example, in the figure below, the first set of candidate tokenizations output by the LLM is “informative (50%),” “fun (25%),” “enlightening (20%),” and “boring (5%).” Adjusting the temperature or top-p parameters will influence the LLM's preference in selecting from the candidate tokenizations, such as choosing the highest probability option, “informative.” You can learn more about these two parameters in Section 2.2 of this chapter.

<a href="https://img.alicdn.com/imgextra/i3/O1CN01N93ZE81e6zAZA4TiK_!!6000000003823-0-tps-582-340.jpg" target="_blank">
<img src="https://img.alicdn.com/imgextra/i3/O1CN01N93ZE81e6zAZA4TiK_!!6000000003823-0-tps-582-340.jpg" width="180">
</a>

Specifically, “informative” will continue to be fed back into the LLM to generate subsequent candidate tokenizations. This process is called auto-regressive model, utilizing both the input text and previously generated text. The LLM uses this method to sequentially generate candidate tokenizations.

**Stage 5: Output Text**

The processes of Stage 3 and Stage 4 are repeated until a special tokenization (such as <EOS>, end of sentence) is output or the output length reaches a threshold, concluding the question-answering session. The LLM then outputs all generated content. Of course, you can utilize the streaming output capability of the LLM, which predicts and immediately returns some tokenizations. In this example, the final output would be “ACP is a very informative course.”

### 2.2 Parameters Affecting the Randomness of Content Generation in Large Language Models (LLMs)

Assume a question-and-answer scenario where the user asks: "What can you learn in the large language models (LLMs) ACP course?" To simulate the content generation process of an LLM, we have preset a candidate token set consisting of the following tokens: "RAG", "prompt", "model", "writing", and "drawing". The LLM will select one from these five candidate tokens as the output (next-token), as shown below.
> User question: What can you learn in the large language models (LLMs) ACP course?<br><br>
> LLM response: RAG <br>

In this process, two important parameters influence the LLM's output: temperature and top_p. These parameters control the randomness and diversity of the content generated by the LLM. Below, we introduce how these two parameters work and how to use them.

#### 2.2.1 Temperature: Adjusting the Probability Distribution of the Candidate Token Set

Before generating the next word (next-token), the LLM first calculates an initial probability distribution for the candidate tokens. This distribution represents the probability of each candidate token being selected as the next-token. Temperature acts as a regulator that alters the probability distribution of the candidate tokens, influencing the content generation of the LLM. By adjusting this parameter, you can flexibly control the diversity and creativity of the generated text.

To better understand, the figure below illustrates the impact of different temperature values on the probability distribution of the candidate tokens. The plotting code is located in the /resources/2_1 directory.

<a href="https://img.alicdn.com/imgextra/i4/O1CN0137QeqL1o3uhFmaXHU_!!6000000005170-0-tps-3538-1242.jpg" target="_blank">
<img src="https://img.alicdn.com/imgextra/i4/O1CN0137QeqL1o3uhFmaXHU_!!6000000005170-0-tps-3538-1242.jpg" width="1000">
</a>

The low, medium, and high temperatures in the figure are based on the range [0, 2) of the Qwen-Plus model.

As shown in the figure, as the temperature increases from low to high (0.1 -> 0.7 -> 1.2), the probability distribution becomes flatter. The probability of the candidate token "RAG" decreases from 0.8 -> 0.6 -> 0.3. Although it remains the most probable token, its probability is now closer to that of other candidate tokens. Consequently, the final output transitions from relatively fixed to increasingly diverse.

For different use cases, you can refer to the following recommendations for setting the temperature parameter:
- Clear answers (e.g., generating code): Lower the temperature.
- Creative variety (e.g., ad copy): Increase the temperature.
- No special requirements: Use the default temperature (usually within the medium range).

Note that when temperature = 0, although randomness is minimized, it does not guarantee identical outputs every time. For a deeper understanding, you can refer to [the underlying algorithm implementation of temperature](https://github.com/huggingface/transformers/blob/v4.49.0/src/transformers/generation/logits_process.py#L226).

Next, let’s experience the effect of temperature. By adjusting the temperature value, ask the same question 10 times and observe the fluctuations in the responses.
> The example code for temperature is similar to the upcoming explanation of top_p, so it has been encapsulated for subsequent use.

In [4]:
import time

def get_qwen_stream_response(user_prompt, system_prompt, temperature, top_p):
    response = client.chat.completions.create(
        model="qwen-plus-0919",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=temperature,
        top_p=top_p,
        stream=True
    )
    
    for chunk in response:
        yield chunk.choices[0].delta.content

# The default values of temperature and top_p use the default values of the Qwen-Plus model
def print_qwen_stream_response(user_prompt, system_prompt, temperature=0.7, top_p=0.8, iterations=10):
    for i in range(iterations):
        print(f"Output {i + 1} : ", end="")
        ## Add delay to prevent rate limiting
        time.sleep(0.5)
        response = get_qwen_stream_response(user_prompt, system_prompt, temperature, top_p)
        output_content = ''
        for chunk in response:
            output_content += chunk
        print(output_content)

# Qwen-Plus model: The value range of temperature is [0, 2), with a default value of 0.7
# Set temperature=0
print_qwen_stream_response(user_prompt="马也可以叫做", system_prompt="Please help me continue writing, with a word count requirement of 4 Chinese characters or less.", temperature=0)

输出 1 : 骥、骊、駮、骝。
输出 2 : 骏马
输出 3 : 驷、驹、骝、骥。
输出 4 : 赤兔
输出 5 : 驷、驹、骝、骥。
输出 6 : 驹子
输出 7 : 骏马
输出 8 : 驹子
输出 9 : 驷、驹、骥、騋。
输出 10 : 驹子


In [5]:
# Set temperature=1.9
print_qwen_stream_response(user_prompt="A horse can also be called", system_prompt="Please help me continue writing, the word count requirement is within 4 Chinese characters.", temperature=1.9)

输出 1 : 千里马。
输出 2 : 四蹄兽。
输出 3 : 驹子。
输出 4 : 千里马。
输出 5 : 驹，骉。
输出 6 : 骏马。
输出 7 : 赤兔
输出 8 : 四蹄动物
输出 9 : 赤兔黄忠。但实际上，马的称呼有很多，如骏马、赛马、 Mustang（注：此处 Mustang 为野马的一个品种）、坐骑等。如果你只是希望获得两个汉字的答案，那应该是“骏马”。但依据你的要求限定在四个汉字以内，我给出的答案是"赤兔"。需要说明的是，"赤兔"是指一种古代传说中的名马，而"黄忠"是三国时期的人物，这里将其去掉以符合您的字数限制要求。不过需要注意，不同情况下对于马的称呼也会有所不同，比如按照颜色、体型、速度等特征命名。如果您对某个特定的方面或语境下马的称呼感兴趣，请进一步告知，我会尽力提供帮助。不过基于你给的要求答案是：赤兔。
输出 10 : 骝马。


It can be clearly observed from the experiment that the higher the temperature value, the more varied and diverse the content generated by the model becomes.

#### 2.2.2 top_p: Control the sampling range of the candidate token set

top_p is a filtering mechanism used to select a "small subset" meeting specific conditions from the candidate token set. The specific method is as follows: sort by probability from high to low, and select tokens whose cumulative probability reaches the set threshold to form a new candidate set, thereby narrowing down the selection range.

The figure below shows the sampling effect of different top_p values on the candidate token set.

<a href="https://img.alicdn.com/imgextra/i1/O1CN01xmkonv21sNL6VtQpi_!!6000000007040-0-tps-2732-1282.jpg" target="_blank">
<img src="https://img.alicdn.com/imgextra/i1/O1CN01xmkonv21sNL6VtQpi_!!6000000007040-0-tps-2732-1282.jpg" width="700">
</a>

In the illustration, the blue part represents tokens whose cumulative probability reaches the top_p threshold (e.g., 0.5 or 0.9), forming a new candidate set; the gray part represents tokens that are not selected.

When top_p=0.5, the model prioritizes selecting the highest-probability token, i.e., "RAG"; when top_p=0.9, the model randomly selects one among "RAG," "Prompt," and "Model" to generate output.


From this, it can be seen that the impact of the top_p value on the content generated by large language models (LLMs) can be summarized as follows:
- Larger value: Wider candidate range, more diverse content, suitable for creative writing, poetry generation, and other scenarios.
- Smaller value: Narrower candidate range, more stable output, suitable for news drafts, code generation, and other scenarios requiring clear answers.
- Extremely small value (e.g., 0.0001): Theoretically, the model only selects the highest-probability token, resulting in very stable output. However, in practice, due to factors such as distributed systems and additional adjustments to model outputs, slight randomness may still be introduced, making it impossible to guarantee completely consistent output every time.


Below, experience the effect of top_p. By adjusting the top_p value, ask the same question 10 times and observe the fluctuations in the response content.

In [6]:
# Qwen-Plus model: The value range of top_p is (0,1], with a default value of 0.8.
# Set top_p=0.001
print_qwen_stream_response(user_prompt="Name an intelligent gaming smartphone, it could be", system_prompt="Please help me name it, the requirement is within 4 Chinese characters.", top_p=0.001)

输出 1 : "智游无界"
输出 2 : "智游无界"
输出 3 : "智游无界"
输出 4 : "智游无界"
输出 5 : "智游无界"
输出 6 : "智游无界"
输出 7 : "智游无界"
输出 8 : "智游无界"
输出 9 : "智游无界"
输出 10 : "智游无界"


In [7]:
# Set top_p=0.9
print_qwen_stream_response(user_prompt="Name an intelligent gaming smartphone, it could be", system_prompt="Please help me name it, the requirement is within 4 Chinese characters.", top_p=0.9)

输出 1 : "智玩无界"
输出 2 : "智游无界"
输出 3 : "智胜游戏王"或"极智战神"。这两款名字都突出了手机的智能化和强大的游戏性能，能够吸引目标消费者的注意。
输出 4 : "智胜游戏王" 或 "电竞性能者"。这两个名字都强调了手机的智能和强大的游戏性能。如果需要更简洁一些，也可以考虑"智胜王"或"性能者"。
输出 5 : "智游无界"
输出 6 : "智游无界"
输出 7 : "智游无界"
输出 8 : "智游无界"
输出 9 : "智游无界"
输出 10 : "智胜游戏王" 或 "极智战神"。这两个名字都强调了手机的智能化和强大的游戏性能，能够吸引喜欢玩手游的消费者。希望对您有所帮助！


Based on the experimental results, it can be observed that the higher the top_p value, the greater the randomness in the output of large language models (LLMs).



#### 2.2.3 Summary

**Should temperature and top_p be adjusted simultaneously?**

To ensure the controllability of the generated content, it is recommended not to adjust top_p and temperature at the same time. Simultaneous adjustment may lead to unpredictable output results. You can prioritize adjusting one parameter, observe its impact on the results, and then fine-tune gradually.

><br>**Knowledge Extension: top_k**<br>In the Qwen series models, the parameter top_k also has capabilities similar to top_p. Refer to the [Qwen API Documentation](https://help.aliyun.com/zh/model-studio/developer-reference/use-qwen-by-calling-api?spm=a2c4g.11186623.help-menu-2400256.d_3_3_0.68332bdb2Afk2s&scm=20140722.H_2712576._.OR_help-V_1). It is a sampling mechanism that randomly selects one Token from the top k ranked by probability for output. Generally speaking, the larger the top_k, the more diverse the generated content; the smaller the top_k, the more fixed the content. When top_k is set to 1, the model only selects the Token with the highest probability, making the output more stable but also causing a lack of variation and creativity.<br><br>
>**Knowledge Extension: seed**<br>In the Qwen series models, the parameter seed also supports controlling the determinism of the generated content. Refer to the [Qwen API Documentation](https://help.aliyun.com/zh/model-studio/developer-reference/use-qwen-by-calling-api?spm=a2c4g.11186623.help-menu-2400256.d_3_3_0.68332bdb2Afk2s&scm=20140722.H_2712576._.OR_help-V_1). Passing the same seed value during each model invocation while keeping other parameters unchanged will make the model return the exact same response every time as much as possible, but it cannot guarantee that the results will be completely identical every time.<br><br>

**Why does randomness still exist when controlling the output of large language models (LLM) by setting temperature, top_p, and seed?**

Even if temperature is set to 0, top_p is set to an extremely small value (e.g., 0.0001), and the same seed is used, the generated results for the same question may still be inconsistent. This is because some complex factors may introduce slight randomness, such as the large language models (LLMs) running in a distributed system or optimization being applied to the model's output.

For example:
A distributed system is like slicing bread with different machines. Although each machine operates according to the same settings, subtle differences between devices may still result in slightly different slices of bread.  



## ⚙️ 3. Enable Large Language Models (LLMs) to Answer Private Knowledge Questions
To enable large language models (LLMs) to answer private knowledge questions, you can choose one of the following two approaches:
- **Without modifying the model**<br>
    Directly provide private knowledge-related reference information when asking questions.
- **Modifying the model**<br>
    Fine-tuning or even train a new model so that it can better understand and answer questions in specific domains.

Considering the high cost of fine-tuning and training new models, in private knowledge question-answering scenarios, you can prioritize passing private knowledge through prompt. This method is simpler and more efficient.

In [8]:
# User question
user_question = "I'm from software group one. What tool should be used for project management?"

# Company project management tool related knowledge
knowledge = """There are two options for company project management tools:
  1. **Jira**: For software development teams, Jira is a very powerful tool that supports agile development methods such as Scrum and Kanban. It provides rich features including issue tracking, time tracking, etc.

  2. **Microsoft Project**: For large enterprises or complex projects, Microsoft Project offers detailed planning, resource allocation, and cost control functions. It is more suitable for scenarios where strict control over project timelines and costs is required.

  In general, please use Microsoft Project as the company has purchased full licenses. Software development groups one, three, and four are currently using Jira and are planned to gradually switch to Microsoft Project before 2026.
"""

response = get_qwen_stream_response(
    user_prompt=user_question,
    # Pass the company project management tool related knowledge as background information into the system prompt
    system_prompt="You are responsible for answering questions in an educational content development company. Your name is Xiao Mi. You need to answer students' questions." + knowledge,
    temperature=0.7,
    top_p=0.8
)

for chunk in response:
    print(chunk, end="")

你好！你是软件一组的成员，根据公司的安排，你们目前仍在使用 **Jira** 进行项目管理。不过，公司计划在2026年之前逐步将所有团队切换到 **Microsoft Project**。

如果你有任何关于 Jira 使用的问题，或者需要了解如何逐步切换到 Microsoft Project 的具体计划，可以随时向我咨询。希望这对你有帮助！

After passing in the reference information when asking a question, LLMs can answer questions about private knowledge. However, the disadvantages of this method are also obvious: the prompt length is limited, and when the amount of private data is too large, passing in all background information may result in an excessively long prompt, which could affect the model's processing efficiency or reach the length limit.

To solve this problem, you can automatically retrieve relevant private knowledge when users ask questions, then merge the retrieved document snippets with user input before sending them to the LLM to generate the final answer. This avoids directly passing a large amount of background information into the prompt. This implementation approach is also known as Retrieval-Augmented Generation (RAG).

Building a RAG application typically involves two phases:

### 3.1 Indexing Phase
<img src="https://gw.alicdn.com/imgextra/i2/O1CN010zLf411zVoZQ9cWsI_!!6000000006720-2-tps-1592-503.png" width="600">

Indexing aims to convert private knowledge documents or fragments into a form that can be efficiently retrieved by splitting file content and transforming it into text embedding (using dedicated Embedding models), while retaining semantic information through vector storage for similarity calculations. Vectorization enables the model to efficiently retrieve and match relevant content, especially when dealing with large-scale knowledge bases, significantly improving query accuracy and response speed.

These vectors, processed by the Embedding model, not only capture the semantic information of the text content well but also, due to the vectorized and standardized semantics, facilitate subsequent relevance calculations with search semantic vectors.

### 3.2 Retrieval and Generation Phase
<img src="https://img.alicdn.com/imgextra/i1/O1CN01vbkBXC1HQ0SBrC1Ii_!!6000000000751-2-tps-1776-639.png" width="600">

Retrieval and generation involve retrieving relevant document fragments from the index based on the user’s question. These fragments will be input together with the question into the LLM to generate the final response. In this way, the LLM can answer questions about private knowledge.

In summary, applications based on the RAG structure avoid various issues caused by inputting entire reference documents as background information while extracting the most relevant parts through retrieval, thus improving the accuracy and relevance of the LLM output.

## 📝 4. Summary of this Section
In this section, we learned the following:

- **How to use the LLM API**<br>
    Through practical code examples, we learned how to call the LLM API and experience its capabilities in question-answering tasks.
- **A preliminary understanding of how LLMs work**<br>
    We explored how LLMs understand questions and generate responses, while also discussing the limitations of randomness and knowledge scope, and how to address these shortcomings.

Beyond the tasks demonstrated in this section, LLMs can handle more types of tasks such as content generation, structured information extraction, text classification, and sentiment analysis. Additionally, introducing the RAG solution into your LLM applications can expand the scope of knowledge they can handle. In the next section, we will introduce methods for creating RAG applications.

### Further Reading
- While studying this course, if you want to learn more about related concepts and principles, you can try asking the LLM to provide further explanations or learning suggestions:
> Qwen-Max supports enabling the enable_search parameter, which allows the LLM to enrich its responses using internet search results during the response generation process.

In [9]:
completion = client.chat.completions.create(
    model="qwen-plus",  # This example uses qwen-plus. You can replace it with other model names as needed. Model list: https://help.aliyun.com/zh/model-studio/getting-started/models
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Number of gold medals won by Team China at the Paris Olympics"},
    ],
    extra_body={"enable_search": True},
)
print(completion.choices[0].message.content)

2024年巴黎奥运会，中国体育代表团共获得 **40枚金牌**，在金牌榜上排名第二，金牌数与排名第一的美国队持平。这一成绩也创造了中国境外奥运会参赛金牌数的新纪录。


The reasoning model [QwQ](https://help.aliyun.com/zh/model-studio/user-guide/qwq) of Alibaba Cloud has strong reasoning capabilities. The model will first output the thought process and then provide the response content.  



In [10]:
reasoning_content = ""  # Define the complete reasoning process
answer_content = ""     # Define the complete response
is_answering = False   # Determine whether the reasoning process has ended and the response has started

# Create a chat completion request
completion = client.chat.completions.create(
    model="qwq-32b",  # Here, qwq-32b is used as an example; you can replace it with another model name as needed
    messages=[
        {"role": "user", "content": "Which is larger, 9.9 or 9.11?"}
    ],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + " Reasoning Process " + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print the reasoning process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start responding
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + " Complete Response " + "=" * 20 + "\n")
                is_answering = True
            # Print the response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content



嗯，用户问的是“9.9和9.11谁大”。首先，我需要确认用户的问题是关于数值大小比较的。这里的“9.9”和“9.11”看起来像是两个小数，但可能用户有特定的背景，比如日期或者版本号之类的，不过通常数字比较的话应该就是数值大小。

首先，我应该先回忆一下小数比较的方法。小数比较的时候，先比较整数部分，整数部分大的那个数就大。如果整数部分相同，再比较小数部分，从十分位开始依次比较每一位，直到分出大小。

现在来看这两个数，9.9和9.11。它们的整数部分都是9，所以整数部分相等。接下来需要比较小数部分。这里要注意的是，9.9其实可以看作是9.90，而9.11则是9.11，这样小数部分就是两位小数的比较。这样，十分位上都是9和1吗？不对，等一下，可能我哪里弄错了。

让我再仔细看一下：9.9的小数部分是0.9，也就是十分位是9，百分位是0。而9.11的小数部分是0.11，也就是十分位是1，百分位是1。这时候比较的话，十分位上，9.9的十分位是9，而9.11的十分位是1，显然9比1大，所以整个数9.9应该比9.11大？

可是这好像有问题，因为通常像版本号或者日期的话，比如9.11可能指的是9月11日，但如果是数值的话，可能用户是想问哪个更大。不过按照数学上的比较，确实是9.9更大，因为小数点后第一位9比1大。但可能用户有其他的考虑？

或者用户可能把9.11当成了9.11，而9.9是9.90，这时候比较的话，0.90和0.11，显然0.90更大，所以9.9更大。不过有时候在某些情况下，比如像版本号中，比如2.10和2.9，通常2.10会比2.9大，因为版本号的数字位数不同的话，可能需要补零比较，比如2.9视为2.09，这样2.10更大。但这里的情况是9.9和9.11，可能用户是否在版本号的情况下？

不过问题中没有提到版本号，所以应该按普通的数值比较。那这样的话，正确的比较应该是9.9比9.11大，因为十分位的9比1大。不过可能用户会疑惑，因为有时候人们可能会误以为11比9大，但其实是在小数点后的不同位置。

或者用户可能把9.11当成了十进制的9.11，而9.9是9.9，这时候确实是9.9更大，因为小数点后第一位9大于1。所以结论应该是9.9更大。不过我需要再仔细检查一下。

比如，将两者转换为相同的小数位数：9.9 = 9.90，而9.11 = 9.11。比较时，

- If you are interested in multi-modal large models, you can refer to:

    * [Visual Understanding](https://help.aliyun.com/zh/model-studio/user-guide/vision)
    * [Audio Understanding](https://help.aliyun.com/zh/model-studio/user-guide/audio-language-model)
    * [Omni-modal](https://help.aliyun.com/zh/model-studio/user-guide/qwen-omni)
    * [Text-to-Image](https://help.aliyun.com/zh/model-studio/user-guide/text-to-image)
    * [AI Video Generation](https://help.aliyun.com/zh/model-studio/user-guide/video-generation)  



## 🔥 Post-class Quiz

### 🔍 Multiple Choice Question 2.1.1
<details>
<summary style="cursor: pointer; padding: 12px; border: 1px solid #dee2e6; border-radius: 6px;">
<b>What is the purpose of the following code snippet❓</b>
<pre style="margin: 10px 0;">
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
</pre>

- A. Load the API key from disk  
- B. Store the API key in memory  
- C. Set the API key as an environment variable  
- D. Create a new API key  

**[Click to view answer]**
</summary>

<div style="margin-top: 10px; padding: 15px; border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

✅ **Reference Answer: C**  
📝 **Explanation**:  
The code injects the API key into the current runtime environment's memory space using the operating system's environment variable interface.
</div>
</details>

---

### 📝 Case Analysis Question 2.1.2
<details>
<summary style="cursor: pointer; padding: 12px;  border: 1px solid #dee2e6; border-radius: 6px; font-color:#000;">
<b>Xiaoming, while developing a writing assistant, encounters the following two scenarios. How should he solve these problems❓</b> 

- Scenario 🅰️ Lack of creativity in generated content: Every time he asks the model to write an article about "the development of artificial intelligence," the generated content is very similar.  
- Scenario 🅱️ Generated content deviates from the topic: When asking the model to write a technical document, the generated content often includes irrelevant information.  

**Question:**  
1. Based on the large language models (LLMs) workflow learned in this section, what might be the causes of these issues in the two scenarios?  
2. How should the temperature or top_p parameters be adjusted to address these problems?  

**[Click to view answer]**
</summary>


<div style="margin-top: 10px;  padding: 15px;  border: 1px solid #dee2e6; border-radius: 0 0 6px 6px;">

#### 🎯 Solution for Scenario A
**🔍 Cause Analysis**  
The `temperature` value is too low (e.g., 0.3), causing the model to make single choices, resulting in a lack of diversity in the generated content.

**⚙️ Parameter Adjustment**  



```python
temperature = 0.7~0.9  # Increase creativity
 top_p = 0.9            # Expand the range of word selection
```

#### 🎯 Solution for Scenario B
**🔍 Cause Analysis**  
`temperature` is too high (e.g., 1.2) or `top_p` is too large

**⚙️ Parameter Adjustment**  



```python
temperature = 0.5~0.7  # Reduce randomness
 top_p = 0.7~0.8        # Focus on high probability words
```

🌟 **Parameter Tuning Tips**
> It is recommended to adjust the parameters by ±0.2 each time and observe the effect changes through A/B testing.
> If you need to balance Scenario A and Scenario B, it is recommended to use the combination: `temperature=0.6` + `top_p=0.8`</div>
</details>  



## ✅ Evaluation and Feedback
We welcome you to participate in the [Alibaba Cloud Large Language Model ACP Course Survey](https://survey.aliyun.com/apps/zhiliao/Mo5O9vuie) to provide feedback on your learning experience and course evaluation.
Your criticism and encouragement are our motivation to move forward!  

