<center><img src="/files/images/DLI_Header.png" /></center>

# Star Bikes AI 助手

在此 notebook 中，您将创建一个 AI 助手帮助客户做出购买 Star Bikes 自行车的最佳决策。您还将了解到正在使用模型的 **token 限制**，及其对保留对话历史记录的影响。

## 学习目标

完成此 notebook 后，您将能够：
* 解释 **token 限制**及其对 LLM 行为的影响。
* 构建一个具有（优先）对话记忆，不会超出 **token 限制**的 AI 助手。

## 视频教程

执行以下单元以加载此 notebook 的视频教程。

In [None]:
 from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-fx-12-v1/v2/07-assistant.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

## 创建 LLaMA-2 工作流

In [None]:
from transformers import pipeline
model = "TheBloke/Llama-2-13B-chat-GPTQ"
# model = "TheBloke/Llama-2-7B-chat-GPTQ"

llama_pipe = pipeline("text-generation", model=model, device_map="auto");

## 获取 LLaMA-2 分词器

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model)

## 辅助函数和类

在此 notebook 中，我们将使用以下函数和类来支持我们与 LLM 的交互。请随时浏览这些函数和类，因为它们在下面的使用中将详细介绍。

### 生成模型响应

In [None]:
def generate(prompt, max_length=4096, pipe=llama_pipe, **kwargs):
    """
    Generates a response to the given prompt using a specified language model pipeline.

    This function takes a prompt and passes it to a language model pipeline, such as LLaMA, 
    to generate a text response. The function is designed to allow customization of the 
    generation process through various parameters and keyword arguments.

    Parameters:
    - prompt (str): The input text prompt to generate a response for.
    - max_length (int): The maximum length of the generated response. Default is 1024 tokens.
    - pipe (callable): The language model pipeline function used for generation. Default is llama_pipe.
    - **kwargs: Additional keyword arguments that are passed to the pipeline function.

    Returns:
    - str: The generated text response from the model, trimmed of leading and trailing whitespace.

    Example usage:
    ```
    prompt_text = "Explain the theory of relativity."
    response = generate(prompt_text, max_length=512, pipe=my_custom_pipeline, temperature=0.7)
    print(response)
    ```
    """

    def_kwargs = dict(return_full_text=False, return_dict=False)
    response = pipe(prompt.strip(), max_length=max_length, **kwargs, **def_kwargs)
    return response[0]['generated_text'].strip()

### 构造提示词，包括可选的系统上下文和/或示例

In [None]:
def construct_prompt_with_context(main_prompt, system_context="", conversation_examples=[]):
    """
    Constructs a complete structured prompt for a language model, including optional system context and conversation examples.

    This function compiles a prompt that can be directly used for generating responses from a language model. 
    It creates a structured format that begins with an optional system context message, appends a series of conversational 
    examples as prior interactions, and ends with the main user prompt. If no system context or conversation examples are provided,
    it will return only the main prompt.

    Parameters:
    - main_prompt (str): The core question or statement for the language model to respond to.
    - system_context (str, optional): Additional context or information about the scenario or environment. Defaults to an empty string.
    - conversation_examples (list of tuples, optional): Prior exchanges provided as context, where each tuple contains a user message 
      and a corresponding agent response. Defaults to an empty list.

    Returns:
    - str: A string formatted as a complete prompt ready for language model input. If no system context or examples are provided, returns the main prompt.

    Example usage:
    ```
    main_prompt = "I'm looking to improve my dialogue writing skills for my next short story. Any suggestions?"
    system_context = "User is an aspiring author seeking to enhance dialogue writing techniques."
    conversation_examples = [
        ("How can dialogue contribute to character development?", "Dialogue should reveal character traits and show personal growth over the story arc."),
        ("What are some common pitfalls in writing dialogue?", "Avoid exposition dumps in dialogue and make sure each character's voice is distinct.")
    ]

    full_prompt = construct_prompt_with_context(main_prompt, system_context, conversation_examples)
    print(full_prompt)
    ```
    """
    
    # Return the main prompt if no system context or conversation examples are provided
    if not system_context and not conversation_examples:
        return main_prompt

    # Start with the initial part of the prompt including the system context, if provided
    full_prompt = f"<s>[INST] <<SYS>>{system_context}<</SYS>>\n" if system_context else "<s>[INST]\n"

    # Add each example from the conversation_examples to the prompt
    for user_msg, agent_response in conversation_examples:
        full_prompt += f"{user_msg} [/INST] {agent_response} </s><s>[INST]"

    # Add the main user prompt at the end
    full_prompt += f"{main_prompt} [/INST]"

    return full_prompt

### LlamaChatbot 类

In [None]:
class LlamaChatbot:
    """
    A chatbot interface for generating conversational responses using the LLaMA language model.

    Attributes:
    - system_context (str): Contextual information to provide to the language model for all conversations.
    - conversation_history (list of tuples): Stores the history of the conversation, where each
      tuple contains a user message and the corresponding agent response.
    """

    def __init__(self, system_context):
        """
        Initializes a new instance of the LlamaChatbot class.

        Parameters:
        - system_context (str): A string that sets the initial context for the language model.
        """
        self.system_context = system_context
        self.conversation_history = []  # Initializes the conversation history

    def chat(self, user_msg):
        """
        Generates a response from the chatbot based on the user's message.

        This method constructs a prompt with the current system context and conversation history,
        sends it to the language model, and then stores the new user message and model's response
        in the conversation history.

        Parameters:
        - user_msg (str): The user's message to which the chatbot will respond.

        Returns:
        - str: The generated response from the chatbot.
        """
        # Generate the prompt using the conversation history and the new user message
        prompt = construct_prompt_with_context(user_msg, self.system_context, self.conversation_history)
        
        # Get the model's response
        agent_response = generate(prompt)

        # Store this interaction in the conversation history
        self.conversation_history.append((user_msg, agent_response))

        return agent_response

    def reset(self):
        """
        Resets the conversation history of the chatbot.

        This method clears the existing conversation history, effectively restarting the conversation.
        """
        # Clear conversation history
        self.conversation_history = []

### LlamaChatBotWithHistoryLimit 类

In [None]:
class LlamaChatbotWithHistoryLimit:
    """
    A chatbot interface for generating conversational responses using the LLaMA language model.

    Attributes:
        - system_context (str): Contextual information to provide to the language model for all conversations.
        - conversation_history (list of tuples): Stores the history of the conversation, where each
          tuple contains a user message and the corresponding agent response.
        - tokenizer: The tokenizer used to tokenize the conversation for maintaining the history limit.
        - max_tokens (int): The maximum number of tokens allowed in the conversation history.
    """

    def __init__(self, system_context, tokenizer, max_tokens=2048):
        """
        Initializes a new instance of the LlamaChatbot class with a tokenizer and token limit.

        Parameters:
            - system_context (str): A string that sets the initial context for the language model.
            - tokenizer: The tokenizer used to process the input and output for the language model.
            - max_tokens (int): The maximum number of tokens to retain in the conversation history.
        """
        self.system_context = system_context
        self.tokenizer = tokenizer
        self.max_tokens = max_tokens
        self.conversation_history = []  # Initializes the conversation history

    def chat(self, user_msg):
        """
        Generates a response from the chatbot based on the user's message.

        This method constructs a prompt with the current system context and conversation history,
        sends it to the language model, and then stores the new user message and model's response
        in the conversation history, ensuring that the history does not exceed the specified token limit.

        Parameters:
            - user_msg (str): The user's message to which the chatbot will respond.

        Returns:
            - str: The generated response from the chatbot.
        """
        # Generate the prompt using the conversation history and the new user message
        prompt = construct_prompt_with_context(user_msg, self.system_context, self.conversation_history)
        
        # Get the model's response
        agent_response = generate(prompt)

        # Store this interaction in the conversation history
        self.conversation_history.append((user_msg, agent_response))

        # Check and maintain the conversation history within the token limit
        self._trim_conversation_history()

        return agent_response

    def _trim_conversation_history(self):
        """
        Trims the conversation history to maintain the number of tokens below the specified limit.
        """
        # Concatenate the conversation history into a single string
        history_string = ''.join(user + agent for user, agent in self.conversation_history)
        
        # Calculate the number of tokens in the conversation history
        history_tokens = len(self.tokenizer.encode(history_string))

        # While the history exceeds the maximum token limit, remove the oldest items
        while history_tokens > self.max_tokens:
            # Always check if there's at least one item to pop
            if self.conversation_history:
                # Remove the oldest conversation tuple
                self.conversation_history.pop(0)
                # Recalculate the history string and its tokens
                history_string = ''.join(user + agent for user, agent in self.conversation_history)
                history_tokens = len(self.tokenizer.encode(history_string))
            else:
                # If the conversation history is empty, break out of the loop
                break

    def reset(self):
        """
        Resets the conversation history of the chatbot.

        This method clears the existing conversation history, effectively restarting the conversation.
        """
        # Clear conversation history
        self.conversation_history = []

### 打印给定字符串的 token 数量

In [None]:
def print_token_count(text, tokenizer):
    """
    Calculate and return the number of tokens in a given text using a specified tokenizer.

    This function takes a string of text and a tokenizer. It uses the tokenizer to encode the text
    into tokens and then returns the count of these tokens.

    Parameters:
    - text (str): The input string to be tokenized.
    - tokenizer: A tokenizer instance capable of encoding text into tokens.

    Returns:
    - int: The number of tokens in the input text as determined by the tokenizer.
    """
    return len(tokenizer.encode(text))

### 拼接对话历史记录

In [None]:
def concat_history(tuples_list):
    """
    Concatenates texts from a list of 2-tuples.

    Each tuple in the list is expected to contain two strings. The function
    will concatenate all the first elements followed by all the second elements
    in their respective order of appearance in the list.

    Parameters:
    - tuples_list (list of 2-tuples): A list where each element is a tuple of two strings.

    Returns:
    - str: A single string that is the result of concatenating all the texts from the tuples.

    Example usage:
    ```
    conversation_tuples = [
        ('Question 1', 'Answer 1'),
        ('Question 2', 'Answer 2'),
        ('Question 3', 'Answer 3')
    ]

    concatenated_text = concatenate_texts_from_tuples(conversation_tuples)
    print(concatenated_text)
    ```
    """
    # Concatenate all the first and second elements of the tuples
    return ''.join(question + response for question, response in tuples_list)

## 数据

### Star Bikes 详细信息

In [None]:
bikes = [
    {
        "model": "Galaxy Rider",
        "type": "Mountain",
        "features": {
            "frame": "Aluminum alloy",
            "gears": "21-speed Shimano",
            "brakes": "Hydraulic disc",
            "tires": "27.5-inch all-terrain",
            "suspension": "Full, adjustable",
            "color": "Matte black with green accents"
        },
        "usps": ["Lightweight frame", "Quick gear shift", "Durable tires"],
        "price": 799.95,
        "internal_id": "GR2321",
        "weight": "15.3 kg",
        "manufacturer_location": "Taiwan"
    },
    {
        "model": "Nebula Navigator",
        "type": "Hybrid",
        "features": {
            "frame": "Carbon fiber",
            "gears": "18-speed Nexus",
            "brakes": "Mechanical disc",
            "tires": "26-inch city slick",
            "suspension": "Front only",
            "color": "Glossy white"
        },
        "usps": ["Sleek design", "Efficient on both roads and trails", "Ultra-lightweight"],
        "price": 649.99,
        "internal_id": "NN4120",
        "weight": "13.5 kg",
        "manufacturer_location": "Germany"
    },
    {
        "model": "Cosmic Comet",
        "type": "Road",
        "features": {
            "frame": "Titanium",
            "gears": "24-speed Campagnolo",
            "brakes": "Rim brakes",
            "tires": "700C road",
            "suspension": "None",
            "color": "Metallic blue"
        },
        "usps": ["Super aerodynamic", "High-speed performance", "Professional-grade components"],
        "price": 1199.50,
        "internal_id": "CC5678",
        "weight": "11 kg",
        "manufacturer_location": "Italy"
    }
]

## 自行车 AI 助手

在本节中，我们将创建一个 AI 客户支持助手，帮助潜在客户购买一辆 Star Bike。

我们先设置一个恰当的**系统上下文**并实例化一个聊天机器人实例。

In [None]:
system_context = """
You are a friendly chatbot knowledgeable about bicycles. \
When asked about specific bike models or features, you try to provide accurate and helpful answers. \
Your goal is to assist and inform potential customers to the best of your ability in 50 words or less.
"""

chatbot = LlamaChatbot(system_context)

让模型告诉我们最新的自行车。

In [None]:
print(chatbot.chat("Can you tell me about the latest models?"))

---

还不错，但我们当然希望助手向我们介绍的是 Star Bikes 的车型！

In [None]:
chatbot.reset()

## Star Bikes AI 助手

我们来创建一个新的聊天机器人，包含上述 `bikes` 数据供其参考。在下面的**系统上下文**中，我们为模型提供了一个**提示**，让它在每轮对话后都询问还有什么可以帮助的。这不仅对于 AI 助手来说是一个好方法，还可以在实践中防止模型无限期的继续下去，或者在只需要一轮对话时增加多次交互。

In [None]:
system_context = f"""
You are a friendly chatbot knowledgeable about these bicycles from Star Bikes {bikes}. \
When asked about specific bike models or features, you try to provide accurate and helpful answers. \
Your goal is to assist and inform potential customers to the best of your ability in 50 words or less. \
You always end by asking what else you can help with.
"""

chatbot = LlamaChatbot(system_context)

In [None]:
print(chatbot.chat("Can you tell me about the latest models?"))

---

看起来很好。来看看被问到自行车的详细信息时它会怎么回复？

In [None]:
print(chatbot.chat("How much do each of the models cost?"))

---

非常好。再看看它如何响应更模糊的查询。

In [None]:
print(chatbot.chat("I am more intersted in biking around town."))

---

总而言之，我们的助手似乎已经表现的很好了。

## 关于 Token 数量

当我们将文本传给 LLaMA-2 等语言模型时，文本已经被转为 **token** 了，它们是语言模型能够处理并据此生成内容的文本单位，像是单词或标点符号。

像 LLaMA-2 这样的语言模型天生带有 **token 限制**，是模型在单个提示响应周期中可以处理的 token 数量固定上限。该限制取决于它们的设计和处理 token 所需的计算资源。LLaMA-2 模型的 **token 限制**是 `4096`。模型的 token 限制可以从它的文档查到，也可以在其限制范围内进行调控。使用 `transformers` 工作流时，可以通过 `max_length` 参数来调控。

其自身的 token 限制，或 `max_length` 参数（以较少者为准）*同时决定了输入和输出的 token 总数*。

由于我们没清除上面对话的聊天记录，可以直接看看 `chatbot` 实例现在的 `conversation_history` 。

In [None]:
print(chatbot.conversation_history)

---

为了算出来所有这些字符串所包含的 token 总量，我们用上面定义的 `concat_history` 辅助函数来将对话历史的所有字符串拼接起来。

In [None]:
conv_history = concat_history(chatbot.conversation_history)

In [None]:
print(conv_history)

---

现在，我们将使用上面定义的另一个辅助函数，`print_token_count`，来对对话历史字符串进行**分词**，用的就是已经导入的 LLaMA-2 **分词器**。

In [None]:
print_token_count(conv_history, tokenizer)

---

我们来看看跟聊天机器人进行多次交流时，对话历史的 token 数是如何变化的。

In [None]:
print(chatbot.chat("What kind of bike would be best if I'm on a budget?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("What's the next most expensive bike after the Galaxy Rider?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("Why is titanium so good for a frame?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("Do you remember where I said I was most interested in riding?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("Can you please summarize our conversation for me?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

最后，我们来重置聊天机器人并再次打印 token 数量。

In [None]:
chatbot.reset()

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

鉴于我们的聊天机器人实现是通过将对话历史记录放到提示词里来存储对话的，因此我们与模型的每次交流都使得输入更加接近 **token 限制**。

如上所述，我们使用的模型的 **token 限制**是 `4096`，如果您查看上面的 `generate` 函数，就会发现我们传给 `max_length` 参数的值就是 `4096`。因此，我们暂时还并未接近 **token 限制**，但我们依然应该考虑如何确保这个硬性限制不会影响我们的聊天机器人。

## 限制聊天记录

下面是修改后的聊天类 `LlamaChatbotWithHistoryLimit`。它接受 `max_tokens` 参数，以及用于跟踪对话历史记录中 token 数量的 `tokenizer` 。

当对话历史记录超过 `max_tokens` 时，就会调用 `_trim_conversation_history` 来弹出（pop off）最早的历史对话直到总长度小于 `max_tokens`。

In [None]:
class LlamaChatbotWithHistoryLimit:
    """
    A chatbot interface for generating conversational responses using the LLaMA language model.

    Attributes:
        - system_context (str): Contextual information to provide to the language model for all conversations.
        - conversation_history (list of tuples): Stores the history of the conversation, where each
          tuple contains a user message and the corresponding agent response.
        - tokenizer: The tokenizer used to tokenize the conversation for maintaining the history limit.
        - max_tokens (int): The maximum number of tokens allowed in the conversation history.
    """

    def __init__(self, system_context, tokenizer, max_tokens=2048):
        """
        Initializes a new instance of the LlamaChatbot class with a tokenizer and token limit.

        Parameters:
            - system_context (str): A string that sets the initial context for the language model.
            - tokenizer: The tokenizer used to process the input and output for the language model.
            - max_tokens (int): The maximum number of tokens to retain in the conversation history.
        """
        self.system_context = system_context
        self.tokenizer = tokenizer
        self.max_tokens = max_tokens
        self.conversation_history = []  # Initializes the conversation history

    def chat(self, user_msg):
        """
        Generates a response from the chatbot based on the user's message.

        This method constructs a prompt with the current system context and conversation history,
        sends it to the language model, and then stores the new user message and model's response
        in the conversation history, ensuring that the history does not exceed the specified token limit.

        Parameters:
            - user_msg (str): The user's message to which the chatbot will respond.

        Returns:
            - str: The generated response from the chatbot.
        """
        # Generate the prompt using the conversation history and the new user message
        prompt = construct_prompt_with_context(user_msg, self.system_context, self.conversation_history)
        
        # Get the model's response
        agent_response = generate(prompt)

        # Store this interaction in the conversation history
        self.conversation_history.append((user_msg, agent_response))

        # Check and maintain the conversation history within the token limit
        self._trim_conversation_history()

        return agent_response

    def _trim_conversation_history(self):
        """
        Trims the conversation history to maintain the number of tokens below the specified limit.
        """
        # Concatenate the conversation history into a single string
        history_string = ''.join(user + agent for user, agent in self.conversation_history)
        
        # Calculate the number of tokens in the conversation history
        history_tokens = len(self.tokenizer.encode(history_string))

        # While the history exceeds the maximum token limit, remove the oldest items
        while history_tokens > self.max_tokens:
            # Always check if there's at least one item to pop
            if self.conversation_history:
                # Remove the oldest conversation tuple
                self.conversation_history.pop(0)
                # Recalculate the history string and its tokens
                history_string = ''.join(user + agent for user, agent in self.conversation_history)
                history_tokens = len(self.tokenizer.encode(history_string))
            else:
                # If the conversation history is empty, break out of the loop
                break

    def reset(self):
        """
        Resets the conversation history of the chatbot.

        This method clears the existing conversation history, effectively restarting the conversation.
        """
        # Clear conversation history
        self.conversation_history = []

我们来创建一个新的聊天机器人实例，这次 `max_tokens` 限制为 `200` 个 token。

In [None]:
system_context = f"""
You are a friendly chatbot knowledgeable about these bicycles from Star Bikes {bikes}. \
When asked about specific bike models or features, you try to provide accurate and helpful answers. \
Your goal is to assist and inform potential customers to the best of your ability in 50 words or less. \
You always end by asking what else you can help with.
"""

chatbot = LlamaChatbotWithHistoryLimit(system_context, tokenizer=tokenizer, max_tokens=200)

In [None]:
print(chatbot.chat("Can you tell me about the latest models?"))

---

我们再多运行几轮，来跟踪对话历史记录中的 token 数量。请记住，我们已经将 `max_tokens` 设成了 `200`。

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("How much do each of the models cost?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

您可以看到 token 数量已减少到 `96` 来防止我们超出 `200` 的限制。我们再观察几轮对话。

In [None]:
print(chatbot.chat("I am more intersted in biking around town."))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("What kind of bike would be best if I'm on a budget?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

我们的聊天机器人已经成功的弹出了前几轮对话，以避免超出 token 限制。

当然，这种防止故障的行为以不能完美的保留全部对话历史为代价。这里，我们可以看到与以往不同的是，当我们要求提供迄今为止的全部对话摘要时，我们只会得到最近几轮的交流总结。

In [None]:
print(chatbot.chat("Can you summarize our conversation?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
chatbot.reset()

## 最终练习：为您自己的虚构公司创建 AI 助手

运用您迄今为止所学的一切，为您自己虚构的公司创建一个 AI 助手。您的工作要包含几个主要步骤。
1. 想一个公司，包括公司名和它销售的东西。
2. 使用我们的 LLaMA-2 模型为您公司将要销售的产品生成合成数据。请参考上面的“Star Bikes 细节”部分，或 `bikes` 字典为例。如果您在生成合成 JSON 数据时遇到问题，请参阅 notebook *3-Review Analyst.ipynb*。
3. 创建 AI 助手，为其提供您在上一步中生成的合成数据。欢迎您使用这个 notebook 提供的 `LlamaChatbotWithHistoryLimit` 类。

## 关键概念回顾

此 notebook 中介绍了以下关键概念：
* **Token**：语言模型能够处理的一个文本块，比如单词或标点。
* **Token 限制**：语言模型在单个提示词中可以处理的最大 token 数量。
* **分词器**：一种将文本转换为 token 的工具，供语言模型理解。

## 重启内核

为下一个 notebook 释放 GPU 显存，请运行以下单元。

In [None]:
from IPython import get_ipython

get_ipython().kernel.do_shutdown(restart=True)