# Testing extract_relevance_info_from_docs_with_conversation Function

This notebook demonstrates how to use and test the `extract_relevance_info_from_docs_with_conversation` function from the ByzerLLM library.

In [1]:
import byzerllm
from byzerllm import ByzerLLM

# Initialize ByzerLLM
llm = ByzerLLM.from_default_model("deepseek_chat")

[32m2024-08-28 23:44:14.074[0m | [1mINFO    [0m | [36mbyzerllm.utils.connect_ray[0m:[36mconnect_cluster[0m:[36m48[0m - [1mJDK 21 will be used (/Users/allwefantasy/.auto-coder/jdk-21.0.2.jdk/Contents/Home)...[0m
2024-08-28 23:44:14,113	INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379...
2024-08-28 23:44:14,128	INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


## Prepare Test Data

Let's create some sample documents and a conversation history to test our function.

In [4]:
# Sample documents
documents = [
    "The capital of France is Paris. It is known for its beautiful architecture and cuisine.",
    "Python is a popular programming language used in data science and web development.",
    "The Great Wall of China is one of the most famous landmarks in the world."
]

# Sample conversation history
conversations = [
    {"role": "user", "content": "Tell me about France."},
    {"role": "assistant", "content": "France is a country in Western Europe known for its rich history and culture."},
    {"role": "user", "content": "What's the capital of France?"}
]

## Define and Test the Function

In [None]:
@byzerllm.prompt()
def extract_relevance_info_from_docs_with_conversation(conversations, documents) -> str:
    """
    使用以下文档和对话历史来提取相关信息。

    文档：
    {% for doc in documents %}
    {{ doc }}
    {% endfor %}

    对话历史：
    {% for msg in conversations %}
    <{{ msg.role }}>: {{ msg.content }}
    {% endfor %}

    请根据提供的文档内容、用户对话历史以及最后一个问题，提取并总结文档中与问题相关的重要信息。
    如果文档中没有相关信息，请回复"该文档中没有与问题相关的信息"。
    提取的信息尽量保持和原文中的一样，并且只输出这些信息。
    """

# Test the function
result = extract_relevance_info_from_docs_with_conversation.with_llm(llm).run(conversations=conversations, documents=documents)
print(result)

## Analyze the Results

Let's analyze the output of our function to see if it correctly extracted the relevant information from the documents based on the conversation history.

In [None]:
print("Extracted relevant information:")
print(result)

print("\nDoes the extracted information answer the user's last question?")
print("User's last question:", conversations[-1]['content'])
print("Relevant information found:", "Yes" if "Paris" in result else "No")

## Test with Different Scenarios

Let's test our function with different scenarios to ensure it works correctly in various situations.

In [None]:
# Scenario 1: Question about a topic not in the documents
conversations_scenario1 = [
    {"role": "user", "content": "What's the capital of Germany?"},
]

result_scenario1 = extract_relevance_info_from_docs_with_conversation.with_llm(llm).run(conversations=conversations_scenario1, documents=documents)
print("Scenario 1 Result:")
print(result_scenario1)

# Scenario 2: Question about a topic in the documents, but not related to France
conversations_scenario2 = [
    {"role": "user", "content": "Tell me about Python programming."},
]

result_scenario2 = extract_relevance_info_from_docs_with_conversation.with_llm(llm).run(conversations=conversations_scenario2, documents=documents)
print("\nScenario 2 Result:")
print(result_scenario2)

## Conclusion

In this notebook, we've tested the `extract_relevance_info_from_docs_with_conversation` function with various scenarios. The function should effectively extract relevant information from the provided documents based on the conversation history and the user's last question. 

By analyzing the results, we can see how well the function performs in different situations, such as when the relevant information is present in the documents or when it's not available.

In [None]:
from typing import List, Dict
from autocoder.common import AutoCoderArgs, SourceCode
from byzerllm.utils.client.code_utils import extract_code
import json
from loguru import logger

@byzerllm.prompt()
def extract_relevance_range_from_docs_with_conversation(
    conversations: List[Dict[str, str]], documents: List[str]
) -> str:
    """
    使用以下文档和对话历史来提取相关信息。

    文档：
    {% for doc in documents %}
    {{ doc }}
    {% endfor %}

    对话历史：
    {% for msg in conversations %}
    <{{ msg.role }}>: {{ msg.content }}
    {% endfor %}

    请根据提供的文档内容、用户对话历史以及最后一个问题，从文档中提取与问题相关的一个或者多个重要信息。
    每一块重要信息由 start_str 和 end_str 组成。
    返回一个 JSON 数组，每个元素包含 "start_str" 和 "end_str"，分别表示重要信息的起始和结束字符串。
    确保 start_str 和 end_str 在原文中都是唯一的，不会出现多次，并且不会重叠。
    start_str 和 end_str 应该尽可能短，但要确保它们在原文中是唯一的。
    

    如果文档中没有相关重要信息，请返回空数组 []。

    示例1：
    文档：这是一个示例文档。大象是陆地上最大的动物之一。它们生活在非洲和亚洲。猫是常见的宠物，它们喜欢捕鼠。    
    问题：大象生活在哪里？
    返回：[{"start_str": "大象是陆地", "end_str": "在非洲和亚洲。"}]

    示例2：
    文档：太阳系有八大行星。地球是太阳系中第三颗行星，有海洋，有沙漠，温度适宜，昼夜温差小，是目前已知唯一有生命的星球。月球是地球唯一的天然卫星。
    问题：地球的特点是什么？
    返回：[{"start_str": "地球是太阳系", "end_str": "生命的星球。"}]

    示例3：
    文档：苹果是一种常见的水果。它富含维生素和膳食纤维。香蕉也是一种受欢迎的水果，含有大量钾元素。
    问题：橙子的特点是什么？
    返回：[]

    请返回严格的 JSON 格式。不要有任何多余的文字或解释。
    """  

# result = extract_relevance_range_from_docs_with_conversation.with_llm(llm).run(conversations=conversations, documents=documents)


# {"conversation":conversations, "doc":[doc.source_code]}
conversations = None
documents = None
with open("/tmp/rag.json", "r") as f:
    lines = f.read().split("\n")
    for i, line in enumerate(lines):
        if line:
            v = json.loads(line)
            conversations = v["conversation"]
            documents = v["doc"]

def process_range_doc(doc, max_retries=3):
    for attempt in range(max_retries):
        content = ""
        try:
            extracted_info = extract_relevance_range_from_docs_with_conversation.with_llm(
                llm
            ).run(
                conversations, [doc.source_code]
            )

            print(extracted_info)                                                             
            json_str = extract_code(extracted_info)[0][1]
            json_objs = json.loads(json_str)                                    
                                                
            for json_obj in json_objs:
                start_str = json_obj["start_str"]
                end_str = json_obj["end_str"]
                start_index = doc.source_code.index(start_str)
                end_index = doc.source_code.index(end_str) + len(end_str)
                content += doc.source_code[start_index:end_index] + "\n"  
                print(f"{start_str} - {end_str} : {doc.source_code[start_index:end_index]}")                                      
            
            return SourceCode(
                module_name=doc.module_name, source_code=content.strip()
            )
        except Exception as e:
            if attempt < max_retries - 1:
                logger.warning(f"Error processing doc {doc.module_name}, retrying... (Attempt {attempt + 1}) attempts: {str(e)}")                                        
            else:
                logger.error(f"Failed to process doc {doc.module_name} after {max_retries} attempts: {str(e)}")
                return SourceCode(
                module_name=doc.module_name, source_code=content.strip()
            )

print(documents[0])
m = process_range_doc(SourceCode(
                module_name="test", source_code=documents[0]
            ))


print(m.source_code)
