<a href="https://colab.research.google.com/github/edisonjoao1/ai4u/blob/main/open_deep_research_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline for an Open-Source Deep Research
Reasoning models and some tool usage, although no conditional execution controlled by the agent.

In [None]:
import pymilvus
import transformers
from transformers import TextStreamer
from unsloth import FastLanguageModel
import regex as re
import wikipediaapi
import json
from tqdm import tqdm
from langchain_huggingface import HuggingFacePipeline as Pipeline
from langchain_huggingface import ChatHuggingFace as Chat
from langchain_core.messages import HumanMessage
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_milvus import Milvus, Zilliz
from typing import Any, Set
import json
from json_repair import repair_json
from typing import Any, Set

## Load reasoning and embedding models

In [None]:
model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit"
max_seq_length = 4048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model)

embeddings = SentenceTransformerEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

==((====))==  Unsloth 2025.1.8: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.988 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.0.dev20240829+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28+d444815.d20240829. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  embeddings = SentenceTransformerEmbeddings(


## Helper methods
Some useful methods for chatting with the reasoning model, extracting JSON from the output, and converting the JSON into a list of strings.

In [None]:
default_system_prompt = "You are a helpful assistant who answers question truthfully to the best of your knowledge."

def ask_model(prompt, system_prompt=default_system_prompt):
    chat = [
    {
        "role": "system",
        "content": system_prompt,
    },

    {

        "role": "user",
        "content": f"{prompt}",
    },
]

    formatted_prompt = tokenizer.apply_chat_template(
        chat, tokenize=False, add_generation_prompt=True, return_tensors="pt"
    )
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

    streamer = TextStreamer(
        tokenizer, skip_prompt=True, skip_special_tokens=True
    )

    results = model.generate(**inputs, streamer=streamer, max_new_tokens=4048)
    return tokenizer.decode(token_ids=results.cpu().numpy().tolist()[0])

json_re = re.compile(r"```json\n(?s:.)*\n```")

def extract_json(response):
    # TODO: Error handling
    try:
        match = json_re.search(response)
        json_results = '\n'.join(match.group().splitlines()[1:-1])
    except:
        return {}
    return json.loads(repair_json(json_results))

def leaves(struct: Any) -> Set[Any]:
    """Return a set of leaf values found in nested dicts and lists excluding None values."""
    # Ref: https://stackoverflow.com/a/59832594/
    values = set()

    def add_leaves(struct_: Any) -> None:
        if isinstance(struct_, dict):
            for sub_struct in struct_.values():
                add_leaves(sub_struct)
        elif isinstance(struct_, list):
            for sub_struct in struct_:
                add_leaves(sub_struct)
        elif struct_ is not None:
            values.add(struct_)

    add_leaves(struct)
    return values

## Define / refine question
Break down question into sub-questions, sub-sub-questions, and so on. Convert the main question into a report title.

In [None]:
query = "How has The Simpsons changed over time?"
page_title = "The Simpsons"

In [None]:
prompt = f"""What is the topic of the following question? Respond in JSON format.

Question: {query}"""

response = ask_model(prompt)

<think>
Alright, so I need to figure out the topic of the question "How has The Simpsons changed over time?" Hmm, okay. Let me break this down. The question is asking about changes in The Simpsons, so it's probably about the evolution of the show. That could mean a lot of things—storylines, characters, art style, themes, cultural impact, or how the humor has developed over the years.

First, I'll think about the show itself. The Simpsons started in the late '80s and is still going strong. It's a popular animated series known for its humor and character development. Over time, shows often change to stay relevant, so maybe the topic is about how the show has evolved in terms of its content, themes, or production.

I remember that the show started with a simple setup in Springfield, but as it went on, it tackled more complex issues and satire. So maybe the topic is about the show's evolution in terms of addressing social issues or becoming more mature. Or perhaps it's about the changes in

In [None]:
prompt = f"""Break down the following question into intermediate sub-questions to approach answering it. Provide a list of intermediate sub-questions and respond with JSON format. If you cannot produce sub-question then say so. Do not directly answer the following question and only return the sub-questions in JSON format. Your answer must contain JSON.

Question: {query}"""

response = ask_model(prompt)

<think>
Okay, so I need to break down the question "How has The Simpsons changed over time?" into sub-questions. Hmm, where do I start? Well, the question is asking about changes in The Simpsons, so I should think about different aspects of the show that could have evolved. Maybe the characters, the humor, the animation style, the themes, the popularity, and maybe even the cultural impact. 

Wait, but I'm not sure if I'm covering all the possible areas. Let me think more. The show has been on for a while, so there might be changes in the tone, the way they handle storylines, the number of seasons, the voice acting, maybe even the production quality. Oh, and the cast has changed too—some original voice actors have left, and new ones have come in, right?

So, to make sure I cover everything, I should list out each of these aspects as sub-questions. Let me see: changes in characters, changes in humor, changes in animation style, changes in themes or messages, changes in popularity, change

In [None]:
sub_questions = list(leaves(extract_json(response)))

In [None]:
breakdown = {}

# Hardcode topic for now from output above
topic = "The evolution of The Simpsons as a show over time, covering changes in content, humor, character development, animation, and its role in society."

# Break sub-questions into sub-sub-questions
for q in sub_questions:
    prompt = f"""You are researching the follow topic. Break down the following question into intermediate sub-questions to approach answering it. Provide a list of intermediate sub-questions and respond with JSON format. If you cannot produce sub-questions then say so. Do not directly answer the following question and only return the sub-questions in JSON format. Your answer must contain JSON.

    Topic: {topic}

    Question: {q}"""

    response = ask_model(prompt)
    sub_sub_questions = list(leaves(extract_json(response)))
    breakdown[q] = sub_sub_questions


<think>
Alright, so I need to break down the question "How has the cast changed over time?" related to the evolution of The Simpsons. Let me think about what aspects are involved here.

First, I know that the cast has changed a lot, but I need to figure out the intermediate questions to approach this. The main question is about changes in the cast over time, so I should consider different areas that contribute to this change.

I guess the first sub-question would be about the original cast members. Who were the main voices and how did they evolve? Then, there might be new cast additions over the years, so another sub-question about that.

Also, some original voice actors have left, so I should include a sub-question about departures. Then, new voice actors joining would be another point.

The show has been popular for a long time, so recurring roles changing might be another aspect. Additionally, the role of the show in society might have influenced casting choices, so a sub-question a

In [None]:
sub_questions = list(leaves(extract_json(response)))

In [None]:
breakdown

{'How has the cast changed over time?': ['How have recurring roles in The Simpsons changed over the years?',
  'Who are the new cast members added later to The Simpsons and what roles do they play?',
  'What is the role of The Simpsons in shaping the casting choices for animated shows?',
  'Who are the current voice actors for The Simpsons and what roles do they play now?',
  'Which original voice actors have left the show and what has their departure meant for the cast?',
  'What were the original cast members of The Simpsons and how have their roles evolved over time?',
  'How has the casting of The Simpsons influenced the careers of its voice actors?'],
 'How has the animation style of The Simpsons changed?': [],
 'How have the characters in The Simpsons evolved over time?': ['How has the humor in The Simpsons changed as the show progressed?',
  'How have the storylines become more complex and relevant over time?',
  'How have the character arcs of individual characters evolved over

## Search

### Build vector database on Wikipedia article

In [None]:
wiki_wiki = wikipediaapi.Wikipedia(user_agent='MilvusDeepResearchBot (<insert your email>)', language='en')
page_py = wiki_wiki.page(page_title)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
docs = text_splitter.create_documents([page_py.text])

In [None]:
vectorstore = Milvus.from_documents(  # or Zilliz.from_documents
    documents=docs,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    drop_old=True,  # Drop the old Milvus collection if it exists
    index_params={
        "metric_type": "COSINE",
        "index_type": "FLAT",  # <= NOTE: Currently a bug where langchain_milvus defaults to "HNSW" index, which doesn't work with Milvus Lite
        "params": {},
    },
)

### Break down question in sub-questions, and so on.

In [None]:
q = 'How has the cast changed over time?'

def question_to_header(question, topic=None):
    if topic is None:
        prompt = f"""Rewrite the following question as a header title. Be concise. Respond in JSON format with an escaped code block.

Question: {q}"""
    else:
        prompt = f"""Rewrite the following question with given context as a header title. Be concise. Respond in JSON format with an escaped code block.

Context: {topic}
Question: {q}"""

    response = ask_model(prompt)
    return list(leaves(extract_json(response)))[0]

In [None]:
header = question_to_header(q, topic)

<think>
Alright, I need to help the user by rewriting a question into a header title based on the provided context. The context is about the evolution of The Simpsons as a show, covering changes in content, humor, character development, animation, and its role in society. The original question is "How has the cast changed over time?"

First, I should focus on the key elements from the context. The show's evolution includes the cast, so the question is about changes in the cast. I need to make the title concise and relevant.

I'll start by identifying the main subject, which is the cast, and the aspect of change over time. So, "Evolution of the Cast" seems fitting. Next, I should include the timeframe, which is "Over Time," and tie it to the broader context of the show's evolution. 

Putting it together, "The Evolution of the Cast Over Time: Changes in The Simpsons' Ensemble" seems to capture both the cast's development and the show's overall changes. It's concise and includes the neces

In [None]:
from langchain_core.messages import HumanMessage

import transformers
from langchain_huggingface import HuggingFacePipeline as Pipeline
from langchain_huggingface import ChatHuggingFace as Chat

FastLanguageModel.for_inference(model)

hf_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    # device="cuda",
    # repetition_penalty=1.15,
    return_full_text=False,
    max_new_tokens=4048,
    # output_scores=True,
    # use_cache=False,
    # truncation=True
)

llm = Pipeline(pipeline=hf_pipeline)
chat = Chat(llm=llm)

Device set to use cuda:0


## Analyze
Answer (sub-)sub-questions

In [None]:
# DEBUG: Without providing context
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define the prompt template for generating AI responses
PROMPT_TEMPLATE = """
You are an AI assistant, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in $question$ tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer. Answer in a single short paragraph.
$context$
{context}
$/context$

$question$
{question}
$/question$
"""

# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()

# Define a function to format the retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# rag_chain.get_graph().print_ascii()

In [None]:
# Prompt the RAG for each question
answers = {}
total = len(leaves(breakdown)) + 4

pbar = tqdm(total=total)
for k, v in breakdown.items():
    if v == []:
        print(k)
        answers[k] = rag_chain.invoke(k).split('</think>')[-1].strip()
        pbar.update(1)
    else:
        for q in v:
            print(q)
            answers[q] = rag_chain.invoke(q).split('</think>')[-1].strip()
            pbar.update(1)

33it [15:41, 28.53s/it]


How have recurring roles in The Simpsons changed over the years?




Who are the new cast members added later to The Simpsons and what roles do they play?




What is the role of The Simpsons in shaping the casting choices for animated shows?




Who are the current voice actors for The Simpsons and what roles do they play now?




Which original voice actors have left the show and what has their departure meant for the cast?




What were the original cast members of The Simpsons and how have their roles evolved over time?




How has the casting of The Simpsons influenced the careers of its voice actors?




How has the animation style of The Simpsons changed?




How has the humor in The Simpsons changed as the show progressed?




How have the storylines become more complex and relevant over time?




How have the character arcs of individual characters evolved over time?




How has the animation style of The Simpsons evolved?




What were the initial characteristics of the characters in The Simpsons?




How did the addition of new characters affect the show's format?




How did the storytelling style of The Simpsons change from episodic to serialized?




How did the role of The Simpsons in society influence its format?




How did the initial structure of The Simpsons differ from its later structure?




How did the animation style of The Simpsons evolve over the years?




How did the humor and tone of The Simpsons evolve over time?




How have the themes and messages of The Simpsons evolved?




How have the characters in The Simpsons developed in terms of depth and complexity?




How has the animation quality of The Simpsons evolved over time?




What improvements have been made in special effects and technical aspects of The Simpsons?




How has the production quality, such as funding and resources, impacted The Simpsons?




How has the humor and writing in The Simpsons changed from its beginnings to later seasons?




How has The Simpsons inspired other television shows?




What is the role of The Simpsons in shaping societal perceptions and commentary?




How has the humor of The Simpsons impacted pop culture?




How has The Simpsons influenced the animation style in media?




How has the content of The Simpsons evolved over time?




How have the characters of The Simpsons influenced other shows and pop culture?




How has the humor and tone of The Simpsons changed?




How has the popularity of The Simpsons grown over the years?




In [None]:
answers

{'How have recurring roles in The Simpsons changed over the years?': "The cast of *The Simpsons* has evolved significantly over time. Initially, the main voice actors were paid $30,000 per episode until a 1998 pay dispute led to a salary increase to $125,000 by 2004. Further negotiations in 2004 and 2008 raised their pay to between $250,000 to $360,000 and eventually $400,000 per episode. Despite a 30% pay cut in 2016, they still earned just over $300,000 per episode.\n\nIn 2020, following the Black Lives Matter protests, Fox announced that recurring characters of color, like Carl Carlson and Dr. Hibbert, would no longer be voiced by white actors. These roles are now performed by black actors, marking a significant change in voice casting.\n\nThe show's supporting cast has also expanded, with characters gaining more prominent roles and even their own episodes. Inspired by SCTV, the series features a large ensemble of quirky characters, many of whom have grown into more prominent roles 

In [None]:
import pickle
with open('answers.pkl', 'wb') as f:
    pickle.dump(answers, f)

## Synthesize

In [None]:
report = [f'# {topic}\n\n']
for k, v in breakdown.items():
    report.append(f'## {k}\n')
    if v == []:
        report.append(answers[k] + '\n\n')

    else:
        for q in v:
            report.append(f'### {q}\n')
            report.append(answers[q] + '\n\n')

In [None]:
md = ''.join(report)
with open('report.md', 'w') as f:
    print(md, file=f)

In [None]:
md

'# The evolution of The Simpsons as a show over time, covering changes in content, humor, character development, animation, and its role in society.\n\n## How has the cast changed over time?\nThe Simpsons has undergone significant changes since its debut in 1989. Initially praised for its sharp humor, character-driven plots, and satirical take on life, the show\'s early seasons (often referred to as its "golden era") were celebrated for their wit and realism. However, by the late 1990s, especially around the time of season nine, the show began to shift towards more zany and controversial humor, which some critics found exhausting. This change included an increased reliance on celebrity cameos and cultural references rather than deeper character development.\n\nDespite this perceived decline in quality, the show has continued to evolve, experimenting with new formats and staying relevant through its innovative storytelling and satirical commentary on modern life. However, many fans and 