# Project 4: **Build a Deep Research System**
Welcome to project 4! For this project, we shift our focus from tool use and agents to *reasoning* models. You will practice state‚Äëof‚Äëthe‚Äëart inference‚Äëtime scaling methods such as *Chain‚Äëof‚ÄëThought* prompting and *Tree‚Äëof‚ÄëThoughts*, and briefly explore high-level concepts of training reasoning models using techniques like **STaR**.


Finally, you will put everything together to build a *deep research agent* that can browse the web, reason over what it finds, and give structured answers.

## Learning Objectives  
* Apply common inference‚Äëtime scaling methods: **zero‚Äëshot / few‚Äëshot CoT, self‚Äëconsistency, sequential decoding, tree‚Äëof‚Äëthoughts**  
* Gain intuition for **training** reasoning‚Äëcapable models following **STaR** approach 
* Build a minimal **deep‚Äëresearch agent** that combines step‚Äëby‚Äëstep reasoning with live web search   
* Practice extending deep-search to a multi-agent system 

## Roadmap  
0. Environment setup  
1. Inference‚Äëtime scaling  
  1.1 Few‚Äëshot.   
  1.2 Zero‚Äëshot‚ÄØCoT.   
  1.3 Self‚Äëconsistency.   
  1.4 Sequential revisions.     
  1.5 Tree‚Äëof‚ÄëThought (ToT)
2. Training reasoning models and inspecting deepseek-r1 
3. Deep-research agent  
4. (Optional) Multi-agent deep-research

# 0- Environment setup

### Step 1: Create your environment and install dependencies 
Before we start coding, you need a reproducible setup. Open a terminal in the same directory as this notebook, and use Conda or uv to install the project dependencies.

#### Option 1: Conda
```bash
# Create and activate the conda environment
conda env create -f environment.yaml && conda activate deep_research
```

#### Option 2: uv (Fast alternative)
If you prefer [uv](https://docs.astral.sh/uv/) over Conda:

```bash
# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install dependencies
uv venv .venv-deep-research && source .venv-deep-research/bin/activate
uv pip install -r requirements.txt
```

### Step 2: Register this environment as a Jupyter kernel
```bash
python -m ipykernel install --user --name=deep_research --display-name "deep_research"
```
Now open your notebook and switch to the `deep_research` kernel (Kernel ‚Üí Change Kernel).

### Step 3: Setup and run Ollama serve

In this project we use the `llama3.2:3b`, `qwen2.5:3b-instruct` and `deepseek-r1:1.5b` models. You can try other smaller or larger reasoning LLMs such as `phi4-mini` to compare performance. Explore available models here: https://ollama.com/library.

Open terminal and run ollama:
```bash
ollama serve
```
Then open another terminal and pull required models: 
```bash
ollama pull llama3.2:3b
ollama pull deepseek-r1:1.5b
ollama pull qwen2.5:3b-instruct
# Additional small reasoning models to compare
# ollama pull phi4-mini
```

---  
# 1‚Äë Inference‚Äëtime scaling

Inference-time scaling refers to techniques that make an existing model reason better without retraining it. Instead of changing the model‚Äôs weights, we achieve reasoning capability by adjusting how we prompt, sample, or aggregate LLM's outputs.

In this section, we‚Äôll explore several inference-time strategies that improve reasoning quality using a non-reasoning base model. You will experiment with and compare methods such as:

- Few-shot Chain-of-Thought (CoT)
- Zero-shot CoT
- Self-consistency
- Sequential revision
- Tree-of-Thoughts (ToT)

## 1.1: Few-Shot CoT

Few-shot prompting provides examples before asking a new question. The model learns from the pattern and applies it to new inputs.

We'll explore this with two models to understand how few-shot interacts with model capabilities:

1. **GPT-2** (no instruction tuning): Doesn't reason by default. We'll see if few-shot examples can elicit reasoning.
2. **Llama 3.2** (instruction-tuned): Already reasons naturally. We'll use few-shot to control the output format.

### GPT-2: Can few-shot examples elicit reasoning?

GPT-2 is a base language model that just predicts the next token. It wasn't trained to follow instructions or reason step-by-step. Let's see what happens with and without few-shot examples.

In [1]:
import os
import torch
from transformers import pipeline

# MPS setup for Apple Silicon (use CPU if not on Mac)
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
if device.type == "mps":
    torch.mps.empty_cache()

generator = pipeline(task="text-generation", model="openai-community/gpt2", dtype=torch.float16, device=device)

question = "A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?"

# --- Without few-shot examples ---
output_zero = generator(f"Q: {question}\nA:", max_new_tokens=100, do_sample=True, temperature=0.8)[0]["generated_text"]

print("=== GPT-2 WITHOUT few-shot examples ===")
print(output_zero)
print()

# --- With few-shot examples ---
few_shot = """Q: A store sells apples for $2 each. If I buy 3 apples and pay with a $10 bill, how much change do I get?
A: Step 1: Calculate total cost. 3 x $2 = $6.
Step 2: Calculate change. $10 - $6 = $4.
Therefore, the answer is $4.

Q: A train leaves at 9:15 AM and the journey takes 2 hours 30 minutes. What time does it arrive?
A: Step 1: Add hours. 9:15 + 2:00 = 11:15.
Step 2: Add minutes. 11:15 + 0:30 = 11:45.
Therefore, the answer is 11:45 AM.
"""

prompt = few_shot + f"Q: {question}\nA:"
output_few = generator(prompt, max_new_tokens=100, do_sample=True, temperature=0.8)[0]["generated_text"]

print("=== GPT-2 WITH few-shot examples ===")
print(output_few[len(few_shot):])

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: openai-community/gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'temperature', 'do_sample'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=100) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=100) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


=== GPT-2 WITHOUT few-shot examples ===
Q: A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?
A: The square in the top right corner is the area of the rectangle.
Q: But the width is two times the width?
A: No, the width is not two times the width. It's just that the same area is only twice the width. Also, if two people cross each other and try to eat the same food, the surface of the rectangle will have a thickness of a fraction of the width, so that on the other hand the rectangle is twice as wide and the surface

=== GPT-2 WITH few-shot examples ===
Q: A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?
A: Step 1: Add corners. 36 = 36 cm x 36 = 36.3 x 36.3 = 36.39
Therefore, the answer is 36:39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x 36 = 36.39 x


### Llama 3.2: Using few-shot to control output format

Unlike GPT-2, Llama 3.2 is instruction-tuned and already produces reasoning traces by default. So what's the point of few-shot examples?

**The power of few-shot with instruction-tuned models is controlling the output format.** We can make the model follow a specific structure like `[GIVEN]/[FIND]/[SOLVE]/[ANSWER]` that it wouldn't use naturally.

In [None]:
from openai import OpenAI

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"

question = "A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?"

# --- Without few-shot examples (model's default format) ---
response_zero = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": question}],
    temperature=0.7
)
print("=== WITHOUT few-shot examples ===")
print(response_zero.choices[0].message.content)
print()

# --- With few-shot examples (enforcing a specific format) ---
few_shot_examples = """Q: A store sells apples for $2 each. If I buy 3 apples and pay with a $10 bill, how much change do I get?
A: [GIVEN] apples cost $2, buying 3, paying with $10
[FIND] change received
[SOLVE] total = 3 x $2 = $6; change = $10 - $6 = $4
[ANSWER] $4

Q: A train leaves at 9:15 AM and the journey takes 2 hours 30 minutes. What time does it arrive?
A: [GIVEN] departure 9:15 AM, duration 2h 30m
[FIND] arrival time
[SOLVE] 9:15 + 2:30 = 11:45
[ANSWER] 11:45 AM
"""

prompt = few_shot_examples + f"Q: {question}\nA:"
response_few = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7
)
print("=== WITH few-shot examples ===")
print(response_few.choices[0].message.content)

=== WITHOUT few-shot examples ===
To find the area, we need to first find the dimensions of the rectangle.

Let's call the width "w" and the length "2w" since it's twice the width. The perimeter of a rectangle is given by:

Perimeter = 2(length + width)
36 = 2(2w + w)

Combine like terms:
36 = 6w

Divide both sides by 6:
w = 6

So, the width is 6 cm and the length is 2w = 12 cm.

Now that we have the dimensions, we can find the area of the rectangle:

Area = Length x Width
= 12 x 6
= 72

The area of the rectangle is 72 square centimeters.

=== WITH few-shot examples ===
To solve this problem, we need to find the dimensions of the rectangle first.

Let's denote the width as W and the length as L. We are given that the length is twice the width, so:

L = 2W

We also know that the perimeter of a rectangle is given by:

Perimeter = 2L + 2W

Substituting L = 2W into the equation above, we get:

36 = 2(2W) + 2W
36 = 4W + 2W
36 = 6W

Now, divide both sides by 6 to find W:

W = 36/6
W = 6 cm



### 1.2: Zero‚ÄëShot Chain‚Äëof‚ÄëThought
Zero-shot CoT encourages the model to reason without examples by adding a short cue such as ‚ÄúLet‚Äôs think step by step.‚Äù This simple phrase often activates the model‚Äôs latent reasoning ability even when no demonstrations are provided. It serves as a baseline to compare with few-shot and other inference-time scaling methods.

In [13]:
from openai import OpenAI

# Step 1: Write the question and a zero-shot CoT cue (e.g., "Let's think step by step.")
# Step 2: Build a single prompt string that includes brief role guidance plus the question
# Step 3: Call your Ollama or OpenAI client to get a response from llama3.2:3b  # e.g., client.chat.completions.create(...)
# Step 4: Print the chain and the final answer

client = OpenAI(api_key = "ollama", base_url = "http://localhost:11434/v1")

question = "Why do we use neural network to build LLMs?"

prompt = f"""You are a knowledgeable tutor. Answer the question. 
Question: {question}
Let's think step by step."""

MODEL = "llama3.2:3b"
response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role":"user","content": prompt}],
    temperature=0
)
print(response.choices[0].message.content)

To understand why neural networks are used to build Large Language Models (LLMs), let's break down the process step by step:

1. **Understanding the Problem**: The primary goal of building an LLM is to create a model that can generate human-like text, answer questions, or perform other natural language processing tasks.

2. **Traditional Approaches**: Before neural networks, traditional approaches to NLP involved rule-based systems and statistical models. These methods were limited in their ability to handle complex linguistic structures and nuances of human language.

3. **The Rise of Neural Networks**: In the 1980s and 1990s, researchers began exploring the use of neural networks for NLP tasks. The key innovation was the development of recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, which allowed models to capture sequential dependencies in language.

4. **Why Neural Networks?**: Neural networks are particularly well-suited for LLMs because they can:
   -

### 1.3 Self‚ÄëConsistency
Self-consistency enhances reasoning accuracy by sampling multiple independent reasoning paths for the same question instead of relying on a single deterministic answer. Each run may follow a slightly different logical chain, and the diversity helps correct individual mistakes. After generating several reasoning traces, you then aggregate the final answers using majority voting.

This approach is especially useful when tasks involve multi-step reasoning or arithmetic, where single-path outputs may be incorrect.

In [None]:
from openai import OpenAI
import re, collections

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"


def cot_answer(question, temperature=1.2):
    prompt = f"""Answer the following question with step-by-step reasoning.
End your answer with exactly: "Therefore, the answer is [X]" where [X] is just the final answer.

Question: {question}
Let's think step by step."""
    
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )

    content = r.choices[0].message.content
    match = re.search(r"the answer is[:\s]*([^\n\.]+)", content, re.IGNORECASE)
    return content, match.group(1).strip() if match else None


def self_consistent(question, n=5):
    traces = []
    answers = []
    for i in range(n):
        reasoning, ans = cot_answer(question)
        traces.append((i + 1, reasoning, ans))
        answers.append(ans)
    
    counter = collections.Counter(answers)
    winner, _ = counter.most_common(1)[0]
    return winner, counter, traces


question = "A store has a 20% off sale. If an item costs $80 after the discount, what was the original price?"

winner, counter, traces = self_consistent(question, n=5)

# Print each reasoning trace
for i, reasoning, ans in traces:
    print(f"=== Sample {i} (Answer: {ans}) ===")
    print(reasoning[:500] + "..." if len(reasoning) > 500 else reasoning)
    print()

print("=" * 50)
print("Votes:", counter)
print("Chosen answer:", winner)
print("Correct answer: $100")

=== Sample 1 (Answer: $100) ===
To find the original price of the item before the discount, we can start with the price after the discount and work our way backwards.

Step 1: The item costs $80 after the discount and has a 20% off sale. Let's use this information to set up an equation.
Let x be the original price of the item.
Since there is a 20% discount, the discount amount can be represented as 0.2x (20% of x).
The sale price after the discount can be calculated by subtracting the discount amount from the original price:

...

=== Sample 2 (Answer: None) ===
To find the original price of the item before the discount, we need to calculate the amount of the discount and add it back to the final price.

Step 1: Calculate the original price without considering the discount.
Item costs $80 after a 20% off sale.

Step 2: Apply the percentage change due to the discount:
Let x be the value multiplied by the number 0.20 (or 20%) and add that value back into the final result of $80, set equa

### 1.4: Sequential Revision

Sequential revision iteratively improves an answer by generating a first draft, critiquing it, and producing revised drafts that condition on prior answers. Each round should be short and focused, so improvements accumulate without drifting from the question.

In [None]:
from openai import OpenAI

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"


def sequential_revision(question: str, max_steps: int = 3) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Keep your answers clear and correct."},
        {"role": "user", "content": question}
    ]
   
    draft = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        temperature=0.7,
    ).choices[0].message.content.strip()
    print(f"Draft 1: {draft}")

    # Iterative revision
    for idx in range(1, max_steps):
        messages = [
            {"role": "system", "content": "You are a helpful assistant. Improve answers by making them clearer and more accurate."},
            {"role": "user", "content": question},
            {"role": "assistant", "content": draft},
            {"role": "user", "content": "Please revise your answer. Make it clearer, more accurate, and better written. Only include the new answer."}
        ]
        draft = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=0.7,
        ).choices[0].message.content.strip()
        print(f"Draft {idx+1}: {draft}")

    return draft


output = sequential_revision("If a rectangle is twice as long as it is wide and the perimeter is 30 cm, what is the area?")

## 1.5 Tree-of-Thoughts

Tree-of-Thoughts (ToT) reframes reasoning as a search problem. Instead of generating one linear chain of thoughts, the model:
1. Generates multiple candidate "thoughts" at each step
2. Evaluates how promising each thought is
3. Expands only the best candidates (beam search)
4. Backtracks if needed

This mirrors how humans solve hard problems: brainstorm options, evaluate them, pursue the best, and backtrack when stuck.

### Example 1: Word Ladder (Algorithmic ToT)

This example shows ToT as pure beam search without LLM calls. Each "thought" is a candidate word that differs by one letter. We score by edit distance to goal and keep the best candidates.

This demonstrates the **core algorithm** behind ToT: expand, score, prune.

In [None]:
###### Word Ladder Puzzle ##########

def neighbors(word, vocabulary):
    for i, c1 in enumerate(word):
        for c2 in 'abcdefghijklmnopqrstuvwxyz':
            if c1 != c2:
                candidate = word[:i] + c2 + word[i+1:]
                if candidate in vocabulary:
                    yield candidate


def tree_of_thought(start, goal, vocab, max_depth=5, beam_width=4):
    frontier = [[start]]
    for depth in range(max_depth):
        candidates = []
        for path in frontier:
            for nxt in neighbors(path[-1], vocab):
                if nxt in path:  # avoid loops
                    continue
                candidates.append(path + [nxt])
        # score: negative edit distance to goal
        scored = sorted(candidates, key=lambda p: sum(a!=b for a,b in zip(p[-1], goal)))
        frontier = scored[:beam_width]
        if any(p[-1] == goal for p in frontier):
            return [p for p in frontier if p[-1]==goal][0]
    return None


vocab = {"hit","dot","cog","log","dog","lot","lit","hot"}
print(tree_of_thought("hit", "cog", vocab))


None
['hit', 'hot', 'dot', 'dog', 'cog']


### Example 2: Generic ToT for Open-Ended Problems

For open-ended problems without verifiable answers, we can still apply ToT by having the LLM both propose and evaluate thoughts.

In [2]:
###### Generic ToT Search ##########

import re
from openai import OpenAI

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"

def propose_thoughts(question, state, k=2):
    prompt = f"""You are exploring solutions.
Problem: {question}
Current partial solution: {state}

Propose {k} different next thoughts, numbered 1 to {k}."""
    
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9,
    )
    # Split response into separate thoughts
    return [r.choices[0].message.content.strip()]


def score_state(question, state):
    prompt = f"""Problem: {question}
Rate from 1-10 how promising this partial solution is: {state}
Reply with just a number."""
    
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    nums = re.findall(r"\d+", r.choices[0].message.content)
    return int(nums[0]) if nums else 5


def tree_of_thoughts(question, depth=2, width=2):
    frontier = [("", 0)]
    for _ in range(depth):
        new_frontier = []
        for state, _ in frontier:
            for thought in propose_thoughts(question, state, k=width):
                new_state = (state + "\n" + thought).strip()
                score = score_state(question, new_state)
                new_frontier.append((new_state, score))
        new_frontier.sort(key=lambda x: x[1], reverse=True)
        frontier = new_frontier[:width]
    best_state, best_score = frontier[0]
    return best_state, best_score


question = "Design a plan for a weekend science workshop for 12-year-olds."
solution, score = tree_of_thoughts(question)

print(f"Best solution (score {score}):\n{solution}")

Best solution (score 8):
Here are two potential next steps:

1. **Define the Workshop Theme and Objectives**: Develop a clear theme and set of objectives for the weekend science workshop. This will help guide the planning process and ensure that all activities and experiments align with the goals of engaging and educating the participants (12-year-olds). Some possible themes could include:
 * "Environmental Science" (focusing on sustainability, conservation, and ecology)
 * "Physics and Engineering" (exploring simple machines, circuits, and robotics)
 * "Chemistry and Materials" (introducing basic chemistry concepts through hands-on experiments)

By defining a clear theme and set of objectives, we can start to generate ideas for activities, experiments, and resources that will make the workshop engaging and effective.

2. **Identify a Venue and Logistics**: Secure a suitable venue for the weekend science workshop, taking into account factors such as:
 * Availability (dates and times)
 

---  
# 3‚Äë Training Models for Reasoning

### 3.1: CoT Training
Chain-of-Thought (CoT) training conditions the model on explicit rationales during fine-tuning. Instead of teaching the model to output only the final answer, we train on (question, rationale, answer) so the model learns to internalize multi-step reasoning patterns. A practical recipe is STaR (Self-Taught Reasoner), which uses a stronger teacher model to bootstrap rationales that a smaller student can learn from.

For tasks that require multi-hop reasoning, models fine-tuned on rationales often achieve higher accuracy and are more stable at inference time than models trained on direct answers only. 

Training a full language model is beyond the scope of this notebook, but here is the high-level workflow followed by a short pseudocode:
- Collect questions: Prepare a dataset of questions and correct answers.
- Generate rationales: Use a strong LLM to produce step-by-step reasoning ending with the correct answer.
- Filter and clean: Discard incorrect or low-quality rationales.
- Prepare training data: Format triples (question, rationale, answer) for supervised fine-tuning.
- Fine-tune: Fine-tune the LLM on rationales.
- Iterate: Refine prompts, improve data quality, and retrain for stronger reasoning.

In [None]:
# Pseudocode (STaR loop)
# for round in 1 ... iters:
    # STEP 1: self-generate reasoning (teacher creates rationale + answer)
    # STEP 2: keep only correct, high-quality traces
    # STEP 3: fine-tune student on (question, rationale, answer) data

### 3.2: ORM¬†vs¬†PRM¬†+ RL
Training a Reward Model (RM) allows large language models to be improved through reinforcement learning (RL). Instead of fine-tuning directly on examples, we train a separate model that can score or rank model outputs, and use those scores as feedback signals to refine the policy model.

Two main reward modeling approaches are ORM (predicts a scalar reward for the final answer) and PRM (evaluates the reasoning steps instead of just the outcome)



| Approach | Typical loss | When to use |
|-----------|-------------|-------------|
|*Outcome Reward Model* | Predict scalar reward | Easy to collect training data using verifiers |
|*Process Reward Model* | Predict rewards per step | Difficult to collect training data but more accurate |
| *RLHF* | Use RM as reward in **RL** fine‚Äëtuning | Aligns policy with human signals | Aligns model policy with human or synthetic preferences




In [None]:
# for round = 1 ... iters:
    # STEP 1:  Generate reasoning
        # sample a minibatch of questions
        # policy roll‚Äëout (actions + log‚Äëprobs)
    # STEP 2:  Score the trajectory
        # ORM: scalar reward for the final answer / PRM: scalar reward for the thought process
    # STEP 3:  Reinforce the policy (PPO)

### 3.3 Inspect a reasoning model

Now that we've discussed how reasoning models are trained, let's see one in action. We'll use **DeepSeek-R1**, a reasoning model that produces explicit *thinking tokens* before giving its final answer. The model wraps its internal chain-of-thought inside `<think>...</think>` tags, followed by a clean final response.

In the cell below we send a question to DeepSeek-R1 and parse the output to separate:
- **Thinking tokens** ‚Äî the model's internal reasoning process (hidden from the end user in production).
- **Final answer** ‚Äî the polished response the user actually sees.

We use `deepseek-r1:1.5b` here for speed. You can switch to `deepseek-r1:8b` for higher-quality reasoning, but it will take longer to run. Pull whichever variant you want to try:

In [9]:
import re
from openai import OpenAI

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
model = "deepseek-r1:1.5b"

question = "A rectangle has a perimeter of 36 cm. If the length is twice the width, what is the area?"

response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": question}],
    temperature=0.6,
)

raw_output = response.choices[0].message.content

# DeepSeek-R1 wraps its reasoning in <think>...</think> tags
think_match = re.search(r"<think>(.*?)</think>", raw_output, re.DOTALL)
thinking = think_match.group(1).strip() if think_match else "(no thinking tokens found)"
final_answer = re.sub(r"<think>.*?</think>", "", raw_output, flags=re.DOTALL).strip()

print("=" * 60)
print("THINKING TOKENS (internal reasoning)")
print("=" * 60)
print(thinking)
print()
print("=" * 60)
print("FINAL ANSWER (returned to the user)")
print("=" * 60)
print(final_answer)

THINKING TOKENS (internal reasoning)
Okay, so I have this problem about a rectangle with a perimeter of 36 cm. It says that the length is twice the width, and I need to find the area. Hmm, let me think step by step how to approach this.

First off, I remember that the perimeter of a rectangle is calculated using the formula: P = 2*(length + width). Since they've given the perimeter as 36 cm, I can plug that into the equation.

But before I do that, maybe I should define variables for length and width to make it clearer. Let me call the width 'w'. Then, since the length is twice the width, the length would be '2w'. That makes sense because they said length is twice as long as the width.

So now, substituting these into the perimeter formula: 36 = 2*(length + width) which becomes 36 = 2*(2w + w). Let me write that out:

36 = 2*(2w + w)

Simplify inside the parentheses first. 2w plus w is 3w, so now it's:

36 = 2*3w

Which simplifies to:

36 = 6w

Hmm, okay, so if I divide both sides by 6

---  
# 4‚Äë A Deep Research Agent

A deep-research agent pairs a reasoning model with external tools for web search and retrieval. We will follow the ReAct pattern: the model writes short thoughts, decides when to call tools, reads observations, and continues reasoning until it can answer or reaches a step limit.

We now combine a **search tool** with an LLM in a multi-step setup. We follow the *ReAct* pattern (reason ‚Üí tool ‚Üí observation):

1. The model reasons and decides to use tools
2. The agent searches and feeds condensed snippets back as context
3. Iterate until the model answers or hits a step limit

We use `create_agent` from `langchain.agents`, which builds a ReAct-style agent graph. Note: the agent model must support **tool calling** (e.g., `llama3.2:3b`). Models like `deepseek-r1` are reasoning models that do not support native tool calling and cannot be used directly as the agent LLM. We can stick to the `llama3.2:3b` or `qwen2.5:3b-instruct` for this section.

In [3]:
from ddgs import DDGS
from langchain_core.tools import tool


@tool
def ddg_search(query: str, k: int = 5) -> str:
    """Basic DuckDuckGo web search that returns a concatenated text snippet."""
    with DDGS() as ddgs:
        results = [hit["body"] for hit in ddgs.text(query, max_results=k)]
    return "\n".join(results)


search_tool = ddg_search


In [None]:
from langchain.agents import create_agent
from langchain_ollama import ChatOllama

MODEL = "qwen2.5:3b-instruct"

question = "What are the best resources to learn machine learning in 2025?"

# Step 1: Initialize the LLM via ChatOllama (must support tool calling)
llm = ChatOllama(model=MODEL, temperature=0.2)

# Step 2: Build a tool-calling agent with DuckDuckGo search
agent = create_agent(llm, tools=[search_tool])

# Step 3: Ask a query and let the agent search + reason to produce an answer
result = agent.invoke({"messages": [("user", question)]})
print(result["messages"][-1].content)

Based on the information provided by DuckDuckGo search:

1. There's an upcoming course titled "MachineLearning for Bioinformatics & Systems Biology" that might be relevant to learning about machine learning resources in 2025.

Additionally, there are several topics related to machine learning techniques and their applications:
- Digital marketers use machine learning to analyze user behavior on websites.
- Cross-validation is a technique used in deep learning models like Keras, PyTorch, and MxNet. It helps evaluate the performance of these models by splitting data into training and validation sets multiple times.

For resources related to machine learning in 2025 or beyond, you might want to look for more specific courses or events that are scheduled for that year. The course mentioned could be a good starting point if it aligns with your interests and goals.


# 5- (Optional) Multi-Agent Deep Research

Instead of a single agent, we can design multiple collaborating agents that work in parallel:

1. **Planner**: Analyzes the query and breaks it into sub-questions
2. **Researchers**: Run in parallel, each searching and summarizing findings for one sub-question  
3. **Synthesizer**: Combines all research into a coherent final report

This setup improves coverage and speed by parallelizing the research phase.

In [10]:
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
from ddgs import DDGS

client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")
MODEL = "llama3.2:3b"


def plan_research(query: str) -> list[str]:
    """Planner agent: breaks query into sub-questions and decides scale (1, 3, or up to 5 sub-queries)."""
    prompt = f"""You are a research planner. Given a query, break it into 1-5 focused sub-questions.
- Simple factual queries: 1 sub-question
- Moderate topics: 3 sub-questions  
- Complex topics needing multiple angles: 5 sub-questions

Query: {query}

Return ONLY the sub-questions, one per line, no numbering or bullets."""

    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    lines = [line.strip() for line in r.choices[0].message.content.strip().split("\n") if line.strip()]
    return lines[:5]  # cap at 5


def search_and_summarize(sub_question: str) -> dict:
    """Researcher agent: searches web and summarizes findings for one sub-question."""
    # Search
    with DDGS() as ddgs:
        results = [hit["body"] for hit in ddgs.text(sub_question, max_results=3)]
    snippets = "\n".join(results)
    
    # Summarize
    prompt = f"""Based on these search results, write a concise summary answering: {sub_question}

Search results:
{snippets}

Summary:"""

    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return {"question": sub_question, "summary": r.choices[0].message.content.strip()}


def synthesize_report(query: str, findings: list[dict]) -> str:
    """Synthesizer agent: combines all findings into a final report."""
    findings_text = "\n\n".join([f"### {f['question']}\n{f['summary']}" for f in findings])
    
    prompt = f"""You are a research synthesizer. Combine these findings into a coherent report.

Original query: {query}

Research findings:
{findings_text}

Write a well-structured report that answers the original query. Use markdown formatting."""

    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.4
    )
    return r.choices[0].message.content.strip()


def deep_research(query: str) -> str:
    """Run the full multi-agent deep research pipeline."""
    print(f"Planning research for: {query}\n")
    
    # Step 1: Plan
    sub_questions = plan_research(query)
    print(f"Sub-questions ({len(sub_questions)}):")
    for sq in sub_questions:
        print(f"  - {sq}")
    
    # Step 2: Research in parallel
    print("\nResearching in parallel...")
    with ThreadPoolExecutor(max_workers=len(sub_questions)) as executor:
        findings = list(executor.map(search_and_summarize, sub_questions))
    print(f"Collected {len(findings)} research summaries.")
    
    # Step 3: Synthesize
    print("\nSynthesizing final report...\n")
    report = synthesize_report(query, findings)
    return report


# Run the multi-agent research
query = "What are the best resources to learn machine learning in 2025?"
report = deep_research(query)
print("=" * 60)
print("FINAL REPORT")
print("=" * 60)
print(report)

Planning research for: What are the best resources to learn machine learning in 2025?

Sub-questions (5):
  - What are the top online courses for machine learning in 2025?
  - What are some popular books on machine learning that were published in 2024 or later?
  - What are some of the most influential researchers and their current projects in machine learning?
  - How do different machine learning frameworks (e.g. TensorFlow, PyTorch) compare to each other in terms of ease of use and performance for beginners?
  - What role do online communities (e.g. Kaggle, Reddit's r/MachineLearning) play in supporting the growth of machine learning practitioners in 2025?

Researching in parallel...
Collected 5 research summaries.

Synthesizing final report...

FINAL REPORT
# Best Resources to Learn Machine Learning in 2025

As machine learning continues to evolve and become increasingly important across various industries, it's essential for individuals to stay up-to-date with the latest developme

## üéâ Congratulations!

You have:
* Practiced various inference-time reasoning methods (CoT, self-consistency, sequential revision, ToT)
* Gained intuition about training reasoning models (STaR, ORM/PRM)
* Built a **deep-research agent** with tool calling and ReAct-style reasoning
* Implemented a **multi-agent system** with parallel research and report synthesis


üëè **Great job!** Take a moment to celebrate. The techniques you implemented here power many production agents and chatbots.