# MAS-Zero: Multi-Agent Systems with Zero Supervision

MAS-Zero ("Designing Multi-Agent Systems with Zero Supervision") is a [paper by Ke et al.](https://arxiv.org/pdf/2505.14996) that explores how to design multi-agent systems (MAS) without supervision.

Unlike automated MAS design methods that rely on validation data (e.g., ADAS) and optimize a single design for an entire task distribution, MAS-Zero designs a solution *per problem instance* without using any additional validation set.

![](assets/mas-zero.png)

MAS-Zero has three stages:

1. **MAS-Init**: Run a set of established, human-designed prompting strategies (e.g., Chain-of-Thought, Self-Consistency, Debate, Self-Refine) to generate initial candidate answers.
2. **MAS-Evolve**: Iteratively improve the solution via two alternating phases—meta-design and meta-feedback.
    1. **meta-design**: The meta-agent decomposes the question into sub-tasks and proposes a MAS using the available building blocks and any accumulated experience.
    2. **meta-feedback**: Evaluate the proposed MAS and its outputs (including intermediate outputs) for:
        - **Solvability**: Each sub-task is independently and completely solvable.
        - **Completeness**: The set of sub-tasks covers all critical information from the original question, so their answers can be aggregated into a correct final answer.
    3. Store the meta-feedback as experience and feed it back into subsequent meta-design iterations.
3. **MAS-Verify**: Collect candidate answers from MAS-Init and MAS-Evolve, rank them by frequency (majority vote bias), filter invalid answers, and select a final answer from the remaining candidates.

```
@misc{ke2025maszero,
      title={MAS-Zero: Designing Multi-Agent Systems with Zero Supervision}, 
      author={Zixuan Ke and Austin Xu and Yifei Ming and Xuan-Phi Nguyen and Caiming Xiong and Shafiq Joty},
      year={2025},
      eprint={2505.14996},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.14996}, 
}
```

This notebook shows how to implement MAS-Zero using the `agenticblocks` library.

## Setup

### Imports

In [None]:
import collections
import inspect
import io
import json
import random
import re
import urllib.request
import traceback
import zipfile

import agenticblocks as ab
import pandas as pd
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


### Model Access

We need to configure access to the language model(s) we want to use.

`agenticblocks` supports all OpenAI-API-compatible providers.

You can set the base URL and API key via the `OPENAI_API_URL` and `OPENAI_API_KEY` environment variables.

For more details, see the [getting started example](01_getting_started.ipynb).

In [2]:
import dotenv
dotenv.load_dotenv()

#!export OPENAI_API_URL=
#!export OPENAI_API_KEY=

True

This example involves executing LLM generated code. With `YOLO = False` you are prompted to review and confirm the code before it is executed. With `YOLO = True` code will be run without confirmation.

In [None]:
YOLO = False

In [None]:
# We use llama-3.3-70b-instruct which can be used for free on openrouter
MODEL_NAME = "meta-llama/llama-3.3-70b-instruct:free"

In [4]:
model = ab.Model(MODEL_NAME, cost_tracking="ignore_errors")

### Data

Let's download the GPQA (Graduate-Level Google-Proof Q&A) dataset. GPQA consists of challenging multiple-choice questions written by domain experts and designed to be difficult even for experts.

We will use a subset of 50 examples to evaluate MAS-Zero.

In [5]:
with urllib.request.urlopen("https://github.com/idavidrein/gpqa/raw/main/dataset.zip") as response:
    zip_data = io.BytesIO(response.read())

with zipfile.ZipFile(zip_data, 'r') as zf:
    with zf.open('dataset/gpqa_main.csv', pwd=b"deserted-untie-orchid") as csv_file:
        df = pd.read_csv(csv_file)

In [6]:
def format_input(row):
    correct = row['Correct Answer']
    incorrect = [row['Incorrect Answer 1'], row['Incorrect Answer 2'], row['Incorrect Answer 3']]
    choices = [correct] + incorrect
    random.shuffle(choices)
    input_text = f"""{row['Question']}

{chr(10).join([f"{chr(ord('A')+i)}) {choice}" for i, choice in enumerate(choices)])}
"""
    correct_idx = choices.index(correct)
    return pd.Series([input_text, correct_idx], index=["input", "correct_index"])

df[["input", "correct_index"]] = df.apply(format_input, axis=1)

## MAS-Zero Implementation

Let's implement the three stages of MAS-Zero using the `agenticblocks` library. For demonstration purposes, we will use a hard example from the GPQA dataset.

In [7]:
example = df.loc[
    df["Writer's Difficulty Estimate"].str.startswith("Post-graduate level or harder")
    & df["Question Difficulty_EV_1"].str.startswith("Post-graduate level or harder")
    & df["Question Difficulty_EV_2"].str.startswith("Post-graduate level or harder")
].iloc[3]

print(example.input)
print(example.correct_index)
print(example["Correct Answer"])

In a quantum dialog protocol a 4-mode continuous variable GHZ state is distributed among 3-parties, and a bell measurement is performed on these states, what would be the measurement output if the three parties encode in the following way using a displacement operator D(alpha): 
P1: (xa,pa) 
P2: (xb,pb)
P3: (xc,pc)
Here, (x,p) correspond to the amplitude and phase, such that 
alpha= x +ip, is the argument of displacement operator.
In the scheme, the 2nd and 3rd mode are encoded by P2. The 1st and 4th mode are encoded by P1 and P3.

A) (xa -xb,pa -pb), (xb-xc,pb-pc)
B) (xa +xb,pa -pb), (xb+xc,pb-pc)
C) (xa +xb,pa +pb), (xb+xc,pb+pc)
D) (xa -xb,pa +pb), (xb-xc,pb+pc)

3
(xa -xb,pa +pb), (xb-xc,pb+pc)


### Stage 1: MAS-Init

First, we run a set of established building blocks (Chain-of-Thought, Self-Consistency, Multi-Agent Debate, Self-Refine) to generate initial candidate answers. We use the `agenticblocks` built-ins for this.

In [8]:
builtin_blocks = [ab.IO, ab.ChainOfThought, ab.SelfConsistency, ab.MultiAgentDebate, ab.SelfRefine]

In [9]:
# build the initial archive by evaluating the built-in blocks on the train set
archive = []
for block_class in tqdm(builtin_blocks, total=len(builtin_blocks)):
    model = ab.Model(MODEL_NAME, cost_tracking="ignore_errors")
    if block_class == ab.MultiAgentDebate:
        block = block_class(agents=[ab.Model(MODEL_NAME, cost_tracking="ignore_errors") for _ in range(3)])
    elif block_class == ab.SelfConsistency:
        block = block_class(ab.ChainOfThought(model), n=3, temperature=0.7, aggregator=model)
    else:
        block = block_class(model)
    with ab.trace() as t:
        out = block(example.input)
    archive += [{
        "name": f"ab.{block_class.__name__}",
        "thought": "agenticblocks built-in",
        "code": inspect.getsource(block_class),
        "output": out,
        "trace": t.to_dict(),
        "fitness": {"95% Bootstrap Confidence Interval": f"({0:.1f}%, {0:.1f}%)", "median": round(0, 1)},  # fitness is always set to 0
    }]

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [16:43<00:00, 200.77s/it]


In [10]:
archive

[{'name': 'ab.IO',
  'thought': 'agenticblocks built-in',
  'code': 'class IO(Block):\n    """IO block - simple pass-through to the model."""\n\n    def __init__(self, model: Model | str):\n        self.model = Model(model) if isinstance(model, str) else model\n\n    def __repr__(self):\n        return f"IO({self.model!r})"\n\n    def forward(self, prompt: str, **kwargs: Any) -> str:\n        return self.model(prompt, **kwargs)\n',
  'output': "To determine the measurement output in a quantum dialog protocol using a 4-mode continuous variable GHZ state with bell measurements, we need to understand how the displacement operators applied by each party affect the modes. The GHZ state is a highly entangled state, and the displacement operators encode information onto the modes.\n\nGiven the encoding scheme:\n- P1 encodes on the 1st and 4th modes with $(x_a, p_a)$, which corresponds to a displacement operator $D(\\alpha_a)$ where $\\alpha_a = x_a + ip_a$.\n- P2 encodes on the 2nd and 3rd mo

### Stage 2: MAS-Evolve


Let's define the prompts used in MAS-Zero. These are adapted from the [MAS-Zero repository](https://github.com/SalesforceAIResearch/MAS-Zero).

In [11]:
system_prompt = """You are a helpful assistant.\n\nReply EXACTLY with the following JSON format.\n{"reflection": "Your reflection (if applicable).", "thought": "Your thought.", "name": "Your name.", "code": "Your code."}\nDO NOT MISS ANY REQUEST FIELDS and ensure that your response is a well-formed JSON object!"""

In [12]:
meta_design_prompt = f"""# Overview
You are an expert machine learning researcher testing various agentic systems. Given a set of architectures in the archive and the question. Note that architecture can contain multiple agents, and agnet mean a LLM that use for specifical objectives by specifclaied setting (instruction, tempreture...)

Your objective is to 

(1) Perform task decomposition. Specfically, decompose the give question significantly so that the sub-architecture (or node or block) can perform each of the sub-tasks. The output should be sub-task 1, sub-task 2, ... sub-task n. Do not solve the task for the sub-architecture and do not leak the expected answer in your sub-task description/instruction/question (a short-cut like 'output exactly the following...' is also leakage and should be avoided). Instead, decompose the task that easy enough for the sub-architecture to solve. You need to justify how these sub-tasks can achieve the final answer to the orginal questions.

Make sure 

(a) Include sub-task ID  and 'Based on (task i)' in the instruction of a sub-task. 
For example, 
Similarly, if sub-task 2 requires the output of task 1, then sub-task 2's instruction should be
'Sub-task 2: Based on the outputs from sub-task 1, ....(origin sub-task 1's instruction)'

Similarly, if sub-task 3 requires the output of task 1 and 2, then sub-task 3's instruction should be
'Sub-task 3: Based on the outputs from sub-task 1 and 2, ....(origin sub-task 3's instruction)'
This helps each sub-task connects to its prerequisite sub-tasks so that there is enough information to solve it.

(b) Each sub-task should be specific and detailed enough to solve and to help achieve the final answer to the given question. The output should be helpful to solve the next sub-task. You need to include details steps (but not the answer) to the sub-task 
For example,
`Sub-task 3: Based on the output of sub-task 1 and 2....`
You can see it clearly states 'based on what sub-tasks'

(c) Use the following prompt template for the sub-task blocks (replace [INSTRUCTION] with your actual instruction):
'''Given the above, answer the following question: [INSTRUCTION]

  If the question is too complicated or informaion is missing, you still need to give your best answer but add 
  (1) an additional mark [TOO_HARD] in the next line of your final answer 
  (2) information request or decomposison suggestion in the next line of the [TOO_HARD] mark, 
  in the "answer" entry (for example, 300\n[TOO_HARD]\nSuggestion:...) 
  and justify why you think so in the "thinking" entry'''

(d) The answer to the last sub-task should be the same as the answer to the final question, so that the architecture successfully solve the complext question by solveing each of the sub-task. 

(2) Given the resulting sub-task 1, sub-task 2, ... sub-task n, design connections between existing blocks to adress each of them. 
You should structure the architecture as a multi-layered network. Each existing architecture (or blocks) serves as a node, while connections between them act as edges, forming a structured hierarchy of interactions. Additionally, you must determine the number of layers in the network.

For example, if the exising architectures are 'ab.IO, ab.ChainOfThought, ab.SelfConsistency, ab.MultiAgentDebate, ab.SelfRefine' and you determine that there can be 3 layers. There are 3 resulting sub-task from (1) sub-task 1, sub-task 2, sub-task 1, sub-task 3:

Example Setup

Resulting sub-tasks:
sub-task 1, sub-task 2, sub-task 1, sub-task 3

Available architectures:
ab.IO, ab.ChainOfThought, ab.SelfConsistency, ab.MultiAgentDebate, ab.SelfRefine

Network with 3 Layers:

Layer 1: ab.IO  ab.ChainOfThought  ab.SelfConsistency  ab.MultiAgentDebate  ab.SelfRefine  
Layer 2: ab.IO  ab.ChainOfThought  ab.SelfConsistency  ab.MultiAgentDebate  ab.SelfRefine   
Layer 3: ab.IO  ab.ChainOfThought  ab.SelfConsistency  ab.MultiAgentDebate  ab.SelfRefine  

Connection Strategies:

1. Linear Connection: Directly link two block to pass information forward.
Example: [ab.ChainOfThought] (address sub-task 1) -> [ab.MultiAgentDebate] (address sub-task 2) (Single connection and exit)

2. Multi-Layer Connection: An block can appear in multiple layers, forming deeper reasoning structures.
Example: [ab.ChainOfThought] (address sub-task 1) -> [ab.MultiAgentDebate] (address sub-task 2) -> [ab.ChainOfThought -> ab.SelfRefine] (address sub-task 3) (ab.ChainOfThought appears in both Layer 1 and Layer 3) (the whole [ab.ChainOfThought -> ab.SelfRefine] is a sub-task architecture that aims to address sub-task 3)

IMPORTANT:

1. Decomposition itself should not be included in the architecture as the question has been decomposed at step (1). Do not assign one block to perform all the sub-tasks (if you put all decomposed sub-tasks into a single instruction for an block, it is very wrong). Instead, assign different block to address each of the sub-task instead.

2. If your previous attemps in Discovered architecture archive are incorrect (fitness value equals to 0), it means the sub-tasks are still too difficult to the corresponidng blocka. Please further decompose the question to easier sub-tasks. 

Your aim is to design an optimal block connection that can performe well on each of the sub-task.Your code should implment the exising blocks given in the archive (the 'code' entry of blocks) as it-is without modication: Do not propose new blocks or modify existing ones and only change the connections between the given blocks, but block setting like instruction, tempreture are allowed to modify

# How to implement your agentic systems
You can use the agenticblocks library to create and manage models:

```python
import agenticblocks as ab
model = ab.Model("{MODEL_NAME}", system_prompt="You are a helpful assistant.")
```

Always use the {MODEL_NAME} model when you create a new model.
A model can be prompted in the following way:

```python
model("Your prompt here", temperature=0.7)
```

blocks define how models are prompted and how they interact with each other.
agenticblocks offers some built-in blocks that you can use to create agentic systems:

The built-in blocks can be imported using `from agenticblocks import BlockName`.
Use these blocks as building blocks to solve the sub-tasks.
You can see how built-in blocks are implemented in the discovered architecture archive below.

# Discovered architecture archive
Here is the archive of the discovered architectures:

[ARCHIVE]

The trace shows the output of intermediate blocks and sub-tasks.
The fitness value is the median and 95% Bootstrap Confidence Interval of the correct rate on the given question. Your GOAL is to maximize the "fitness".

# Output Instruction and Example:
You need to output a JSON object with three keys: "thought", "name", and "code".
The first key should be ("thought"), and it should capture your thought process for and it should capture your thought process for reconnecting the exisitng blocks in achived. 

In the "(thought)" section, include the following:

(1) **Decomposion**: Given the new sub-task from (1). Form your final decomposition, which should include all of the new sub-task. Explain in details how do you decompose the question and how such decomposition is eaiser enough such that the subtask is solavable by the given agent, blocks and architecture 

(2) **Overall Architecture**: 

Given the resulting sub-task 1, sub-task 2, ... sub-task n, design connections between existing blocks to adress each of them. describe your reasoning and the overall concept behind the connection design and finally detail the implementation steps. All connection must betweene exising blocks in the archive and no new blocks can be made. The format must strickly follow: 

(a) Use '->' for connection. for example, 'ChainOfThought (address sub-task 1) (exisitng block name) -> MultiAgentDebate (address sub-task 2) (another exising block name)' means connect the ChainOfThought block and the MultiAgentDebate block to address sub-task 1 and sub-task 2 correspondingly.
The second key ("name") corresponds to the class name of your next agent architecture. Make sure the name is unique and descriptive of the architecture you are proposing. 
Finally, the last key ("code") corresponds to the exact Python code of the class that you would like to try. You must write a COMPLETE CODE in "code": Your code will be part of the entire project, so please implement complete, reliable, reusable code snippets.

Here is an example of the output format for the next agent architecture:

{{
    "thought": "**Insights:**\nYour insights on what should be the next interesting agent.\n**Overall Idea:**\nyour reasoning and the overall concept behind the agent design.\n**Implementation:**\ndescribe the implementation step by step.",
    "name": "YourCustomAgenticBlock",
    "code": '''class YourCustomAgenticBlock(ab.Block):
    def __init__(self):
        # Your code here

    def __call__(self, question):
        # Your code here
'''
}}

You must use the exact function interface used above. You need to specify the instruction, input information, and the required output fields for various LLM agents to do their specific part of the architecture. DON'T try to use some function that doesn't exisit.
Also, it could be helpful to set the LLM’s role and temperature to further control the LLM’s response. Note that the LLMAgentBase() will automatically parse the output and return a list of “Infos”. You can get the content by Infos.content. 
DO NOT FORGET the taskInfo input to LLM if you think it is needed, otherwise LLM will not know about the task.

You must use the exact interface used above. You need to specify the instruction, input information, and the required output fields for various LLM agents to do their specific part of the architecture. 
Also, it could be helpful to set the LLM’s system_prompt and temperature to further control the LLM’s response. Note that only the question will be passed to the instance of the block class. Everything else needs to be already implemented in the __init__ and __call__ methods of the block class.
DO NOT FORGET to pass the task description to models inside the block class if you think it is needed, otherwise the models will not know about the task.

The key "code" from your output JSON will be saved to a custom_blocks.py file and will be called like this:

```python
from custom_blocks import YourCustomAgenticBlock

question = '''The constellation ... is a bright W-shaped constellation in the northern sky.

A) Centaurus
B) Cygnus
C) Cassiopeia
D) Cepheus
'''

block = YourCustomAgenticBlock()
result = block(question)
```

# Your task
You are deeply familiar with LLM prompting techniques and LLM agent works from the literature. Your goal is to maximize the specified performance metrics by reconnecting the exisitng block in archived. Do not try to propose new block or modify the exising block, and only change the connection but block setting like instruction, tempreture are allowed to modify
Observe the discovered blocka carefully and think about what insights, lessons, or stepping stones can be learned from them.
You are encouraged to draw inspiration from related agent papers or academic papers from other research areas.
Use the knowledge from the archive and inspiration from academic literature to propose the new connection.

Below is the question to solve:\n\n[QUESTION]
"""

In [13]:
meta_feedback_prompt = """"{previous_response}
Carefully review the proposed new architectures ("code" entry), the answer of each sub-tasks, agnets and the final response in the trace and the fitness ("fitness" entry, i.e., the median and 95% Bootstrap Confidence Interval of the correct rate on the given question), and the 'memory' (previous final answer extracted from their reponse and the corresponding fitness score, in the format of a list of dictionary final answer: fitness) in all the history user and assistant answers. Reflect on the following points:"

1. **Solvable**: Assess whether all sub-tasks are solvable by the corresponidng block via checking the output answer of each sub-task. 

- if the answer of the sub-task explicitly contain [TOO_HARD]. This clearly state that task is too hard, what need to be further decomposed. Consider the suggestion given after the [TOO_HARD] (you can see the 'Suggestions:' next to the [TOO_HARD] )and improve your decomposison accrodingly. See below (a) for what need to be make sure

- If the sub-task answr is incoorect. That means it is not solvable, what need to be improved. It may beuase 

(a) the task is still too difficult for the block, then the sub-task need to be further decomposed. 

When proposing new sub-task, make sure 
(i) it is specific and detailed enough to solve and to help achieve the final answer to the given question.
(ii) all information required to answer the question is proveded by the previous answers or the instruction. 
(iii) the related sub-tasks thinking and answers have correctly input to the current sub-task by adding it to the taskInfo list when calling the agent. 
(iv) The output should be helpful to solve the next sub-task. Also make sure the sub-task connection is clearly by clealy state 'Based on the output of sub-task i..' in the sub-task instruction

(b) some agnets in the block is malfunctional or the underlying LLM is too weak to solve the sub-task alone. This can be detemined by checking the agents output to decide whether it works as expected. If this is the case, then we need to get rid of the block and use another block in the architecture. There are then two possibilities
    (i) the agent in the block is not optimal to solve the sub-task, setting needed to be improved (instruction, tempreture...)
    (ii) the agent architecutre in the block is not optimal, a new block that combine exisitng blocks in a different way or different settings need to be proposed
Please jutify it is (a), the decomposition issue or (b) (and (i) or (ii)), the block and agent issue. It could also be both.

2. **Completeness**: Are the sub-tasks include all neccessay information from the irginal query that can ensure the aggregation of sub-task responses can effectively yild a comprehensive answer to the user query? Note that while a sub-task might include only part of the neccessary information, it is not allowable for any particular piece of critical information to be omitted from all sub-tasks. Make sure the sub-task are connected to the prerequisite sub-tasks so that there is enough information to solve it.

3. **Fitness**: Your final goal is to optimize the fitness score after updating the architectures and/or task decomposision based on (1) and (2). The fitness is computed based on the final reponse. If it is low, it indicates that your final answer is incorrect. In your updated architecture or task decomposition, you need to make sure they will update your final response accordingly. 

And then, you need to improve or revise the implementation, or implement the new proposed architecture based on the reflection.

Your response should add new entries to the previous answers:

"reflection": 
(1) Provide your thoughts on the Solvable, Completeness and Fitness of the architecture and/or task decomposision (which sub-tasks are incorrect? which agent in which block are malfunctional?)
(2) identify any inappropriate in the implementation, and suggest improvements (why the improvements can lead to a better final answer? Expain in detail)

"thought": Revise your previous proposal or propose a new architecture if necessary, using the same format as the example response. 

For case (a), Give the 

(1) **Further Decomposion**: Compre to your previous decomposition attemps in the Discovered architecture archive (see the 'thought' entries), how do you futher decompose the questions? please give details by the following format: 'last sub-task 1 -> (further decompose to) new sub-task 2, new sub-task 3..., new sub-task n)' Give detail compare and justify how the new sub-tasks are eaiser than the old one. Do not give answer or short-cut (an example for shot-cut: 'output exactly the following:..', which is not allowed) in the sub-task instruction in any format, but only do the planing. Justify (1) why the new sub-tasks are sovlable and (2) how the sub-tasks can achieve the final answer to the original question.


For case (b), Give the 

(2) **Improved subtask architeture**: Compare to your last block attemps in the history (last answer), which sub-task architecture need to be imrpoved? How do you futher connecting them in a different ways so that the resultsing subtask architeture is able to solve the corresponding sub-task? please give details by the following format: 'last sub-task architeture (what architecute was it?) (aims to address sub-task i)-> (improve to) new sub-task architeture (what is the main difference?)' Give detail compare and justify how the new connection is improved than the old one. Note that the new connection still follow the rules that you need need to determine the number of layers as well as the connection, but do not propose new blocks or modify existing ones in the sub-task architecture, and just changes the connection among the block, but block setting like instruction, tempreture are allowed to modify

For case where the final response is not updated and still the same mistaken answer, Give the

(3) **Updated Subtask Instruction**. Read the 'memeory' entry, improve the sub-task instruction so that it can know explicitly that some answers should be avoided. For example, you can add `It is known that (wrong answers, include all wrong answers from the 'memeory', i.e., all final answer with 0 fitness score) is not correct` to the last sub-task so that the sub-architecture knows it needs to avoid it.


"name": Provide a name for the revised or new architecture. (Don't put words like "new" or "improved" in the name.)

"code": Update the code entry based on your reflection and thought. Make sure you actually implement all the improvements mentioned in the reflection and thougths and improvement in this code. Make sure only return the final answer, i.e., the output of self.make_final_answer. All the requirement on code still valid: You must write a COMPLETE CODE in "code": Your code will be part of the entire project (so do not implement any other part), so please implement complete, reliable, reusable code snippets. Do not make any syntactic mistakes. For example. 
if single quote (') is used in string, then double quote (") should be used for the whole string.

This is WRONG
`f'CoT-SC agent ABC, on the purpose of determining changes to Maxwell's'`. 
This is wrong becuse single qupte is used (Maxwell's) within the sting but single quote is used again for the f-string (f''). This will casue unterminated string error. To correct it, one should use double quote for f-stirng, i.e., `f"CoT-SC agent ABC, on the purpose of determining changes to Maxwell's"`
"""

The prompts will be used in the following order:

1. The `meta_design_prompt` is passed to the model. This will result in the first suggestion of a new block.
2. Then the suggested mas design is evaluated on the example. If there is mistake in the code, we give the meta agent `n_retries` chances to correct its code.
3. The trace from the new design and the `meta_feedback_prompt` are passed to the meta agent to generate a new design.
4. Repeat steps 2 and 3 `n_generations - 1` times

This will lead to `n_generations` new block designs and candidate answers.

In [14]:
def extract_json_obj(text: str):
    """Best-effort extraction of a JSON object from a model response."""
    m = re.search(r"```json\s*([\s\S]*?)\s*```", text)
    if m:
        return json.loads(m.group(1))

    start = text.find("{")
    end = text.rfind("}")
    if start != -1 and end != -1 and end > start:
        return json.loads(text[start : end + 1])

    return json.loads(text)

In [None]:
n_generations = 5
n_retries = 5

meta_agent = ab.Model(MODEL_NAME, keep_history=True, cost_tracking="ignore_errors", system_prompt=system_prompt)

for gen in range(n_generations):
    print("="*30, f"Generation {gen+1} of {n_generations}", "="*30)

    # Step 1 / 3: propose a design
    if gen == 0:
        prompt = (
            meta_design_prompt.replace("[ARCHIVE]", json.dumps(archive))
            .replace("[QUESTION]", example.input)
        )
        draft = meta_agent(prompt)
    else:
        draft = meta_agent(meta_feedback_prompt.format(previous_response=prev_response))

    # Step 2: evaluate the proposed design (with retries on code errors)
    last_err = None
    for retry in range(n_retries):
        try:
            spec = extract_json_obj(draft)

            # YOLO mode: require explicit approval before executing model-generated code
            if not YOLO:
                print(f"\n{spec['name']}:\n{spec['code']}\n")
                if input("Run? [Y/n]: ").strip().lower() == "n":
                    print("Skipped by user.")
                    break

            namespace = {"ab": ab}
            exec(spec["code"], namespace)
            BlockCls = namespace[spec["name"]]
            block = BlockCls()

            with ab.trace() as t:
                out = block(example.input)

            record = {
                "name": spec["name"],
                "thought": spec.get("thought", ""),
                "code": spec["code"],
                "output": out,
                "trace": t.to_dict(),
                "fitness": {  # fitness is 0 here as we cannot use info from the validation set
                    "95% Bootstrap Confidence Interval": "(0.0%, 0.0%)",
                    "median": "0.0%",
                },
            }

            # Add the new block to the archive and prepare it as input for the next iteration
            archive += [record]
            prev_response = json.dumps(record, indent=2)

            print("Successfully found a new block:")
            print(prev_response)

            break

        except Exception as e:  # noqa: BLE001
            last_err = e
            print(f"Error (attempt {retry + 1}/{n_retries}): {e}")
            draft = meta_agent(
                "The following exception occurred when trying to test the block: "
                f"{e}.\n\nTraceback:\n{traceback.format_exc()}\n\n"
                "Return a valid JSON object with keys: thought, name, code. "
                "The name must match the class name defined in code."
            )
            # remove the message with error and last user prompt from message history to save tokens
            meta_agent.messages.pop(-2)
            meta_agent.messages.pop(-2)

    else:
        print(f"Failed to produce a runnable design after {n_retries} retries. Skipping to next generation.")

Error (attempt 1/5): Expecting ',' delimiter: line 1 column 2265 (char 2264)
Error (attempt 2/5): Invalid control character at: line 1 column 3734 (char 3733)
Error (attempt 3/5): No valid cost information available from the API response for model meta-llama/llama-3.3-70b-instruct:free: Usage {'completion_tokens': 1087, 'prompt_tokens': 817, 'total_tokens': 1904, 'completion_tokens_details': {'reasoning_tokens': 0, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0, 'upstream_inference_completions_cost': 0}}, cost 0. Cost must be > 0.0. Set cost_tracking: 'ignore_errors'
Successfully found a new block:
{
  "name": "QuantumDialogProtocol",
  "thought": "To address the question effectively, let's break down the key components and apply the principles of continuous variable quantum information processing. The GHZ state is 

We have now collected candidate answers from the built-in blocks and the blocks designed by the meta-agent. Let's use an LLM to extract a final answer from each output string.

In [36]:
model = ab.Model(MODEL_NAME, keep_history=True, cost_tracking="ignore_errors")
prompt = """You are given the output of an LLM to the following question:

{example}

Return a JSON object with the following fields:

- choice: the choice letter: A), B), C), D) or None
- index: the index of the correct answer or None (e.g. 0 for A, 1 for B, 2 for C, 3 for D)
- text: the text of the correct answer or None
- explanation: the explanation of the answer

Use None in the choice, index and text fields if the output is not a valid answer.
Do not include any other text in your response.

Here is the output of the LLM:

{output}
"""

for c in archive:
    model.reset_history()
    out = {i: None for i in ["choice", "index", "text", "explanation"]}
    for retry in range(3):
        try:
            out = model(prompt.format(example=example.input, output=c["output"]))
            out = extract_json_obj(out)
            break
        except Exception as e:
            print(f"Error (attempt {retry + 1}/{3}): {e}")
            model(f"Error when attempting to parse the JSON object: {traceback.format_exc()}")

    print(c["name"], "->", out)
    c["candidate_answer"] = out
    c["candidate_answer"] = out

ab.IO -> {'choice': 'D', 'index': 3, 'text': '(xa -xb,pa +pb), (xb-xc,pb+pc)', 'explanation': "Given the typical nature of Bell measurements and without further details on the measurement implementation, it's difficult to select the correct option with certainty. However, considering the standard approach to continuous variable quantum information and the typical implementation of Bell measurements, Option D might be considered a plausible representation of measurement outcomes under specific implementations, as it reflects a combination of relative amplitude and phase measurements that could be relevant in certain quantum dialog protocols."}
ab.ChainOfThought -> {'choice': 'C', 'index': 2, 'text': '(xa +xb,pa +pb), (xb+xc,pb+pc)', 'explanation': 'The displacements add to the quadratures (both amplitude and phase), so the correct measurement output reflects the addition of both the amplitude and phase quadratures due to the displacement operators applied for encoding.'}
ab.SelfConsiste

### Stage 3: MAS-Verify

Now that we have collected all candidate answers, MAS-Verify uses the following steps to select a final answer:

1. Filter out invalid answers.
2. Rank the remaining candidate answers by frequency (biases toward majority vote).
3. Use an LLM to pick the final answer.


In [None]:
# filter out invalid answers
candidate_answers = [c["candidate_answer"] for c in archive if c["candidate_answer"]["choice"] is not None]

In [None]:
# sort by frequency
freq = collections.Counter(c.get("choice") for c in candidate_answers)
candidate_answers.sort(key=lambda c: (-freq.get(c.get("choice"), 0),))

In [51]:
model = ab.Model(MODEL_NAME, cost_tracking="ignore_errors", system_prompt="Return a JSON object with the following fields: thinking, selection")
verify_prompt = """Given the problem and a list of candidate thinking steps and the final answers,
do not solve the task yourself but look carefully at the reasoning steps and final answer,
select the best answer among the candidates.
In the "thinking" entry, compare the selected answer with all other unselected answer one-by-one,
identify the erroneous steps in the unselected answer and give detailed explanation on why it is incorrect.
In the "selection" entry, gives the best answer id \n\n Problem: \n {problem} \n\n Answer List: {answer_list}"""
final_choice = model(verify_prompt.format(problem=example.input, answer_list=json.dumps(candidate_answers, indent=2)))

In [52]:
print(final_choice)

The task requires analyzing the given options and explanations to determine the most accurate answer without directly solving the problem. Let's examine the explanations provided for each choice to identify the correct answer and explain why the other options are incorrect.

### Thinking

1. **Option A: (xa -xb,pa -pb), (xb-xc,pb-pc)**
   - The explanations provided for Option A suggest that the measurement output is derived from the differences in both amplitudes and phases between the modes encoded by different parties. However, this explanation does not fully align with the typical effects of displacement operators in quantum information protocols, especially considering how phases are usually combined.

2. **Option B: (xa +xb,pa -pb), (xb+xc,pb-pc)**
   - There is no explanation provided for Option B in the list, making it difficult to assess its validity directly. However, based on the structure of the options, it seems to mix the addition of amplitudes with the subtraction of pha

In [55]:
# compare to ground truth
print(example.input)

print(f"Correct Answer: {chr(ord('A') + example['correct_index'])}) {example['Correct Answer']}")

In a quantum dialog protocol a 4-mode continuous variable GHZ state is distributed among 3-parties, and a bell measurement is performed on these states, what would be the measurement output if the three parties encode in the following way using a displacement operator D(alpha): 
P1: (xa,pa) 
P2: (xb,pb)
P3: (xc,pc)
Here, (x,p) correspond to the amplitude and phase, such that 
alpha= x +ip, is the argument of displacement operator.
In the scheme, the 2nd and 3rd mode are encoded by P2. The 1st and 4th mode are encoded by P1 and P3.

A) (xa -xb,pa -pb), (xb-xc,pb-pc)
B) (xa +xb,pa -pb), (xb+xc,pb-pc)
C) (xa +xb,pa +pb), (xb+xc,pb+pc)
D) (xa -xb,pa +pb), (xb-xc,pb+pc)

Correct Answer: D) (xa -xb,pa +pb), (xb-xc,pb+pc)


## Conclusion

This notebook demonstrated an implementation of MAS-Zero using the `agenticblocks` framework. Key takeaways:

1. **MAS-Init** provides diverse initial answers using established prompting strategies (Chain-of-Thought, Self-Consistency, Multi-Agent Debate, Self-Refine).

2. **MAS-Evolve** iteratively refines solutions through:
   - task decomposition by the meta-agent
   - feedback on solvability and completeness
   - experience accumulation across iterations

3. **MAS-Verify** uses majority voting to select a final answer from all candidates.

Unlike ADAS, MAS-Zero operates on each question independently and does not require a validation dataset for meta-learning. This makes it well suited to scenarios where labeled validation data is unavailable.