# MAS-Zero: Multi-Agent Systems with Zero Supervision

MAS-Zero: Designing Multi-Agent Systems with Zero Supervision is a [paper by Ke et al.](https://arxiv.org/pdf/2505.14996) which was one of the first works to explore designing Mulit-Agent Systems (MAS) in an unsupervised manner.

While other automated design systems for MAS (e.g. ADAS) often rely on validation data to inform the design process and are designing solutions for a whole task, MAS-Zero automatically designs a solution for each problem instance without relying on any additional validation data.

![](assets/mas-zero.png)

MAS-Zero is made of 3 steps:

1. **MAS-Init**: Entry point to MAS-Zero in which a set of established human-designed strategies (e.g. Chain of Thought, Self Consitency, Debate, Self-Refine) are run to generate initial candidate solutions.
2. **MAS-Evolve**: MAS-Evolve is an iterative process which consists of two alternating phases - meta-design and meta-feedback. The goal of this stage is to learn about the strengths of the component agents and refine the solution.
    1. **meta-design**: the meta agent decomposes the question into subtasks and proposes a MAS based on the building blocks and any accumulated experience from prior iterations.
    2. **meta-feedback**: The designed MAS as well as its output, including all intermediate output are evaluated against the following two criteria:
        * Solvability: Requires that each sub-task is independently and completely solvable.
        * Completeness: Requires that the complete set of sub-tasks covers all necessary information from the original input, ensuring that the answers to the sub-tasks can be aggregated to a the original task.
    3. The natural language feedback provided in the meta-feedback step is added to an experience library and added to the input of the meta-design stage in subsequent rounds.
3. **MAS-Verify**: Each round of MAS-Evolve produces multiple intermediate outputs and one candidate answer. After multiple rounds of MAS-Evolve the final answer to the task is selected from the candidate answers and the answers from the initial building blocks. In order to do so all answers are first ranked by their frequency, then clearly invalid answers are filtered out and finally the best answer is selected from the remaining candidates.

```
@misc{ke2025maszero,
      title={MAS-Zero: Designing Multi-Agent Systems with Zero Supervision}, 
      author={Zixuan Ke and Austin Xu and Yifei Ming and Xuan-Phi Nguyen and Caiming Xiong and Shafiq Joty},
      year={2025},
      eprint={2505.14996},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.14996}, 
}
```

This notebook shows how to implement the ideas from MAS-Zero using the agenticblocks library.

## Setup

### Imports

In [1]:
import io
import random
import urllib.request
import zipfile

import agenticblocks as ab
import pandas as pd

### Model Access

We need to set up access to the language model(s) we want to use.
agenticblocks supports all OpenAI API compatible providers.

You can set the base url and api key via the `OPENAI_API_URL` and `OPENAI_API_KEY` environment variables.

For more details check the [getting started example](01_getting_started.ipynb).

In [2]:
import dotenv
dotenv.load_dotenv()

#!export OPENAI_API_URL=
#!export OPENAI_API_KEY=

True

In [None]:
# We use OpenAI's gpt-4o model for both meta agent and sub-agents
#MODEL_NAME = "openai/gpt-4o"
MODEL_NAME = "meta-llama/llama-3.3-70b-instruct:free"

### Data

Let's download the GPQA (Graduate-Level Google-Proof Q&A) dataset. GPQA consists of challenging multiple choice questions written by domain experts that are designed to be difficult even for experts to answer correctly.

We will use a subset of 50 examples to evaluate mas-zero.

In [45]:
with urllib.request.urlopen("https://github.com/idavidrein/gpqa/raw/main/dataset.zip") as response:
    zip_data = io.BytesIO(response.read())

with zipfile.ZipFile(zip_data, 'r') as zf:
    with zf.open('dataset/gpqa_main.csv', pwd=b"deserted-untie-orchid") as csv_file:
        df = pd.read_csv(csv_file)

In [46]:
def format_input(row):
    correct = row['Correct Answer']
    incorrect = [row['Incorrect Answer 1'], row['Incorrect Answer 2'], row['Incorrect Answer 3']]
    choices = [correct] + incorrect
    random.shuffle(choices)
    input_text = f"""{row['Question']}

{chr(10).join([f"{chr(ord('A')+i)}) {choice}" for i, choice in enumerate(choices)])}
"""
    correct_idx = choices.index(correct)
    return pd.Series([input_text, correct_idx], index=["input", "correct_index"])

df[["input", "correct_index"]] = df.apply(format_input, axis=1)

## MAS-Zero Implementation

Lets implement the three stages of MAS-Zero using the agenticblocks library. For demonstration purposes we will use a hard example of the gpqa dataset

In [59]:
example = df.loc[
    df["Writer's Difficulty Estimate"].str.startswith("Post-graduate level or harder")
    & df["Question Difficulty_EV_1"].str.startswith("Post-graduate level or harder")
    & df["Question Difficulty_EV_2"].str.startswith("Post-graduate level or harder")
].iloc[3]

print(example.input)
print(example.correct_index)
print(example["Correct Answer"])

In a quantum dialog protocol a 4-mode continuous variable GHZ state is distributed among 3-parties, and a bell measurement is performed on these states, what would be the measurement output if the three parties encode in the following way using a displacement operator D(alpha): 
P1: (xa,pa) 
P2: (xb,pb)
P3: (xc,pc)
Here, (x,p) correspond to the amplitude and phase, such that 
alpha= x +ip, is the argument of displacement operator.
In the scheme, the 2nd and 3rd mode are encoded by P2. The 1st and 4th mode are encoded by P1 and P3.

A) (xa -xb,pa -pb), (xb-xc,pb-pc)
B) (xa +xb,pa +pb), (xb+xc,pb+pc)
C) (xa +xb,pa -pb), (xb+xc,pb-pc)
D) (xa -xb,pa +pb), (xb-xc,pb+pc)

3
(xa -xb,pa +pb), (xb-xc,pb+pc)


### MAS-Init

First we run a set of established building blocks (Chain of Thought, Self-Consistency, Multi-Agent Debate, Self-Refine) to generate initial candidate solutions. We use the agenticblocks built-ins for this.

In [60]:
building_blocks = [ab.IO, ab.ChainOfThought, ab.SelfConsistency, ab.MultiAgentDebate, ab.SelfRefine]

In [63]:
responses = []
for block_class in building_blocks:
    model = ab.Model(MODEL_NAME, cost_tracking="ignore_errors")
    if block_class == ab.MultiAgentDebate:
        block = block_class(agents=[ab.Model(MODEL_NAME, cost_tracking="ignore_errors") for _ in range(3)])
    else:
        block = block_class(model)
    responses += [block(example.input)]

In [65]:
responses[0]

{'content': "To solve this, let's understand how a Bell measurement works in the context of continuous variable quantum information, particularly with a 4-mode GHZ state distributed among three parties. The GHZ state for continuous variables can be represented in a way that involves correlations between the quadratures (amplitude and phase) of the modes. A Bell measurement typically involves measuring the difference in amplitude and the sum of phase quadratures between two modes.\n\nGiven the encoding:\n- Party 1 (P1) encodes on modes 1 and 4 with \\( \\alpha_1 = x_a + ip_a \\) and \\( \\alpha_4 = x_c + ip_c \\), respectively.\n- Party 2 (P2) encodes on modes 2 and 3 with \\( \\alpha_2 = x_b + ip_b \\) and \\( \\alpha_3 = x_c + ip_c \\), respectively.\n\nHowever, the question states that P2 encodes the 2nd and 3rd modes, and P1 and P3 encode the 1st and 4th modes, which might have been a source of confusion in the initial setup. Let's clarify and correct this based on standard protocol

In [66]:
responses[1]

{'content': "To determine the correct answer, let's break down the process step by step, focusing on how the encoding and measurement process works in a continuous variable quantum system, particularly with a 4-mode continuous variable GHZ state and the use of displacement operators for encoding.\n\n1. **Understanding the GHZ State**: A 4-mode continuous variable GHZ state is an entangled state among four modes, which can be thought of as a superposition of states where all modes are correlated in their quadratures (amplitude and phase). This state is a resource for quantum information protocols, including quantum teleportation and superdense coding.\n\n2. **Encoding with Displacement Operators**: The displacement operator \\(D(\\alpha)\\) is used to encode information onto the modes. Here, \\(\\alpha = x + ip\\), where \\(x\\) and \\(p\\) represent the amplitude and phase quadratures, respectively. Each party (P1, P2, P3) encodes their information by displacing their respective modes.

In [67]:
responses[2]

{'content': "A nice question about continuous variable quantum information processing!\n\nTo determine the measurement output, let's first understand how the displacement operators act on the 4-mode GHZ state. The GHZ state can be written as:\n\n|GHZ⟩ = (|0000⟩ + |1111⟩) / √2\n\nwhere |0⟩ and |1⟩ are the vacuum states of the four modes.\n\nThe displacement operators act on the modes as follows:\n\nD1(xa, pa)  D2(xb, pb)  D3(xc, pc)  D4(xd, pd) |GHZ⟩\n\nSince P1 encodes the 1st and 4th modes, and P2 encodes the 2nd and 3rd modes, we have:\n\nD1(xa, pa)  D2(xb, pb)  D3(xb, pb)  D4(xc, pc) |GHZ⟩\n\nThe Bell measurement is a joint measurement on two modes, which can be represented by the operators:\n\n(X1 - X2, P1 + P2) and (X2 - X3, P2 + P3)\n\nwhere Xk and Pk are the quadrature operators for mode k.\n\nApplying these operators to the encoded state, we get:\n\n(X1 - X2, P1 + P2) = (xa - xb, pa + pb)\n(X2 - X3, P2 + P3) = (xb - xc, pb + pc)\n\nHowever, note that the question asks for the o

In [12]:
df.input.iloc[0]

"A large gene has dozens of exons, of which the central ones code for folded triple helical repeats that connect the cytoskeleton with sarcolemma and extracellular space. Each exon usually codes for one folded triple alpha helix. The most common mutations of the gene are central exon deletions that create out-of-frame peptides and progressive degenerative organ waste. A solution is to deliver a Morpholino that recognizes the 5' end of the out-of-frame exon in pre-mRNA. The molecule prevents binding of the spliceosome and creates exon skipping and in-frame joining. Several missing exons are well tolerated by an organism. Which structure below is not involved in the proposed therapy?\n\nA) polyA tail\nB) R-loops\nC) antisense\nD) lariat\n"

In [None]:
block

IO('openai/gpt-4o')


2. **MAS-Evolve**: An iterative meta-design and meta-feedback loop where:
   - The meta-agent decomposes the question into sub-tasks and designs a MAS
   - The MAS output is evaluated for **solvability** (each sub-task is independently solvable) and **completeness** (sub-tasks cover all necessary information)
   - Feedback is accumulated in an experience library for subsequent iterations

3. **MAS-Verify**: Select the final answer from all candidates by ranking by frequency and filtering invalid answers.


Let's define some prompts for the MAS-Zero stages. These are adapted from the [MAS-Zero repository](https://github.com/SalesforceAIResearch/MAS-Zero).

### Building Blocks

First, let's define the building blocks that MAS-Init will use. These are the established prompting strategies:


In [None]:
# Define building blocks
building_blocks = [ab.IO, ab.ChainOfThought, ab.SelfConsistency, ab.MultiAgentDebate, ab.SelfRefine]

# Show the source of each block for reference
for block in building_blocks:
    print(f"=== {block.__name__} ===")
    print(inspect.getsource(block))
    print()


### Stage 1: MAS-Init

MAS-Init runs each building block on the question to generate initial candidate answers. These serve as a baseline and also provide diverse perspectives for the MAS-Evolve stage.


In [None]:
def extract_answer(response_text):
    """Extract the answer letter (A, B, C, or D) from a response."""
    # Look for patterns like "A)", "A.", "Answer: A", etc.
    import re
    
    # Try to find the last mention of a letter choice
    for line in reversed(response_text.strip().split('\n')):
        for letter in ['A', 'B', 'C', 'D']:
            if f"{letter})" in line or f"{letter}." in line:
                return letter
            # Check for "Answer: X" pattern
            if re.search(rf'\b{letter}\b', line) and ('answer' in line.lower() or 'choice' in line.lower()):
                return letter
    
    # Fallback: look for any standalone letter
    for letter in ['A', 'B', 'C', 'D']:
        if f"{letter})" in response_text or f" {letter} " in response_text:
            return letter
    
    return None

def mas_init(question_formatted, model_name):
    """
    MAS-Init: Run building blocks to generate initial candidate answers.
    Returns a list of (block_name, response_text, extracted_answer) tuples.
    """
    candidates = []
    
    for block_class in building_blocks:
        try:
            model = ab.Model(model_name)
            
            if block_class == ab.MultiAgentDebate:
                block = block_class(agents=[ab.Model(model_name) for _ in range(3)])
            elif block_class == ab.SelfConsistency:
                block = block_class(ab.ChainOfThought(model), n=3, temperature=0.7, aggregator=model)
            else:
                block = block_class(model)
            
            prompt = f"{question_formatted}\n\nThe last line of your answer should be the correct choice, e.g., A)"
            result = block(prompt)
            response_text = result["content"]
            answer = extract_answer(response_text)
            
            candidates.append({
                'block': block_class.__name__,
                'response': response_text,
                'answer': answer
            })
        except Exception as e:
            print(f"Error running {block_class.__name__}: {e}")
            candidates.append({
                'block': block_class.__name__,
                'response': str(e),
                'answer': None
            })
    
    return candidates


### Stage 2: MAS-Evolve

MAS-Evolve iteratively refines the solution through:
1. **Meta-Design**: The meta-agent decomposes the question into sub-tasks and proposes a solution approach
2. **Meta-Feedback**: Evaluates the design for solvability and completeness, providing feedback for the next iteration


In [None]:
# Meta-Design prompt template
META_DESIGN_PROMPT = """You are an expert problem solver. Your task is to analyze a complex question and design a multi-agent approach to solve it.

## Question
{question}

## Initial Candidate Answers from Building Blocks
{init_candidates}

## Experience from Previous Iterations
{experience}

## Your Task
1. Analyze the question and identify what knowledge or reasoning steps are needed
2. Decompose the question into 2-4 sub-tasks that together can answer the main question
3. For each sub-task, explain what needs to be determined
4. Provide a final answer based on your analysis

Output your response in the following JSON format:
```json
{{
    "analysis": "Your analysis of the question and what makes it challenging",
    "sub_tasks": [
        {{"id": 1, "task": "Description of sub-task 1", "reasoning": "Why this sub-task is needed"}},
        {{"id": 2, "task": "Description of sub-task 2", "reasoning": "Why this sub-task is needed"}}
    ],
    "sub_task_answers": [
        {{"id": 1, "answer": "Answer to sub-task 1"}},
        {{"id": 2, "answer": "Answer to sub-task 2"}}
    ],
    "final_reasoning": "How the sub-task answers combine to answer the main question",
    "final_answer": "A, B, C, or D"
}}
```"""

# Meta-Feedback prompt template  
META_FEEDBACK_PROMPT = """You are an evaluator for a multi-agent problem-solving system. Evaluate the following solution approach:

## Original Question
{question}

## Proposed Solution
{solution}

## Evaluation Criteria

1. **Solvability**: Is each sub-task independently solvable? Can the sub-tasks be answered with available knowledge?

2. **Completeness**: Do the sub-tasks cover all necessary information from the original question? Can the answers to all sub-tasks be combined to answer the original question?

Provide your evaluation in JSON format:
```json
{{
    "solvability_score": 1-10,
    "solvability_feedback": "Specific feedback on solvability issues",
    "completeness_score": 1-10,
    "completeness_feedback": "Specific feedback on completeness issues",
    "overall_feedback": "Suggestions for improvement",
    "is_satisfactory": true/false
}}
```"""

def mas_evolve(question_formatted, init_candidates, model_name, num_iterations=2):
    """
    MAS-Evolve: Iteratively refine the solution through meta-design and meta-feedback.
    """
    meta_agent = ab.Model(model_name, keep_history=True)
    feedback_agent = ab.Model(model_name)
    
    # Format initial candidates for the prompt
    init_summary = "\n".join([
        f"- {c['block']}: {c['answer']} - {c['response'][:200]}..."
        for c in init_candidates if c['answer']
    ])
    
    experience_library = []
    evolve_candidates = []
    
    for iteration in range(num_iterations):
        # Format experience
        experience_str = "\n".join(experience_library) if experience_library else "No previous experience."
        
        # Meta-Design phase
        design_prompt = META_DESIGN_PROMPT.format(
            question=question_formatted,
            init_candidates=init_summary,
            experience=experience_str
        )
        
        try:
            design_response = meta_agent(design_prompt)
            design_text = design_response["content"]
            
            # Try to parse JSON from response
            try:
                # Extract JSON from response
                json_start = design_text.find('{')
                json_end = design_text.rfind('}') + 1
                if json_start >= 0 and json_end > json_start:
                    design_json = json.loads(design_text[json_start:json_end])
                    answer = design_json.get('final_answer', '').strip().upper()
                    if answer and answer[0] in 'ABCD':
                        answer = answer[0]
                    else:
                        answer = extract_answer(design_text)
                else:
                    design_json = {}
                    answer = extract_answer(design_text)
            except json.JSONDecodeError:
                design_json = {}
                answer = extract_answer(design_text)
            
            evolve_candidates.append({
                'iteration': iteration + 1,
                'response': design_text,
                'answer': answer,
                'design': design_json
            })
            
            # Meta-Feedback phase
            feedback_prompt = META_FEEDBACK_PROMPT.format(
                question=question_formatted,
                solution=design_text
            )
            
            feedback_response = feedback_agent(feedback_prompt)
            feedback_text = feedback_response["content"]
            
            # Parse feedback
            try:
                json_start = feedback_text.find('{')
                json_end = feedback_text.rfind('}') + 1
                if json_start >= 0 and json_end > json_start:
                    feedback_json = json.loads(feedback_text[json_start:json_end])
                else:
                    feedback_json = {}
            except json.JSONDecodeError:
                feedback_json = {}
            
            # Add to experience library
            experience_entry = f"Iteration {iteration + 1}: {feedback_json.get('overall_feedback', feedback_text[:200])}"
            experience_library.append(experience_entry)
            
            # Check if satisfactory
            if feedback_json.get('is_satisfactory', False):
                break
                
        except Exception as e:
            print(f"Error in MAS-Evolve iteration {iteration + 1}: {e}")
            traceback.print_exc()
    
    return evolve_candidates, experience_library


### Stage 3: MAS-Verify

MAS-Verify selects the final answer from all candidates by:
1. Collecting all answers from MAS-Init and MAS-Evolve
2. Ranking by frequency (majority voting)
3. Filtering out invalid answers
4. Selecting the most common valid answer


In [None]:
def mas_verify(init_candidates, evolve_candidates):
    """
    MAS-Verify: Select the final answer from all candidates using majority voting.
    """
    # Collect all valid answers
    all_answers = []
    
    # Add answers from MAS-Init
    for c in init_candidates:
        if c['answer'] and c['answer'] in 'ABCD':
            all_answers.append(c['answer'])
    
    # Add answers from MAS-Evolve
    for c in evolve_candidates:
        if c['answer'] and c['answer'] in 'ABCD':
            all_answers.append(c['answer'])
    
    if not all_answers:
        return None, {}
    
    # Count frequencies
    answer_counts = Counter(all_answers)
    
    # Get the most common answer
    final_answer = answer_counts.most_common(1)[0][0]
    
    return final_answer, dict(answer_counts)


### Complete MAS-Zero Pipeline

Now we combine all three stages into a complete MAS-Zero function:


In [None]:
def mas_zero(question_data, model_name, evolve_iterations=2, verbose=False):
    """
    Complete MAS-Zero pipeline for a single question.
    
    Args:
        question_data: Dict with 'question', 'choices', 'correct_idx', 'correct_letter'
        model_name: Name of the model to use
        evolve_iterations: Number of MAS-Evolve iterations
        verbose: Whether to print detailed output
    
    Returns:
        Dict with final_answer, correct_answer, is_correct, and detailed results
    """
    question_formatted = format_question(question_data)
    correct_answer = question_data['correct_letter']
    
    if verbose:
        print("=" * 60)
        print("QUESTION:")
        print(question_formatted)
        print(f"\nCorrect answer: {correct_answer}")
        print("=" * 60)
    
    # Stage 1: MAS-Init
    if verbose:
        print("\n--- Stage 1: MAS-Init ---")
    init_candidates = mas_init(question_formatted, model_name)
    
    if verbose:
        for c in init_candidates:
            print(f"{c['block']}: {c['answer']}")
    
    # Stage 2: MAS-Evolve
    if verbose:
        print("\n--- Stage 2: MAS-Evolve ---")
    evolve_candidates, experience = mas_evolve(
        question_formatted, init_candidates, model_name, 
        num_iterations=evolve_iterations
    )
    
    if verbose:
        for c in evolve_candidates:
            print(f"Iteration {c['iteration']}: {c['answer']}")
    
    # Stage 3: MAS-Verify
    if verbose:
        print("\n--- Stage 3: MAS-Verify ---")
    final_answer, answer_counts = mas_verify(init_candidates, evolve_candidates)
    
    if verbose:
        print(f"Answer distribution: {answer_counts}")
        print(f"Final answer: {final_answer}")
        print(f"Correct answer: {correct_answer}")
        print(f"Is correct: {final_answer == correct_answer}")
    
    return {
        'final_answer': final_answer,
        'correct_answer': correct_answer,
        'is_correct': final_answer == correct_answer,
        'init_candidates': init_candidates,
        'evolve_candidates': evolve_candidates,
        'experience': experience,
        'answer_counts': answer_counts
    }


## Evaluation

Let's run MAS-Zero on our sample of GPQA questions and measure performance.


In [None]:
# Run MAS-Zero on all questions
results = []

for i, q in enumerate(tqdm(questions_data, desc="Evaluating MAS-Zero")):
    print(f"\n{'='*60}")
    print(f"Question {i+1}/{len(questions_data)}")
    
    result = mas_zero(q, MODEL_NAME, evolve_iterations=2, verbose=True)
    results.append(result)
    
    print(f"\nRunning accuracy: {sum(r['is_correct'] for r in results)}/{len(results)} = {sum(r['is_correct'] for r in results)/len(results)*100:.1f}%")


### Results Analysis


In [None]:
# Calculate overall accuracy
total_correct = sum(r['is_correct'] for r in results)
total_questions = len(results)
overall_accuracy = total_correct / total_questions * 100

print(f"\n{'='*60}")
print("FINAL RESULTS")
print(f"{'='*60}")
print(f"Total questions: {total_questions}")
print(f"Correct answers: {total_correct}")
print(f"Overall accuracy: {overall_accuracy:.1f}%")

# Analyze performance by stage
print(f"\n--- Performance by Stage ---")

# MAS-Init accuracy (what would we get with just the building blocks?)
init_accuracies = {}
for block in building_blocks:
    block_name = block.__name__
    block_correct = 0
    block_total = 0
    for r in results:
        for c in r['init_candidates']:
            if c['block'] == block_name:
                block_total += 1
                if c['answer'] == r['correct_answer']:
                    block_correct += 1
    if block_total > 0:
        init_accuracies[block_name] = block_correct / block_total * 100
        print(f"{block_name}: {block_correct}/{block_total} = {init_accuracies[block_name]:.1f}%")

# MAS-Evolve accuracy (final iteration)
evolve_correct = 0
evolve_total = 0
for r in results:
    if r['evolve_candidates']:
        final_evolve = r['evolve_candidates'][-1]
        evolve_total += 1
        if final_evolve['answer'] == r['correct_answer']:
            evolve_correct += 1

if evolve_total > 0:
    print(f"\nMAS-Evolve (final iteration): {evolve_correct}/{evolve_total} = {evolve_correct/evolve_total*100:.1f}%")

print(f"\nMAS-Zero (full pipeline): {total_correct}/{total_questions} = {overall_accuracy:.1f}%")


In [None]:
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy comparison by method
ax1 = axes[0]
methods = list(init_accuracies.keys()) + ['MAS-Evolve', 'MAS-Zero']
accuracies = list(init_accuracies.values())
if evolve_total > 0:
    accuracies.append(evolve_correct/evolve_total*100)
else:
    accuracies.append(0)
accuracies.append(overall_accuracy)

colors = plt.cm.tab10(np.linspace(0, 1, len(methods)))
bars = ax1.bar(range(len(methods)), accuracies, color=colors)
ax1.set_xticks(range(len(methods)))
ax1.set_xticklabels(methods, rotation=45, ha='right')
ax1.set_ylabel('Accuracy (%)')
ax1.set_title('Accuracy Comparison: Building Blocks vs MAS-Zero')
ax1.set_ylim(0, 100)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
             f'{acc:.1f}%', ha='center', va='bottom', fontsize=9)

# Plot 2: Running accuracy over questions
ax2 = axes[1]
running_acc = []
correct_count = 0
for i, r in enumerate(results):
    if r['is_correct']:
        correct_count += 1
    running_acc.append(correct_count / (i + 1) * 100)

ax2.plot(range(1, len(results) + 1), running_acc, 'b-o', markersize=4)
ax2.axhline(y=25, color='r', linestyle='--', label='Random baseline (25%)')
ax2.set_xlabel('Question Number')
ax2.set_ylabel('Cumulative Accuracy (%)')
ax2.set_title('MAS-Zero Running Accuracy')
ax2.set_ylim(0, 100)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## Conclusion

This notebook demonstrated an implementation of MAS-Zero using the agenticblocks framework. Key takeaways:

1. **MAS-Init** provides diverse initial answers using established prompting strategies (Chain of Thought, Self-Consistency, Multi-Agent Debate, Self-Refine)

2. **MAS-Evolve** iteratively refines solutions through:
   - Task decomposition by the meta-agent
   - Feedback on solvability and completeness
   - Experience accumulation across iterations

3. **MAS-Verify** uses majority voting to select the final answer from all candidates

Unlike ADAS, MAS-Zero operates on each question independently without requiring a validation dataset for meta-learning. This makes it particularly suited for scenarios where labeled validation data is unavailable.

For better results, consider:
- Increasing the number of MAS-Evolve iterations
- Using more diverse building blocks in MAS-Init
- Fine-tuning the meta-design and meta-feedback prompts for your specific domain
