# Solving Hanoi Tower with AI

The read of the illusion of thinking (from Apple's research) found reasonance in my own fun in problem solving. At a certain problem complexity, I give up when not equipped with the right tools. Perhaps then my problem solving is not much about thinking about the problem, but thinking about the tools to solve the problem. So it lead me to this question:

- Does non thinking AI solves puzzle better than thinking one when provided with the right tools?
- Does thinking AI are able to pick up the right tools to solve it?

In this notebook, I wanted to explore the first question based on reading about mcp (link to medium), hanoi algorithm (link to medium), and the Illusion of thinking article (link to article)


In this experiment, we'll:
1. Set up an MCP (Model Context Protocol) server for puzzle validation
2. Configure an AI agent to solve the Tower of Hanoi puzzle
3. Compare different approaches (with/without MCP, with/without pseudocode)
4. Analyze the agent's problem-solving strategies

The Tower of Hanoi serves as an excellent test case because:
- It has a clear, well-defined solution
- It requires systematic thinking and planning
- It can be validated step-by-step
- It demonstrates both recursive and iterative approaches

## Hanoi MCP server

The server provides a hanoi tower puzzle solver, a python version of the following pseudo code algorithm

```
ALGORITHM Solve(n, source, target, auxiliary, moves)
    // n = number of disks to move
    // source = starting peg (0, 1, or 2)
    // target = destination peg (0, 1, or 2)
    // auxiliary = the unused peg (0, 1, or 2)
    // moves = list to store the sequence of moves

    IF n equals 1 THEN
        // Get the top disk from source peg
        disk = the top disk on the source peg
        // Add the move to our list: [disk_id, source, target]
        ADD [disk, source, target] to moves
        RETURN
    END IF

    // Move n-1 disks from source to auxiliary peg
    Solve(n-1, source, auxiliary, target, moves)

    // Move the nth disk from source to target
    disk = the top disk on the source peg
    ADD [disk, source, target] to moves

    // Move n-1 disks from auxiliary to target
    Solve(n-1, auxiliary, target, source, moves)

    END ALGORITHM
```

In [1]:
import multiprocessing
from server.hanoi import run_mcp_server

multiprocessing.Process(target=run_mcp_server).start() 

## Example 

In [2]:
from config.hanoi_config import HanoiConfig, HanoiSolution, HanoiMove
from client.client_hanoi_tower import run_agent
from itertools import product
import pandas as pd
from datetime import datetime
import os
from tqdm import tqdm
import pickle

config = HanoiConfig(n_disks = 2)
config.use_mcp = True
config.add_pseudocode = True
config.model_name = 'o3-mini'# "gpt-4o-mini"
config.server_command = "python"
config.server_args = ["server/hanoi.py"]

[2;36m[06/20/25 15:43:03][0m[2;36m [0m[34mINFO    [0m Starting Hanoi MCP server               ]8;id=55152;file:///Users/olivierbertrand/my-test-project/ai-hanoi-mcp/server/hanoi.py\[2mhanoi.py[0m]8;;\[2m:[0m]8;id=494792;file:///Users/olivierbertrand/my-test-project/ai-hanoi-mcp/server/hanoi.py#57\[2m57[0m]8;;\


In [3]:
# Use await instead of asyncio.run() in Jupyter notebooks
result = await run_agent(config=config)

In [4]:
if "structured_response" in result:
        print("\nStructured solution:")
        solution = result["structured_response"]
        print(f"Total moves: {solution.total_moves}")
        valid_solution = solution.validate_solution(config.n_disks)
        if not valid_solution['is_valid']:
                print(f"Invalid solution: {valid_solution}")
        else:
                print("Valid solution")
else:
        print('No structured solution')


Structured solution:
Total moves: 3
Valid solution


In [5]:
result


{'messages': [HumanMessage(content='\n    I have a puzzle with 2 disks of different sizes with\n    Initial configuration:\n    • Peg 0: 2 (bottom), ... 2, 1 (top)\n    • Peg 1: (empty)\n    • Peg 2: (empty)\n    Goal configuration:\n    • Peg 0: (empty)\n    • Peg 1: (empty)\n    • Peg 2: 2 (bottom), ... 2, 1 (top)\n    Rules:\n    • Only one disk can be moved at a time.\n    • Only the top disk from any stack can be moved.\n    • A larger disk may not be placed on top of a smaller disk.\n    Find the sequence of moves to transform the initial configuration into the goal configuration.\n    ', additional_kwargs={}, response_metadata={}, id='b61bb698-36a1-4cba-90f3-5b699fe4824f'),
  AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_DrIHpiS6xZS7LiWCTRQWsyf5', 'function': {'arguments': '{"n": 2}', 'name': 'hanoi_solver'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 414, 'prompt_tokens': 1007, 'total_tokens': 1421, 'co

## Experiment

In [6]:
saving_result_file = 'data/hanoi_results.csv'
if os.path.exists(saving_result_file):
    completed_results = pd.read_csv(saving_result_file)
    completed_results = completed_results.loc[
        :, ['model_name', 'use_mcp', 'add_pseudocode', 'n_disks', 'ith_try']
    ]
else:
    completed_results = None


In [7]:

disks = [10, 8, 6, 5, 4, 3, 2]
tries = range(15)   
models = ['o4-mini', 'gpt-4.1-mini']
helpers = [
    dict(use_mcp = False, add_pseudocode = False),
    dict(use_mcp = False, add_pseudocode = True),
    dict(use_mcp = True, add_pseudocode = False)
]


for n_disk, model, ith_try, helper in tqdm(product(disks, models, tries, helpers)):
    config = HanoiConfig(n_disks = n_disk)
    config.use_mcp = helper['use_mcp']
    config.add_pseudocode = helper['add_pseudocode']
    config.model_name = model
    config.server_command = "python"
    config.server_args = ["server/hanoi.py"]
    if completed_results is not None and not completed_results.loc[
            (completed_results['model_name'] == model) &
            (completed_results['use_mcp'] == helper['use_mcp']) &
            (completed_results['add_pseudocode'] == helper['add_pseudocode']) &
            (completed_results['n_disks'] == n_disk) &
            (completed_results['ith_try'] == ith_try)
        ].empty:
            continue

    run_start = datetime.now()
    result = await run_agent(config)
    run_end = datetime.now()
    # To identify with the meta parameters
    result['run_start'] = run_start
    result['run_end'] = run_end
    with open('data/hanoi_results.pkl', 'ab') as f:
        pickle.dump(result, f)
    has_structured_response = "structured_response" in result
    if has_structured_response:
        solution = result["structured_response"]
        valid_solution = solution.validate_solution(config.n_disks)['is_valid']
    else:
        valid_solution = False

    to_save = pd.DataFrame(
        [dict(
            model_name = config.model_name,
            use_mcp = config.use_mcp,
            add_pseudocode = config.add_pseudocode,
            n_disks = config.n_disks,
            ith_try = ith_try,
            has_structured_response = has_structured_response,
            valid_solution = valid_solution,
            total_moves = solution.total_moves if has_structured_response else None,
            run_start = run_start,
            run_end = run_end
        )]
    )
    to_save.to_csv(saving_result_file, mode='a', header=not os.path.exists(saving_result_file))

0it [00:00, ?it/s]

43it [00:46,  1.07s/it]

44it [01:47,  3.00s/it]

45it [04:14,  9.34s/it]

76it [04:39,  3.71s/it]

77it [04:56,  4.09s/it]

78it [05:09,  4.45s/it]

79it [07:17, 11.22s/it]

80it [07:46, 12.47s/it]

81it [09:10, 19.17s/it]

82it [09:46, 21.16s/it]

83it [10:00, 20.17s/it]

84it [10:13, 18.98s/it]

85it [10:37, 19.95s/it]

86it [11:03, 21.18s/it]

87it [11:11, 18.14s/it]

88it [13:09, 43.60s/it]

89it [15:26, 68.36s/it]

90it [16:51, 72.99s/it]

121it [18:21,  8.22s/it]

122it [19:30, 10.70s/it]

123it [21:10, 15.62s/it]

124it [22:36, 20.73s/it]

125it [23:51, 25.88s/it]

126it [25:34, 35.02s/it]

127it [26:47, 40.62s/it]

128it [28:33, 51.76s/it]

129it [31:13, 73.17s/it]

130it [32:35, 75.16s/it]

131it [35:41, 101.64s/it]

132it [37:09, 98.05s/it] 

133it [38:14, 89.29s/it]

134it [40:00, 93.91s/it]

135it [41:21, 90.22s/it]

166it [41:51,  7.56s/it]

167it [42:31,  8.91s/it]

168it [43:29, 11.60s/it]

169it [44:41, 16.03s/it]

170it [46:18, 23.68s/it]

171it [47:21, 28.32s/it]

172it [47:51, 28.61s/it]

173it [48:58, 35.18s/it]

174it [49:47, 37.96s/it]

175it [51:09, 47.56s/it]

176it [52:00, 48.42s/it]

177it [53:15, 55.25s/it]

178it [54:10, 55.11s/it]

179it [55:57, 69.34s/it]

180it [57:45, 80.53s/it]

211it [58:59,  8.16s/it]

212it [1:02:45, 17.06s/it]

213it [1:05:35, 25.53s/it]

214it [1:07:22, 31.44s/it]

215it [1:08:29, 34.81s/it]

216it [1:11:09, 49.71s/it]

217it [1:14:09, 68.75s/it]

218it [1:15:29, 70.68s/it]

219it [1:18:47, 95.65s/it]

220it [1:20:41, 99.79s/it]

221it [1:22:27, 101.24s/it]

222it [1:25:20, 119.59s/it]

223it [1:28:06, 131.92s/it]

224it [1:29:15, 114.42s/it]

225it [1:34:48, 176.39s/it]

256it [1:35:28, 14.25s/it] 

257it [1:36:09, 15.33s/it]

258it [1:36:55, 17.00s/it]

259it [1:37:55, 20.14s/it]

260it [1:38:48, 23.28s/it]

261it [1:39:27, 25.22s/it]

262it [1:40:14, 28.32s/it]

263it [1:40:56, 30.69s/it]

264it [1:41:33, 32.01s/it]

265it [1:42:27, 36.80s/it]

266it [1:43:05, 37.11s/it]

267it [1:43:50, 38.90s/it]

268it [1:44:37, 41.31s/it]

269it [1:45:46, 48.72s/it]

270it [1:46:20, 44.72s/it]

271it [1:47:34, 53.03s/it]

272it [1:48:29, 53.72s/it]

273it [1:49:10, 49.93s/it]

274it [1:50:04, 51.21s/it]

275it [1:50:50, 49.50s/it]

276it [1:52:19, 61.41s/it]

277it [1:53:19, 61.01s/it]

278it [1:54:28, 63.31s/it]

279it [1:56:26, 79.73s/it]

280it [1:57:35, 76.37s/it]

281it [1:58:55, 77.52s/it]

282it [1:59:47, 69.91s/it]

283it [2:00:22, 59.42s/it]

284it [2:01:17, 58.08s/it]

285it [2:02:22, 60.31s/it]

286it [2:03:15, 58.12s/it]

287it [2:04:12, 57.55s/it]

288it [2:05:49, 69.51s/it]

289it [2:07:15, 74.49s/it]

290it [2:08:22, 72.10s/it]

291it [2:09:16, 66.88s/it]

292it [2:09:52, 57.47s/it]

293it [2:10:38, 53.91s/it]

294it [2:11:29, 53.11s/it]

295it [2:12:08, 49.09s/it]

296it [2:12:53, 47.76s/it]

297it [2:14:36, 64.41s/it]

298it [2:15:13, 55.93s/it]

299it [2:16:11, 56.57s/it]

300it [2:17:53, 70.27s/it]

301it [2:19:02, 69.79s/it]

302it [2:19:58, 65.84s/it]

303it [2:21:05, 66.05s/it]

304it [2:21:50, 59.95s/it]

305it [2:22:37, 55.98s/it]

306it [2:23:32, 55.56s/it]

307it [2:24:31, 56.76s/it]

308it [2:25:21, 54.72s/it]

309it [2:26:45, 63.34s/it]

310it [2:28:45, 80.45s/it]

311it [2:29:31, 70.24s/it]

312it [2:30:20, 63.83s/it]

313it [2:30:53, 54.46s/it]

314it [2:31:57, 57.30s/it]

315it [2:33:23, 65.92s/it]

316it [2:33:54, 55.49s/it]

317it [2:34:32, 50.14s/it]

318it [2:34:50, 40.70s/it]

319it [2:35:36, 42.07s/it]

320it [2:36:15, 41.40s/it]

321it [2:36:36, 35.16s/it]

322it [2:37:19, 37.39s/it]

323it [2:37:51, 35.73s/it]

324it [2:38:11, 31.13s/it]

325it [2:39:21, 42.72s/it]

326it [2:40:34, 51.81s/it]

327it [2:40:57, 43.10s/it]

328it [2:41:29, 39.83s/it]

329it [2:42:02, 38.00s/it]

330it [2:42:28, 34.29s/it]

331it [2:43:13, 37.36s/it]

332it [2:44:38, 51.76s/it]

333it [2:44:58, 42.35s/it]

334it [2:45:41, 42.31s/it]

335it [2:46:18, 40.89s/it]

336it [2:46:38, 34.66s/it]

337it [2:47:01, 30.94s/it]

338it [2:47:32, 30.96s/it]

339it [2:47:51, 27.40s/it]

340it [2:48:16, 26.65s/it]

341it [2:48:48, 28.50s/it]

342it [2:49:07, 25.49s/it]

343it [2:49:45, 29.23s/it]

344it [2:50:09, 27.68s/it]

345it [2:50:31, 26.06s/it]

346it [2:51:09, 29.59s/it]

347it [2:51:34, 28.27s/it]

348it [2:51:51, 24.94s/it]

349it [2:52:34, 30.20s/it]

350it [2:53:20, 34.87s/it]

351it [2:53:43, 31.31s/it]

352it [2:54:15, 31.67s/it]

353it [2:54:44, 30.89s/it]

354it [2:55:01, 26.69s/it]

355it [2:55:55, 34.81s/it]

356it [2:56:19, 31.69s/it]

357it [2:56:38, 27.85s/it]

358it [2:57:21, 32.38s/it]

359it [2:57:52, 31.84s/it]

360it [2:58:13, 28.72s/it]

391it [2:59:08,  3.66s/it]

392it [2:59:46,  5.07s/it]

393it [3:00:23,  6.84s/it]

394it [3:01:02,  9.23s/it]

395it [3:01:34, 11.35s/it]

396it [3:02:09, 14.15s/it]

397it [3:02:33, 15.59s/it]

398it [3:03:01, 17.76s/it]

399it [3:04:52, 36.23s/it]

400it [3:05:35, 37.71s/it]

401it [3:06:38, 43.80s/it]

402it [3:07:44, 49.37s/it]

403it [3:08:26, 47.40s/it]

404it [3:09:03, 44.54s/it]

405it [3:11:05, 66.39s/it]

436it [3:11:24,  5.49s/it]

437it [3:12:09,  7.08s/it]

438it [3:12:20,  7.33s/it]

439it [3:12:42,  8.35s/it]

440it [3:13:07,  9.93s/it]

441it [3:13:21, 10.51s/it]

442it [3:13:41, 11.79s/it]

443it [3:14:06, 14.18s/it]

444it [3:14:20, 14.03s/it]

445it [3:14:41, 15.69s/it]

446it [3:15:24, 22.03s/it]

447it [3:15:35, 19.24s/it]

448it [3:15:53, 19.03s/it]

449it [3:16:31, 24.32s/it]

450it [3:16:46, 21.50s/it]

451it [3:17:02, 20.10s/it]

452it [3:17:16, 18.19s/it]

453it [3:17:35, 18.46s/it]

454it [3:17:50, 17.57s/it]

455it [3:18:08, 17.53s/it]

456it [3:18:39, 21.57s/it]

457it [3:19:01, 21.71s/it]

458it [3:19:17, 20.03s/it]

459it [3:19:30, 17.80s/it]

460it [3:19:45, 17.02s/it]

461it [3:19:57, 15.55s/it]

462it [3:20:15, 16.27s/it]

463it [3:20:28, 15.42s/it]

464it [3:20:38, 13.81s/it]

465it [3:20:53, 14.04s/it]

466it [3:21:07, 14.00s/it]

467it [3:21:16, 12.64s/it]

468it [3:21:31, 13.29s/it]

469it [3:21:42, 12.50s/it]

470it [3:21:52, 11.73s/it]

471it [3:22:11, 13.95s/it]

472it [3:22:20, 12.51s/it]

473it [3:22:33, 12.57s/it]

474it [3:22:51, 14.32s/it]

475it [3:23:05, 14.14s/it]

476it [3:23:20, 14.58s/it]

477it [3:23:37, 15.22s/it]

478it [3:23:49, 14.28s/it]

479it [3:24:00, 13.23s/it]

480it [3:24:19, 14.98s/it]

481it [3:24:28, 13.15s/it]

482it [3:24:39, 12.46s/it]

483it [3:25:00, 15.12s/it]

484it [3:25:14, 14.85s/it]

485it [3:25:29, 14.86s/it]

486it [3:25:53, 17.56s/it]

487it [3:26:08, 16.65s/it]

488it [3:26:18, 14.65s/it]

489it [3:26:33, 14.89s/it]

490it [3:26:46, 14.17s/it]

491it [3:26:58, 13.58s/it]

492it [3:27:16, 15.04s/it]

493it [3:27:26, 13.32s/it]

494it [3:27:42, 14.18s/it]

495it [3:28:02, 16.00s/it]

496it [3:28:14, 14.75s/it]

497it [3:28:31, 15.36s/it]

498it [3:28:39, 13.29s/it]

499it [3:28:48, 12.06s/it]

500it [3:29:06, 13.65s/it]

501it [3:29:13, 11.78s/it]

502it [3:29:24, 11.59s/it]

503it [3:29:35, 11.27s/it]

504it [3:29:42,  9.98s/it]

505it [3:29:53, 10.27s/it]

506it [3:30:08, 11.84s/it]

507it [3:30:15, 10.45s/it]

508it [3:30:23,  9.76s/it]

509it [3:30:36, 10.68s/it]

510it [3:30:46, 10.39s/it]

511it [3:30:57, 10.43s/it]

512it [3:31:09, 11.04s/it]

513it [3:31:18, 10.51s/it]

514it [3:31:28, 10.35s/it]

515it [3:31:40, 10.85s/it]

516it [3:31:49, 10.12s/it]

517it [3:32:01, 10.64s/it]

518it [3:32:26, 15.00s/it]

519it [3:32:34, 12.93s/it]

520it [3:32:48, 13.39s/it]

521it [3:33:09, 15.51s/it]

522it [3:33:16, 13.00s/it]

523it [3:33:27, 12.37s/it]

524it [3:33:52, 16.27s/it]

525it [3:33:59, 13.58s/it]

526it [3:34:11, 13.00s/it]

527it [3:34:40, 17.66s/it]

528it [3:34:47, 14.67s/it]

529it [3:35:02, 14.66s/it]

530it [3:35:18, 15.21s/it]

531it [3:35:27, 13.11s/it]

532it [3:35:42, 13.81s/it]

533it [3:35:52, 12.53s/it]

534it [3:36:06, 13.05s/it]

535it [3:36:19, 13.09s/it]

536it [3:36:30, 12.32s/it]

537it [3:36:45, 13.16s/it]

538it [3:36:57, 12.94s/it]

539it [3:37:19, 15.58s/it]

540it [3:37:27, 13.43s/it]

571it [3:37:37,  1.23s/it]

572it [3:37:46,  1.54s/it]

573it [3:38:03,  2.39s/it]

574it [3:38:11,  2.82s/it]

575it [3:38:19,  3.31s/it]

576it [3:38:33,  4.62s/it]

577it [3:38:41,  5.11s/it]

578it [3:38:51,  6.01s/it]

579it [3:39:04,  7.32s/it]

580it [3:39:14,  7.92s/it]

581it [3:39:27,  9.06s/it]

582it [3:39:42, 10.53s/it]

583it [3:39:52, 10.40s/it]

584it [3:40:01, 10.15s/it]

585it [3:40:15, 11.26s/it]

616it [3:40:21,  1.01s/it]

617it [3:40:29,  1.30s/it]

618it [3:40:34,  1.50s/it]

619it [3:40:45,  2.17s/it]

620it [3:40:56,  3.02s/it]

621it [3:41:01,  3.20s/it]

622it [3:41:17,  5.08s/it]

623it [3:41:24,  5.52s/it]

624it [3:41:29,  5.37s/it]

625it [3:41:44,  7.50s/it]

626it [3:41:52,  7.54s/it]

627it [3:41:57,  6.85s/it]

628it [3:42:04,  6.89s/it]

629it [3:42:15,  8.05s/it]

630it [3:42:28, 21.19s/it]


# Results

In [8]:
results = pd.read_csv(saving_result_file)
# Calculate success rate for each configuration
success_rates = results.groupby(['model_name', 'use_mcp', 'add_pseudocode', 'n_disks']).agg({
    'valid_solution': ['count', 'sum', lambda x: (x.sum() / x.count() * 100).round(2)]
}).round(2)

# Rename columns for clarity
success_rates.columns = ['total_attempts', 'successful_attempts', 'success_rate_percent']
success_rates = success_rates.reset_index()
success_rates

Unnamed: 0,model_name,use_mcp,add_pseudocode,n_disks,total_attempts,successful_attempts,success_rate_percent
0,gpt-4.1-mini,False,False,2,15,15,100.0
1,gpt-4.1-mini,False,False,3,15,15,100.0
2,gpt-4.1-mini,False,False,4,15,4,26.67
3,gpt-4.1-mini,False,False,5,15,13,86.67
4,gpt-4.1-mini,False,False,6,15,0,0.0
5,gpt-4.1-mini,False,False,8,15,0,0.0
6,gpt-4.1-mini,False,False,10,15,0,0.0
7,gpt-4.1-mini,False,True,2,15,15,100.0
8,gpt-4.1-mini,False,True,3,15,15,100.0
9,gpt-4.1-mini,False,True,4,15,15,100.0


In [9]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create subplots for the three different configurations
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=('Tool ✗, Pseudocode ✗', 'Tool ✗, Pseudocode ✓', 'Tool ✓, Pseudocode ✗'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}, {"secondary_y": False}]]
)

# Define the three configurations
configs = [
    {'use_mcp': False, 'add_pseudocode': False},
    {'use_mcp': False, 'add_pseudocode': True},
    {'use_mcp': True, 'add_pseudocode': False}
]

# Colors for the models
colors = {'o4-mini': 'blue', 'gpt-4.1-mini': 'red'}

for col, config in enumerate(configs, 1):
    # Filter data for this configuration
    mask = (success_rates['use_mcp'] == config['use_mcp']) & \
           (success_rates['add_pseudocode'] == config['add_pseudocode'])
    config_data = success_rates[mask]
    
    # Plot each model
    for model in ['o4-mini', 'gpt-4.1-mini']:
        model_data = config_data[config_data['model_name'] == model]
        
        fig.add_trace(
            go.Scatter(
                x=model_data['n_disks'],
                y=model_data['success_rate_percent'],
                mode='lines+markers',
                name=f'{model}',
                line=dict(color=colors[model]),
                showlegend=(col == 1),  # Only show legend for first subplot
                hovertemplate='<b>%{fullData.name}</b><br>' +
                            'Disks: %{x}<br>' +
                            'Success Rate: %{y:.1f}%<br>' +
                            '<extra></extra>'
            ),
            row=1, col=col
        )

# Update layout
fig.update_layout(
    title='Hanoi Tower Success Rates by Configuration',
    height=500,
    width=1200,
    showlegend=True,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

# Update axes labels
for i in range(1, 4):
    fig.update_xaxes(title_text="Number of Disks", row=1, col=i)
    fig.update_yaxes(title_text="Success Rate (%)", row=1, col=i)

fig.show()
