An enterprise-grade AI agent powered by Claude, GPT, Gemini and CustomGPT.ai
CustomGPT.ai Research Lab is #1 on the GAIA benchmark β the top-ranked general AI assistant agent in the world, ahead of every major lab.
| Rank | Agent | Organisation | Average score |
|---|---|---|---|
| π₯ 1 | CustomGPT.ai Research Lab v44 | CustomGPT.ai | 93.36% |
| 2 | Co-Sight Pro v1.0.1 | ZTE-AICloud | 93.02% |
| 3 | OPS-Agentic-Search | Alibaba Cloud | 92.36% |
| 4 | Lemon | LR AILab, Lenovo CTO Org | 91.36% |
| 5 | Nemotron-ToolOrchestra-0106 | NVIDIA | 90.37% |
| 6 | SU Zero β Shuqian Series Pro MAX | Suzhou AI Lab / Shuqian Tech | 90.03% |
| 7 | HALO V1217-1 | Microsoft AI Asia β Ads | 89.37% |
Leading entries on the GAIA Leaderboard as of 5 June 2026. CustomGPT.ai holds the #1 position.
CustomGPT Agent is a multi-agent AI system that combines CustomGPT.ai's enterprise RAG platform with frontier-model reasoning and secure sandboxed code execution. It is designed to solve complex, real-world tasks that require multi-step reasoning, web research, file analysis, visual browser navigation, and computation.
The agent ranks #1 on the GAIA Benchmark (General AI Assistants) β a challenging benchmark that tests AI systems on tasks requiring real-world tool use, multi-step reasoning, and web browsing β tasks that go far beyond simple question answering.
The orchestrator is provider-agnostic: it runs on Claude Opus 4.6/4.7 via the Claude Agent SDK or on OpenAI GPT-5.5 through a parallel orchestrator, with a configurable thinking level per run. For tasks that require seeing a page β Google Street View, 3D model viewers, interactive maps β the agent drives a real headed Chromium browser through a vision-first computer-use tool suite, with Google Gemini 3.1 Pro as a secondary vision provider. Every step is wrapped by a quality & verification plane that re-checks claims against primary sources before any answer is returned.
The system is organized into four planes: a provider-agnostic orchestrator, a set of specialist subagents, a shared MCP tool layer, and a quality & verification plane that wraps every step.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ORCHESTRATOR (provider-agnostic, ReAct loop) β
β Claude Opus 4.6 / 4.7 (Agent SDK) β’ OpenAI GPT-5.5 β
β Configurable thinking level β’ Plan β Act β Verify β
β Experience memory + runtime-error memory injected into the prompt β
ββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββ¬ββββββββββββββββββββββββββ
β β β β β β delegates via Task tool
ββββΌββββ ββββΌββββ ββββΌβββββ βββΌββββββ ββΌβββββββ ββΌββββββββ
βPLAN- β βLOOKUPβ βCOMPUTAβ β FILE β βVISUAL β β CRITIC β
βNER β β(Opus)β β-TION β βANALY- β βANALYSTβ β (Opus) β
β β βweb/ β β(Sonnetβ βSIS β β(headedβ βgroundedβ
βplan β βwiki β βdouble-β β(Sonnetβ βbrowserβ βre-checkβ
β β βrsrch β βcomputeβ β visionβ β cu_*) β βvs src β
ββββ¬ββββ ββββ¬ββββ ββββ¬βββββ ββββ¬βββββ ββββ¬βββββ βββ¬βββββββ
β β β β β β
ββββΌβββββββββΌβββββββββΌββββββββββΌββββββββββΌβββββββββΌββββββββββββββββββββββββ
β MCP TOOL LAYER (~80 tools) β
β Search/Browse: web_search Β· search_wiki Β· search_google Β· query_arxiv β
β Β· ask_perplexity Β· jina_search Β· jina_read Β· β
β browse_webpage Β· search_archive Β· find_paper Β· β
β fetch_via_wayback β
β Media/Video: youtube_metadata Β· youtube_captions Β· youtube_search Β· β
β identify_song Β· extract_video_frames Β· extract_audio β
β Execution: execute_python Β· execute_shell Β· read_file Β· write_fileβ
β (E2B microVM, Playwright) β
β Vision/CU: describe_image Β· image_question Β· cu_screenshot Β· β
β cu_click Β· cu_type Β· cu_scroll Β· cu_zoom Β· cu_navigate β
β Knowledge: search_knowledge_base (CustomGPT.ai RAG) β
β Reasoning/Mem: deep_think Β· fetch_experiences Β· submit_answer β
ββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUALITY & VERIFICATION PLANE (hooks + loops) β
β β’ Quadruple-Verification hooks β 4 cycles on every tool output β
β C1 code quality Β· C2 security Β· C3 stop-quality Β· C4 research β
β β’ Self-Correction loop β render β screenshot β analyze β fix (Γ3) β
β β’ Majority-vote self-consistency across N runs β
β β’ Budget / loop-detection / audit-log / safety hooks β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Technology | Role |
|---|---|---|
| Orchestrator | Claude Opus 4.6 / 4.7 (Agent SDK) or OpenAI GPT-5.5 | Task decomposition, routing, evidence-backed reasoning, and final answer synthesis β model & thinking level selectable per run; prompt augmented with experience + runtime-error memory |
| PLANNER Subagent | Claude / GPT-5.5 | Produces a structured plan (question restatement, strategy, micro-step queue) before execution begins |
| LOOKUP Specialist | Claude Opus + search/browse tools | Web/wiki/arXiv/Perplexity research and fact retrieval, page navigation |
| COMPUTATION Specialist | Claude Sonnet 4.6 + E2B sandbox | Python/shell execution, math, and data processing with double-compute verification |
| FILE ANALYSIS Specialist | Claude Sonnet 4.6 + E2B + vision | Document parsing (PDF, JATS XML, Excel, images, audio), transcription, and file manipulation |
| VISUAL ANALYST Subagent | Claude / Gemini 3.1 Pro + headed Chromium | Vision-first navigation of Street View, 3D viewers, and interactive maps via the cu_* computer-use tools |
| CRITIC Verifier | Claude Opus (real research tools) | Grounded verification β independently re-checks the answer's source quote against primary sources before approval |
| E2B Sandbox | Firecracker microVM + pooled warm starts | Isolated Python/shell execution, Playwright browser automation, file workspace |
The layer that wraps every step and is the main driver of the agent's accuracy gains:
| Mechanism | What it does |
|---|---|
| Quadruple Verification | Claude Agent SDK hooks running four cycles on tool output β C1 code-quality, C2 security, C3 stop-quality, C4 research-claim verification β with an immutable audit log |
| Grounded CRITIC | Re-checks the answer's cited source against primary sources with real research tools, not self-reflection, before any answer is submitted |
| Self-Correction Loop | Create β render β screenshot β analyze β fix loop (max 3 iterations) that makes generated artifacts executive-ready |
| Self-Consistency Voting | Runs a task N times and selects the majority answer (normalized grouping), with early-stop on consensus |
| Experience & Runtime Memory | Deterministic matchers inject lessons from past successes (experiences.json) and past failures (runtime_memory.json) into the orchestrator prompt |
| Sandbox Pool | Keeps warm E2B microVMs to remove cold-start latency across parallel runs |
| Operational hooks | Per-agent / global tool-call budgets, loop detection, safety gates, and cost tracking |
-
Multi-Provider Orchestrator β The same task can be driven by Claude Opus 4.6/4.7 (via the Claude Agent SDK) or OpenAI GPT-5.5, with a per-run thinking-level knob, enabling apples-to-apples cross-model comparison on identical specialists
-
Plan β Act β Verify Loop β A dedicated PLANNER subagent writes a structured
plan.mdbefore execution; the orchestrator acts against it and the CRITIC verifies the result, with durablenotes.mdobservations for human visibility -
Vision-First Computer Use β A real headed Chromium browser (persistent cookie profile, 16:9 viewport) is driven through a
cu_*tool suite, letting the agent see and operate Street View, 3D model viewers, and interactive maps exactly as a human would -
Dual Vision Stack β Claude vision plus Google Gemini 3.1 Pro as a secondary provider for exhaustive image description and strict image-vs-image verification
-
First-Party Media Tooling β YouTube metadata/captions/channel search, song identification, and frame/audio extraction give the agent direct access to video content instead of brittle scraping fallbacks
-
Grounded CRITIC Verification β The CRITIC re-checks the answer's cited source against primary sources with real research tools before any answer is submitted
-
Quadruple-Verification Hooks β Four independent cycles (code quality, security, stop-quality, research-claim) run automatically on tool output, with an immutable audit trail
-
Self-Correction + Self-Consistency β A render β screenshot β analyze β fix loop produces executive-ready artifacts; majority-vote across N runs reduces variance on hard tasks
-
Experience & Runtime Memory β Learns across runs by injecting prior successes and prior failures into the orchestrator prompt
-
Enterprise RAG Integration β CustomGPT.ai's 40+ API endpoints provide citation-backed knowledge retrieval from proprietary data sources
-
Full Observability β A model-I/O logger captures every request/response to JSONL and a progress bus emits real-time task events for live monitoring and replay
User Question
β
βΌ
ββββββββββββββββ
β PLANNER βββββ Writes plan.md: restate question, choose strategy, queue steps
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ
β Orchestrator βββββ Routes each step to the right specialist (Claude or GPT-5.5);
ββββββββ¬ββββββββ injects matched experience + runtime-error memory
β
ββββΊ LOOKUP: Search web/wiki/arXiv/Perplexity, browse pages
ββββΊ COMPUTATION: Write & run Python in the E2B sandbox (double-compute)
ββββΊ FILE ANALYSIS: Parse PDFs, spreadsheets, images, audio
ββββΊ VISUAL ANALYST: Drive headed Chromium β screenshot, click, zoom, pan
ββββΊ CustomGPT KB: Query the enterprise knowledge base
β
βΌ (every tool output passes the 4 verification cycles)
ββββββββββββββββ
β CRITIC βββββ Re-check cited source against primary sources
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Self-correction Γ3 βββββ Render β screenshot β analyze β fix
ββββββββ¬ββββββββββββββββ
β
βΌ (optional) majority vote across N runs
Final Answer (with plan.md, notes.md & evidence trace)
CustomGPT.ai is an enterprise AI platform that enables businesses to build custom AI agents powered by their own data β with no hallucinations and full source citations.
Key Capabilities:
- 1,400+ data formats β Ingest websites, documents, helpdesks, videos, and more
- Anti-hallucination technology β Third-party verified #1 for accuracy
- 92 languages supported
- 40+ REST API endpoints β Full programmatic access for agent integration
- SOC-2 Type II & GDPR compliant
- Enterprise-grade security β Data never used for LLM training
In this agent, CustomGPT.ai serves as the knowledge base layer, providing semantic search, ranked context retrieval, and citation-backed answers from proprietary enterprise data.
"Business AI for trusted answers β no hallucinations, no guessing."
| Role | |
|---|---|
| Alden Do Rosario | CEO, CustomGPT.ai |
| Role & Contributions | |
|---|---|
| Felipe Pires | Technical Product Manager & Project Manager β hiring and team building; task breakdown, distribution, and dispatch; evaluation strategy & methodology; regression/canary validation harness design; competitive benchmark research; multi-model integration and agent architecture direction |
| Developer | Focus |
|---|---|
| Aleksa StojanoviΔ | Research, methodology, and team coordination β identified tooling and infrastructure gaps; built verification pipelines and analysis tooling; supported L3 edge-case tooling directions (Street View, multimedia, Wikipedia API); helped focus team effort on high-leverage work; led the final merge to main |
| Ramzi Mo | YouTube & media MCP tool suite (metadata, captions, song ID, frame/audio extraction) and orchestrator routing for video/audio; reasoning-rule additions (percent-change, formula delegation, count-with-exclusion, ranked extraction, temporal anchoring, multi-source/batch); RSV video-counting pattern; browser/PDF/figure error recovery; CRITIC verification rules (anti-sycophancy, FORMAT A/B, shortcut detection) |
| Hussein Younes | Vision pipeline β self-consistency voting in describe_image, the image_question narrow-question wrapper, describe-image caps / anti-fixation, and Gemini 3.1 vision routing for spatial questions; grounded-CRITIC verification; find_paper + fetch_via_wayback research tools; tool-delegation robustness and error-recovery fixes; quadruple-verification audit logging; and the agent test suite (voting, experience memory, infra-retry, subagents) |
| Arnav Gupta | Multi-provider orchestrator (Claude Opus 4.6/4.7 + GPT-5.5, per-run thinking level); vision-first computer use (headed Chromium + cu_* suite for Street View, 3D viewers, maps); Plan β Act β Verify loop (PLANNER + VISUAL ANALYST subagents, Gemini 3.1 Pro secondary vision); evidence tooling & observability (source-traceable research tools, model-I/O logging, real-time progress bus) |
| Dennis Yavuz | Early engineering on the initial implementation β orchestrator scaffolding, specialist subagents, MCP tool layer, and E2B sandbox integration |
- Anthropic β Claude Opus 4.6 / 4.7, Claude Sonnet 4.6, and the Claude Agent SDK
- OpenAI β GPT-5.5 orchestrator provider
- Google β Gemini 3.1 Pro vision provider
- CustomGPT.ai β Enterprise RAG platform and knowledge base API
- E2B β Secure sandboxed code execution
- Playwright β Headed Chromium browser automation
- GAIA Benchmark β General AI Assistants evaluation framework
@misc{customgpt-agent-2026,
title={CustomGPT Agent: Enterprise AI Agent with RAG-Augmented Multi-Agent Architecture},
author={CustomGPT.ai Research},
year={2026},
url={https://github.com/adorosario/customgpt-agent}
}Built with Claude Agent SDK and CustomGPT.ai
