Skip to content

adorosario/customgpt-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 

Repository files navigation

CustomGPT.ai

CustomGPT Agent

An enterprise-grade AI agent powered by Claude, GPT, Gemini and CustomGPT.ai

CustomGPT.ai Claude Agent SDK E2B Sandbox Playwright GAIA Benchmark

Multi-Provider Orchestrator Claude Opus 4.6 / 4.7 GPT-5.5 Gemini 3.1 Pro

GAIA Leaderboard #1 Average score 93.36%


πŸ† Results β€” #1 on the GAIA Leaderboard

CustomGPT.ai Research Lab is #1 on the GAIA benchmark β€” the top-ranked general AI assistant agent in the world, ahead of every major lab.

Rank Agent Organisation Average score
πŸ₯‡ 1 CustomGPT.ai Research Lab v44 CustomGPT.ai 93.36%
2 Co-Sight Pro v1.0.1 ZTE-AICloud 93.02%
3 OPS-Agentic-Search Alibaba Cloud 92.36%
4 Lemon LR AILab, Lenovo CTO Org 91.36%
5 Nemotron-ToolOrchestra-0106 NVIDIA 90.37%
6 SU Zero – Shuqian Series Pro MAX Suzhou AI Lab / Shuqian Tech 90.03%
7 HALO V1217-1 Microsoft AI Asia – Ads 89.37%

Leading entries on the GAIA Leaderboard as of 5 June 2026. CustomGPT.ai holds the #1 position.

GAIA Leaderboard β€” CustomGPT.ai Research Lab ranked #1


Overview

CustomGPT Agent is a multi-agent AI system that combines CustomGPT.ai's enterprise RAG platform with frontier-model reasoning and secure sandboxed code execution. It is designed to solve complex, real-world tasks that require multi-step reasoning, web research, file analysis, visual browser navigation, and computation.

The agent ranks #1 on the GAIA Benchmark (General AI Assistants) β€” a challenging benchmark that tests AI systems on tasks requiring real-world tool use, multi-step reasoning, and web browsing β€” tasks that go far beyond simple question answering.

The orchestrator is provider-agnostic: it runs on Claude Opus 4.6/4.7 via the Claude Agent SDK or on OpenAI GPT-5.5 through a parallel orchestrator, with a configurable thinking level per run. For tasks that require seeing a page β€” Google Street View, 3D model viewers, interactive maps β€” the agent drives a real headed Chromium browser through a vision-first computer-use tool suite, with Google Gemini 3.1 Pro as a secondary vision provider. Every step is wrapped by a quality & verification plane that re-checks claims against primary sources before any answer is returned.


Architecture

The system is organized into four planes: a provider-agnostic orchestrator, a set of specialist subagents, a shared MCP tool layer, and a quality & verification plane that wraps every step.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              ORCHESTRATOR  (provider-agnostic, ReAct loop)               β”‚
β”‚     Claude Opus 4.6 / 4.7 (Agent SDK)   β€’   OpenAI GPT-5.5               β”‚
β”‚     Configurable thinking level  β€’  Plan β†’ Act β†’ Verify                  β”‚
β”‚     Experience memory + runtime-error memory injected into the prompt    β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚        β”‚        β”‚        β”‚        β”‚        β”‚   delegates via Task tool
β”Œβ”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚PLAN- β”‚ β”‚LOOKUPβ”‚ β”‚COMPUTAβ”‚ β”‚ FILE  β”‚ β”‚VISUAL β”‚ β”‚ CRITIC β”‚
β”‚NER   β”‚ β”‚(Opus)β”‚ β”‚-TION  β”‚ β”‚ANALY- β”‚ β”‚ANALYSTβ”‚ β”‚ (Opus) β”‚
β”‚      β”‚ β”‚web/  β”‚ β”‚(Sonnetβ”‚ β”‚SIS    β”‚ β”‚(headedβ”‚ β”‚groundedβ”‚
β”‚plan  β”‚ β”‚wiki  β”‚ β”‚double-β”‚ β”‚(Sonnetβ”‚ β”‚browserβ”‚ β”‚re-checkβ”‚
β”‚      β”‚ β”‚rsrch β”‚ β”‚computeβ”‚ β”‚ visionβ”‚ β”‚ cu_*) β”‚ β”‚vs src  β”‚
β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
   β”‚        β”‚        β”‚         β”‚         β”‚        β”‚
β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       MCP TOOL LAYER  (~80 tools)                        β”‚
β”‚  Search/Browse:  web_search Β· search_wiki Β· search_google Β· query_arxiv β”‚
β”‚                  Β· ask_perplexity Β· jina_search Β· jina_read Β·           β”‚
β”‚                  browse_webpage Β· search_archive Β· find_paper Β·         β”‚
β”‚                  fetch_via_wayback                                       β”‚
β”‚  Media/Video:    youtube_metadata Β· youtube_captions Β· youtube_search Β· β”‚
β”‚                  identify_song Β· extract_video_frames Β· extract_audio   β”‚
β”‚  Execution:      execute_python Β· execute_shell Β· read_file Β· write_fileβ”‚
β”‚                  (E2B microVM, Playwright)                              β”‚
β”‚  Vision/CU:      describe_image Β· image_question Β· cu_screenshot Β·       β”‚
β”‚                  cu_click Β· cu_type Β· cu_scroll Β· cu_zoom Β· cu_navigate  β”‚
β”‚  Knowledge:      search_knowledge_base  (CustomGPT.ai RAG)              β”‚
β”‚  Reasoning/Mem:  deep_think Β· fetch_experiences Β· submit_answer          β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚
β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              QUALITY & VERIFICATION PLANE  (hooks + loops)               β”‚
β”‚  β€’ Quadruple-Verification hooks β€” 4 cycles on every tool output          β”‚
β”‚      C1 code quality Β· C2 security Β· C3 stop-quality Β· C4 research       β”‚
β”‚  β€’ Self-Correction loop β€” render β†’ screenshot β†’ analyze β†’ fix (Γ—3)       β”‚
β”‚  β€’ Majority-vote self-consistency across N runs                          β”‚
β”‚  β€’ Budget / loop-detection / audit-log / safety hooks                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

Component Technology Role
Orchestrator Claude Opus 4.6 / 4.7 (Agent SDK) or OpenAI GPT-5.5 Task decomposition, routing, evidence-backed reasoning, and final answer synthesis β€” model & thinking level selectable per run; prompt augmented with experience + runtime-error memory
PLANNER Subagent Claude / GPT-5.5 Produces a structured plan (question restatement, strategy, micro-step queue) before execution begins
LOOKUP Specialist Claude Opus + search/browse tools Web/wiki/arXiv/Perplexity research and fact retrieval, page navigation
COMPUTATION Specialist Claude Sonnet 4.6 + E2B sandbox Python/shell execution, math, and data processing with double-compute verification
FILE ANALYSIS Specialist Claude Sonnet 4.6 + E2B + vision Document parsing (PDF, JATS XML, Excel, images, audio), transcription, and file manipulation
VISUAL ANALYST Subagent Claude / Gemini 3.1 Pro + headed Chromium Vision-first navigation of Street View, 3D viewers, and interactive maps via the cu_* computer-use tools
CRITIC Verifier Claude Opus (real research tools) Grounded verification β€” independently re-checks the answer's source quote against primary sources before approval
E2B Sandbox Firecracker microVM + pooled warm starts Isolated Python/shell execution, Playwright browser automation, file workspace

Quality & Verification Plane

The layer that wraps every step and is the main driver of the agent's accuracy gains:

Mechanism What it does
Quadruple Verification Claude Agent SDK hooks running four cycles on tool output β€” C1 code-quality, C2 security, C3 stop-quality, C4 research-claim verification β€” with an immutable audit log
Grounded CRITIC Re-checks the answer's cited source against primary sources with real research tools, not self-reflection, before any answer is submitted
Self-Correction Loop Create β†’ render β†’ screenshot β†’ analyze β†’ fix loop (max 3 iterations) that makes generated artifacts executive-ready
Self-Consistency Voting Runs a task N times and selects the majority answer (normalized grouping), with early-stop on consensus
Experience & Runtime Memory Deterministic matchers inject lessons from past successes (experiences.json) and past failures (runtime_memory.json) into the orchestrator prompt
Sandbox Pool Keeps warm E2B microVMs to remove cold-start latency across parallel runs
Operational hooks Per-agent / global tool-call budgets, loop detection, safety gates, and cost tracking

Key Features

  1. Multi-Provider Orchestrator β€” The same task can be driven by Claude Opus 4.6/4.7 (via the Claude Agent SDK) or OpenAI GPT-5.5, with a per-run thinking-level knob, enabling apples-to-apples cross-model comparison on identical specialists

  2. Plan β†’ Act β†’ Verify Loop β€” A dedicated PLANNER subagent writes a structured plan.md before execution; the orchestrator acts against it and the CRITIC verifies the result, with durable notes.md observations for human visibility

  3. Vision-First Computer Use β€” A real headed Chromium browser (persistent cookie profile, 16:9 viewport) is driven through a cu_* tool suite, letting the agent see and operate Street View, 3D model viewers, and interactive maps exactly as a human would

  4. Dual Vision Stack β€” Claude vision plus Google Gemini 3.1 Pro as a secondary provider for exhaustive image description and strict image-vs-image verification

  5. First-Party Media Tooling β€” YouTube metadata/captions/channel search, song identification, and frame/audio extraction give the agent direct access to video content instead of brittle scraping fallbacks

  6. Grounded CRITIC Verification β€” The CRITIC re-checks the answer's cited source against primary sources with real research tools before any answer is submitted

  7. Quadruple-Verification Hooks β€” Four independent cycles (code quality, security, stop-quality, research-claim) run automatically on tool output, with an immutable audit trail

  8. Self-Correction + Self-Consistency β€” A render β†’ screenshot β†’ analyze β†’ fix loop produces executive-ready artifacts; majority-vote across N runs reduces variance on hard tasks

  9. Experience & Runtime Memory β€” Learns across runs by injecting prior successes and prior failures into the orchestrator prompt

  10. Enterprise RAG Integration β€” CustomGPT.ai's 40+ API endpoints provide citation-backed knowledge retrieval from proprietary data sources

  11. Full Observability β€” A model-I/O logger captures every request/response to JSONL and a progress bus emits real-time task events for live monitoring and replay


How It Works

User Question
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   PLANNER    │──── Writes plan.md: restate question, choose strategy, queue steps
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Orchestrator │──── Routes each step to the right specialist (Claude or GPT-5.5);
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     injects matched experience + runtime-error memory
       β”‚
       β”œβ”€β”€β–Ί LOOKUP:        Search web/wiki/arXiv/Perplexity, browse pages
       β”œβ”€β”€β–Ί COMPUTATION:   Write & run Python in the E2B sandbox (double-compute)
       β”œβ”€β”€β–Ί FILE ANALYSIS: Parse PDFs, spreadsheets, images, audio
       β”œβ”€β”€β–Ί VISUAL ANALYST: Drive headed Chromium β€” screenshot, click, zoom, pan
       └──► CustomGPT KB:  Query the enterprise knowledge base
              β”‚
              β–Ό   (every tool output passes the 4 verification cycles)
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚    CRITIC     │──── Re-check cited source against primary sources
       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  Self-correction Γ—3   │──── Render β†’ screenshot β†’ analyze β†’ fix
       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό   (optional) majority vote across N runs
        Final Answer (with plan.md, notes.md & evidence trace)

About CustomGPT.ai

CustomGPT.ai

CustomGPT.ai is an enterprise AI platform that enables businesses to build custom AI agents powered by their own data β€” with no hallucinations and full source citations.

Key Capabilities:

  • 1,400+ data formats β€” Ingest websites, documents, helpdesks, videos, and more
  • Anti-hallucination technology β€” Third-party verified #1 for accuracy
  • 92 languages supported
  • 40+ REST API endpoints β€” Full programmatic access for agent integration
  • SOC-2 Type II & GDPR compliant
  • Enterprise-grade security β€” Data never used for LLM training

In this agent, CustomGPT.ai serves as the knowledge base layer, providing semantic search, ranked context retrieval, and citation-backed answers from proprietary enterprise data.

"Business AI for trusted answers β€” no hallucinations, no guessing."


Team

Leadership

Role
Alden Do Rosario CEO, CustomGPT.ai

Project & Product Management

Role & Contributions
Felipe Pires Technical Product Manager & Project Manager β€” hiring and team building; task breakdown, distribution, and dispatch; evaluation strategy & methodology; regression/canary validation harness design; competitive benchmark research; multi-model integration and agent architecture direction

Developers

Developer Focus
Aleksa Stojanović Research, methodology, and team coordination — identified tooling and infrastructure gaps; built verification pipelines and analysis tooling; supported L3 edge-case tooling directions (Street View, multimedia, Wikipedia API); helped focus team effort on high-leverage work; led the final merge to main
Ramzi Mo YouTube & media MCP tool suite (metadata, captions, song ID, frame/audio extraction) and orchestrator routing for video/audio; reasoning-rule additions (percent-change, formula delegation, count-with-exclusion, ranked extraction, temporal anchoring, multi-source/batch); RSV video-counting pattern; browser/PDF/figure error recovery; CRITIC verification rules (anti-sycophancy, FORMAT A/B, shortcut detection)
Hussein Younes Vision pipeline β€” self-consistency voting in describe_image, the image_question narrow-question wrapper, describe-image caps / anti-fixation, and Gemini 3.1 vision routing for spatial questions; grounded-CRITIC verification; find_paper + fetch_via_wayback research tools; tool-delegation robustness and error-recovery fixes; quadruple-verification audit logging; and the agent test suite (voting, experience memory, infra-retry, subagents)
Arnav Gupta Multi-provider orchestrator (Claude Opus 4.6/4.7 + GPT-5.5, per-run thinking level); vision-first computer use (headed Chromium + cu_* suite for Street View, 3D viewers, maps); Plan β†’ Act β†’ Verify loop (PLANNER + VISUAL ANALYST subagents, Gemini 3.1 Pro secondary vision); evidence tooling & observability (source-traceable research tools, model-I/O logging, real-time progress bus)
Dennis Yavuz Early engineering on the initial implementation β€” orchestrator scaffolding, specialist subagents, MCP tool layer, and E2B sandbox integration

Acknowledgements

  • Anthropic β€” Claude Opus 4.6 / 4.7, Claude Sonnet 4.6, and the Claude Agent SDK
  • OpenAI β€” GPT-5.5 orchestrator provider
  • Google β€” Gemini 3.1 Pro vision provider
  • CustomGPT.ai β€” Enterprise RAG platform and knowledge base API
  • E2B β€” Secure sandboxed code execution
  • Playwright β€” Headed Chromium browser automation
  • GAIA Benchmark β€” General AI Assistants evaluation framework

Citation

@misc{customgpt-agent-2026,
  title={CustomGPT Agent: Enterprise AI Agent with RAG-Augmented Multi-Agent Architecture},
  author={CustomGPT.ai Research},
  year={2026},
  url={https://github.com/adorosario/customgpt-agent}
}

Built with Claude Agent SDK and CustomGPT.ai

About

An enterprise-grade AI agent powered by Claude and CustomGPT.ai

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors