feat: add multimodal support (voice, camera, screen, MCP) by JacobFV · Pull Request #36 · agi-inc/agi-cli

JacobFV · 2026-02-10T10:02:20Z

Multimodal Support - AGI CLI Updates

This update adds comprehensive multimodal support to the AGI CLI.

New Features

Voice Mode (`--voice`)

Audio input from microphone
Automatic turn detection
Text-to-speech output
Requires: OPENAI_API_KEY environment variable

Camera Mode (`--camera`)

Webcam video feed
30-second rolling buffer
Agent can see you

Screen Mode (`--screen`)

Screen recording
30-second rolling buffer
Agent can see your screen

MCP Support (`--mcp`)

Load MCP servers from config
Default config: ~/.agi/mcp.json
Custom config: --mcp-config /path/to/mcp.json

Usage Examples

Voice Mode

agi --voice "What's the current time?"

Voice + Screen

agi --voice --screen "What's on my screen?"

Full Multimodal

agi --voice --camera --screen "Can you see me and my screen?"

MCP Servers

# Set up MCP config
mkdir -p ~/.agi
cat > ~/.agi/mcp.json << 'EOF'
{
  "filesystem": {
    "command": "npx",
    "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/Documents"]
  }
}
EOF

# Use MCP
agi --mcp "List my documents"

Everything Combined

agi --voice --camera --screen --mcp "Help me with my work"

Configuration

Environment Variables

AGI_API_KEY: Your AGI API key (required)
OPENAI_API_KEY: OpenAI key for voice features (required for --voice)

MCP Config Format

{
  "server-name": {
    "command": "executable",
    "args": ["arg1", "arg2"],
    "env": {
      "ENV_VAR": "value"
    }
  }
}

CLI Options

Option	Description
`--voice`	Enable voice input/output
`--camera`	Enable camera video
`--screen`	Enable screen recording
`--mcp`	Load MCP servers from config
`--mcp-config PATH`	Custom MCP config path
`-m, --model`	Model to use (default: claude-sonnet)
`-v, --verbose`	Show agent thinking
`--no-confirm`	Auto-approve confirmations

Implementation

Changes made:

Updated src/cli.ts to add multimodal options
Updated src/hooks/useAgent.ts to pass multimodal config to driver
Added UI components for multimodal events
Updated examples in help text

Testing

# Install dependencies
npm install

# Build
npm run build

# Test voice mode
agi --voice "Hello"

# Test full multimodal
agi --voice --camera --screen --mcp "What do you see?"

Related PRs

agi-api (driver): https://github.com/agi-inc/agents/pull/344
agi-python: https://github.com/agi-inc/agi-python/pull/8
agi-node: feat: add comprehensive multimodal driver support agi-node#11
agi-csharp: feat: add comprehensive multimodal driver support documentation agi-csharp#8

Add comprehensive multimodal features to AGI CLI: ## New CLI Options - --voice: Enable voice input/output (requires OPENAI_API_KEY) - --camera: Enable camera video feed - --screen: Enable screen recording - --mcp: Load MCP servers from config - --mcp-config: Custom MCP config path (default: ~/.agi/mcp.json) ## Features - Voice input with automatic turn detection - Text-to-speech output - Camera and screen video buffers - MCP server integration for extended tools - All features work together seamlessly ## Usage Examples agi --voice "What's the time?" agi --voice --screen "What's on my screen?" agi --voice --camera --screen --mcp "Help me with my work" ## Related PRs - agi-api (driver): agi-inc/agents#344 - agi-python: agi-inc/agi-python#8 - agi-node: agi-inc/agi-node#11 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Allows users to specify a custom AGI API endpoint URL: - Added apiUrl to CliArgs interface - Added --api-url CLI option - Pass apiUrl to useAgent hook Usage: agi --api-url http://localhost:8000 "your goal" 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Update App.tsx to pass voice, camera, screen, mcp, mcpConfig to useAgent - Update UseAgentOptions interface to accept multimodal options - Pass all multimodal options to AgentDriver constructor - Complete end-to-end wiring: CLI args → App → useAgent → AgentDriver → API Now the --voice, --camera, --screen, --mcp flags are fully functional! 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add voice, camera, screen, mcp, mcpConfig to the start callback dependency array so React captures the correct values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused mkdirSync, join, and color variable that caused ESLint failures in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI will pass once agi-node 0.5.0 is published to npm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JacobFV · 2026-02-10T17:40:18Z

Merge Order

This PR depends on @agi_inc/agi-js@0.5.0 which has the multimodal DriverOptions types. CI will fail until the Node SDK PR is merged and published:

agi-inc/agents #344 — merge first (driver binary)
agi-inc/agi-node feat: redesign TUI and add update command #11 → publish @agi_inc/agi-js@0.5.0 to npm
agi-inc/agi-cli feat: add multimodal support (voice, camera, screen, MCP) #36 — merge after agi-js 0.5.0 is published
agi-inc/agi-python feat(auth): add device code login flow and credential management #8 and agi-inc/agi-csharp feat(auth): add device code login flow and credential management #8 — can merge independently

The import.meta typecheck error is a pre-existing tsconfig issue (tsup outputs ESM correctly, tsc --noEmit false positive).

JacobFV and others added 7 commits February 10, 2026 02:02

fix(hooks): add missing multimodal deps to useCallback array

be5d92f

Add voice, camera, screen, mcp, mcpConfig to the start callback dependency array so React captures the correct values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: bump version to 0.6.0 for multimodal release

ab1b47b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(cli): remove unused imports and bump to 0.5.15

c665ee5

Remove unused mkdirSync, join, and color variable that caused ESLint failures in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore(deps): require @agi_inc/agi-js ^0.5.0 for multimodal support

c342832

CI will pass once agi-node 0.5.0 is published to npm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JacobFV merged commit b035bc8 into main Feb 10, 2026
1 of 4 checks passed

JacobFV mentioned this pull request Feb 10, 2026

chore(main): release agi-cli 0.5.15 #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multimodal support (voice, camera, screen, MCP)#36

feat: add multimodal support (voice, camera, screen, MCP)#36
JacobFV merged 7 commits intomainfrom
jacob/multimodal-support

JacobFV commented Feb 10, 2026

Uh oh!

JacobFV commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JacobFV commented Feb 10, 2026

Multimodal Support - AGI CLI Updates

New Features

Voice Mode (--voice)

Camera Mode (--camera)

Screen Mode (--screen)

MCP Support (--mcp)

Usage Examples

Voice Mode

Voice + Screen

Full Multimodal

MCP Servers

Everything Combined

Configuration

Environment Variables

MCP Config Format

CLI Options

Implementation

Testing

Related PRs

Uh oh!

JacobFV commented Feb 10, 2026

Merge Order

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Voice Mode (`--voice`)

Camera Mode (`--camera`)

Screen Mode (`--screen`)

MCP Support (`--mcp`)