Skip to content

feat: add multimodal support (voice, camera, screen, MCP)#36

Merged
JacobFV merged 7 commits intomainfrom
jacob/multimodal-support
Feb 10, 2026
Merged

feat: add multimodal support (voice, camera, screen, MCP)#36
JacobFV merged 7 commits intomainfrom
jacob/multimodal-support

Conversation

@JacobFV
Copy link
Collaborator

@JacobFV JacobFV commented Feb 10, 2026

Multimodal Support - AGI CLI Updates

This update adds comprehensive multimodal support to the AGI CLI.

New Features

Voice Mode (--voice)

  • Audio input from microphone
  • Automatic turn detection
  • Text-to-speech output
  • Requires: OPENAI_API_KEY environment variable

Camera Mode (--camera)

  • Webcam video feed
  • 30-second rolling buffer
  • Agent can see you

Screen Mode (--screen)

  • Screen recording
  • 30-second rolling buffer
  • Agent can see your screen

MCP Support (--mcp)

  • Load MCP servers from config
  • Default config: ~/.agi/mcp.json
  • Custom config: --mcp-config /path/to/mcp.json

Usage Examples

Voice Mode

agi --voice "What's the current time?"

Voice + Screen

agi --voice --screen "What's on my screen?"

Full Multimodal

agi --voice --camera --screen "Can you see me and my screen?"

MCP Servers

# Set up MCP config
mkdir -p ~/.agi
cat > ~/.agi/mcp.json << 'EOF'
{
  "filesystem": {
    "command": "npx",
    "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/Documents"]
  }
}
EOF

# Use MCP
agi --mcp "List my documents"

Everything Combined

agi --voice --camera --screen --mcp "Help me with my work"

Configuration

Environment Variables

  • AGI_API_KEY: Your AGI API key (required)
  • OPENAI_API_KEY: OpenAI key for voice features (required for --voice)

MCP Config Format

{
  "server-name": {
    "command": "executable",
    "args": ["arg1", "arg2"],
    "env": {
      "ENV_VAR": "value"
    }
  }
}

CLI Options

Option Description
--voice Enable voice input/output
--camera Enable camera video
--screen Enable screen recording
--mcp Load MCP servers from config
--mcp-config PATH Custom MCP config path
-m, --model Model to use (default: claude-sonnet)
-v, --verbose Show agent thinking
--no-confirm Auto-approve confirmations

Implementation

Changes made:

  • Updated src/cli.ts to add multimodal options
  • Updated src/hooks/useAgent.ts to pass multimodal config to driver
  • Added UI components for multimodal events
  • Updated examples in help text

Testing

# Install dependencies
npm install

# Build
npm run build

# Test voice mode
agi --voice "Hello"

# Test full multimodal
agi --voice --camera --screen --mcp "What do you see?"

Related PRs

JacobFV and others added 7 commits February 10, 2026 02:02
Add comprehensive multimodal features to AGI CLI:

## New CLI Options
- --voice: Enable voice input/output (requires OPENAI_API_KEY)
- --camera: Enable camera video feed
- --screen: Enable screen recording
- --mcp: Load MCP servers from config
- --mcp-config: Custom MCP config path (default: ~/.agi/mcp.json)

## Features
- Voice input with automatic turn detection
- Text-to-speech output
- Camera and screen video buffers
- MCP server integration for extended tools
- All features work together seamlessly

## Usage Examples
agi --voice "What's the time?"
agi --voice --screen "What's on my screen?"
agi --voice --camera --screen --mcp "Help me with my work"

## Related PRs
- agi-api (driver): agi-inc/agents#344
- agi-python: agi-inc/agi-python#8
- agi-node: agi-inc/agi-node#11

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Allows users to specify a custom AGI API endpoint URL:
- Added apiUrl to CliArgs interface
- Added --api-url CLI option
- Pass apiUrl to useAgent hook

Usage: agi --api-url http://localhost:8000 "your goal"

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Update App.tsx to pass voice, camera, screen, mcp, mcpConfig to useAgent
- Update UseAgentOptions interface to accept multimodal options
- Pass all multimodal options to AgentDriver constructor
- Complete end-to-end wiring: CLI args → App → useAgent → AgentDriver → API

Now the --voice, --camera, --screen, --mcp flags are fully functional!

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add voice, camera, screen, mcp, mcpConfig to the start callback
dependency array so React captures the correct values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused mkdirSync, join, and color variable that caused
ESLint failures in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI will pass once agi-node 0.5.0 is published to npm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JacobFV
Copy link
Collaborator Author

JacobFV commented Feb 10, 2026

Merge Order

This PR depends on @agi_inc/agi-js@0.5.0 which has the multimodal DriverOptions types. CI will fail until the Node SDK PR is merged and published:

  1. agi-inc/agents #344 — merge first (driver binary)
  2. agi-inc/agi-node feat: redesign TUI and add update command #11 → publish @agi_inc/agi-js@0.5.0 to npm
  3. agi-inc/agi-cli feat: add multimodal support (voice, camera, screen, MCP) #36 — merge after agi-js 0.5.0 is published
  4. agi-inc/agi-python feat(auth): add device code login flow and credential management #8 and agi-inc/agi-csharp feat(auth): add device code login flow and credential management #8 — can merge independently

The import.meta typecheck error is a pre-existing tsconfig issue (tsup outputs ESM correctly, tsc --noEmit false positive).

@JacobFV JacobFV merged commit b035bc8 into main Feb 10, 2026
1 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant