A voice-controlled web interface for interacting with a browser automation agent. This project combines advanced voice recognition with browser automation to create a hands-free way to control web browsing.
- Voice Input: Uses @ricky0123/vad-react for browser-based Voice Activity Detection (VAD)
- Transcription: Configurable to use different Speech-to-Text services (Groq Whisper, OpenAI Whisper API)
- Agent Backend: Communicates with Python backend API in
/backenddirectory - Browser Control: Uses the
browser-uselibrary to automate browser actions based on voice commands - Real-time Updates: WebSockets for live agent status, goals, actions, and browser screenshots
- Modern Stack: Built with Next.js (React) and Tailwind CSS, deployable on Vercel or similar platforms
This repository contains both the frontend UI and the Python backend API in a monorepo structure:
βββ swift-browser-ui/ (Repo root)
βββ app/ <-- Next.js frontend code
βββ backend/ <-- Python FastAPI backend code
βββ public/ <-- Frontend public assets (VAD files)
βββ node_modules/ <-- Node.js dependencies
βββ backend/.venv/ <-- Python virtual environment (created by uv/venv)
βββ package.json <-- Node.js dependencies
βββ backend/pyproject.toml <-- Python project config
βββ backend/requirements.txt <-- Python dependencies (for pip)
βββ .env.local <-- Frontend & Shared Env Vars (Root)
βββ backend/.env <-- Backend-specific Env Vars
βββ ... (other config files)
The system works through these components:
-
Frontend (Browser):
- Captures voice using VAD to detect speech
- Sends audio blob to
/api/route.ts - Displays agent progress via WebSocket updates
-
Next.js API Route:
- Receives audio and sends to configured STT service
- Gets transcription and forwards to Python backend
-
Python Backend:
- Manages browser-use Agent instance using configured LLM
- Executes browser actions based on commands
- Sends status updates and screenshots to frontend
+-----------------------+ POST /api +-----------------------+ POST /agent/task +---------------------+ +-------------+
| Frontend (Browser) |------------->| Next.js API Route |-------------------->| Python Backend |-->| browser-use |
| - VAD (Detects Speech)| (Audio Blob)| - STT Transcription | (Task Text) | - FastAPI/WebSocket| | Agent |
| - WebSocket Client | | - Calls Python Backend| | - Manages Agent | | |
| - Displays Agent View |<-------------+ |<--------------------+ |<--+ |
+-----------------------+ WebSocket JSON+-----------------------+ JSON Response +---------------------+ +-------------+
(Status, Screenshot) (Task Accepted + SessionID)
- Node.js: v18 or later recommended
- pnpm:
npm install -g pnpmfor frontend dependencies - Python: v3.11 or later
- uv (Recommended) or pip: For Python dependencies
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | shorpip install uv
- Install uv:
- API Keys:
- STT provider (Groq or OpenAI) for frontend
- LLM provider (OpenAI, Anthropic, Google, Azure) for backend
- Ollama (Optional): For local models
- Microphone: A working microphone for your browser
# Replace with your fork's URL
git clone https://github.com/e8Complete/swift-browser-use.git
cd swift-browser-useFrom the root directory:
pnpm install
pnpm add openai # Required if using OpenAI for STTNavigate to the backend directory:
cd backend# This creates/uses .venv and installs from pyproject.toml
uv pip install -e .# Generate requirements.txt if needed
uv pip freeze > requirements.txt
# Create a virtual environment and install
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtAfter activating the virtual environment:
playwright install chromiumcd ..Copy .env.example to .env.local: cp .env.example .env.local
Edit .env.local:
# ./ai-ng-swift/.env.local
# --- STT Configuration for Next.js API Route ---
# Choose one: 'groq' or 'openai'
STT_PROVIDER=openai
# Required if STT_PROVIDER=groq
GROQ_API_KEY=gr_...
# Required if STT_PROVIDER=openai
OPENAI_API_KEY=sk_...
# --- Backend URLs ---
# URL for Next.js API route to call Python backend
PYTHON_BACKEND_URL=http://localhost:8000
# WebSocket URL for the browser client (JS) to connect
# Use ws:// locally, wss:// when deployed with SSL
# Needs NEXT_PUBLIC_ prefix!
NEXT_PUBLIC_PYTHON_WS_URL=ws://localhost:8000Create this file inside the backend/ directory:
# ./ai-ng-swift/backend/.env
# --- LLM Configuration for Python Agent ---
# Choose provider: 'openai', 'azure', 'anthropic', 'google', 'ollama'
AGENT_LLM_PROVIDER=ollama
# Model name appropriate for the provider
AGENT_LLM_MODEL=llama3
# --- Provider Specific Settings (only need those for selected provider) ---
# OPENAI_API_KEY=sk_... # If using openai LLM
# ANTHROPIC_API_KEY=...
# GEMINI_API_KEY=...
# AZURE_ENDPOINT=...
# AZURE_OPENAI_API_KEY=...
# AZURE_OPENAI_API_VERSION=...
OLLAMA_BASE_URL=http://localhost:11434 # Default if using ollama
# Optional backend logging level
# BROWSER_USE_LOGGING_LEVEL=debugFrom the root directory:
pnpm devAccess at http://localhost:3000
From the backend/ directory:
cd backend
# If using pip/venv, activate it: source .venv/bin/activate
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reloadAPI runs at http://localhost:8000
- Open http://localhost:3000
- Grant microphone permissions
- Speak a command (e.g., "Open Google and search for cute cat videos")
- Watch the browser UI for transcription, status updates, and screenshots
Deploying a monorepo with mixed languages requires separate configurations:
Deploy the Next.js app to Vercel, Netlify, etc:
- Configure platform to use Node.js with pnpm
- Set environment variables in the platform settings:
STT_PROVIDERGROQ_API_KEYorOPENAI_API_KEYPYTHON_BACKEND_URL(must point to deployed backend)NEXT_PUBLIC_PYTHON_WS_URL(must point to deployed backend)
Deploy to a platform suitable for long-running Python processes:
- Fly.io, Render, Cloud Run, Railway, DigitalOcean Apps, AWS EC2/ECS
- Often deployed via Docker
# Use an official Python base image
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies needed by Playwright/Browsers
RUN apt-get update && apt-get install -y --no-install-recommends \
# Playwright dependencies
libnss3 libnspr4 libdbus-1-3 libatk1.0-0 libatk-bridge2.0-0 \
libcups2 libdrm2 libexpat1 libgbm1 libgcc1 libglib2.0-0 \
libpango-1.0-0 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 \
libxdamage1 libxext6 libxfixes3 libxrandr2 libxtst6 \
ca-certificates fonts-liberation libappindicator3-1 \
libasound2 libatspi2.0-0 libcairo2 libfontconfig1 \
libgtk-3-0 libpangoft2-1.0-0 libstdc++6 \
lsb-release wget xdg-utils \
# Clean up
&& rm -rf /var/lib/apt/lists/*
# Install uv
RUN pip install uv
# Copy only dependency definition files first for caching
COPY pyproject.toml ./
# Optional: If using requirements.txt
# COPY requirements.txt ./
# Install Python dependencies
RUN uv pip install --system --no-cache -e .
# Or if using requirements.txt:
# RUN uv pip install --system --no-cache -r requirements.txt
# Install Playwright browsers
RUN playwright install chromium --with-deps
# Copy the rest of the application
COPY . .
# Expose the port the app runs on
EXPOSE 8000
# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]- browser-use for browser automation
- @ricky0123/vad-react for voice activity detection
- Next.js and FastAPI frameworks