webmcp is an MCP server for web search and content extraction. LLM agents can use it to:
- search the web with DuckDuckGo (default) or SearXNG (optional)
- fetch and clean page content from one or more URLs
- send cleaned content to a local LLM for structured extraction
search_web(query, limit=10)returns web results (title, URL, description)extract(urls, prompt=None, schema=None, use_browser=True)extracts data from pages- browser-based fetching with Playwright for JavaScript-heavy sites
- lightweight HTTP fetching mode for faster/simple pages
- persistent tool-call logging to
tool_calls.log.json - configurable search provider: DDG by default, optional SearXNG
For the main researcher llama.cpp server, include --webui-mcp-proxy in launch parameters. Without this flag, this workflow will not function correctly.
For best results, use research_prompt.txt as your system prompt. This prompt is a core part of the intended workflow and quality; it is effectively half of how this repository is meant to function.
Tested setup:
- Main researcher LLM:
Qwen3.5:27b-Q3_K_M.ggufvia llama.cpp on an RTX 4090, context length 200,000, about 40 tok/s. - Extract tool LLM:
Qwen3.5:9b-Q4_K_M.ggufvia llama.cpp on a GTX 1080 Ti, context length 32,768, about 40 tok/s. - This workflow has been tested with the llama.cpp WebUI specifically, and has not been validated with other MCP clients yet.
- Python 3.10+
- A local OpenAI-compatible LLM endpoint (for example, llama.cpp, LM Studio, vLLM, ollama, etc)
The app reads LLM settings from environment variables and supports a local .env file.
- Copy
.env.exampleto.env - Set values:
LLM_URL=http://localhost:1234
LLM_MODEL=your-model-name
SEARCH_PROVIDER=ddg
# Optional when SEARCH_PROVIDER=searxng
SEARXNG_URL=http://localhost:8080LLM_URL and LLM_MODEL are required at startup.
SEARCH_PROVIDER defaults to ddg. Set it to searxng to replace DDG, and provide SEARXNG_URL.
search_web supports two providers:
ddg(default): uses DuckDuckGo viaddgssearxng: uses your SearXNG instance
SearXNG notes:
- Set
SEARCH_PROVIDER=searxng - Set
SEARXNG_URLto your instance base URL (for example,http://192.168.0.55:8888) webmcpcalls<SEARXNG_URL>/searchwithformat=json
Install dependencies from the pinned requirements file:
pip install -r requirements.txt
python -m playwright install chromiumpython app.pyServer starts on:
http://0.0.0.0:8642
extract(..., use_browser=True)is best for dynamic pages that require JS rendering.extract(..., use_browser=False)is faster for static pages.- If extraction quality is poor, the LLM should provide a more specific
promptand/or a stricterschema.
- Revisit JS page rendering and extraction strategy. Right now, roughly 25-30% of pages return little or no usable content even when fetched successfully.
- Improve anti-bot handling for page fetches. Many targets still return 400-range errors, so investigate stronger browser mimicry (Playwright/Chromium behavior, headers, fingerprinting, and potentially user-agent/profile rotation).
MIT. See LICENSE.