Skip to content

AuthBits/webmcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webmcp

webmcp is an MCP server for web search and content extraction. LLM agents can use it to:

  • search the web with DuckDuckGo (default) or SearXNG (optional)
  • fetch and clean page content from one or more URLs
  • send cleaned content to a local LLM for structured extraction

Features

  • search_web(query, limit=10) returns web results (title, URL, description)
  • extract(urls, prompt=None, schema=None, use_browser=True) extracts data from pages
  • browser-based fetching with Playwright for JavaScript-heavy sites
  • lightweight HTTP fetching mode for faster/simple pages
  • persistent tool-call logging to tool_calls.log.json
  • configurable search provider: DDG by default, optional SearXNG

Critical Requirement

For the main researcher llama.cpp server, include --webui-mcp-proxy in launch parameters. Without this flag, this workflow will not function correctly.

Prompting And Tested Setup

For best results, use research_prompt.txt as your system prompt. This prompt is a core part of the intended workflow and quality; it is effectively half of how this repository is meant to function.

Tested setup:

  • Main researcher LLM: Qwen3.5:27b-Q3_K_M.gguf via llama.cpp on an RTX 4090, context length 200,000, about 40 tok/s.
  • Extract tool LLM: Qwen3.5:9b-Q4_K_M.gguf via llama.cpp on a GTX 1080 Ti, context length 32,768, about 40 tok/s.
  • This workflow has been tested with the llama.cpp WebUI specifically, and has not been validated with other MCP clients yet.

Requirements

  • Python 3.10+
  • A local OpenAI-compatible LLM endpoint (for example, llama.cpp, LM Studio, vLLM, ollama, etc)

Configuration

The app reads LLM settings from environment variables and supports a local .env file.

  1. Copy .env.example to .env
  2. Set values:
LLM_URL=http://localhost:1234
LLM_MODEL=your-model-name
SEARCH_PROVIDER=ddg
# Optional when SEARCH_PROVIDER=searxng
SEARXNG_URL=http://localhost:8080

LLM_URL and LLM_MODEL are required at startup. SEARCH_PROVIDER defaults to ddg. Set it to searxng to replace DDG, and provide SEARXNG_URL.

Search Providers

search_web supports two providers:

  • ddg (default): uses DuckDuckGo via ddgs
  • searxng: uses your SearXNG instance

SearXNG notes:

  • Set SEARCH_PROVIDER=searxng
  • Set SEARXNG_URL to your instance base URL (for example, http://192.168.0.55:8888)
  • webmcp calls <SEARXNG_URL>/search with format=json

Install

Install dependencies from the pinned requirements file:

pip install -r requirements.txt
python -m playwright install chromium

Run

python app.py

Server starts on:

  • http://0.0.0.0:8642

MCP Usage Notes

  • extract(..., use_browser=True) is best for dynamic pages that require JS rendering.
  • extract(..., use_browser=False) is faster for static pages.
  • If extraction quality is poor, the LLM should provide a more specific prompt and/or a stricter schema.

TODO

  • Revisit JS page rendering and extraction strategy. Right now, roughly 25-30% of pages return little or no usable content even when fetched successfully.
  • Improve anti-bot handling for page fetches. Many targets still return 400-range errors, so investigate stronger browser mimicry (Playwright/Chromium behavior, headers, fingerprinting, and potentially user-agent/profile rotation).

License

MIT. See LICENSE.

About

A lightweight, prompt-driven MCP web research server for high-quality LLM powered information extraction.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages