Agentic web domain analyzer powered by Google Gemini and Firecrawl.
uv is a fast Python package and project manager from Astral. It installs Python (if needed), creates an isolated environment, and manages dependencies so the CLI runs consistently across machines.
Pick one option, then confirm with uv --version.
macOS (Homebrew)
brew install uvLinux/macOS (installer script)
curl -LsSf https://astral.sh/uv/install.sh | shWindows (PowerShell)
irm https://astral.sh/uv/install.ps1 | iexWindows (winget)
winget install --id Astral.UVInstall the CLI directly from GitHub:
uv tool install git+https://github.com/andreifoldes/htmlminer.gitThis makes the htmlminer command available globally.
Pull the latest version with:
uv tool upgrade htmlminerFrom the project root (the folder that contains pyproject.toml), install HTMLMiner in editable mode:
uv pip install -e .This creates a local virtual environment (if needed), installs dependencies, and links the package to your working copy so changes in src/ are picked up immediately without reinstalling.
HTMLMiner runs on Windows, macOS, and Linux. uv takes care of Python, virtual environments, and dependencies across platforms.
HTMLMiner needs API keys to function. You have two options:
Option 1: Interactive Prompts (Recommended for first-time users)
- Just run any command - if API keys are missing, you'll be prompted to enter them
- The CLI will securely ask for your keys and offer to save them to
.envautomatically - Keys are hidden during input for security
Option 2: Manual Setup
- Copy
.env.templateto.env - Add your API keys:
GEMINI_API_KEY=your_key_here FIRECRAWL_API_KEY=your_key_here # Optional for some modes
Required Keys:
GEMINI_API_KEY- Required for all extraction modes. Get it from Google AI StudioFIRECRAWL_API_KEY- Required for--agentmode, optional but recommended for--engine firecrawl. Get it from Firecrawl
The following diagram illustrates the decision tree and logic flow used by HTMLMiner to extract features from websites:
graph TD
Start[Start: htmlminer process] --> CheckMode{Mode: --agent?}
%% Firecrawl Agent Branch
CheckMode -- Yes --> FirecrawlAgent[Run Firecrawl Agent SDK]
FirecrawlAgent --> BuildSchema[Build Dynamic Schema from Config]
BuildSchema --> CallAgent[Call firecrawl.app.agent]
CallAgent --> Output[Save Results]
%% LangGraph Workflow Branch
CheckMode -- No --> LangGraph[LangGraph Workflow]
subgraph LangGraph Workflow
CheckEngine{Engine & Smart Mode?}
CheckEngine -- "Firecrawl + Smart" --> FetchSitemap[fetch_sitemap]
FetchSitemap --> FilterURLs[filter_urls]
FilterURLs --> SelectPages[select_pages<br/>LLM via LangChain]
SelectPages --> ScrapePages[scrape_pages]
CheckEngine -- "Trafilatura / Simple" --> SimpleCrawl[simple_crawl]
SimpleCrawl --> ExtractPages
ScrapePages --> ExtractPages
ExtractPages[extract_pages<br/>LangExtract per page] --> Synthesize[synthesize<br/>LLM via LangChain]
end
Synthesize --> Output
htmlminer process --file test_urls.mdhtmlminer process --url https://deepmind.google/about/htmlminer process --url https://openai.com/safety/ --engine firecrawlLimit the max paragraphs per dimension (Risk, Goal, Method) in the final summary (default is 3):
htmlminer process --url https://anthropic.com/ --max-paragraphs 2Cap the number of longest snippets per feature that are fed into synthesis (default is 50):
htmlminer process --url https://anthropic.com/ --synthesis-top 50Select a cheaper or more capable model for extraction and synthesis:
htmlminer process --url https://anthropic.com/ --gemini-tier expensiveUse Firecrawl's Agent SDK for autonomous web research. This mode uses Spark 1 models to intelligently search and extract data without manual crawling:
# Default: spark-1-mini (cost-efficient)
htmlminer process --url https://example.com --agent
# For complex tasks: spark-1-pro (better accuracy)
htmlminer process --url https://example.com --agent --spark-model proNote: Agent mode requires
FIRECRAWL_API_KEYand uses Firecrawl's credit-based billing.
After a run, the CLI prints:
- Agentic Extraction Results: a table with one row per URL and one column per configured feature (e.g., Risk/Goal/Method). The
Countscolumn shows how many raw extracts were found for each feature. - Token Usage Report: per-step breakdown of model usage, including call counts and total duration.
- Prompt Tokens: input tokens sent to the model (the scraped content plus instructions).
- Completion Tokens: output tokens generated by the model.
- Total Tokens: the sum of prompt + completion tokens.
config.json controls what the agent extracts. Each entry in features defines:
name: the label for the dimension in resultsdescription: what the extractor should look forsynthesis_topic: how the summary for that dimension should be framed
Example configuration:
{
"features": [
{
"name": "Risk",
"description": "Any mentioned risks, dangers, or negative impacts of AI development.",
"synthesis_topic": "Their risk assessment (ie what risks does AI development pose)"
},
{
"name": "Goal",
"description": "High-level goals, missions, or objectives (e.g., 'AI alignment' or global AI agreement).",
"synthesis_topic": "The goals (often pretty high-level e.g. 'AI alignment')"
},
{
"name": "Method",
"description": "Strategies, activities, or actions taken to achieve the goals (research, grantmaking, policy work, etc.).",
"synthesis_topic": "The methods used in service of the goals"
}
]
}If you add or edit features, keep valid JSON and the same field names. A malformed config.json will stop the run with a parse error.
Try small CLI tweaks before changing code:
- Model tier:
--gemini-tier expensiveyields better extraction quality at higher cost. - Smart crawling: Use
--smartto automatically find and include content from sub-pages (e.g.,/about,/research). This vastly improves context for the agent. - Feature page limit: Use
--limitto cap how many pages per feature are selected from the sitemap when smart crawling is enabled. - Engine choice:
--engine firecrawloften captures richer content;--engine trafilaturacan be cleaner for text-heavy pages. - Summary depth: increase
--max-paragraphsfor more detail (or reduce it for faster, tighter outputs). - Input scope: use
--filewith a curated URL list to avoid low-signal pages.
When using smart crawling with Firecrawl, HTMLMiner assigns a relevance score (1-10) to each page selected for scraping. These scores indicate how relevant the LLM believes each page is for extracting the configured features (Risk, Goal, Method by default).
- Scores are saved to the
page_relevancetable inhtmlminer_logs.db - Scores are included in
results.jsonundermetadata.relevance_scores - A score of 10 means highly relevant; 5 is the default for heuristically selected pages
To query relevance scores:
sqlite3 logs/htmlminer_logs.db "SELECT page_url, feature, relevance_score FROM page_relevance ORDER BY relevance_score DESC LIMIT 10;"Note: there is no dedicated verbosity flag yet. For troubleshooting, check htmlminer_logs.db and consider adding a verbosity option if you need more console detail.
If you want to view the raw HTML/Markdown content that was scraped for a specific URL, you can query the internal SQLite database using the command line:
sqlite3 logs/htmlminer_logs.db "SELECT content FROM snapshots WHERE url = 'https://example.com' LIMIT 1;"Or list the latest 5 snapshots:
sqlite3 logs/htmlminer_logs.db "SELECT url, timestamp FROM snapshots ORDER BY timestamp DESC LIMIT 5;"Windows Note: If sqlite3 is not available on your Windows system, you can:
- Install it via
winget install SQLite.SQLiteor download from sqlite.org - Use a GUI tool like DB Browser for SQLite
- Query the database using Python:
python -c "import sqlite3; conn = sqlite3.connect('logs/htmlminer_logs.db'); print(conn.execute('SELECT url FROM snapshots').fetchall())"
--file TEXT Path to markdown file containing URLs
--url TEXT Single URL to process
--output TEXT Path to output file [default: results.json]
--engine TEXT Engine to use: 'firecrawl' or 'trafilatura' [default: firecrawl]
--max-paragraphs INT Max paragraphs per dimension in agentic summary [default: 3]
--llm-timeout INT Timeout in seconds for LLM requests (Gemini/DSpy), capped at 600 [default: 600]
--gemini-tier TEXT Gemini model tier: 'cheap' or 'expensive' [default: cheap]
--smart Enable smart crawling to include sub-pages [default: True]
--limit INT Max pages per feature from sitemap when using --smart [default: 10]
--agent Use Firecrawl Agent SDK for extraction (requires FIRECRAWL_API_KEY)
--spark-model TEXT Spark model for --agent mode: 'mini' or 'pro' [default: mini]
--langextract Enable LangExtract for intermediate extraction. If disabled (default), full page content is used for synthesis.
--langextract-max-char-buffer INT Max chars per chunk for LangExtract [default: 50000]
--help Show this message and exit.