Skip to content

Agentic web domain analyzer powered by Google Gemini and Firecrawl

Notifications You must be signed in to change notification settings

andreifoldes/htmlminer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTMLMiner

Agentic web domain analyzer powered by Google Gemini and Firecrawl.

Installation

What is uv?

uv is a fast Python package and project manager from Astral. It installs Python (if needed), creates an isolated environment, and manages dependencies so the CLI runs consistently across machines.

Install uv

Pick one option, then confirm with uv --version.

macOS (Homebrew)

brew install uv

Linux/macOS (installer script)

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows (PowerShell)

irm https://astral.sh/uv/install.ps1 | iex

Windows (winget)

winget install --id Astral.UV

Install HTMLMiner (recommended)

Install the CLI directly from GitHub:

uv tool install git+https://github.com/andreifoldes/htmlminer.git

This makes the htmlminer command available globally.

Update HTMLMiner

Pull the latest version with:

uv tool upgrade htmlminer

Install for development

From the project root (the folder that contains pyproject.toml), install HTMLMiner in editable mode:

uv pip install -e .

This creates a local virtual environment (if needed), installs dependencies, and links the package to your working copy so changes in src/ are picked up immediately without reinstalling.

Platform notes

HTMLMiner runs on Windows, macOS, and Linux. uv takes care of Python, virtual environments, and dependencies across platforms.

API Key Setup

HTMLMiner needs API keys to function. You have two options:

Option 1: Interactive Prompts (Recommended for first-time users)

  • Just run any command - if API keys are missing, you'll be prompted to enter them
  • The CLI will securely ask for your keys and offer to save them to .env automatically
  • Keys are hidden during input for security

Option 2: Manual Setup

  • Copy .env.template to .env
  • Add your API keys:
    GEMINI_API_KEY=your_key_here
    FIRECRAWL_API_KEY=your_key_here  # Optional for some modes

Required Keys:

  • GEMINI_API_KEY - Required for all extraction modes. Get it from Google AI Studio
  • FIRECRAWL_API_KEY - Required for --agent mode, optional but recommended for --engine firecrawl. Get it from Firecrawl

How it Works

The following diagram illustrates the decision tree and logic flow used by HTMLMiner to extract features from websites:

graph TD
    Start[Start: htmlminer process] --> CheckMode{Mode: --agent?}
    
    %% Firecrawl Agent Branch
    CheckMode -- Yes --> FirecrawlAgent[Run Firecrawl Agent SDK]
    FirecrawlAgent --> BuildSchema[Build Dynamic Schema from Config]
    BuildSchema --> CallAgent[Call firecrawl.app.agent]
    CallAgent --> Output[Save Results]

    %% LangGraph Workflow Branch
    CheckMode -- No --> LangGraph[LangGraph Workflow]
    
    subgraph LangGraph Workflow
        CheckEngine{Engine & Smart Mode?}
        
        CheckEngine -- "Firecrawl + Smart" --> FetchSitemap[fetch_sitemap]
        FetchSitemap --> FilterURLs[filter_urls]
        FilterURLs --> SelectPages[select_pages<br/>LLM via LangChain]
        SelectPages --> ScrapePages[scrape_pages]
        
        CheckEngine -- "Trafilatura / Simple" --> SimpleCrawl[simple_crawl]
        SimpleCrawl --> ExtractPages
        ScrapePages --> ExtractPages
        
        ExtractPages[extract_pages<br/>LangExtract per page] --> Synthesize[synthesize<br/>LLM via LangChain]
    end
    
    Synthesize --> Output
Loading

Usage

Batch Processing (File)

htmlminer process --file test_urls.md

Single URL Processing

htmlminer process --url https://deepmind.google/about/

Using Firecrawl Engine

htmlminer process --url https://openai.com/safety/ --engine firecrawl

Controlling Summary Length

Limit the max paragraphs per dimension (Risk, Goal, Method) in the final summary (default is 3):

htmlminer process --url https://anthropic.com/ --max-paragraphs 2

Limiting Synthesis Context

Cap the number of longest snippets per feature that are fed into synthesis (default is 50):

htmlminer process --url https://anthropic.com/ --synthesis-top 50

Choosing Gemini Model Tier

Select a cheaper or more capable model for extraction and synthesis:

htmlminer process --url https://anthropic.com/ --gemini-tier expensive

Agent Mode (Firecrawl Agent SDK)

Use Firecrawl's Agent SDK for autonomous web research. This mode uses Spark 1 models to intelligently search and extract data without manual crawling:

# Default: spark-1-mini (cost-efficient)
htmlminer process --url https://example.com --agent

# For complex tasks: spark-1-pro (better accuracy)
htmlminer process --url https://example.com --agent --spark-model pro

Note: Agent mode requires FIRECRAWL_API_KEY and uses Firecrawl's credit-based billing.

CLI Output (Results + Token Usage)

After a run, the CLI prints:

  • Agentic Extraction Results: a table with one row per URL and one column per configured feature (e.g., Risk/Goal/Method). The Counts column shows how many raw extracts were found for each feature.
  • Token Usage Report: per-step breakdown of model usage, including call counts and total duration.
    • Prompt Tokens: input tokens sent to the model (the scraped content plus instructions).
    • Completion Tokens: output tokens generated by the model.
    • Total Tokens: the sum of prompt + completion tokens.

Configuration

config.json

config.json controls what the agent extracts. Each entry in features defines:

  • name: the label for the dimension in results
  • description: what the extractor should look for
  • synthesis_topic: how the summary for that dimension should be framed

Example configuration:

{
    "features": [
        {
            "name": "Risk",
            "description": "Any mentioned risks, dangers, or negative impacts of AI development.",
            "synthesis_topic": "Their risk assessment (ie what risks does AI development pose)"
        },
        {
            "name": "Goal",
            "description": "High-level goals, missions, or objectives (e.g., 'AI alignment' or global AI agreement).",
            "synthesis_topic": "The goals (often pretty high-level e.g. 'AI alignment')"
        },
        {
            "name": "Method",
            "description": "Strategies, activities, or actions taken to achieve the goals (research, grantmaking, policy work, etc.).",
            "synthesis_topic": "The methods used in service of the goals"
        }
    ]
}

If you add or edit features, keep valid JSON and the same field names. A malformed config.json will stop the run with a parse error.

Improving Extraction Quality

Try small CLI tweaks before changing code:

  • Model tier: --gemini-tier expensive yields better extraction quality at higher cost.
  • Smart crawling: Use --smart to automatically find and include content from sub-pages (e.g., /about, /research). This vastly improves context for the agent.
  • Feature page limit: Use --limit to cap how many pages per feature are selected from the sitemap when smart crawling is enabled.
  • Engine choice: --engine firecrawl often captures richer content; --engine trafilatura can be cleaner for text-heavy pages.
  • Summary depth: increase --max-paragraphs for more detail (or reduce it for faster, tighter outputs).
  • Input scope: use --file with a curated URL list to avoid low-signal pages.

Page Relevance Scores

When using smart crawling with Firecrawl, HTMLMiner assigns a relevance score (1-10) to each page selected for scraping. These scores indicate how relevant the LLM believes each page is for extracting the configured features (Risk, Goal, Method by default).

  • Scores are saved to the page_relevance table in htmlminer_logs.db
  • Scores are included in results.json under metadata.relevance_scores
  • A score of 10 means highly relevant; 5 is the default for heuristically selected pages

To query relevance scores:

sqlite3 logs/htmlminer_logs.db "SELECT page_url, feature, relevance_score FROM page_relevance ORDER BY relevance_score DESC LIMIT 10;"

Note: there is no dedicated verbosity flag yet. For troubleshooting, check htmlminer_logs.db and consider adding a verbosity option if you need more console detail.

Inspecting Raw Snapshots

If you want to view the raw HTML/Markdown content that was scraped for a specific URL, you can query the internal SQLite database using the command line:

sqlite3 logs/htmlminer_logs.db "SELECT content FROM snapshots WHERE url = 'https://example.com' LIMIT 1;"

Or list the latest 5 snapshots:

sqlite3 logs/htmlminer_logs.db "SELECT url, timestamp FROM snapshots ORDER BY timestamp DESC LIMIT 5;"

Windows Note: If sqlite3 is not available on your Windows system, you can:

  • Install it via winget install SQLite.SQLite or download from sqlite.org
  • Use a GUI tool like DB Browser for SQLite
  • Query the database using Python: python -c "import sqlite3; conn = sqlite3.connect('logs/htmlminer_logs.db'); print(conn.execute('SELECT url FROM snapshots').fetchall())"

Full CLI Options

  --file TEXT             Path to markdown file containing URLs
  --url TEXT              Single URL to process
  --output TEXT           Path to output file [default: results.json]
  --engine TEXT           Engine to use: 'firecrawl' or 'trafilatura' [default: firecrawl]
  --max-paragraphs INT    Max paragraphs per dimension in agentic summary [default: 3]
  --llm-timeout INT       Timeout in seconds for LLM requests (Gemini/DSpy), capped at 600 [default: 600]
  --gemini-tier TEXT      Gemini model tier: 'cheap' or 'expensive' [default: cheap]
  --smart                 Enable smart crawling to include sub-pages [default: True]
  --limit INT             Max pages per feature from sitemap when using --smart [default: 10]
  --agent                 Use Firecrawl Agent SDK for extraction (requires FIRECRAWL_API_KEY)
  --spark-model TEXT      Spark model for --agent mode: 'mini' or 'pro' [default: mini]
  --langextract           Enable LangExtract for intermediate extraction. If disabled (default), full page content is used for synthesis.
  --langextract-max-char-buffer INT  Max chars per chunk for LangExtract [default: 50000]
  --help                  Show this message and exit.

About

Agentic web domain analyzer powered by Google Gemini and Firecrawl

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages