Skip to content

feat(api): add image extraction support to v2 scrape endpoint#2008

Merged
mogery merged 5 commits intofirecrawl:mainfrom
vishkrish200:vishnu/eng-3180-add-scraping-feature-to-return-array-of-images
Aug 27, 2025
Merged

feat(api): add image extraction support to v2 scrape endpoint#2008
mogery merged 5 commits intofirecrawl:mainfrom
vishkrish200:vishnu/eng-3180-add-scraping-feature-to-return-array-of-images

Conversation

@vishkrish200
Copy link
Copy Markdown
Contributor

@vishkrish200 vishkrish200 commented Aug 22, 2025

Add support for extracting all images from webpages

Summary

Adds a new images format to v2 scrape endpoints that extracts all images found on a webpage, regardless of file extension or source type. This addresses cases where images don't appear in the regular links array and don't have typical image file extensions.

Problem Solved

"One thing if you also add in response for array of images option otherwise we separately find images via links array and some time that links not ending with image extension like jpeg,png and more if you add this feature also then it is Helpful"

Our solution finds ALL images on a page, including those that wouldn't be discoverable through the links format.

Key Changes

  • New format type: "images" added to v2 API formats array
  • Comprehensive extraction: Extracts from 8+ different image sources
  • Smart URL resolution: Handles relative, absolute, protocol-relative URLs
  • Security filtering: Blocks dangerous javascript: URLs
  • No extension dependency: Finds images regardless of URL structure
  • Rust performance: Non-blocking HTML parsing with Cheerio fallback

Usage Example

{
  "url": "https://example.com",
  "formats": ["images"]
}

Response Format

{
  "success": true,
  "data": {
    "images": [
      "https://example.com/logo.png",           // From <img> tag
      "https://example.com/og-image.jpg",       // From meta tag  
      "https://example.com/favicon.ico",        // From link tag
      "https://example.com/hero-bg.webp",       // From CSS background
      "https://cdn.example.com/product.avif"    // Modern formats
    ]
  }
}

Image Sources Supported

  • HTML Elements: <img> tags (src, data-src, srcset), <picture> elements, <video poster>
  • Meta Tags: Open Graph (og:image), Twitter Cards (twitter:image), Schema.org
  • Link Tags: Favicons, apple-touch-icons, image_src
  • CSS Styles: Inline background-image properties
  • Modern Web: Responsive images, lazy loading, WebP/AVIF formats
  • URL Handling: Data URIs, protocol-relative URLs, base tag support

Implementation

  • Rust core: extract_images function in html-transformer shared library
  • TypeScript wrapper: Graceful fallback to Cheerio if Rust fails
  • Transformer integration: deriveImagesFromHTML in scraping pipeline
  • Format-based: Added to formats array, not as separate property
  • SDK support: Added to both JavaScript and Python SDKs

Performance Benefits

  • Non-blocking: Rust implementation prevents event loop freezing
  • Memory efficient: Lower memory footprint vs pure JavaScript
  • Graceful degradation: Cheerio fallback ensures reliability

Live Test Results

Website Images Found Examples
news.ycombinator.com 2 Y Combinator logo, tracking pixel
github.com 22 CDN assets, logos, modern WebP/SVG formats
httpbin.org/html 0 Simple page (correct behavior)

Breaking Changes

None - fully backward compatible addition.

Comparison: Images vs Links

# GitHub.com test results:
Links found: 24 (navigation, pages, anchors)
Images found: 22 (visual assets, completely different content)

# Value: Images discovers content not available via links format

Summary by cubic

Adds an "images" format to v2 scrape that returns all images on a page, including ones not discoverable via links or file extensions. Addresses ENG-3180 with a Rust-based extractor and a safe JS fallback.

  • New Features
    • New format: images → string[] of resolved image URLs.
    • Sources: img (src, data-src, srcset), picture, meta (og/twitter/schema), link icons, video poster, inline CSS backgrounds.
    • URL handling: respects base tag; supports relative/absolute/protocol-relative; allows data/blob; filters javascript:.
    • Integrated in transformer pipeline and types (v1/v2); added to JS and Python SDKs.
    • Tests: unit and e2e in API and SDKs. Backward compatible.

- Add comprehensive image extraction from HTML using Rust
- Extract from img tags (src, data-src, srcset), picture elements
- Extract from meta tags (og:image, twitter:image, schema.org)
- Extract from link tags (favicons, apple-touch-icons)
- Extract from video poster attributes
- Handle URL resolution (relative, absolute, protocol-relative)
- Filter out javascript: URLs for security
- Use proper .join() method for URL resolution (addresses PR feedback)
- Add extractImages function to html-transformer.ts
- Create extractImages.ts library with Cheerio fallback
- Graceful error handling with fallback to pure JavaScript
- Comprehensive URL resolution and image source support
- Add 'images' to Format union type and FormatObject
- Add images field to Document types (v1 needed for shared transformers)
- Integrate deriveImagesFromHTML in transformer pipeline
- Add proper format validation and coercion
- Include structured logging context (addresses PR feedback)
- Add 'images' to FormatString type definitions
- Add images field to Document interfaces
- Add images support to ScrapeFormats class
- Maintain backward compatibility
- Add unit tests for extractImages function
- Add API integration tests for images format
- Add JS SDK e2e tests for images extraction
- Add Python SDK e2e tests for images extraction
- Test multiple format combinations and edge cases
- Follow PR feedback: tests in SDKs instead of example files
@vishkrish200 vishkrish200 requested a review from mogery as a code owner August 22, 2025 15:12
@vishkrish200
Copy link
Copy Markdown
Contributor Author

Hi @mogery! i've implemented all the changes you requested from PR #2003 in my fresh fork:

Re: "No need to add to v1" - I had to add images?: string[] to the v1 Document type because:

  • The transformers (apps/api/src/scraper/scrapeURL/transformers/index.ts) import from v1/types.ts
  • These transformers are shared between both v1 and v2 APIs
  • Without the field in v1 Document, TypeScript throws: Property 'images' does not exist on type 'Document'

Is this approach acceptable, or would you prefer a different solution for the shared transformer architecture?

I added these tests as requested:

  • JS SDK: apps/js-sdk/firecrawl/src/__tests__/e2e/v2/scrape.test.ts
  • Python SDK: apps/python-sdk/firecrawl/__tests__/e2e/v2/test_scrape.py

Are these sufficient, or should I add image format testing to existing test cases instead of separate test methods?

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 12 files

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.

Comment thread apps/api/src/__tests__/snips/v2/scrape.test.ts
Comment thread apps/js-sdk/firecrawl/src/__tests__/e2e/v2/scrape.test.ts
Comment thread apps/api/sharedLibs/html-transformer/src/lib.rs
Comment thread apps/api/sharedLibs/html-transformer/src/lib.rs
@mogery mogery merged commit 30c6bdd into firecrawl:main Aug 27, 2025
8 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants