feat(api): add image extraction support to v2 scrape endpoint by vishkrish200 · Pull Request #2008 · firecrawl/firecrawl

vishkrish200 · 2025-08-22T15:12:49Z

Add support for extracting all images from webpages

Summary

Adds a new images format to v2 scrape endpoints that extracts all images found on a webpage, regardless of file extension or source type. This addresses cases where images don't appear in the regular links array and don't have typical image file extensions.

Problem Solved

"One thing if you also add in response for array of images option otherwise we separately find images via links array and some time that links not ending with image extension like jpeg,png and more if you add this feature also then it is Helpful"

Our solution finds ALL images on a page, including those that wouldn't be discoverable through the links format.

Key Changes

New format type: "images" added to v2 API formats array
Comprehensive extraction: Extracts from 8+ different image sources
Smart URL resolution: Handles relative, absolute, protocol-relative URLs
Security filtering: Blocks dangerous javascript: URLs
No extension dependency: Finds images regardless of URL structure
Rust performance: Non-blocking HTML parsing with Cheerio fallback

Usage Example

{
  "url": "https://example.com",
  "formats": ["images"]
}

Response Format

{
  "success": true,
  "data": {
    "images": [
      "https://example.com/logo.png",           // From <img> tag
      "https://example.com/og-image.jpg",       // From meta tag  
      "https://example.com/favicon.ico",        // From link tag
      "https://example.com/hero-bg.webp",       // From CSS background
      "https://cdn.example.com/product.avif"    // Modern formats
    ]
  }
}

Image Sources Supported

HTML Elements: <img> tags (src, data-src, srcset), <picture> elements, <video poster>
Meta Tags: Open Graph (og:image), Twitter Cards (twitter:image), Schema.org
Link Tags: Favicons, apple-touch-icons, image_src
CSS Styles: Inline background-image properties
Modern Web: Responsive images, lazy loading, WebP/AVIF formats
URL Handling: Data URIs, protocol-relative URLs, base tag support

Implementation

Rust core: extract_images function in html-transformer shared library
TypeScript wrapper: Graceful fallback to Cheerio if Rust fails
Transformer integration: deriveImagesFromHTML in scraping pipeline
Format-based: Added to formats array, not as separate property
SDK support: Added to both JavaScript and Python SDKs

Performance Benefits

Non-blocking: Rust implementation prevents event loop freezing
Memory efficient: Lower memory footprint vs pure JavaScript
Graceful degradation: Cheerio fallback ensures reliability

Live Test Results

Website	Images Found	Examples
news.ycombinator.com	2	Y Combinator logo, tracking pixel
github.com	22	CDN assets, logos, modern WebP/SVG formats
httpbin.org/html	0	Simple page (correct behavior)

Breaking Changes

None - fully backward compatible addition.

Comparison: Images vs Links

# GitHub.com test results:
Links found: 24 (navigation, pages, anchors)
Images found: 22 (visual assets, completely different content)

# Value: Images discovers content not available via links format

Summary by cubic

Adds an "images" format to v2 scrape that returns all images on a page, including ones not discoverable via links or file extensions. Addresses ENG-3180 with a Rust-based extractor and a safe JS fallback.

New Features
- New format: images → string[] of resolved image URLs.
- Sources: img (src, data-src, srcset), picture, meta (og/twitter/schema), link icons, video poster, inline CSS backgrounds.
- URL handling: respects base tag; supports relative/absolute/protocol-relative; allows data/blob; filters javascript:.
- Integrated in transformer pipeline and types (v1/v2); added to JS and Python SDKs.
- Tests: unit and e2e in API and SDKs. Backward compatible.

- Add comprehensive image extraction from HTML using Rust - Extract from img tags (src, data-src, srcset), picture elements - Extract from meta tags (og:image, twitter:image, schema.org) - Extract from link tags (favicons, apple-touch-icons) - Extract from video poster attributes - Handle URL resolution (relative, absolute, protocol-relative) - Filter out javascript: URLs for security - Use proper .join() method for URL resolution (addresses PR feedback)

- Add extractImages function to html-transformer.ts - Create extractImages.ts library with Cheerio fallback - Graceful error handling with fallback to pure JavaScript - Comprehensive URL resolution and image source support

- Add 'images' to Format union type and FormatObject - Add images field to Document types (v1 needed for shared transformers) - Integrate deriveImagesFromHTML in transformer pipeline - Add proper format validation and coercion - Include structured logging context (addresses PR feedback)

- Add 'images' to FormatString type definitions - Add images field to Document interfaces - Add images support to ScrapeFormats class - Maintain backward compatibility

- Add unit tests for extractImages function - Add API integration tests for images format - Add JS SDK e2e tests for images extraction - Add Python SDK e2e tests for images extraction - Test multiple format combinations and edge cases - Follow PR feedback: tests in SDKs instead of example files

vishkrish200 · 2025-08-22T15:13:51Z

Hi @mogery! i've implemented all the changes you requested from PR #2003 in my fresh fork:

Re: "No need to add to v1" - I had to add images?: string[] to the v1 Document type because:

The transformers (apps/api/src/scraper/scrapeURL/transformers/index.ts) import from v1/types.ts
These transformers are shared between both v1 and v2 APIs
Without the field in v1 Document, TypeScript throws: Property 'images' does not exist on type 'Document'

Is this approach acceptable, or would you prefer a different solution for the shared transformer architecture?

I added these tests as requested:

JS SDK: apps/js-sdk/firecrawl/src/__tests__/e2e/v2/scrape.test.ts
Python SDK: apps/python-sdk/firecrawl/__tests__/e2e/v2/test_scrape.py

Are these sufficient, or should I add image format testing to existing test cases instead of separate test methods?

cubic-dev-ai

4 issues found across 12 files

_{React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

vishkrish200 added 5 commits August 22, 2025 20:39

feat(api): add TypeScript wrapper for image extraction

fbf0755

- Add extractImages function to html-transformer.ts - Create extractImages.ts library with Cheerio fallback - Graceful error handling with fallback to pure JavaScript - Comprehensive URL resolution and image source support

feat(sdks): add images format support to JS and Python SDKs

67321f0

- Add 'images' to FormatString type definitions - Add images field to Document interfaces - Add images support to ScrapeFormats class - Maintain backward compatibility

vishkrish200 requested a review from mogery as a code owner August 22, 2025 15:12

cubic-dev-ai Bot reviewed Aug 22, 2025

View reviewed changes

Comment thread apps/api/src/__tests__/snips/v2/scrape.test.ts

Comment thread apps/js-sdk/firecrawl/src/__tests__/e2e/v2/scrape.test.ts

Comment thread apps/api/sharedLibs/html-transformer/src/lib.rs

Comment thread apps/api/sharedLibs/html-transformer/src/lib.rs

mogery approved these changes Aug 27, 2025

View reviewed changes

mogery merged commit 30c6bdd into firecrawl:main Aug 27, 2025
8 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add image extraction support to v2 scrape endpoint#2008

feat(api): add image extraction support to v2 scrape endpoint#2008
mogery merged 5 commits intofirecrawl:mainfrom
vishkrish200:vishnu/eng-3180-add-scraping-feature-to-return-array-of-images

vishkrish200 commented Aug 22, 2025 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

vishkrish200 commented Aug 22, 2025

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vishkrish200 commented Aug 22, 2025 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add support for extracting all images from webpages

Summary

Problem Solved

Key Changes

Usage Example

Response Format

Image Sources Supported

Implementation

Performance Benefits

Live Test Results

Breaking Changes

Comparison: Images vs Links

Summary by cubic

Uh oh!

vishkrish200 commented Aug 22, 2025

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vishkrish200 commented Aug 22, 2025 •

edited by cubic-dev-ai Bot

Loading