feat(api): add image extraction support to v2 scrape endpoint#2008
Merged
mogery merged 5 commits intofirecrawl:mainfrom Aug 27, 2025
Conversation
- Add comprehensive image extraction from HTML using Rust - Extract from img tags (src, data-src, srcset), picture elements - Extract from meta tags (og:image, twitter:image, schema.org) - Extract from link tags (favicons, apple-touch-icons) - Extract from video poster attributes - Handle URL resolution (relative, absolute, protocol-relative) - Filter out javascript: URLs for security - Use proper .join() method for URL resolution (addresses PR feedback)
- Add extractImages function to html-transformer.ts - Create extractImages.ts library with Cheerio fallback - Graceful error handling with fallback to pure JavaScript - Comprehensive URL resolution and image source support
- Add 'images' to Format union type and FormatObject - Add images field to Document types (v1 needed for shared transformers) - Integrate deriveImagesFromHTML in transformer pipeline - Add proper format validation and coercion - Include structured logging context (addresses PR feedback)
- Add 'images' to FormatString type definitions - Add images field to Document interfaces - Add images support to ScrapeFormats class - Maintain backward compatibility
- Add unit tests for extractImages function - Add API integration tests for images format - Add JS SDK e2e tests for images extraction - Add Python SDK e2e tests for images extraction - Test multiple format combinations and edge cases - Follow PR feedback: tests in SDKs instead of example files
Contributor
Author
|
Hi @mogery! i've implemented all the changes you requested from PR #2003 in my fresh fork: Re: "No need to add to v1" - I had to add
Is this approach acceptable, or would you prefer a different solution for the shared transformer architecture? I added these tests as requested:
Are these sufficient, or should I add image format testing to existing test cases instead of separate test methods? |
mogery
approved these changes
Aug 27, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for extracting all images from webpages
Summary
Adds a new
imagesformat to v2 scrape endpoints that extracts all images found on a webpage, regardless of file extension or source type. This addresses cases where images don't appear in the regularlinksarray and don't have typical image file extensions.Problem Solved
Our solution finds ALL images on a page, including those that wouldn't be discoverable through the links format.
Key Changes
"images"added to v2 API formats arrayjavascript:URLsUsage Example
{ "url": "https://example.com", "formats": ["images"] }Response Format
{ "success": true, "data": { "images": [ "https://example.com/logo.png", // From <img> tag "https://example.com/og-image.jpg", // From meta tag "https://example.com/favicon.ico", // From link tag "https://example.com/hero-bg.webp", // From CSS background "https://cdn.example.com/product.avif" // Modern formats ] } }Image Sources Supported
<img>tags (src, data-src, srcset),<picture>elements,<video poster>og:image), Twitter Cards (twitter:image), Schema.orgbackground-imagepropertiesImplementation
extract_imagesfunction inhtml-transformershared libraryderiveImagesFromHTMLin scraping pipelineformatsarray, not as separate propertyPerformance Benefits
Live Test Results
Breaking Changes
None - fully backward compatible addition.
Comparison: Images vs Links
Summary by cubic
Adds an "images" format to v2 scrape that returns all images on a page, including ones not discoverable via links or file extensions. Addresses ENG-3180 with a Rust-based extractor and a safe JS fallback.