feat: add `FileDownload` "crawler" #2435

barjin · 2024-04-29T12:56:26Z

Adds a new package @crawlee/file-download, which overrides the HttpCrawler's MIME type limitations and allows the users to download arbitrary files.

Aside from the regular requestHandler, this crawler introduces streamHandler, which passes a ReadableStream with the downloaded data to the user handler.

B4nan · 2024-04-29T13:03:51Z

Do we really need a new package here? If there are no additional dependencies, I would just expose the new class in the HTTP crawler package.

barjin · 2024-04-29T13:05:24Z

Yup, I don't have any strong opinion here - the (very short) discussion was at https://github.com/apify/store-website-content-crawler/issues/242

packages/file-download/src/internals/file-download.ts

packages/http-crawler/src/internals/file-download.ts

barjin · 2024-05-02T14:10:36Z

The latest commit adds streamHandler, which is mutually exclusive with the requestHandler in the constructor and allows the users to work with the data stream instead of the fully downloaded data.

const crawler = new FileDownload({
    async streamHandler({ stream }) {
        const file = createWriteStream('test.webm');
        pipeline(stream, file, (err) => {
            if (err) {
                console.error('Pipeline failed', err);
            }
        });
    },
})

This is a direct port from WCC. What do you think, is it worth it to keep it here? Or should we rely on the requestHandler only?

B4nan

Can we add some tests for this?

I dont mind having the stream handler.

packages/http-crawler/src/internals/file-download.ts

janbuchar

This is actually pretty impressive! Could you also add an example to the docs? Or a whole page even...

packages/http-crawler/src/internals/file-download.ts

janbuchar · 2024-05-09T09:37:11Z

packages/http-crawler/src/internals/file-download.ts

+ *
+ * The source URLs are represented using [Request](https://crawlee.dev/api/core/class/Request) objects that are fed from [RequestList](https://crawlee.dev/api/core/class/RequestList) or [RequestQueue](https://crawlee.dev/api/core/class/RequestQueue) instances provided by the [FileCrawlerOptions.requestList](https://crawlee.dev/api/file-crawler/interface/FileCrawlerOptions#requestList) or [FileCrawlerOptions.requestQueue](https://crawlee.dev/api/file-crawler/interface/FileCrawlerOptions#requestQueue) constructor options, respectively.
+ *
+ * If both [FileCrawlerOptions.requestList](https://crawlee.dev/api/file-crawler/interface/FileCrawlerOptions#requestList) and [FileCrawlerOptions.requestQueue](https://crawlee.dev/api/file-crawler/interface/FileCrawlerOptions#requestQueue) are used, the instance first processes URLs from the [RequestList](https://crawlee.dev/api/core/class/RequestList) and automatically enqueues all of them to [RequestQueue](https://crawlee.dev/api/core/class/RequestQueue) before it starts their processing. This ensures that a single URL is not crawled multiple times.


Could we just... not support this in new crawlers?

Why so? The last time I checked (last week), the Delivery team loves a good RequestList (for whatever reason).

Plus, this behaviour is coming from BasicCrawler (snippet here), so all our crawlers are doing this right now. I'm not here to gatekeep this, just curious about your motivation to remove this :)

for whatever reason

The cost is indeed a valid reason, since with RL there is no cost for the RQ writes (and they can be indeed costly for a high number of requests, e.g. in this run the RQ writes are about 70% of the cost). I will recheck this with RL in place to see what is the actual cost difference.

Well, it's a piece of non-trivial logic that I thought was not really necessary for this use case. And in general, less features means less code to maintain 🤷

was not really necessary for this use case

Right, but it's free (inherited from BasicCrawler). I copied (and edited) the JSDoc from HttpCrawler, so it may be too descriptive in some places, happy to get some feedback for that too. This part is mentioned in all the crawlers we have, though.

The cost is indeed a valid reason, since with RL there is no cost for the RQ writes (and they can be indeed costly for a high number of requests, e.g. in this run the RQ writes are about 70% of the cost). I will recheck this with RL in place to see what is the actual cost difference.

Btw here is the same actor, but with RL in place. The cost is ~5$ while with RQ it's 26$, this is surely a good enough reason to keep this abstraction (unfortunately, lol).

test/core/crawlers/file_download.test.ts

packages/http-crawler/src/internals/file-download.ts

Co-authored-by: Martin Adámek <banan23@gmail.com>

Co-authored-by: Jan Buchar <jan.buchar@apify.com>

@B4nan

* chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * ci: test on node 22 (apify#2438) * chore: use node 20 in templates * chore(deps): update yarn to v4.2.1 * chore(deps): lock file maintenance * fix: return true when robots.isAllowed returns undefined (apify#2439) `undefined` means that there is no explicit rule for the requested route. No rules means no disallow, therefore it's allowed. Fixes apify#2437 --------- Co-authored-by: Jan Buchar <Teyras@gmail.com> * chore(deps): update patch/minor dependencies to v3.3.0 * chore(deps): update patch/minor dependencies to v3.3.2 * chore(deps): lock file maintenance * chore(deps): lock file maintenance * docs: Should be "Same Domain" not "Same Subdomain" (apify#2445) The docs appear to be a bit misleading. If people want "Same Subdomain" they should actually use "Same Hostname". ![image](https://github.com/apify/crawlee/assets/10026538/2b5452c5-e313-404b-812d-811e0764bd2d) * chore(docker): update docker state [skip ci] * docs: fix two typos (array or requests -> array of requests, no much -> not much) (apify#2451) * fix: sitemap `content-type` check breaks on `content-type` parameters (apify#2442) According to the [RFC1341](https://www.w3.org/Protocols/rfc1341/4_Content-Type.html), the Content-type header can contain additional string parameters. * chore(docker): update docker state [skip ci] * chore(deps): lock file maintenance * fix(core): fire local `SystemInfo` events every second (apify#2454) During local development, we are firing events for the AutoscaledPool about current system resources like memory or CPU. We were firing them once a minute by default, but we remove those snapshots older than 30s, so we never had anything to compare and always used only the very last piece of information. This PR changes the interval to 1s, aligning this with how the Apify platform fires events. * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): lock file maintenance * chore(deps): update dependency linkedom to ^0.18.0 (apify#2457) * chore(docker): update docker state [skip ci] * perf: optimize adding large amount of requests via `crawler.addRequests()` (apify#2456) This PR resolves three main issues with adding large amount of requests into the queue: - Every requests added to the queue was automatically added to the LRU requests cache, which has a size of 1 million items. this makes sense for enqueuing a few items, but if we try to add more than the limit, we end up with overloading the LRU cache for no reason. Now we only add the first 1000 requests to the cache (plus any requests added via separate calls, e.g. when doing `enqueueLinks` from inside a request handler, again with a limit of the first 1000 links). - We used to validate the whole requests array via `ow`, and since the shape can vary, it was very slow (e.g. 20s just for the `ow` validation). Now we use a tailored validation for the array that does the same but resolves within 100ms or so. - We always created the `Request` objects out of everything, which had a significant impact on memory usage. Now we skip this completely and let the objects be created later when needed (when calling `RQ.addRequests()` which only receives the actual batch and not the whole array) Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079 * perf: improve scaling based on memory (apify#2459) We only allowed to use 70% of the available memory, this PR changes the limit to 90%. Tested with a low memory options and it did not have any effect, while it allows to use more memory on the large memory setups - where the 30% could mean 2gb or so, we dont need such a huge buffer. Also increases the scaling steps to 10% instead of 5% so speed up the scaling. Related: [apify.slack.com/archives/C0L33UM7Z/p1715109984834079](https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079) * feat: make `RequestQueue` v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) (apify#2390) Closes apify#2388 --------- Co-authored-by: drobnikj <drobnik.j@gmail.com> Co-authored-by: Martin Adámek <banan23@gmail.com> * fix: do not drop statistics on migration/resurrection/resume (apify#2462) This fixes a bug that was introduced with apify#1844 and apify#2083 - we reset the persisted state for statistics and session pool each time a crawler is started, which prevents their restoration. --------- Co-authored-by: Martin Adámek <banan23@gmail.com> * chore(deps): update patch/minor dependencies (apify#2450) * chore(docker): update docker state [skip ci] * fix: double tier decrement in tiered proxy (apify#2468) * docs: scrapy-vs-crawlee blog (apify#2431) Co-authored-by: Saurav Jain <sauain@SauravApify.local> Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com> * perf: optimize `RequestList` memory footprint (apify#2466) The request list now delays the conversion of the source items into the `Request` objects, resulting in a significantly less memory footprint. Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079 * fix: `EnqueueStrategy.All` erroring with links using unsupported protocols (apify#2389) This changes `EnqueueStrategy.All` to filter out non-http and non-https URLs (`mailto:` links were causing the crawler to error). Let me know if there's a better fix or if you want me to change something. Thanks! ``` Request failed and reached maximum retries. Error: Received one or more errors at _ArrayValidator.handle (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17) at _ArrayValidator.parse (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2) at RequestQueueClient.batchAddRequests (/path/to/project/node_modules/@crawlee/src/resource-clients/request-queue.ts:340:36) at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:238:46) at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:304:22) at attemptToAddToQueueAndAddAnyUnprocessed (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:302:42) at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:319:37) at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:309:22) at enqueueLinks (/path/to/project/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:384:2) at browserCrawlerEnqueueLinks (/path/to/project/node_modules/@crawlee/src/internals/browser-crawler.ts:777:21) ``` * fix(core): use createSessionFunction when loading Session from persisted state (apify#2444) Changes SessionPool's new Session loading behavior in the core module to utilize the configured createSessionFunction if specified. This ensures that new Sessions are instantiated using the custom session creation logic provided by the user, improving flexibility and adherence to user configurations. * fix(core): conversion between tough cookies and browser pool cookies (apify#2443) Fixes the conversion from tough cookies to browser pool cookies and vice versa, by correctly handling cookies where the domain has a leading dot versus when it doesn't. * test: fix e2e tests for zero concurrency * chore(deps): update dependency puppeteer to v22.8.2 * chore(docker): update docker state [skip ci] * docs: fixes (apify#2469) @B4nan minor fixes * chore(deps): update dependency puppeteer to v22.9.0 * feat: implement ErrorSnapshotter for error context capture (apify#2332) This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling. This functionality is opt-in, and can be enabled via the crawler options: ```ts const crawler = new BasicCrawler({ // ... statisticsOptions: { saveErrorSnapshots: true, }, }); ``` Closes apify#2280 --------- Co-authored-by: Martin Adámek <banan23@gmail.com> * test: fix e2e tests for error snapshotter * feat: add `FileDownload` "crawler" (apify#2435) Adds a new package `@crawlee/file-download`, which overrides the `HttpCrawler`'s MIME type limitations and allows the users to download arbitrary files. Aside from the regular `requestHandler`, this crawler introduces `streamHandler`, which passes a `ReadableStream` with the downloaded data to the user handler. --------- Co-authored-by: Martin Adámek <banan23@gmail.com> Co-authored-by: Jan Buchar <jan.buchar@apify.com> * chore(release): v3.10.0 * chore(release): update internal dependencies [skip ci] * chore(docker): update docker state [skip ci] * docs: add v3.10 snapshot * docs: fix broken link for a moved content * chore(deps): lock file maintenance * docs: improve crawlee seo ranking (apify#2472) * chore(deps): lock file maintenance * refactor: Remove redundant fields from `StatisticsPersistedState` (apify#2475) Those fields are duplicated in the base class anyway. * chore(deps): lock file maintenance * fix: provide URLs to the error snapshot (apify#2482) This will respect the Actor SDK override automatically since importing the SDK will fire this side effect: https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L25 * docs: update keywords (apify#2481) Co-authored-by: Saurav Jain <sauain@SauravApify.local> * docs: add feedback from community. (apify#2478) Co-authored-by: Saurav Jain <sauain@SauravApify.local> Co-authored-by: Martin Adámek <banan23@gmail.com> Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com> * chore: use biome for code formatting (apify#2301) This takes ~50ms on my machine 🤯 - closes apify#2366 - Replacing spaces with tabs won't be done right here, right now. - eslint and biome are reconciled - ~biome check fails because of typescript errors - we can either fix those or find a way to ignore it~ * chore(docker): update docker state [skip ci] * test: Check if the proxy tier drops after an amount of successful requests (apify#2490) * chore: ignore docker state when checking formatting (apify#2491) * chore: remove unused eslint ignore directives * chore: fix formatting * chore: run biome as a pre-commit hook (apify#2493) * fix: adjust `URL_NO_COMMAS_REGEX` regexp to allow single character hostnames (apify#2492) Closes apify#2487 * fix: investigate and temp fix for possible 0-concurrency bug in RQv2 (apify#2494) * test: add e2e test for zero concurrency with RQ v2 * chore: update biome * chore(docker): update docker state [skip ci] * chore(deps): lock file maintenance (apify#2495) * chore(release): v3.10.1 * chore(release): update internal dependencies [skip ci] * chore(docker): update docker state [skip ci] * chore: add undeclared dependency * chore(deps): update patch/minor dependencies to v1.44.1 * chore(deps): lock file maintenance * chore(docker): update docker state [skip ci] * feat: Loading sitemaps from string (apify#2496) - closes apify#2460 * docs: fix homepage gradients (apify#2500) * fix: Autodetect sitemap filetype from content (apify#2497) - closes apify#2461 * chore(deps): update dependency puppeteer to v22.10.0 * chore(deps): lock file maintenance --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Martin Adámek <banan23@gmail.com> Co-authored-by: Gigino Chianese <Sajito@users.noreply.github.com> Co-authored-by: Jan Buchar <Teyras@gmail.com> Co-authored-by: Connor Adams <connorads@users.noreply.github.com> Co-authored-by: Apify Release Bot <noreply@apify.com> Co-authored-by: Jiří Spilka <jiri.spilka@apify.com> Co-authored-by: Jindřich Bär <jindrichbar@gmail.com> Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com> Co-authored-by: drobnikj <drobnik.j@gmail.com> Co-authored-by: Jan Buchar <jan.buchar@apify.com> Co-authored-by: Saurav Jain <souravjain540@gmail.com> Co-authored-by: Saurav Jain <sauain@SauravApify.local> Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com> Co-authored-by: Stefan Sundin <git@stefansundin.com> Co-authored-by: Gustavo Silva <silva95gustavo@gmail.com> Co-authored-by: Hamza Alwan <ihamzaalwan@gmail.com>

barjin added feature Issues that represent new features or improvements to existing features. adhoc Ad-hoc unplanned task added during the sprint. labels Apr 29, 2024

barjin self-assigned this Apr 29, 2024

github-actions bot added this to the 88th sprint - Tooling team milestone Apr 29, 2024

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Apr 29, 2024

B4nan reviewed Apr 29, 2024

View reviewed changes

packages/file-download/src/internals/file-download.ts Outdated Show resolved Hide resolved

barjin commented May 2, 2024

View reviewed changes

packages/http-crawler/src/internals/file-download.ts Outdated Show resolved Hide resolved

B4nan reviewed May 3, 2024

View reviewed changes

B4nan requested a review from janbuchar May 6, 2024 14:03

janbuchar reviewed May 9, 2024

View reviewed changes

github-actions bot added the tested Temporary label used only programatically for some analytics. label May 13, 2024

barjin and others added 16 commits May 16, 2024 13:53

feat: add FileDownload "crawler"

62fc7aa

chore: update lockfile

ea8b795

chore: move file-download to @crawlee/http

b1216e0

feat: add streamHandler for streaming downloads

b871120

fix: premature stream close

ac3441a

chore: formatting

3246ff7

Co-authored-by: Martin Adámek <banan23@gmail.com>

chore: formatting

77ab695

Co-authored-by: Martin Adámek <banan23@gmail.com>

chore: formatting

5f6ccce

Co-authored-by: Martin Adámek <banan23@gmail.com>

chore: formatting

f760781

Co-authored-by: Martin Adámek <banan23@gmail.com>

chore: formatting

4ba9f38

Co-authored-by: Martin Adámek <banan23@gmail.com>

chore: formatting

6b642d9

Co-authored-by: Martin Adámek <banan23@gmail.com>

chore: add tests

4e8473b

docs: rewording

04e5891

Co-authored-by: Jan Buchar <jan.buchar@apify.com>

docs: rewording

46d96ff

Co-authored-by: Jan Buchar <jan.buchar@apify.com>

chore: nits from PR

24be5ca

chore: rewrite file download test with express

f7c01ae

barjin added 2 commits May 16, 2024 13:53

chore: regenerate lockfile

8e0aa54

chore: revert yarn.lock

54c7640

barjin force-pushed the feat/file-downloader branch from 7d074cb to 54c7640 Compare May 16, 2024 11:54

barjin added 3 commits May 16, 2024 14:18

chore: fix broken docs links

2eaa2b1

chore: fix more broken links

673aa63

fix: use @apilink instead of markdown links

653f5f8

B4nan merged commit d73756b into master May 16, 2024
9 checks passed

B4nan deleted the feat/file-downloader branch May 16, 2024 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `FileDownload` "crawler" #2435

feat: add `FileDownload` "crawler" #2435

barjin commented Apr 29, 2024 •

edited

B4nan commented Apr 29, 2024 •

edited

barjin commented Apr 29, 2024

barjin commented May 2, 2024 •

edited

B4nan left a comment

janbuchar left a comment

janbuchar May 9, 2024

barjin May 13, 2024

B4nan May 13, 2024

janbuchar May 13, 2024

barjin May 13, 2024

B4nan May 13, 2024

feat: add FileDownload "crawler" #2435

feat: add FileDownload "crawler" #2435

Conversation

barjin commented Apr 29, 2024 • edited

B4nan commented Apr 29, 2024 • edited

barjin commented Apr 29, 2024

barjin commented May 2, 2024 • edited

B4nan left a comment

Choose a reason for hiding this comment

janbuchar left a comment

Choose a reason for hiding this comment

janbuchar May 9, 2024

Choose a reason for hiding this comment

barjin May 13, 2024

Choose a reason for hiding this comment

B4nan May 13, 2024

Choose a reason for hiding this comment

janbuchar May 13, 2024

Choose a reason for hiding this comment

barjin May 13, 2024

Choose a reason for hiding this comment

B4nan May 13, 2024

Choose a reason for hiding this comment

feat: add `FileDownload` "crawler" #2435

feat: add `FileDownload` "crawler" #2435

barjin commented Apr 29, 2024 •

edited

B4nan commented Apr 29, 2024 •

edited

barjin commented May 2, 2024 •

edited