Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make RequestQueue v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) #2390

Merged
merged 13 commits into from
May 14, 2024

Conversation

vladfrangu
Copy link
Member

Closes #2388

@vladfrangu vladfrangu self-assigned this Mar 22, 2024
@github-actions github-actions bot added this to the 85th sprint - Tooling team milestone Mar 22, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 22, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

None of the pull request and linked issue has estimate

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 22, 2024
@vladfrangu vladfrangu requested a review from B4nan March 26, 2024 08:42
docs/experiments/request_locking.mdx Outdated Show resolved Hide resolved
packages/basic-crawler/src/internals/basic-crawler.ts Outdated Show resolved Hide resolved
packages/core/src/storages/request_queue_v2.ts Outdated Show resolved Hide resolved
packages/core/src/storages/request_queue_v2.ts Outdated Show resolved Hide resolved
packages/core/src/storages/request_queue_v2.ts Outdated Show resolved Hide resolved
@drobnikj
Copy link
Member

drobnikj commented Apr 2, 2024

Just note this, we will update docs and create separate tutorial to using rqv2. Let's wait to release this to have it ready.

@B4nan
Copy link
Member

B4nan commented Apr 4, 2024

@drobnikj I'd be up for merging this now, so we can test it better (and we'll hold the release till the docs are ready).

@drobnikj drobnikj changed the title feat: make RequestQueueV2 the default queue feat: Make RequestQueueV2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) May 13, 2024
Copy link
Member

@drobnikj drobnikj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content for new RQ storage system is ready:

If it is code-wise ready, let's merge this @B4nan @vladfrangu

docs/experiments/request_locking.mdx Outdated Show resolved Hide resolved
@B4nan B4nan changed the title feat: Make RequestQueueV2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) feat: make RequestQueue v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) May 13, 2024
@B4nan B4nan merged commit 41ae8ab into master May 14, 2024
10 checks passed
@B4nan B4nan deleted the vlad/make-rqv2-default branch May 14, 2024 09:04
gitworkflows pushed a commit to threatcode/crawlee that referenced this pull request May 25, 2024
* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* ci: test on node 22 (apify#2438)

* chore: use node 20 in templates

* chore(deps): update yarn to v4.2.1

* chore(deps): lock file maintenance

* fix: return true when robots.isAllowed returns undefined (apify#2439)

`undefined` means that there is no explicit rule for the requested
route. No rules means no disallow, therefore it's allowed.

Fixes apify#2437

---------

Co-authored-by: Jan Buchar <Teyras@gmail.com>

* chore(deps): update patch/minor dependencies to v3.3.0

* chore(deps): update patch/minor dependencies to v3.3.2

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* docs: Should be "Same Domain" not "Same Subdomain" (apify#2445)

The docs appear to be a bit misleading. If people want "Same Subdomain"
they should actually use "Same Hostname".

![image](https://github.com/apify/crawlee/assets/10026538/2b5452c5-e313-404b-812d-811e0764bd2d)

* chore(docker): update docker state [skip ci]

* docs: fix two typos (array or requests -> array of requests, no much -> not much) (apify#2451)

* fix: sitemap `content-type` check breaks on `content-type` parameters (apify#2442)

According to the
[RFC1341](https://www.w3.org/Protocols/rfc1341/4_Content-Type.html), the
Content-type header can contain additional string parameters.

* chore(docker): update docker state [skip ci]

* chore(deps): lock file maintenance

* fix(core): fire local `SystemInfo` events every second (apify#2454)

During local development, we are firing events for the AutoscaledPool
about current system resources like memory or CPU. We were firing them
once a minute by default, but we remove those snapshots older than 30s,
so we never had anything to compare and always used only the very last
piece of information.

This PR changes the interval to 1s, aligning this with how the Apify
platform fires events.

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* chore(deps): lock file maintenance

* chore(deps): update dependency linkedom to ^0.18.0 (apify#2457)

* chore(docker): update docker state [skip ci]

* perf: optimize adding large amount of requests via `crawler.addRequests()` (apify#2456)

This PR resolves three main issues with adding large amount of requests
into the queue:
- Every requests added to the queue was automatically added to the LRU
requests cache, which has a size of 1 million items. this makes sense
for enqueuing a few items, but if we try to add more than the limit, we
end up with overloading the LRU cache for no reason. Now we only add the
first 1000 requests to the cache (plus any requests added via separate
calls, e.g. when doing `enqueueLinks` from inside a request handler,
again with a limit of the first 1000 links).
- We used to validate the whole requests array via `ow`, and since the
shape can vary, it was very slow (e.g. 20s just for the `ow`
validation). Now we use a tailored validation for the array that does
the same but resolves within 100ms or so.
- We always created the `Request` objects out of everything, which had a
significant impact on memory usage. Now we skip this completely and let
the objects be created later when needed (when calling
`RQ.addRequests()` which only receives the actual batch and not the
whole array)

Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079

* perf: improve scaling based on memory (apify#2459)

We only allowed to use 70% of the available memory, this PR changes the
limit to 90%. Tested with a low memory options and it did not have any
effect, while it allows to use more memory on the large memory setups -
where the 30% could mean 2gb or so, we dont need such a huge buffer.

Also increases the scaling steps to 10% instead of 5% so speed up the
scaling.

Related:
[apify.slack.com/archives/C0L33UM7Z/p1715109984834079](https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079)

* feat: make `RequestQueue` v2 the default queue, see more on [Apify blog](https://blog.apify.com/new-apify-request-queue/) (apify#2390)

Closes apify#2388

---------

Co-authored-by: drobnikj <drobnik.j@gmail.com>
Co-authored-by: Martin Adámek <banan23@gmail.com>

* fix: do not drop statistics on migration/resurrection/resume (apify#2462)

This fixes a bug that was introduced with
apify#1844 and
apify#2083 - we reset the persisted
state for statistics and session pool each time a crawler is started,
which prevents their restoration.

---------

Co-authored-by: Martin Adámek <banan23@gmail.com>

* chore(deps): update patch/minor dependencies (apify#2450)

* chore(docker): update docker state [skip ci]

* fix: double tier decrement in tiered proxy (apify#2468)

* docs: scrapy-vs-crawlee blog (apify#2431)

Co-authored-by: Saurav Jain <sauain@SauravApify.local>
Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>

* perf: optimize `RequestList` memory footprint (apify#2466)

The request list now delays the conversion of the source items into the
`Request` objects, resulting in a significantly less memory footprint.

Related: https://apify.slack.com/archives/C0L33UM7Z/p1715109984834079

* fix: `EnqueueStrategy.All` erroring with links using unsupported protocols (apify#2389)

This changes `EnqueueStrategy.All` to filter out non-http and non-https
URLs (`mailto:` links were causing the crawler to error).

Let me know if there's a better fix or if you want me to change
something.

Thanks!


```
Request failed and reached maximum retries. Error: Received one or more errors
    at _ArrayValidator.handle (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17)
    at _ArrayValidator.parse (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2)
    at RequestQueueClient.batchAddRequests (/path/to/project/node_modules/@crawlee/src/resource-clients/request-queue.ts:340:36)
    at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:238:46)
    at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:304:22)
    at attemptToAddToQueueAndAddAnyUnprocessed (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:302:42)
    at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:319:37)
    at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:309:22)
    at enqueueLinks (/path/to/project/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:384:2)
    at browserCrawlerEnqueueLinks (/path/to/project/node_modules/@crawlee/src/internals/browser-crawler.ts:777:21)
```

* fix(core): use createSessionFunction when loading Session from persisted state (apify#2444)

Changes SessionPool's new Session loading behavior in the core module to
utilize the configured createSessionFunction if specified. This ensures
that new Sessions are instantiated using the custom session creation
logic provided by the user, improving flexibility and adherence to user
configurations.

* fix(core): conversion between tough cookies and browser pool cookies (apify#2443)

Fixes the conversion from tough cookies to browser pool cookies and vice
versa, by correctly handling cookies where the domain has a leading dot
versus when it doesn't.

* test: fix e2e tests for zero concurrency

* chore(deps): update dependency puppeteer to v22.8.2

* chore(docker): update docker state [skip ci]

* docs: fixes (apify#2469)

@B4nan  minor fixes

* chore(deps): update dependency puppeteer to v22.9.0

* feat: implement ErrorSnapshotter for error context capture (apify#2332)

This commit introduces the ErrorSnapshotter class to the crawlee
package, providing functionality to capture screenshots and HTML
snapshots when an error occurs during web crawling.

This functionality is opt-in, and can be enabled via the crawler
options:

```ts
const crawler = new BasicCrawler({
  // ...
  statisticsOptions: {
    saveErrorSnapshots: true,
  },
});
```

Closes apify#2280

---------

Co-authored-by: Martin Adámek <banan23@gmail.com>

* test: fix e2e tests for error snapshotter

* feat: add `FileDownload` "crawler" (apify#2435)

Adds a new package `@crawlee/file-download`, which overrides the
`HttpCrawler`'s MIME type limitations and allows the users to download
arbitrary files.

Aside from the regular `requestHandler`, this crawler introduces
`streamHandler`, which passes a `ReadableStream` with the downloaded
data to the user handler.

---------

Co-authored-by: Martin Adámek <banan23@gmail.com>
Co-authored-by: Jan Buchar <jan.buchar@apify.com>

* chore(release): v3.10.0

* chore(release): update internal dependencies [skip ci]

* chore(docker): update docker state [skip ci]

* docs: add v3.10 snapshot

* docs: fix broken link for a moved content

* chore(deps): lock file maintenance

* docs: improve crawlee seo ranking (apify#2472)

* chore(deps): lock file maintenance

* refactor: Remove redundant fields from `StatisticsPersistedState` (apify#2475)

Those fields are duplicated in the base class anyway.

* chore(deps): lock file maintenance

* fix: provide URLs to the error snapshot (apify#2482)

This will respect the Actor SDK override automatically since importing
the SDK will fire this side effect:

https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L25

* docs: update keywords (apify#2481)

Co-authored-by: Saurav Jain <sauain@SauravApify.local>

* docs: add feedback from community. (apify#2478)

Co-authored-by: Saurav Jain <sauain@SauravApify.local>
Co-authored-by: Martin Adámek <banan23@gmail.com>
Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>

* chore: use biome for code formatting (apify#2301)

This takes ~50ms on my machine 🤯 

- closes apify#2366 
- Replacing spaces with tabs won't be done right here, right now.
- eslint and biome are reconciled
- ~biome check fails because of typescript errors - we can either fix
those or find a way to ignore it~

* chore(docker): update docker state [skip ci]

* test: Check if the proxy tier drops after an amount of successful requests (apify#2490)

* chore: ignore docker state when checking formatting (apify#2491)

* chore: remove unused eslint ignore directives

* chore: fix formatting

* chore: run biome as a pre-commit hook (apify#2493)

* fix: adjust `URL_NO_COMMAS_REGEX` regexp to allow single character hostnames (apify#2492)

Closes apify#2487

* fix: investigate and temp fix for possible 0-concurrency bug in RQv2 (apify#2494)

* test: add e2e test for zero concurrency with RQ v2

* chore: update biome

* chore(docker): update docker state [skip ci]

* chore(deps): lock file maintenance (apify#2495)

* chore(release): v3.10.1

* chore(release): update internal dependencies [skip ci]

* chore(docker): update docker state [skip ci]

* chore: add undeclared dependency

* chore(deps): update patch/minor dependencies to v1.44.1

* chore(deps): lock file maintenance

* chore(docker): update docker state [skip ci]

* feat: Loading sitemaps from string (apify#2496)

- closes apify#2460

* docs: fix homepage gradients (apify#2500)

* fix: Autodetect sitemap filetype from content (apify#2497)

- closes apify#2461

* chore(deps): update dependency puppeteer to v22.10.0

* chore(deps): lock file maintenance

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Martin Adámek <banan23@gmail.com>
Co-authored-by: Gigino Chianese <Sajito@users.noreply.github.com>
Co-authored-by: Jan Buchar <Teyras@gmail.com>
Co-authored-by: Connor Adams <connorads@users.noreply.github.com>
Co-authored-by: Apify Release Bot <noreply@apify.com>
Co-authored-by: Jiří Spilka <jiri.spilka@apify.com>
Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>
Co-authored-by: Vlad Frangu <kingdgrizzle@gmail.com>
Co-authored-by: drobnikj <drobnik.j@gmail.com>
Co-authored-by: Jan Buchar <jan.buchar@apify.com>
Co-authored-by: Saurav Jain <souravjain540@gmail.com>
Co-authored-by: Saurav Jain <sauain@SauravApify.local>
Co-authored-by: davidjohnbarton <41335923+davidjohnbarton@users.noreply.github.com>
Co-authored-by: Stefan Sundin <git@stefansundin.com>
Co-authored-by: Gustavo Silva <silva95gustavo@gmail.com>
Co-authored-by: Hamza Alwan <ihamzaalwan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make RequestQueueV2 default
3 participants