Skip to content

Comments

refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388

Open
barjin wants to merge 29 commits intov4from
chore/more-context-pipeline
Open

refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388
barjin wants to merge 29 commits intov4from
chore/more-context-pipeline

Conversation

@barjin
Copy link
Member

@barjin barjin commented Feb 5, 2026

Extracts all CrawlingContext initialization to ContextPipeline steps to tighten the control over the CrawlingContext contents.

Blocks #3380

@barjin barjin self-assigned this Feb 5, 2026
@barjin barjin added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 5, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the context initialization logic in Crawlee's crawler architecture by moving all CrawlingContext setup into the ContextPipeline. This change provides tighter control over context construction and prepares the codebase for the upcoming session pool exclusivity changes in PR #3380.

Changes:

  • Introduces a new buildContextPipeline() method in BasicCrawler that handles all core context initialization (helpers, request fetching, session management, etc.)
  • Moves context pipeline invocation from runRequestHandler() to the runTaskFunction level in AutoscaledPool
  • Updates subclasses (HttpCrawler, BrowserCrawler, FileDownload) to call super.buildContextPipeline() and extend the pipeline idiomatically

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/basic-crawler/src/internals/basic-crawler.ts Adds buildContextPipeline() method for idiomatic context initialization; refactors runTaskFunction to invoke the pipeline at a higher level with improved error handling
packages/browser-crawler/src/internals/browser-crawler.ts Updates to call super.buildContextPipeline() and adds override keyword for type safety
packages/http-crawler/src/internals/http-crawler.ts Updates to call super.buildContextPipeline() instead of creating a new pipeline; moves ContextPipeline import to type-only import
packages/http-crawler/src/internals/file-download.ts Updates to call this.buildContextPipeline() for consistency with the new architecture
packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Updates to apply result-bound helpers after pipeline execution to avoid being overwritten by base crawler helpers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@barjin barjin marked this pull request as ready for review February 6, 2026 15:50
@barjin barjin requested review from janbuchar February 6, 2026 15:50
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more of a refactor, I'd say...

…extHelpers

The enqueueLinks helper was accidentally removed from the
resultBoundContextHelpers, causing links to not be enqueued correctly
through the RequestHandlerResult in the adaptive crawler.
…line building

Start context pipelines from {} instead of lying about an empty object
being a CrawlingContext. The pipeline gradually extends the type through
compose() calls until it reaches the final CrawlingContext shape.
@barjin barjin changed the title chore: use ContextPipeline to initialize BasicCrawler's context idiomatically refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically Feb 9, 2026
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a bunch of nits, good stuff overall!

@barjin barjin requested a review from janbuchar February 13, 2026 12:18
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only three comments, two of them are fairly important.

@barjin barjin force-pushed the chore/more-context-pipeline branch from c3c8474 to 4f5d471 Compare February 16, 2026 12:01
@barjin barjin requested review from janbuchar and removed request for janbuchar February 16, 2026 12:39
@barjin barjin requested a review from janbuchar February 16, 2026 13:03
* then retries them in a case of an error, etc.
*/
protected async _runTaskFunction() {
protected async _runTaskFunction(crawlingContext: ExtendedContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling contextPipeline.call with this from autoscaledPoolOptions in run makes the split between runTaskFunction and runRequestHandler awkward. Do we even need both?

Also, originally, adaptive crawler would override runRequestHandler with the two-pipeline mechanism. Now, this mechanism is a part of one more, "outer" context pipeline. Is that intentional? I guess it shouldn't have any unforeseeable consequences, but still, it is an unexpected pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the last three commits. The basic pipeline is now separate, and all the crawler "subclass" pipelines expect its output as their input.

This is further enabled by the new .chain() API from 797003b.

This means (among other things) that the AdaptivePlaywrightCrawler http / browser approach will both run on the same "base" context, but won't run the BasicCrawler's pipeline again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I like the changes, buuut... BasicCrawler._runTaskFunction is called by the arrow function passed to AutoscaledPool, and delegates to BasicCrawler.runRequestHandler, right? And the chained pipeline is between the AutoscaledPool arrow function and BasicCrawler._runTaskFunction. So when AdaptivePlaywrightCrawler overrides runRequestHandler, it still runs the two pipelines "inside" another pipeline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand, but I believe what you're describing is correct and in line with the recent changes.

image

BasicCrawler will create the basic context (turquoise), which should be the same for the entirety of the request processing (request data, session (proxy, etc.), helpers...). This context is then used as the base for both the static and browser processing (note that staticContextPipeline / browserContextPipeline now start with the BasicContext and only add the crawler-specific bits). These are the green areas. Note that both modes share the same request/session, etc.

AdaptivePlaywrightCrawler doesn't have its own pipeline extension (buildContextPipeline implementation), so its native crawling context === base crawling context - even after the .chain() call.

What am I missing? 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I finally understand how the adaptive crawler did not break with your changes 😁 See, previously, it did not call the "outer" pipeline at all - https://github.com/apify/crawlee/pull/3388/changes#diff-f409afe36a2511464bd45cfdf042c4f0a2e47717f2a55f951b4457757c95ff58R309 just returned a null that would crash it if it did. Instead, it put its context pipeline logic in the runRequestHandler override.

Your changes substitute the "basic" pipeline in case the contextPipelineBuilder returns null. This is problematic from a type safety perspective (the crawler uses a pipeline that works with a smaller context type than what its type parameters require).

Any ideas what to do about that?

Copy link
Member Author

@barjin barjin Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use AdaptivePlaywrightCrawler.buildContextPipeline to return a bunch of bogus Proxy objects, throwing errors on property access/call (in case the user somehow gets to these directly)? All of them would then get overridden in the inner pipelines with the correct equivalents. wdyt?

* then retries them in a case of an error, etc.
*/
protected async _runTaskFunction() {
protected async _runTaskFunction(crawlingContext: ExtendedContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I like the changes, buuut... BasicCrawler._runTaskFunction is called by the arrow function passed to AutoscaledPool, and delegates to BasicCrawler.runRequestHandler, right? And the chained pipeline is between the AutoscaledPool arrow function and BasicCrawler._runTaskFunction. So when AdaptivePlaywrightCrawler overrides runRequestHandler, it still runs the two pipelines "inside" another pipeline?

Copy link
Contributor

Copilot AI commented Feb 24, 2026

@barjin I've opened a new pull request, #3436, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Feb 24, 2026

@barjin I've opened a new pull request, #3437, to work on those changes. Once the pull request is ready, I'll request review from you.

const subCrawlerContext = { ...context, ...resultBoundContextHelpers };
const subCrawlerContext = { ...context };

for (const [key, descriptor] of Object.entries(Object.getOwnPropertyDescriptors(resultBoundContextHelpers))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain this please?

* then retries them in a case of an error, etc.
*/
protected async _runTaskFunction() {
protected async _runTaskFunction(crawlingContext: ExtendedContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I finally understand how the adaptive crawler did not break with your changes 😁 See, previously, it did not call the "outer" pipeline at all - https://github.com/apify/crawlee/pull/3388/changes#diff-f409afe36a2511464bd45cfdf042c4f0a2e47717f2a55f951b4457757c95ff58R309 just returned a null that would crash it if it did. Instead, it put its context pipeline logic in the runRequestHandler override.

Your changes substitute the "basic" pipeline in case the contextPipelineBuilder returns null. This is problematic from a type safety perspective (the crawler uses a pipeline that works with a smaller context type than what its type parameters require).

Any ideas what to do about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants