refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388
refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388
ContextPipeline to initialize BasicCrawler's context idiomatically#3388Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the context initialization logic in Crawlee's crawler architecture by moving all CrawlingContext setup into the ContextPipeline. This change provides tighter control over context construction and prepares the codebase for the upcoming session pool exclusivity changes in PR #3380.
Changes:
- Introduces a new
buildContextPipeline()method inBasicCrawlerthat handles all core context initialization (helpers, request fetching, session management, etc.) - Moves context pipeline invocation from
runRequestHandler()to therunTaskFunctionlevel inAutoscaledPool - Updates subclasses (
HttpCrawler,BrowserCrawler,FileDownload) to callsuper.buildContextPipeline()and extend the pipeline idiomatically
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/basic-crawler/src/internals/basic-crawler.ts | Adds buildContextPipeline() method for idiomatic context initialization; refactors runTaskFunction to invoke the pipeline at a higher level with improved error handling |
| packages/browser-crawler/src/internals/browser-crawler.ts | Updates to call super.buildContextPipeline() and adds override keyword for type safety |
| packages/http-crawler/src/internals/http-crawler.ts | Updates to call super.buildContextPipeline() instead of creating a new pipeline; moves ContextPipeline import to type-only import |
| packages/http-crawler/src/internals/file-download.ts | Updates to call this.buildContextPipeline() for consistency with the new architecture |
| packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts | Updates to apply result-bound helpers after pipeline execution to avoid being overwritten by base crawler helpers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts
Show resolved
Hide resolved
janbuchar
left a comment
There was a problem hiding this comment.
this is more of a refactor, I'd say...
…extHelpers The enqueueLinks helper was accidentally removed from the resultBoundContextHelpers, causing links to not be enqueued correctly through the RequestHandlerResult in the adaptive crawler.
…line building
Start context pipelines from {} instead of lying about an empty object
being a CrawlingContext. The pipeline gradually extends the type through
compose() calls until it reaches the final CrawlingContext shape.
ContextPipeline to initialize BasicCrawler's context idiomaticallyContextPipeline to initialize BasicCrawler's context idiomatically
janbuchar
left a comment
There was a problem hiding this comment.
Just a bunch of nits, good stuff overall!
janbuchar
left a comment
There was a problem hiding this comment.
Only three comments, two of them are fairly important.
packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts
Outdated
Show resolved
Hide resolved
c3c8474 to
4f5d471
Compare
| * then retries them in a case of an error, etc. | ||
| */ | ||
| protected async _runTaskFunction() { | ||
| protected async _runTaskFunction(crawlingContext: ExtendedContext) { |
There was a problem hiding this comment.
Calling contextPipeline.call with this from autoscaledPoolOptions in run makes the split between runTaskFunction and runRequestHandler awkward. Do we even need both?
Also, originally, adaptive crawler would override runRequestHandler with the two-pipeline mechanism. Now, this mechanism is a part of one more, "outer" context pipeline. Is that intentional? I guess it shouldn't have any unforeseeable consequences, but still, it is an unexpected pattern.
There was a problem hiding this comment.
Updated in the last three commits. The basic pipeline is now separate, and all the crawler "subclass" pipelines expect its output as their input.
This is further enabled by the new .chain() API from 797003b.
This means (among other things) that the AdaptivePlaywrightCrawler http / browser approach will both run on the same "base" context, but won't run the BasicCrawler's pipeline again.
There was a problem hiding this comment.
Okay, I like the changes, buuut... BasicCrawler._runTaskFunction is called by the arrow function passed to AutoscaledPool, and delegates to BasicCrawler.runRequestHandler, right? And the chained pipeline is between the AutoscaledPool arrow function and BasicCrawler._runTaskFunction. So when AdaptivePlaywrightCrawler overrides runRequestHandler, it still runs the two pipelines "inside" another pipeline?
There was a problem hiding this comment.
I'm not sure I fully understand, but I believe what you're describing is correct and in line with the recent changes.
BasicCrawler will create the basic context (turquoise), which should be the same for the entirety of the request processing (request data, session (proxy, etc.), helpers...). This context is then used as the base for both the static and browser processing (note that staticContextPipeline / browserContextPipeline now start with the BasicContext and only add the crawler-specific bits). These are the green areas. Note that both modes share the same request/session, etc.
AdaptivePlaywrightCrawler doesn't have its own pipeline extension (buildContextPipeline implementation), so its native crawling context === base crawling context - even after the .chain() call.
What am I missing? 👀
There was a problem hiding this comment.
Oh I think I finally understand how the adaptive crawler did not break with your changes 😁 See, previously, it did not call the "outer" pipeline at all - https://github.com/apify/crawlee/pull/3388/changes#diff-f409afe36a2511464bd45cfdf042c4f0a2e47717f2a55f951b4457757c95ff58R309 just returned a null that would crash it if it did. Instead, it put its context pipeline logic in the runRequestHandler override.
Your changes substitute the "basic" pipeline in case the contextPipelineBuilder returns null. This is problematic from a type safety perspective (the crawler uses a pipeline that works with a smaller context type than what its type parameters require).
Any ideas what to do about that?
There was a problem hiding this comment.
Can we use AdaptivePlaywrightCrawler.buildContextPipeline to return a bunch of bogus Proxy objects, throwing errors on property access/call (in case the user somehow gets to these directly)? All of them would then get overridden in the inner pipelines with the correct equivalents. wdyt?
packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts
Outdated
Show resolved
Hide resolved
9f17fb8 to
075af99
Compare
| * then retries them in a case of an error, etc. | ||
| */ | ||
| protected async _runTaskFunction() { | ||
| protected async _runTaskFunction(crawlingContext: ExtendedContext) { |
There was a problem hiding this comment.
Okay, I like the changes, buuut... BasicCrawler._runTaskFunction is called by the arrow function passed to AutoscaledPool, and delegates to BasicCrawler.runRequestHandler, right? And the chained pipeline is between the AutoscaledPool arrow function and BasicCrawler._runTaskFunction. So when AdaptivePlaywrightCrawler overrides runRequestHandler, it still runs the two pipelines "inside" another pipeline?
| const subCrawlerContext = { ...context, ...resultBoundContextHelpers }; | ||
| const subCrawlerContext = { ...context }; | ||
|
|
||
| for (const [key, descriptor] of Object.entries(Object.getOwnPropertyDescriptors(resultBoundContextHelpers))) { |
There was a problem hiding this comment.
Can you add a comment to explain this please?
| * then retries them in a case of an error, etc. | ||
| */ | ||
| protected async _runTaskFunction() { | ||
| protected async _runTaskFunction(crawlingContext: ExtendedContext) { |
There was a problem hiding this comment.
Oh I think I finally understand how the adaptive crawler did not break with your changes 😁 See, previously, it did not call the "outer" pipeline at all - https://github.com/apify/crawlee/pull/3388/changes#diff-f409afe36a2511464bd45cfdf042c4f0a2e47717f2a55f951b4457757c95ff58R309 just returned a null that would crash it if it did. Instead, it put its context pipeline logic in the runRequestHandler override.
Your changes substitute the "basic" pipeline in case the contextPipelineBuilder returns null. This is problematic from a type safety perspective (the crawler uses a pipeline that works with a smaller context type than what its type parameters require).
Any ideas what to do about that?
Extracts all
CrawlingContextinitialization toContextPipelinesteps to tighten the control over theCrawlingContextcontents.Blocks #3380