Passing context to crawler handlers #103

janbuchar · 2024-04-11T08:19:25Z

The problem

We want to pick the best approach for passing context data/helpers to various handler functions in crawlee.py. We already have an implementation in place, but if there's a better way, we should rather do it sooner than later.

What OG Crawlee does

new CheerioCrawler({
  requestHandler: async ({ request, pushData, enqueueLinks }) => { // types of helpers are inferred correctly - thanks typescript
    // ...
  }
})

types are correct
we get suggestions for context helpers
implementation is iffy from a type-safety perspective, but salvageable

Python version A (+/- current implementation)

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(context: BeatifulSoupCrawlingContext) -> None:  # explicit type annotation is necessary for type checking and suggestions
  context.push_data(...)

if a type checker and annotations are used, types are correct (can't get better than that in Python)
we get suggestions for context helpers
the implementation is type-safe enough, but very inflexible
- in contrast to typescript, it won't be salvageable anytime near - not until we have intersection types
there is no "object destructuring" in Python, so everything needs to be prefixed with context.

Python version B

This proposal is similar to how pytest fixtures or FastAPI dependencies work.

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(push_data: PushData, soup: BeautifulSoup) -> None:  # explicit type annotation is necessary for type checking
  push_data(...)

no context. prefix
the function signature is not checked by a type checker, but we can do it when the handler is registered, which should be fine as well
allows for a more flexible implementation with easier code reuse
no suggestions of parameter names
the "injection" from Crawlee's side can be based on both parameter name and type annotation, so the type annotations are optional for users (but if they don't use it, they miss out on type safety and autocompletions)

Please voice your opinions on the matter 🙂 We also welcome any alternative approaches, of course.

The text was updated successfully, but these errors were encountered:

B4nan · 2024-04-11T08:37:34Z

implementation is iffy

curious which part of the current js implementation you see as iffy? :]

@crawler.default_request_handler

we use this kind of API currently? you create the crawler class and decorate global function with that? i recall some discussion around decorators, but honestly, i was expecting the same API as we have in the JS version, so an options object with request handler as a key. if we want this, i don't understand why this is called "default request handler", as there can be only one request handler within a single crawler instance

i like the first variant better as that way you care how a single interface is called, and then get intellisense on what it provides, as opposed to the second one where you need to know yourself (both the key and its type), but the second one is also interesting idea. could we support both at the same time? e.g. via different decorators?

janbuchar · 2024-04-11T08:46:03Z

implementation is iffy

curious which part of the current js implementation you see as iffy? :]

The part where we disable type checking so that we can push an invalid object through an inheritance chain and hope that it will come out valid 🙂

@crawler.default_request_handler

we use this kind of API currently? you create the crawler class and decorate global function with that? i recall some discussion around decorators, but honestly, i was expecting the same API as we have in the JS version, so an options object with request handler as a key. if we want this, i don't understand why this is called "default request handler", as there can be only one request handler within a single crawler instance

Yes, but the example was slightly incorrect - it should have been crawler.router.default_request_handler. We use a router by default, we can stop doing that, but let's not go off the rails in this issue.

Passing a request_handler is possible (in the current implementation even), but since lambdas (anonymous functions) are limited to a single expression in Python, it kinda sucks - you need to make a named function before you instantiate the crawler.

i like the first variant better as that way you care how a single interface is called, and then get intellisense on what it provides, as opposed to the second one where you need to know yourself (both the key and its type), but the second one is also interesting idea. could we support both at the same time? e.g. via different decorators?

We could, but I have a bad feeling about it - users being confused and all that.

B4nan · 2024-04-11T09:25:15Z

Ok good, for router the decorators actually make a lot of sense to me.

vdusek · 2024-04-11T10:35:47Z

A question regarding version B: As a user, how would I determine which arguments are available for injection into a handler? Will there be a mechanism in the Crawler (e.g. protocol or abstract method), that informs users about the handler interface, or will this information be provided only through documentation?

In version A, I suppose, in the module with Crawler, there would be also a Context class to expect in the handler.

janbuchar · 2024-04-11T10:40:58Z

A question regarding version B: As a user, how would I determine which arguments are available for injection into a handler? Will there be a mechanism in the Crawler (e.g. protocol or abstract method), that informs users about the handler interface, or will this information be provided only through documentation?

I believe that documentation and possibly hints in exception messages if you attempt to use something we can't provide are sadly our only option here. I'd be overjoyed if we could think of some way to advertise this with python types.

In version A, I suppose, in the module with Crawler, there would be also a Context class to expect in the handler.

Exactly. Though I'm afraid state-of-the-art python type checkers are not capable of inferring the argument type, you have to manually specify it. But it's one look in the docs instead of several.

vdusek · 2024-04-11T12:35:00Z

Handler decorator API

we use this kind of API currently? you create the crawler class and decorate global function with that? i recall some discussion around decorators, but honestly, i was expecting the same API as we have in the JS version, so an options object with request handler as a key. if we want this, i don't understand why this is called "default request handler", as there can be only one request handler within a single crawler instance

Using a handler registration as a decorator reflects a more Python-ic approach to this kind of functionality. It aligns with the practices seen in various frameworks. For example, web frameworks Flask and FastAPI use decorators to register handlers for routes. Similarly, Airflow uses them to register tasks for DAGs. And Pytest uses them for registering fixtures or marks. I would rather give priority to doing things in a Pythonic way, instead of striving for uniform API with the JS version of the library.

More handler interfaces

i like the first variant better as that way you care how a single interface is called, and then get intellisense on what it provides, as opposed to the second one where you need to know yourself (both the key and its type), but the second one is also interesting idea. could we support both at the same time? e.g. via different decorators?

To be honest, I don't see an advantage in offering multiple ways to achieve the same result. Especially if it means maintaining a larger and more complex codebase. I would prefer to settle on one approach - even if it's not my preferred one - rather than complicating the library.

So which way to go

From the user's perspective, version A seems to be a winner to me. It just provides a better developer experience - explicit declaration of the handlers interface, which means better type-checking and code suggestions. Now it depends, on whether the complexity and reduced flexibility outweigh its benefits.

We already discussed it with Honza yesterday, and the main issue is the absence of Intersection type in Python. Thanks to that we are compelled to define explicit merged context classes and potential boilerplate around it. It's not great, but IMO it remains manageable. So I slightly prefer version A (However, not 100% sure).

B4nan · 2024-04-11T12:48:01Z

Using a handler registration as a decorator reflects a more Python-ic approach to this kind of functionality. It aligns with the practices seen in various frameworks. For example, web frameworks Flask and FastAPI use decorators to register handlers for routes. Similarly, Airflow uses them to register tasks for DAGs. And Pytest uses them for registering fixtures or marks. I would rather give priority to doing things in a Pythonic way, instead of striving for uniform API with the JS version of the library.

Understood, no problems with that, as long as we keep the same/similar DX. Which I don't think it would be the case with variant B.

Thanks to that we are compelled to define explicit merged context classes

Sounds good to me, a bit worse for maintenance, but much better observability for new users.

jirimoravcik · 2024-04-11T12:55:12Z

I prefer version A.
As for the intersection types, I checked and it seems the PEP is still WIP (https://github.com/CarliJoy/intersection_examples/blob/main/docs/specification.rst) so I don't think we'll be getting that anytime soon 😢

janbuchar · 2024-04-16T15:10:49Z

I believe it's clear that version A is the winner 🙂

janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. solutioning The issue is not being implemented but only analyzed and planned. labels Apr 11, 2024

janbuchar assigned janbuchar, B4nan, jirimoravcik, fnesveda, vladfrangu, vdusek and barjin Apr 11, 2024

vdusek added this to the 87th sprint - Tooling team milestone Apr 12, 2024

fnesveda removed their assignment Apr 12, 2024

janbuchar closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing context to crawler handlers #103

Passing context to crawler handlers #103

janbuchar commented Apr 11, 2024 •

edited

Loading

B4nan commented Apr 11, 2024

janbuchar commented Apr 11, 2024 •

edited

Loading

B4nan commented Apr 11, 2024

vdusek commented Apr 11, 2024

janbuchar commented Apr 11, 2024

vdusek commented Apr 11, 2024

B4nan commented Apr 11, 2024

jirimoravcik commented Apr 11, 2024

janbuchar commented Apr 16, 2024

Passing context to crawler handlers #103

Passing context to crawler handlers #103

Comments

janbuchar commented Apr 11, 2024 • edited Loading

The problem

What OG Crawlee does

Python version A (+/- current implementation)

Python version B

B4nan commented Apr 11, 2024

janbuchar commented Apr 11, 2024 • edited Loading

B4nan commented Apr 11, 2024

vdusek commented Apr 11, 2024

janbuchar commented Apr 11, 2024

vdusek commented Apr 11, 2024

Handler decorator API

More handler interfaces

So which way to go

B4nan commented Apr 11, 2024

jirimoravcik commented Apr 11, 2024

janbuchar commented Apr 16, 2024

janbuchar commented Apr 11, 2024 •

edited

Loading

janbuchar commented Apr 11, 2024 •

edited

Loading