Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing context to crawler handlers #103

Closed
janbuchar opened this issue Apr 11, 2024 · 9 comments
Closed

Passing context to crawler handlers #103

janbuchar opened this issue Apr 11, 2024 · 9 comments
Assignees
Labels
solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@janbuchar
Copy link
Collaborator

janbuchar commented Apr 11, 2024

The problem

We want to pick the best approach for passing context data/helpers to various handler functions in crawlee.py. We already have an implementation in place, but if there's a better way, we should rather do it sooner than later.

What OG Crawlee does

new CheerioCrawler({
  requestHandler: async ({ request, pushData, enqueueLinks }) => { // types of helpers are inferred correctly - thanks typescript
    // ...
  }
})
  • types are correct
  • we get suggestions for context helpers
  • implementation is iffy from a type-safety perspective, but salvageable

Python version A (+/- current implementation)

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(context: BeatifulSoupCrawlingContext) -> None:  # explicit type annotation is necessary for type checking and suggestions
  context.push_data(...)
  • if a type checker and annotations are used, types are correct (can't get better than that in Python)
  • we get suggestions for context helpers
  • the implementation is type-safe enough, but very inflexible
    • in contrast to typescript, it won't be salvageable anytime near - not until we have intersection types
  • there is no "object destructuring" in Python, so everything needs to be prefixed with context.

Python version B

This proposal is similar to how pytest fixtures or FastAPI dependencies work.

crawler = BeautifulSoupCrawler(...)

@crawler.router.default_request_handler
async def default_handler(push_data: PushData, soup: BeautifulSoup) -> None:  # explicit type annotation is necessary for type checking
  push_data(...)
  • no context. prefix
  • the function signature is not checked by a type checker, but we can do it when the handler is registered, which should be fine as well
  • allows for a more flexible implementation with easier code reuse
  • no suggestions of parameter names
  • the "injection" from Crawlee's side can be based on both parameter name and type annotation, so the type annotations are optional for users (but if they don't use it, they miss out on type safety and autocompletions)

Please voice your opinions on the matter 🙂 We also welcome any alternative approaches, of course.

@janbuchar janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. solutioning The issue is not being implemented but only analyzed and planned. labels Apr 11, 2024
@B4nan
Copy link
Member

B4nan commented Apr 11, 2024

implementation is iffy

curious which part of the current js implementation you see as iffy? :]

@crawler.default_request_handler

we use this kind of API currently? you create the crawler class and decorate global function with that? i recall some discussion around decorators, but honestly, i was expecting the same API as we have in the JS version, so an options object with request handler as a key. if we want this, i don't understand why this is called "default request handler", as there can be only one request handler within a single crawler instance

i like the first variant better as that way you care how a single interface is called, and then get intellisense on what it provides, as opposed to the second one where you need to know yourself (both the key and its type), but the second one is also interesting idea. could we support both at the same time? e.g. via different decorators?

@janbuchar
Copy link
Collaborator Author

janbuchar commented Apr 11, 2024

implementation is iffy

curious which part of the current js implementation you see as iffy? :]

The part where we disable type checking so that we can push an invalid object through an inheritance chain and hope that it will come out valid 🙂

@crawler.default_request_handler

we use this kind of API currently? you create the crawler class and decorate global function with that? i recall some discussion around decorators, but honestly, i was expecting the same API as we have in the JS version, so an options object with request handler as a key. if we want this, i don't understand why this is called "default request handler", as there can be only one request handler within a single crawler instance

Yes, but the example was slightly incorrect - it should have been crawler.router.default_request_handler. We use a router by default, we can stop doing that, but let's not go off the rails in this issue.

Passing a request_handler is possible (in the current implementation even), but since lambdas (anonymous functions) are limited to a single expression in Python, it kinda sucks - you need to make a named function before you instantiate the crawler.

i like the first variant better as that way you care how a single interface is called, and then get intellisense on what it provides, as opposed to the second one where you need to know yourself (both the key and its type), but the second one is also interesting idea. could we support both at the same time? e.g. via different decorators?

We could, but I have a bad feeling about it - users being confused and all that.

@B4nan
Copy link
Member

B4nan commented Apr 11, 2024

Ok good, for router the decorators actually make a lot of sense to me.

@vdusek
Copy link
Collaborator

vdusek commented Apr 11, 2024

A question regarding version B: As a user, how would I determine which arguments are available for injection into a handler? Will there be a mechanism in the Crawler (e.g. protocol or abstract method), that informs users about the handler interface, or will this information be provided only through documentation?

In version A, I suppose, in the module with Crawler, there would be also a Context class to expect in the handler.

@janbuchar
Copy link
Collaborator Author

A question regarding version B: As a user, how would I determine which arguments are available for injection into a handler? Will there be a mechanism in the Crawler (e.g. protocol or abstract method), that informs users about the handler interface, or will this information be provided only through documentation?

I believe that documentation and possibly hints in exception messages if you attempt to use something we can't provide are sadly our only option here. I'd be overjoyed if we could think of some way to advertise this with python types.

In version A, I suppose, in the module with Crawler, there would be also a Context class to expect in the handler.

Exactly. Though I'm afraid state-of-the-art python type checkers are not capable of inferring the argument type, you have to manually specify it. But it's one look in the docs instead of several.

@vdusek
Copy link
Collaborator

vdusek commented Apr 11, 2024

Handler decorator API

we use this kind of API currently? you create the crawler class and decorate global function with that? i recall some discussion around decorators, but honestly, i was expecting the same API as we have in the JS version, so an options object with request handler as a key. if we want this, i don't understand why this is called "default request handler", as there can be only one request handler within a single crawler instance

Using a handler registration as a decorator reflects a more Python-ic approach to this kind of functionality. It aligns with the practices seen in various frameworks. For example, web frameworks Flask and FastAPI use decorators to register handlers for routes. Similarly, Airflow uses them to register tasks for DAGs. And Pytest uses them for registering fixtures or marks. I would rather give priority to doing things in a Pythonic way, instead of striving for uniform API with the JS version of the library.

More handler interfaces

i like the first variant better as that way you care how a single interface is called, and then get intellisense on what it provides, as opposed to the second one where you need to know yourself (both the key and its type), but the second one is also interesting idea. could we support both at the same time? e.g. via different decorators?

To be honest, I don't see an advantage in offering multiple ways to achieve the same result. Especially if it means maintaining a larger and more complex codebase. I would prefer to settle on one approach - even if it's not my preferred one - rather than complicating the library.

So which way to go

From the user's perspective, version A seems to be a winner to me. It just provides a better developer experience - explicit declaration of the handlers interface, which means better type-checking and code suggestions. Now it depends, on whether the complexity and reduced flexibility outweigh its benefits.

We already discussed it with Honza yesterday, and the main issue is the absence of Intersection type in Python. Thanks to that we are compelled to define explicit merged context classes and potential boilerplate around it. It's not great, but IMO it remains manageable. So I slightly prefer version A (However, not 100% sure).

@B4nan
Copy link
Member

B4nan commented Apr 11, 2024

Using a handler registration as a decorator reflects a more Python-ic approach to this kind of functionality. It aligns with the practices seen in various frameworks. For example, web frameworks Flask and FastAPI use decorators to register handlers for routes. Similarly, Airflow uses them to register tasks for DAGs. And Pytest uses them for registering fixtures or marks. I would rather give priority to doing things in a Pythonic way, instead of striving for uniform API with the JS version of the library.

Understood, no problems with that, as long as we keep the same/similar DX. Which I don't think it would be the case with variant B.

Thanks to that we are compelled to define explicit merged context classes

Sounds good to me, a bit worse for maintenance, but much better observability for new users.

@jirimoravcik
Copy link
Member

I prefer version A.
As for the intersection types, I checked and it seems the PEP is still WIP (https://github.com/CarliJoy/intersection_examples/blob/main/docs/specification.rst) so I don't think we'll be getting that anytime soon 😢

@vdusek vdusek added this to the 87th sprint - Tooling team milestone Apr 12, 2024
@fnesveda fnesveda removed their assignment Apr 12, 2024
@janbuchar
Copy link
Collaborator Author

I believe it's clear that version A is the winner 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

7 participants