-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write an example of Crawly with splash integrated as a proxy #27
Comments
might need to consider that in a mutli-spider situation, some spiders do not need the browser rendering. Maybe it can be on the spider implementation level, where browser rendering is declared with a flag (or something like that) |
related to #18 |
Yes, in general we need to be able to setup different HTTP clients for different spiders |
Thinking a bit more: |
I think that using the request makes the most sense. The spider is in a good position to make the decision as to what kind of request should be made on it's behalf. Thinking out loud, what about returning plain url-strings in the
This would also minimize the construction of unecessary request-stuff on the occasion that the request is a duplicate (or is dropped for some other reason.) crawly would only call the spider's |
@jallum I think improving Maybe a |
Agreed. |
It turns out that splash does not yet have a proxy interface :(. I don't really like this way of doing requests:
So probably will switch back to the original idea of headless browsers |
But this would be all within a middleware, right? It would be transparent to the user. I think a drawback of the headless browser solution, whether it is hound or puppeteer, is that there is a lot of additional dependencies and moving parts. |
Yes technically it's possible to build a middleware. Similarly as scrapy does: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py What needs to be done in this case:
I am ok with 1, but the 2 looks like a hack to me. I wonder maybe it will be easier to wrap a splash into a proxy interface inside docker as an alternative? |
@oltarasenko I think we shouldn't put it in the middleware, since it shouldn't fundamentally change the request, but only the way that the data is fetched. I think we can abstract out the response fetching to a configurable module. What we can do is to make the Thus, we can have a The fetcher can be declared as so in the config: config :crawly,
....
fetcher: FetchWithSplash
.... Expanding on this idea, we can then allow configuration to take place (once PR #31 is complete), thereby allowing configuration for issue #33 : config :crawly,
fetcher: {FetchWithHTTPoison, options: blah } Let me know what you think. With this proposal, the received |
I also think docker would be an unnecessary overhead and be even more things for the end user to learn to setup. |
I kind of like the idea of pluggable HTTP clients! Actually this was in plans for Crawly already. And I also agree about the middleware based approach (e.g. I would rather skip it). |
@oltarasenko yup I saw that you'd added the line for HTTPoison in the |
The example project is located here: https://github.com/oltarasenko/autosites |
No description provided.
The text was updated successfully, but these errors were encountered: