Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write an example of Crawly with splash integrated as a proxy #27

Closed
oltarasenko opened this issue Nov 29, 2019 · 15 comments
Closed

Write an example of Crawly with splash integrated as a proxy #27

oltarasenko opened this issue Nov 29, 2019 · 15 comments

Comments

@oltarasenko
Copy link
Collaborator

No description provided.

@Ziinc
Copy link
Collaborator

Ziinc commented Nov 29, 2019

might need to consider that in a mutli-spider situation, some spiders do not need the browser rendering. Maybe it can be on the spider implementation level, where browser rendering is declared with a flag (or something like that)

@Ziinc
Copy link
Collaborator

Ziinc commented Nov 29, 2019

related to #18

@oltarasenko
Copy link
Collaborator Author

Yes, in general we need to be able to setup different HTTP clients for different spiders

@Ziinc
Copy link
Collaborator

Ziinc commented Nov 29, 2019

Thinking a bit more:
Could it be at a request level? What if a site's structure has certain parts that require JS, and certain parts that don't. It would be unnecessary and inefficient to use a browser for parts which don't require JS.

@jallum
Copy link
Contributor

jallum commented Nov 29, 2019

I think that using the request makes the most sense. The spider is in a good position to make the decision as to what kind of request should be made on it's behalf.

Thinking out loud, what about returning plain url-strings in the ParsedItem, and then crawly could pass those url-strings back to a new (optional) callback on the spider to produce the request given the url? At that point, the spider could configure a request and return it. The default implementation of build_request (or whatever better name people can think of) would also serve to eliminate a bit of boilerplate that seems to find it's way into all my spiders:

 # Convert URLs into requests
    requests =
      urls
      |> Enum.map(&build_absolute_url/1)
      |> Enum.map(&Crawly.Utils.request_from_url/1)

This would also minimize the construction of unecessary request-stuff on the occasion that the request is a duplicate (or is dropped for some other reason.) crawly would only call the spider's build_request on urls it actually intends to follow.

@Ziinc
Copy link
Collaborator

Ziinc commented Dec 2, 2019

@jallum I think improving Crawly.Utils.request_from_url/1 could be split to another issue where relative/absolute url checks and building are done automatically.

Maybe a :browser boolean flag in Crawly.Request?

@jallum
Copy link
Contributor

jallum commented Dec 2, 2019

Agreed.

@oltarasenko
Copy link
Collaborator Author

oltarasenko commented Dec 7, 2019

It turns out that splash does not yet have a proxy interface :(. I don't really like this way of doing requests:

curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

So probably will switch back to the original idea of headless browsers

@Ziinc
Copy link
Collaborator

Ziinc commented Dec 8, 2019

But this would be all within a middleware, right? It would be transparent to the user.

I think a drawback of the headless browser solution, whether it is hound or puppeteer, is that there is a lot of additional dependencies and moving parts.

@Ziinc Ziinc closed this as completed Dec 8, 2019
@Ziinc Ziinc reopened this Dec 8, 2019
@oltarasenko
Copy link
Collaborator Author

Yes technically it's possible to build a middleware. Similarly as scrapy does: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py

What needs to be done in this case:

  1. Modify the request url to a splash based URL
  2. Unwrap the URL so we will get the original URL on the spider level

I am ok with 1, but the 2 looks like a hack to me. I wonder maybe it will be easier to wrap a splash into a proxy interface inside docker as an alternative?

@Ziinc
Copy link
Collaborator

Ziinc commented Dec 10, 2019

@oltarasenko I think we shouldn't put it in the middleware, since it shouldn't fundamentally change the request, but only the way that the data is fetched.

I think we can abstract out the response fetching to a configurable module.
Considering that the worker in lib/crawly/worker.ex fetches the response as so:
image

What we can do is to make the get_response function "pluggable" (not referring to Plug). This module will be responsible for converting a request into a response.

Thus, we can have a Fetcher protocol, with out-of-the-box support for FetchWithHTTPoison and FetchWithSplash and typespecs that require the returning of a HTTPoison.Response struct.
I think Fetcher or something similar would be more intuitive and clearer than referring to it as a HTTP Client (since client can refer to many things), and allows for more flexibility if there are other protocols that users want to implement.

The fetcher can be declared as so in the config:

config :crawly,
    ....
    fetcher: FetchWithSplash
    ....

Expanding on this idea, we can then allow configuration to take place (once PR #31 is complete), thereby allowing configuration for issue #33 :

config :crawly, 
    fetcher: {FetchWithHTTPoison, options: blah }

Let me know what you think. With this proposal, the received Request does not need to be modified, as it would have to be if implemented as a middleware.

@Ziinc
Copy link
Collaborator

Ziinc commented Dec 10, 2019

I also think docker would be an unnecessary overhead and be even more things for the end user to learn to setup.

@oltarasenko
Copy link
Collaborator Author

I kind of like the idea of pluggable HTTP clients! Actually this was in plans for Crawly already. And I also agree about the middleware based approach (e.g. I would rather skip it).

@Ziinc
Copy link
Collaborator

Ziinc commented Dec 10, 2019

@oltarasenko yup I saw that you'd added the line for HTTPoison in the config.exs. I'll check through #32 today, then once it's meged in, I'll start on updating the docs for #31

@oltarasenko
Copy link
Collaborator Author

The example project is located here: https://github.com/oltarasenko/autosites

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants