Write an example of Crawly with splash integrated as a proxy #27

oltarasenko · 2019-11-29T12:45:26Z

No description provided.

Ziinc · 2019-11-29T13:53:25Z

might need to consider that in a mutli-spider situation, some spiders do not need the browser rendering. Maybe it can be on the spider implementation level, where browser rendering is declared with a flag (or something like that)

Ziinc · 2019-11-29T14:04:14Z

related to #18

oltarasenko · 2019-11-29T15:36:29Z

Yes, in general we need to be able to setup different HTTP clients for different spiders

Ziinc · 2019-11-29T16:29:32Z

Thinking a bit more:
Could it be at a request level? What if a site's structure has certain parts that require JS, and certain parts that don't. It would be unnecessary and inefficient to use a browser for parts which don't require JS.

jallum · 2019-11-29T23:10:04Z

I think that using the request makes the most sense. The spider is in a good position to make the decision as to what kind of request should be made on it's behalf.

Thinking out loud, what about returning plain url-strings in the ParsedItem, and then crawly could pass those url-strings back to a new (optional) callback on the spider to produce the request given the url? At that point, the spider could configure a request and return it. The default implementation of build_request (or whatever better name people can think of) would also serve to eliminate a bit of boilerplate that seems to find it's way into all my spiders:

 # Convert URLs into requests
    requests =
      urls
      |> Enum.map(&build_absolute_url/1)
      |> Enum.map(&Crawly.Utils.request_from_url/1)

This would also minimize the construction of unecessary request-stuff on the occasion that the request is a duplicate (or is dropped for some other reason.) crawly would only call the spider's build_request on urls it actually intends to follow.

Ziinc · 2019-12-02T11:27:52Z

@jallum I think improving Crawly.Utils.request_from_url/1 could be split to another issue where relative/absolute url checks and building are done automatically.

Maybe a :browser boolean flag in Crawly.Request?

jallum · 2019-12-02T16:37:26Z

Agreed.

oltarasenko · 2019-12-07T22:52:12Z

It turns out that splash does not yet have a proxy interface :(. I don't really like this way of doing requests:

curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

So probably will switch back to the original idea of headless browsers

Ziinc · 2019-12-08T04:39:30Z

But this would be all within a middleware, right? It would be transparent to the user.

I think a drawback of the headless browser solution, whether it is hound or puppeteer, is that there is a lot of additional dependencies and moving parts.

oltarasenko · 2019-12-09T08:10:15Z

Yes technically it's possible to build a middleware. Similarly as scrapy does: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py

What needs to be done in this case:

Modify the request url to a splash based URL
Unwrap the URL so we will get the original URL on the spider level

I am ok with 1, but the 2 looks like a hack to me. I wonder maybe it will be easier to wrap a splash into a proxy interface inside docker as an alternative?

Ziinc · 2019-12-10T08:06:22Z

@oltarasenko I think we shouldn't put it in the middleware, since it shouldn't fundamentally change the request, but only the way that the data is fetched.

I think we can abstract out the response fetching to a configurable module.
Considering that the worker in lib/crawly/worker.ex fetches the response as so:

What we can do is to make the get_response function "pluggable" (not referring to Plug). This module will be responsible for converting a request into a response.

Thus, we can have a Fetcher protocol, with out-of-the-box support for FetchWithHTTPoison and FetchWithSplash and typespecs that require the returning of a HTTPoison.Response struct.
I think Fetcher or something similar would be more intuitive and clearer than referring to it as a HTTP Client (since client can refer to many things), and allows for more flexibility if there are other protocols that users want to implement.

The fetcher can be declared as so in the config:

config :crawly,
    ....
    fetcher: FetchWithSplash
    ....

Expanding on this idea, we can then allow configuration to take place (once PR #31 is complete), thereby allowing configuration for issue #33 :

config :crawly, 
    fetcher: {FetchWithHTTPoison, options: blah }

Let me know what you think. With this proposal, the received Request does not need to be modified, as it would have to be if implemented as a middleware.

Ziinc · 2019-12-10T08:33:49Z

I also think docker would be an unnecessary overhead and be even more things for the end user to learn to setup.

oltarasenko · 2019-12-10T08:53:54Z

I kind of like the idea of pluggable HTTP clients! Actually this was in plans for Crawly already. And I also agree about the middleware based approach (e.g. I would rather skip it).

Ziinc · 2019-12-10T09:12:36Z

@oltarasenko yup I saw that you'd added the line for HTTPoison in the config.exs. I'll check through #32 today, then once it's meged in, I'll start on updating the docs for #31

oltarasenko · 2020-02-25T19:29:24Z

The example project is located here: https://github.com/oltarasenko/autosites

Ziinc closed this as completed Dec 8, 2019

Ziinc reopened this Dec 8, 2019

Ziinc mentioned this issue Dec 10, 2019

Pluggable fetchers #34

Merged

Ziinc mentioned this issue Dec 30, 2019

Browser Rendering? #18

Closed

oltarasenko closed this as completed Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write an example of Crawly with splash integrated as a proxy #27

Write an example of Crawly with splash integrated as a proxy #27

oltarasenko commented Nov 29, 2019

Ziinc commented Nov 29, 2019

Ziinc commented Nov 29, 2019

oltarasenko commented Nov 29, 2019

Ziinc commented Nov 29, 2019

jallum commented Nov 29, 2019 •

edited

Loading

Ziinc commented Dec 2, 2019

jallum commented Dec 2, 2019

oltarasenko commented Dec 7, 2019 •

edited

Loading

Ziinc commented Dec 8, 2019 •

edited

Loading

oltarasenko commented Dec 9, 2019

Ziinc commented Dec 10, 2019 •

edited

Loading

Ziinc commented Dec 10, 2019

oltarasenko commented Dec 10, 2019

Ziinc commented Dec 10, 2019

oltarasenko commented Feb 25, 2020

Write an example of Crawly with splash integrated as a proxy #27

Write an example of Crawly with splash integrated as a proxy #27

Comments

oltarasenko commented Nov 29, 2019

Ziinc commented Nov 29, 2019

Ziinc commented Nov 29, 2019

oltarasenko commented Nov 29, 2019

Ziinc commented Nov 29, 2019

jallum commented Nov 29, 2019 • edited Loading

Ziinc commented Dec 2, 2019

jallum commented Dec 2, 2019

oltarasenko commented Dec 7, 2019 • edited Loading

Ziinc commented Dec 8, 2019 • edited Loading

oltarasenko commented Dec 9, 2019

Ziinc commented Dec 10, 2019 • edited Loading

Ziinc commented Dec 10, 2019

oltarasenko commented Dec 10, 2019

Ziinc commented Dec 10, 2019

oltarasenko commented Feb 25, 2020

jallum commented Nov 29, 2019 •

edited

Loading

oltarasenko commented Dec 7, 2019 •

edited

Loading

Ziinc commented Dec 8, 2019 •

edited

Loading

Ziinc commented Dec 10, 2019 •

edited

Loading