Pluggable fetchers #34

Ziinc · 2019-12-10T10:26:03Z

For #27 and #33

oltarasenko · 2019-12-23T10:37:41Z

@Ziinc I have updated it to the implementation I have in mind, for now, we don't take into account any sort of per-spider configuration. However, I will address this issue separately.

oltarasenko · 2019-12-28T20:54:55Z

@Ziinc any feedback here? I want to merge it, as I also have the second PR here, e.g. the Splash fetcher.

Ziinc

Looks good to merge, besides the improvements mentioned. 👍

lib/crawly/fetchers/fetcher.ex

Ziinc · 2019-12-30T03:31:30Z

lib/crawly/response.ex

+               body: term(),
+               headers: list(),
+               request: Crawly.Request.t(),
+               request_url: Crawly.Request.url(),


Is the request_url field necessary? It is already in the Request map. For convenience?

I think it's driven by HTTPoison. My understanding that request url might be different. For example in the case of redirect:

iex(1)> HTTPoison.get("http://meta.ua") {:ok, %HTTPoison.Response{ body: "<html>\r\n<head><title>301 Moved Permanently</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>301 Moved Permanently</h1></center>\r\n<hr><center>nginx/1.14.0</center>\r\n</body>\r\n</html>\r\n", headers: [ {"Server", "nginx/1.14.0"}, {"Date", "Mon, 30 Dec 2019 13:00:44 GMT"}, {"Content-Type", "text/html"}, {"Content-Length", "185"}, {"Connection", "keep-alive"}, {"Location", "https://meta.ua/"} ], request: %HTTPoison.Request{ body: "", headers: [], method: :get, options: [], params: %{}, url: "http://meta.ua" }, request_url: "http://meta.ua", status_code: 301 }}```

I did some digging, turns out that HTTPoison mimics python's request library, such that the request_url key reflects the final requested url made, after applying modifications such as params.

relevant HTTPoison issue:
edgurgel/httpoison#270

relevant python library:
https://2.python-requests.org//en/master/user/quickstart/#passing-parameters-in-urls

I think it will be good to add in docs on what request_url is used for in the Crawly.Response struct, as it will also apply to how Fetchers should output the Response.

lib/crawly/worker.ex

Current commit introduces a Crawly.Fetchers.Fetcher behavior and an implementation based on HTTPoison. Currenly only global config is taken into account (however this part is going to be switched to per-spider config soon).

oltarasenko · 2019-12-30T13:25:47Z

@Ziinc I have addressed your comments in the most recent commit.

Ziinc

Looks good to merge. The docs on the request_url can be moved to a separate issue, or to #27

Ziinc · 2019-12-30T15:39:10Z

lib/crawly/response.ex

+               body: term(),
+               headers: list(),
+               request: Crawly.Request.t(),
+               request_url: Crawly.Request.url(),


I did some digging, turns out that HTTPoison mimics python's request library, such that the request_url key reflects the final requested url made, after applying modifications such as params.

relevant HTTPoison issue:
edgurgel/httpoison#270

relevant python library:
https://2.python-requests.org//en/master/user/quickstart/#passing-parameters-in-urls

I think it will be good to add in docs on what request_url is used for in the Crawly.Response struct, as it will also apply to how Fetchers should output the Response.

Ziinc · 2019-12-30T15:50:54Z

Oh, just thought of another point on the tuple-based config passing.

I think that for introducing beginner users, we should use the normal atom-based definitions, while the tuple-based configurations should be allowed for more advanced users only.

This is because, from a beginner point of view, the default options would be more than necessary to get up and running, while more advanced users would of course want to tweak the options. It would also provide unnecessary friction to learn about tuple-based options when all the beginner wants is to try it out.

So, for example, when declaring the fetcher in the tutorials, we should use:

config, :crawly,
    fetcher: Crawly.Fetchers.HTTPoisonFetcher

while also allowing advanced users to use:

    fetcher: {Crawly.Fetchers.HTTPoisonFetcher, [my: options]

It will mean having to create a util function to standardize the tuple-based config unwrapping across the codebase.

For now, we can merge this PR first (since it blocks other issues), and create a separate issue for this

Ziinc assigned oltarasenko Dec 10, 2019

oltarasenko force-pushed the pluggable-fetchers branch from 995af30 to 1e9769a Compare December 23, 2019 10:35

oltarasenko marked this pull request as ready for review December 23, 2019 10:36

Ziinc mentioned this pull request Dec 25, 2019

Add retries support #39

Closed

Ziinc commented Dec 30, 2019

View reviewed changes

Ziinc and others added 4 commits December 30, 2019 14:24

added draft logic comments for fetchers implementation

46dfe3e

corrected default fetcher name

7fc4a76

Implement pluggable fetchers

69dfc31

Current commit introduces a Crawly.Fetchers.Fetcher behavior and an implementation based on HTTPoison. Currenly only global config is taken into account (however this part is going to be switched to per-spider config soon).

Allow to send configuration options to fetcher

abee884

oltarasenko force-pushed the pluggable-fetchers branch from 2aacd25 to abee884 Compare December 30, 2019 13:25

Ziinc commented Dec 30, 2019

View reviewed changes

oltarasenko merged commit d57a1b9 into master Dec 30, 2019

oltarasenko mentioned this pull request Dec 30, 2019

Add pluggable HTTP backends #41

Closed

oltarasenko deleted the pluggable-fetchers branch January 13, 2020 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pluggable fetchers #34

Pluggable fetchers #34

Ziinc commented Dec 10, 2019

oltarasenko commented Dec 23, 2019

oltarasenko commented Dec 28, 2019

Ziinc left a comment

Ziinc Dec 30, 2019

oltarasenko Dec 30, 2019

Ziinc Dec 30, 2019 •

edited

Loading

oltarasenko commented Dec 30, 2019

Ziinc left a comment

Ziinc Dec 30, 2019 •

edited

Loading

Ziinc commented Dec 30, 2019

Pluggable fetchers #34

Pluggable fetchers #34

Conversation

Ziinc commented Dec 10, 2019

oltarasenko commented Dec 23, 2019

oltarasenko commented Dec 28, 2019

Ziinc left a comment

Choose a reason for hiding this comment

Ziinc Dec 30, 2019

Choose a reason for hiding this comment

oltarasenko Dec 30, 2019

Choose a reason for hiding this comment

Ziinc Dec 30, 2019 • edited Loading

Choose a reason for hiding this comment

oltarasenko commented Dec 30, 2019

Ziinc left a comment

Choose a reason for hiding this comment

Ziinc Dec 30, 2019 • edited Loading

Choose a reason for hiding this comment

Ziinc commented Dec 30, 2019

Ziinc Dec 30, 2019 •

edited

Loading

Ziinc Dec 30, 2019 •

edited

Loading