-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pluggable fetchers #34
Conversation
995af30
to
1e9769a
Compare
@Ziinc I have updated it to the implementation I have in mind, for now, we don't take into account any sort of per-spider configuration. However, I will address this issue separately. |
@Ziinc any feedback here? I want to merge it, as I also have the second PR here, e.g. the Splash fetcher. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to merge, besides the improvements mentioned. 👍
body: term(), | ||
headers: list(), | ||
request: Crawly.Request.t(), | ||
request_url: Crawly.Request.url(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the request_url field necessary? It is already in the Request map. For convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's driven by HTTPoison. My understanding that request url might be different. For example in the case of redirect:
iex(1)> HTTPoison.get("http://meta.ua")
{:ok,
%HTTPoison.Response{
body: "<html>\r\n<head><title>301 Moved Permanently</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>301 Moved Permanently</h1></center>\r\n<hr><center>nginx/1.14.0</center>\r\n</body>\r\n</html>\r\n",
headers: [
{"Server", "nginx/1.14.0"},
{"Date", "Mon, 30 Dec 2019 13:00:44 GMT"},
{"Content-Type", "text/html"},
{"Content-Length", "185"},
{"Connection", "keep-alive"},
{"Location", "https://meta.ua/"}
],
request: %HTTPoison.Request{
body: "",
headers: [],
method: :get,
options: [],
params: %{},
url: "http://meta.ua"
},
request_url: "http://meta.ua",
status_code: 301
}}```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some digging, turns out that HTTPoison mimics python's request library, such that the request_url
key reflects the final requested url made, after applying modifications such as params.
relevant HTTPoison issue:
edgurgel/httpoison#270
relevant python library:
https://2.python-requests.org//en/master/user/quickstart/#passing-parameters-in-urls
I think it will be good to add in docs on what request_url is used for in the Crawly.Response
struct, as it will also apply to how Fetchers should output the Response.
Current commit introduces a Crawly.Fetchers.Fetcher behavior and an implementation based on HTTPoison. Currenly only global config is taken into account (however this part is going to be switched to per-spider config soon).
2aacd25
to
abee884
Compare
@Ziinc I have addressed your comments in the most recent commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to merge. The docs on the request_url can be moved to a separate issue, or to #27
body: term(), | ||
headers: list(), | ||
request: Crawly.Request.t(), | ||
request_url: Crawly.Request.url(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some digging, turns out that HTTPoison mimics python's request library, such that the request_url
key reflects the final requested url made, after applying modifications such as params.
relevant HTTPoison issue:
edgurgel/httpoison#270
relevant python library:
https://2.python-requests.org//en/master/user/quickstart/#passing-parameters-in-urls
I think it will be good to add in docs on what request_url is used for in the Crawly.Response
struct, as it will also apply to how Fetchers should output the Response.
Oh, just thought of another point on the tuple-based config passing. I think that for introducing beginner users, we should use the normal atom-based definitions, while the tuple-based configurations should be allowed for more advanced users only. This is because, from a beginner point of view, the default options would be more than necessary to get up and running, while more advanced users would of course want to tweak the options. It would also provide unnecessary friction to learn about tuple-based options when all the beginner wants is to try it out. So, for example, when declaring the fetcher in the tutorials, we should use: config, :crawly,
fetcher: Crawly.Fetchers.HTTPoisonFetcher while also allowing advanced users to use: fetcher: {Crawly.Fetchers.HTTPoisonFetcher, [my: options]
It will mean having to create a util function to standardize the tuple-based config unwrapping across the codebase. For now, we can merge this PR first (since it blocks other issues), and create a separate issue for this |
For #27 and #33