Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable fetchers #34

Merged
merged 4 commits into from
Dec 30, 2019
Merged

Pluggable fetchers #34

merged 4 commits into from
Dec 30, 2019

Conversation

Ziinc
Copy link
Collaborator

@Ziinc Ziinc commented Dec 10, 2019

For #27 and #33

@oltarasenko oltarasenko marked this pull request as ready for review December 23, 2019 10:36
@oltarasenko
Copy link
Collaborator

@Ziinc I have updated it to the implementation I have in mind, for now, we don't take into account any sort of per-spider configuration. However, I will address this issue separately.

@Ziinc Ziinc mentioned this pull request Dec 25, 2019
@oltarasenko
Copy link
Collaborator

@Ziinc any feedback here? I want to merge it, as I also have the second PR here, e.g. the Splash fetcher.

Copy link
Collaborator Author

@Ziinc Ziinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to merge, besides the improvements mentioned. 👍

lib/crawly/fetchers/fetcher.ex Outdated Show resolved Hide resolved
body: term(),
headers: list(),
request: Crawly.Request.t(),
request_url: Crawly.Request.url(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the request_url field necessary? It is already in the Request map. For convenience?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's driven by HTTPoison. My understanding that request url might be different. For example in the case of redirect:

iex(1)> HTTPoison.get("http://meta.ua")
{:ok,
 %HTTPoison.Response{
   body: "<html>\r\n<head><title>301 Moved Permanently</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>301 Moved Permanently</h1></center>\r\n<hr><center>nginx/1.14.0</center>\r\n</body>\r\n</html>\r\n",
   headers: [
     {"Server", "nginx/1.14.0"},
     {"Date", "Mon, 30 Dec 2019 13:00:44 GMT"},
     {"Content-Type", "text/html"},
     {"Content-Length", "185"},
     {"Connection", "keep-alive"},
     {"Location", "https://meta.ua/"}
   ],
   request: %HTTPoison.Request{
     body: "",
     headers: [],
     method: :get,
     options: [],
     params: %{},
     url: "http://meta.ua"
   },
   request_url: "http://meta.ua",
   status_code: 301
 }}```

Copy link
Collaborator Author

@Ziinc Ziinc Dec 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some digging, turns out that HTTPoison mimics python's request library, such that the request_url key reflects the final requested url made, after applying modifications such as params.

relevant HTTPoison issue:
edgurgel/httpoison#270

relevant python library:
https://2.python-requests.org//en/master/user/quickstart/#passing-parameters-in-urls

I think it will be good to add in docs on what request_url is used for in the Crawly.Response struct, as it will also apply to how Fetchers should output the Response.

lib/crawly/worker.ex Outdated Show resolved Hide resolved
Ziinc and others added 4 commits December 30, 2019 14:24
Current commit introduces a Crawly.Fetchers.Fetcher behavior and
an implementation based on HTTPoison.

Currenly only global config is taken into account (however this part
is going to be switched to per-spider config soon).
@oltarasenko
Copy link
Collaborator

@Ziinc I have addressed your comments in the most recent commit.

Copy link
Collaborator Author

@Ziinc Ziinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to merge. The docs on the request_url can be moved to a separate issue, or to #27

body: term(),
headers: list(),
request: Crawly.Request.t(),
request_url: Crawly.Request.url(),
Copy link
Collaborator Author

@Ziinc Ziinc Dec 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some digging, turns out that HTTPoison mimics python's request library, such that the request_url key reflects the final requested url made, after applying modifications such as params.

relevant HTTPoison issue:
edgurgel/httpoison#270

relevant python library:
https://2.python-requests.org//en/master/user/quickstart/#passing-parameters-in-urls

I think it will be good to add in docs on what request_url is used for in the Crawly.Response struct, as it will also apply to how Fetchers should output the Response.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2019

Oh, just thought of another point on the tuple-based config passing.

I think that for introducing beginner users, we should use the normal atom-based definitions, while the tuple-based configurations should be allowed for more advanced users only.

This is because, from a beginner point of view, the default options would be more than necessary to get up and running, while more advanced users would of course want to tweak the options. It would also provide unnecessary friction to learn about tuple-based options when all the beginner wants is to try it out.

So, for example, when declaring the fetcher in the tutorials, we should use:

config, :crawly,
    fetcher: Crawly.Fetchers.HTTPoisonFetcher

while also allowing advanced users to use:

    fetcher: {Crawly.Fetchers.HTTPoisonFetcher, [my: options]

It will mean having to create a util function to standardize the tuple-based config unwrapping across the codebase.

For now, we can merge this PR first (since it blocks other issues), and create a separate issue for this

@oltarasenko oltarasenko merged commit d57a1b9 into master Dec 30, 2019
@oltarasenko oltarasenko deleted the pluggable-fetchers branch January 13, 2020 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants