Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Crawly.fetch to use the plugged http client #51

Merged
merged 1 commit into from Jan 14, 2020

Conversation

oltarasenko
Copy link
Collaborator

No description provided.

@Ziinc
Copy link
Collaborator

Ziinc commented Jan 13, 2020

I think that since Crawly.fetch/1 is used for debugging/repl, it would be useful for it to simulate the request being piped through the declared middlewares before being fetched.

Another thing, just came to my attention that the fetcher is called with fetch/2. Thinking aloud here, would it be better to use the same/similar behaviour of pipelines for fetchers, such that run/2 or run/3 is called instead? It's just a semantics thing, but I think it helps to keep the idea of "everything is a pipeline". The state passed to the fetcher could be the worker state or the crawly engine state.

The code itself looks fine, just wanna discuss this further

@oltarasenko
Copy link
Collaborator Author

Hey @Ziinc

I think that since Crawly.fetch/1 is used for debugging/repl, it would be useful for it to simulate the request being piped through the declared middlewares before being fetched.

It can be done of course. But maybe we could make it similar to Scrapy's crawl?
E.g. I am thinking of a command Crawly.crawl(url) which will get a page and will show what's extracted (including middlewares/item pipelines), using a given spider. (One of the complex things here is to lookup a spider by the URL).

Another thing, just came to my attention that the fetcher is called with fetch/2. Thinking aloud here, would it be better to use the same/similar behaviour of pipelines for fetchers, such that run/2 or run/3 is called instead? It's just a semantics thing, but I think it helps to keep the idea of "everything is a pipeline". The state passed to the fetcher could be the worker state or the crawly engine state.

I was even thinking of having a fetcher defined as a middleware. However, what stops me, is the fact that fetchers would have to be able to perform get/post requests at the end. So maybe calling it run will not be semantically correct. However, let's see. I am on my way of adding the first unusual fetcher, let's see if we can make it more generic afterwords.

The code itself looks fine, just wanna discuss this further

I think we could raise a ticket for the part 1 (e.g. Crawly.crawl/1 or crawl_url/1 command).

@oltarasenko
Copy link
Collaborator Author

I have created a separate issue for the comment above, merging this code now.

@oltarasenko oltarasenko merged commit ff4f1b4 into master Jan 14, 2020
@oltarasenko oltarasenko deleted the fix_crawly_fetch branch January 14, 2020 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants