Switch Crawly.fetch to use the plugged http client #51

oltarasenko · 2020-01-13T15:16:34Z

No description provided.

Ziinc · 2020-01-13T16:17:39Z

I think that since Crawly.fetch/1 is used for debugging/repl, it would be useful for it to simulate the request being piped through the declared middlewares before being fetched.

Another thing, just came to my attention that the fetcher is called with fetch/2. Thinking aloud here, would it be better to use the same/similar behaviour of pipelines for fetchers, such that run/2 or run/3 is called instead? It's just a semantics thing, but I think it helps to keep the idea of "everything is a pipeline". The state passed to the fetcher could be the worker state or the crawly engine state.

The code itself looks fine, just wanna discuss this further

oltarasenko · 2020-01-14T07:46:01Z

Hey @Ziinc

I think that since Crawly.fetch/1 is used for debugging/repl, it would be useful for it to simulate the request being piped through the declared middlewares before being fetched.

It can be done of course. But maybe we could make it similar to Scrapy's crawl?
E.g. I am thinking of a command Crawly.crawl(url) which will get a page and will show what's extracted (including middlewares/item pipelines), using a given spider. (One of the complex things here is to lookup a spider by the URL).

Another thing, just came to my attention that the fetcher is called with fetch/2. Thinking aloud here, would it be better to use the same/similar behaviour of pipelines for fetchers, such that run/2 or run/3 is called instead? It's just a semantics thing, but I think it helps to keep the idea of "everything is a pipeline". The state passed to the fetcher could be the worker state or the crawly engine state.

I was even thinking of having a fetcher defined as a middleware. However, what stops me, is the fact that fetchers would have to be able to perform get/post requests at the end. So maybe calling it run will not be semantically correct. However, let's see. I am on my way of adding the first unusual fetcher, let's see if we can make it more generic afterwords.

The code itself looks fine, just wanna discuss this further

I think we could raise a ticket for the part 1 (e.g. Crawly.crawl/1 or crawl_url/1 command).

oltarasenko · 2020-01-14T12:10:11Z

I have created a separate issue for the comment above, merging this code now.

Switch Crawly.fetch to use the plugged http client

8324a6f

oltarasenko requested a review from Ziinc January 13, 2020 15:16

oltarasenko mentioned this pull request Jan 14, 2020

Implement Crawly.crawl/2 command for fetching given page with given spider #52

Closed

oltarasenko merged commit ff4f1b4 into master Jan 14, 2020

oltarasenko deleted the fix_crawly_fetch branch January 14, 2020 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch Crawly.fetch to use the plugged http client #51

Switch Crawly.fetch to use the plugged http client #51

oltarasenko commented Jan 13, 2020

Ziinc commented Jan 13, 2020

oltarasenko commented Jan 14, 2020

oltarasenko commented Jan 14, 2020

Switch Crawly.fetch to use the plugged http client #51

Switch Crawly.fetch to use the plugged http client #51

Conversation

oltarasenko commented Jan 13, 2020

Ziinc commented Jan 13, 2020

oltarasenko commented Jan 14, 2020

oltarasenko commented Jan 14, 2020