-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse pipelines #150
Parse pipelines #150
Conversation
Implemented parsers through settings declaration. I'm thinking of eventually passing the spider's init arguments to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea in general, however, at this point, I can't clearly see how these Parsers are going to be added (e.g. I don't see a good way to explain it to others), so maybe I will request examples here as well.
Also as I get it parsers are a set of rules which are going to populate both requests and items in a distributed way, for example, it may be the case when a parser would produce multiple items for example on step1. So it's not clear for me how to write the next parser to append fields to each of the items on step2 (e.g. I can't see how parsers are going to find the related item ids).
Regarding CrawlyUI:
One of the problems I am trying to solve (maybe this PR can help with that) is the fact that one spider may define a couple of templates for page extraction, and what I need to do in these cases is I need to choose those parsers who can produce more items (or items with the biggest number of fields). So at least for CrawlyUI I need something like a competition of parsers (the best wins).
Otherwise, comments to the code are quite minor.
lib/crawly/worker.ex
Outdated
spider_name: spider_name, | ||
response: response | ||
}) do | ||
{false, _} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we drop the entire item if one of the parsers returns false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add in logging for now. maybe we can add in an :on_parsed_item_drop_callback
I understand what you mean. I didn't add in docs on how to use it, as I myself am still thinking of the best way to explain how to use the parsers. The main point of this feature is to allow logic reuse. Which becomes key when I implement runtime spiders, as different sites will have the same extraction process, but simply different extraction rules (e.g. different xpaths). Parser A and Parser B may extract and append more requests but have different extraction rules, while Parser C and Parser D may be extracting different item types (like blog articles and comments) When you refer to incremental data extraction (like extracting data and adding it to a map), this level of thinking is more of at an individual parser level, where the Parser receives parameters (extraction rules for different items) and implements logic (append to the Examples makes everything clearer:
And imagine if we can parameterize all of this with runtime spiders 🤯 creating spiders would be so quick and easy, with hardly any custom code at all. And even if there was custom code, it can be abstracted and re-used. |
@Ziinc As I have said I like the idea, however, it might be complex for people, so we need to have good documentation and examples. Also maybe I will write an article about it (or maybe you can do it). It would be nice if it could also help me solve my problem with CrawlyUI, however it may be done at the later stages. ** For some reason for me, it looks like item_loaders in scrapy. |
@Ziinc I have finally built some vision of how we may evolve parsers' idea. I can take over this PR (e.g. in a separate branch) to demonstrate it if you are in a shortage of time atm. |
@oltarasenko what do you mean by further evolution? I will be able to clean up the pr with docs for review today, within a few hours. |
Oh, you have asked me to refactor the general-purpose extractor, so I have cherry-picked some of your code, and as it often happens have modified some parts. Please have a glance so we can discuss it. |
@oltarasenko I removed the |
@oltarasenko could we have this merged in over the weekend? |
Sorry, i am with kids during the weekend. Will have time on the week. Let's aim the beginning of the week.
Sorry it's hard for me on weekends, as kids are at home. Let's aim the beginning of the week |
lib/crawly/manager.ex
Outdated
@@ -55,6 +55,21 @@ defmodule Crawly.Manager do | |||
def init([spider_name, options]) do | |||
crawl_id = Keyword.get(options, :crawl_id) | |||
Logger.metadata(spider_name: spider_name, crawl_id: crawl_id) | |||
|
|||
itemcount_limit = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR implements everything we have discussed. I also wanted to ask if you can contribute a blog post describing how to use the feature in a form of a short tutorial?
Parse pipelines (elixir-crawly#150)
This PR aims to introduce pipelines to the extraction process, allowing for extraction helper modules. Introduces a
Parse
to be piped through the extraction modules.Considered success if:
ParsedItem
Further areas of exploration:
Initial discussion here: #141