Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse pipelines #150

Merged
merged 11 commits into from
Jan 26, 2021
Merged

Parse pipelines #150

merged 11 commits into from
Jan 26, 2021

Conversation

Ziinc
Copy link
Collaborator

@Ziinc Ziinc commented Dec 10, 2020

This PR aims to introduce pipelines to the extraction process, allowing for extraction helper modules. Introduces a Parse to be piped through the extraction modules.

Considered success if:

  1. Extraction modules can implement pipeline behaviour and be used to pipe a ParsedItem
  2. Extraction modules can be declared in config, at global and spider level.

Further areas of exploration:

  • parameterizing extraction rules through runtime spider config

Initial discussion here: #141

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 24, 2020

Implemented parsers through settings declaration.

I'm thinking of eventually passing the spider's init arguments to override_settings, so that the parser options can be parameterized.

@Ziinc Ziinc marked this pull request as ready for review December 24, 2020 11:34
Copy link
Collaborator

@oltarasenko oltarasenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea in general, however, at this point, I can't clearly see how these Parsers are going to be added (e.g. I don't see a good way to explain it to others), so maybe I will request examples here as well.

Also as I get it parsers are a set of rules which are going to populate both requests and items in a distributed way, for example, it may be the case when a parser would produce multiple items for example on step1. So it's not clear for me how to write the next parser to append fields to each of the items on step2 (e.g. I can't see how parsers are going to find the related item ids).

Regarding CrawlyUI:
One of the problems I am trying to solve (maybe this PR can help with that) is the fact that one spider may define a couple of templates for page extraction, and what I need to do in these cases is I need to choose those parsers who can produce more items (or items with the biggest number of fields). So at least for CrawlyUI I need something like a competition of parsers (the best wins).

Otherwise, comments to the code are quite minor.

test/worker_test.exs Outdated Show resolved Hide resolved
test/worker_test.exs Outdated Show resolved Hide resolved
lib/crawly/spider.ex Outdated Show resolved Hide resolved
spider_name: spider_name,
response: response
}) do
{false, _} ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we drop the entire item if one of the parsers returns false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add in logging for now. maybe we can add in an :on_parsed_item_drop_callback

lib/crawly/worker.ex Outdated Show resolved Hide resolved
@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 25, 2020

I understand what you mean. I didn't add in docs on how to use it, as I myself am still thinking of the best way to explain how to use the parsers.

The main point of this feature is to allow logic reuse. Which becomes key when I implement runtime spiders, as different sites will have the same extraction process, but simply different extraction rules (e.g. different xpaths). Parser A and Parser B may extract and append more requests but have different extraction rules, while Parser C and Parser D may be extracting different item types (like blog articles and comments)

When you refer to incremental data extraction (like extracting data and adding it to a map), this level of thinking is more of at an individual parser level, where the Parser receives parameters (extraction rules for different items) and implements logic (append to the :items key the template that has the highest extracted count). This parser can then be reused for multiple spiders. In this case, you would technically be trying to implement something with parsers within parsers (performing multiple extractions).

Examples makes everything clearer:

parsers: [
  {ExtractRequests, xpath: "//a[@class]"},
  {ExtractRequests, css: "a li"},
  # specify only one extraction rule
  {ExtractItems strategy: :append, rule: %ScrapedProduct{id: "//h1" }},
  # specify multiple extraction rules, append by default
  {ExtractItems, rules: [%ScrapedBlogPost{id: "//h1" },  %ScrapedComment{body: "//p"}] },
  # custom logic, select items with most struct count
  FilterItemsByHighestCount
]

ExtractRequests and ExtractItems can be provided by crawly (as you have done in #164 ), to make extraction even easier/faster.

And imagine if we can parameterize all of this with runtime spiders 🤯 creating spiders would be so quick and easy, with hardly any custom code at all. And even if there was custom code, it can be abstracted and re-used.

@oltarasenko
Copy link
Collaborator

@Ziinc As I have said I like the idea, however, it might be complex for people, so we need to have good documentation and examples. Also maybe I will write an article about it (or maybe you can do it).

It would be nice if it could also help me solve my problem with CrawlyUI, however it may be done at the later stages.

** For some reason for me, it looks like item_loaders in scrapy.

@oltarasenko
Copy link
Collaborator

@Ziinc I have finally built some vision of how we may evolve parsers' idea. I can take over this PR (e.g. in a separate branch) to demonstrate it if you are in a shortage of time atm.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2020

@oltarasenko what do you mean by further evolution? I will be able to clean up the pr with docs for review today, within a few hours.

@oltarasenko
Copy link
Collaborator

@oltarasenko what do you mean by further evolution? I will be able to clean up the pr with docs for review today, within a few hours.

Oh, you have asked me to refactor the general-purpose extractor, so I have cherry-picked some of your code, and as it often happens have modified some parts. Please have a glance so we can discuss it.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2020

@oltarasenko I removed the Parse struct, which would prevent custom information to be placed in the parser's state. Added docs and rebased against master.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Jan 23, 2021

@oltarasenko could we have this merged in over the weekend?

@oltarasenko
Copy link
Collaborator

@oltarasenko could we have this merged in over the weekend?

Sorry, i am with kids during the weekend. Will have time on the week. Let's aim the beginning of the week.

@oltarasenko could we have this merged in over the weekend?

Sorry it's hard for me on weekends, as kids are at home. Let's aim the beginning of the week

@@ -55,6 +55,21 @@ defmodule Crawly.Manager do
def init([spider_name, options]) do
crawl_id = Keyword.get(options, :crawl_id)
Logger.metadata(spider_name: spider_name, crawl_id: crawl_id)

itemcount_limit =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

@oltarasenko oltarasenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR implements everything we have discussed. I also wanted to ask if you can contribute a blog post describing how to use the feature in a form of a short tutorial?

@Ziinc Ziinc merged commit 1ec03d7 into master Jan 26, 2021
oshosanya added a commit to oshosanya/crawly that referenced this pull request Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants