Parse pipelines #150

Ziinc · 2020-12-10T09:48:53Z

This PR aims to introduce pipelines to the extraction process, allowing for extraction helper modules. Introduces a Parse to be piped through the extraction modules.

Considered success if:

Extraction modules can implement pipeline behaviour and be used to pipe a ParsedItem
Extraction modules can be declared in config, at global and spider level.

Further areas of exploration:

parameterizing extraction rules through runtime spider config

Initial discussion here: #141

Ziinc · 2020-12-24T11:34:23Z

Implemented parsers through settings declaration.

I'm thinking of eventually passing the spider's init arguments to override_settings, so that the parser options can be parameterized.

oltarasenko

I like the idea in general, however, at this point, I can't clearly see how these Parsers are going to be added (e.g. I don't see a good way to explain it to others), so maybe I will request examples here as well.

Also as I get it parsers are a set of rules which are going to populate both requests and items in a distributed way, for example, it may be the case when a parser would produce multiple items for example on step1. So it's not clear for me how to write the next parser to append fields to each of the items on step2 (e.g. I can't see how parsers are going to find the related item ids).

Regarding CrawlyUI:
One of the problems I am trying to solve (maybe this PR can help with that) is the fact that one spider may define a couple of templates for page extraction, and what I need to do in these cases is I need to choose those parsers who can produce more items (or items with the biggest number of fields). So at least for CrawlyUI I need something like a competition of parsers (the best wins).

Otherwise, comments to the code are quite minor.

test/worker_test.exs

lib/crawly/spider.ex

oltarasenko · 2020-12-24T20:37:53Z

lib/crawly/worker.ex

+                 spider_name: spider_name,
+                 response: response
+               }) do
+            {false, _} ->


Do we drop the entire item if one of the parsers returns false?

Will add in logging for now. maybe we can add in an :on_parsed_item_drop_callback

lib/crawly/worker.ex

Ziinc · 2020-12-25T19:09:03Z

I understand what you mean. I didn't add in docs on how to use it, as I myself am still thinking of the best way to explain how to use the parsers.

The main point of this feature is to allow logic reuse. Which becomes key when I implement runtime spiders, as different sites will have the same extraction process, but simply different extraction rules (e.g. different xpaths). Parser A and Parser B may extract and append more requests but have different extraction rules, while Parser C and Parser D may be extracting different item types (like blog articles and comments)

When you refer to incremental data extraction (like extracting data and adding it to a map), this level of thinking is more of at an individual parser level, where the Parser receives parameters (extraction rules for different items) and implements logic (append to the :items key the template that has the highest extracted count). This parser can then be reused for multiple spiders. In this case, you would technically be trying to implement something with parsers within parsers (performing multiple extractions).

Examples makes everything clearer:

parsers: [
  {ExtractRequests, xpath: "//a[@class]"},
  {ExtractRequests, css: "a li"},
  # specify only one extraction rule
  {ExtractItems strategy: :append, rule: %ScrapedProduct{id: "//h1" }},
  # specify multiple extraction rules, append by default
  {ExtractItems, rules: [%ScrapedBlogPost{id: "//h1" },  %ScrapedComment{body: "//p"}] },
  # custom logic, select items with most struct count
  FilterItemsByHighestCount
]

ExtractRequests and ExtractItems can be provided by crawly (as you have done in #164 ), to make extraction even easier/faster.

And imagine if we can parameterize all of this with runtime spiders 🤯 creating spiders would be so quick and easy, with hardly any custom code at all. And even if there was custom code, it can be abstracted and re-used.

oltarasenko · 2020-12-25T19:22:33Z

@Ziinc As I have said I like the idea, however, it might be complex for people, so we need to have good documentation and examples. Also maybe I will write an article about it (or maybe you can do it).

It would be nice if it could also help me solve my problem with CrawlyUI, however it may be done at the later stages.

** For some reason for me, it looks like item_loaders in scrapy.

oltarasenko · 2020-12-30T11:56:50Z

@Ziinc I have finally built some vision of how we may evolve parsers' idea. I can take over this PR (e.g. in a separate branch) to demonstrate it if you are in a shortage of time atm.

Ziinc · 2020-12-30T13:02:08Z

@oltarasenko what do you mean by further evolution? I will be able to clean up the pr with docs for review today, within a few hours.

oltarasenko · 2020-12-30T15:21:51Z

@oltarasenko what do you mean by further evolution? I will be able to clean up the pr with docs for review today, within a few hours.

Oh, you have asked me to refactor the general-purpose extractor, so I have cherry-picked some of your code, and as it often happens have modified some parts. Please have a glance so we can discuss it.

… parsing-pipeline

Ziinc · 2020-12-30T17:23:15Z

@oltarasenko I removed the Parse struct, which would prevent custom information to be placed in the parser's state. Added docs and rebased against master.

Ziinc · 2021-01-23T13:55:36Z

@oltarasenko could we have this merged in over the weekend?

oltarasenko · 2021-01-23T19:34:31Z

@oltarasenko could we have this merged in over the weekend?

Sorry, i am with kids during the weekend. Will have time on the week. Let's aim the beginning of the week.

@oltarasenko could we have this merged in over the weekend?

Sorry it's hard for me on weekends, as kids are at home. Let's aim the beginning of the week

oltarasenko · 2021-01-25T12:09:41Z

lib/crawly/manager.ex

@@ -55,6 +55,21 @@ defmodule Crawly.Manager do
  def init([spider_name, options]) do
    crawl_id = Keyword.get(options, :crawl_id)
    Logger.metadata(spider_name: spider_name, crawl_id: crawl_id)
+
+    itemcount_limit =


oltarasenko

I think this PR implements everything we have discussed. I also wanted to ask if you can contribute a blog post describing how to use the feature in a form of a short tutorial?

Parse pipelines (elixir-crawly#150)

added Parse struct

51f2d43

Ziinc mentioned this pull request Dec 10, 2020

Announcement: Now it's possible to create a spider in UI #141

Closed

implemented parsers

c66ae37

Ziinc requested a review from oltarasenko December 24, 2020 11:32

Ziinc marked this pull request as ready for review December 24, 2020 11:34

Merge branch 'master' into parsing-pipeline

ddb94c4

oltarasenko requested changes Dec 24, 2020

View reviewed changes

oltarasenko added a commit that referenced this pull request Dec 29, 2020

An attempt to make general parser based on ideas in PR #150

f1a6df2

Ziinc added 7 commits December 31, 2020 00:31

refactored code for code quality

c7f725e

added Parse struct

a7c9357

implemented parsers

213531e

refactored code for code quality

56fb2a8

Merge branch 'parsing-pipeline' of github.com:oltarasenko/crawly into…

8af50b7

… parsing-pipeline

Removed restrictive Parse struct

f584dc5

added docs

40aaa0b

Ziinc requested a review from oltarasenko December 30, 2020 17:22

oltarasenko reviewed Jan 25, 2021

View reviewed changes

oltarasenko approved these changes Jan 25, 2021

View reviewed changes

Merge branch 'master' into parsing-pipeline

7f8cea6

Ziinc merged commit 1ec03d7 into master Jan 26, 2021

oshosanya added a commit to oshosanya/crawly that referenced this pull request Jan 30, 2021

Merge pull request #1 from oltarasenko/master

7a669b1

Parse pipelines (elixir-crawly#150)

oshosanya mentioned this pull request Feb 4, 2021

Merge pull request #1 from oltarasenko/master #171

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse pipelines #150

Parse pipelines #150

Ziinc commented Dec 10, 2020 •

edited

Ziinc commented Dec 24, 2020

oltarasenko left a comment

oltarasenko Dec 24, 2020

Ziinc Dec 30, 2020

Ziinc commented Dec 25, 2020

oltarasenko commented Dec 25, 2020

oltarasenko commented Dec 30, 2020

Ziinc commented Dec 30, 2020

oltarasenko commented Dec 30, 2020

Ziinc commented Dec 30, 2020

Ziinc commented Jan 23, 2021

oltarasenko commented Jan 23, 2021

oltarasenko Jan 25, 2021

oltarasenko left a comment

Parse pipelines #150

Parse pipelines #150

Conversation

Ziinc commented Dec 10, 2020 • edited

Ziinc commented Dec 24, 2020

oltarasenko left a comment

Choose a reason for hiding this comment

oltarasenko Dec 24, 2020

Choose a reason for hiding this comment

Ziinc Dec 30, 2020

Choose a reason for hiding this comment

Ziinc commented Dec 25, 2020

oltarasenko commented Dec 25, 2020

oltarasenko commented Dec 30, 2020

Ziinc commented Dec 30, 2020

oltarasenko commented Dec 30, 2020

Ziinc commented Dec 30, 2020

Ziinc commented Jan 23, 2021

oltarasenko commented Jan 23, 2021

oltarasenko Jan 25, 2021

Choose a reason for hiding this comment

oltarasenko left a comment

Choose a reason for hiding this comment

Ziinc commented Dec 10, 2020 •

edited