Figure out story for queries that span multiple sub-trees. #48

fimad · 2016-10-16T22:56:07Z

Right now scalpel makes the assumption that an HTML document is a tree (possibly a malformed one) and that scraping involves selecting one or more sub-trees and extracting (the same) data from each sub-tree.

There are use cases (#41, #45) that don't fit into this model but would nice to support.

This issue is for brainstorming API/architecture changes that would support these types of queries.
#41 could be solved by extending selectors to allow jumping between sibling sub-trees, maybe something that looked like:

<p class="something">Here</p>
<p>Other stuff that matters</p>

chroot "p" @: [hasClass "something"] $ do
  here <- text AnyTag
  otherStuff <- text $ rightSibling "p"

The problem with this is that it doesn't really extend to more complicated scenarios like those #45 which involves collecting a sequence of siblings until a certain condition is met.

fimad · 2016-10-16T23:09:25Z

It seems like both of these may be solvable if we were able to insert new fake nodes into the tag tree.

What if there was a group/groups scraper that took a list of selectors and created a new tree where each selector in the list was a child.

#41

<p class="something">Here</p>
<p>Other stuff that matters</p>

[
  "Here", 
  "Other stuff that matters"
] <- group ["p" @: [hasClass "something"], "p"] texts

#45

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>

[
  ("title2 1", ["text 1", "text 2"]),
  ("title2 2", ["text 3"]),
  ("title2 3", []),
] <- chroot "body" $ do
  groups ["h2", many "p"] $ do
    title <- text "h2"
    body  <- texts "p"
    return (title, body)

3noch · 2017-06-12T19:45:38Z

It'd also be very useful to refer to siblings that have naked text (i.e. not part of a subtree).

ocramz · 2019-02-02T19:33:30Z

@3noch @fimad I think I just encountered one such edge case: I have a bit of HTML like the following :

<h2 class="...">
    <span class="...">
               .... 
    </span>
    info
</h2>

And I just need to retrieve info (or rather, ignore the span).

3noch · 2019-02-04T21:59:46Z

The workaround for now is to chroot to the h2 and use position IIRC.

fimad · 2019-02-09T03:35:26Z

Unfortunately I don't think position would help with that example since there is currently no way to select bare text nodes. One of the assumptions scalpel makes is that anything you'd want to select is between <tags>.

It's also not immediately clear how to expose bare text selection in a way that would be backwards compatible. My current thinking is to create an additional value for SelectNode for text nodes. That would let you do something like the following to grab the second text node under an <h2>:

chroot "h2" $ 
  chroots textSelector $ do
    p <- position
    guard (p == 1)
    text textSelector

With an API like the one proposed in #21 you could do something even more snazzy like: text ("h2" /// textSelector) to grab just the text nodes that are direct children of the <h2>.

The potential issue here though is that allowing selection of bare text nodes would create a breaking change in the behavior of anySelector. For example, scrapeStringLike "<a>text</a>" $ texts anySelector currently returns Just ["text"] but if we treated each text node as selectable then it would return Just ["text", "text"].

This might be an OK breaking change though since I think the most useful use of anySelector is to select the current root node in a chroot block like the examples in the read me.

fimad · 2019-02-10T02:13:11Z

Was starting to look into this and I'm now thinking of trying to do the following:

Have some way to toggle the behavior of Scraper so that instead of starting from the current root, each subsequent scraper would only "see" the first child of the current context and child would be "consumed" on a match. This could be toggled with a new function serial :: Scraper a -> Scraper a.

#41

<p class="something">Here</p>
<p>Other stuff that matters</p>

("Here", "Other stuff that matters") <- serial $ do
  first <- text "p" @: [hasClass "something"]
  second <- text "p"
  (first, second)

#45

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>

[
  ("title2 1", ["text 1", "text 2"]),
  ("title2 2", ["text 3"]),
  ("title2 3", []),
] <- chroot "body" $ serial $ many $ do
    title <- text "h2"
    body <- many $ text "p"
    return (title, body)

Some details that still need to be worked out are how to treat nested matches. Should the next scraper pick up from the sibling of the root or the match? For example:

<a>
  <b>1</b>
  <c>2</c>
</a>
<c>3</c>

serial $ do
  "1" <- text "a" // "b"
  foo <- text "c"

Should foo be "2" or "3"?

Let's say you wanted to select all 3 values. This would be supported by each case:

Start from the sibling of the root of the match:

serial $ do
  ("1", "2") <- chroot "a" $ (,) <$> text "b" <*> text "c"
  "3" <- text "c"

Start from the sibling of the leaf of the match:

serial $ do
  "1" <- text "b"
  "2" <- text "c"
  "3" <- text "c"

Starting from the root is more verbose, but also requires the user to be more precise about the structure of the match.

A new selector, textSelector, is added that matches raw text nodes. This can currently be used to capture floating text that is not directly nested within other tags. This is probably not super useful without the contextual awareness that will be provided by #48. Issue #70

The approach utilizes a zipper like API to focus on sibling nodes and execute scrapers on the focused nodes. Regression tests have been added that shows that this solves the requirements in issues #41, and #45. The documentation needs to be cleaned up a bit and I'm not quite happy with the name `visitSerially` and `visitChildrenSerially`. Also there is a bad interaction with the newly added textSelector that makes stepNext/Back behave unexpectedly. Whitespace between tags count as nodes which leads to unintuitive behavior. Issue #48

fimad · 2019-02-18T05:16:13Z

Supported in 0.6.0.

Ended up going with a solution similar to that proposed above where only the immediate children are visited.

The solution utilizes a new type SerialScraper where the user explicitly moves the cursor with functions like stepNext and seekNext.

fimad added the enhancement label Oct 16, 2016

This was referenced Oct 16, 2016

Sequences #45

Closed

[question] How to select siblings? #41

Closed

fimad modified the milestone: 0.4.1 Oct 17, 2016

fimad mentioned this issue Feb 10, 2019

Allow selecting bare text nodes #70

Closed

fimad mentioned this issue Feb 13, 2019

Allow selecting bare text #71

Merged

fimad modified the milestone: 0.6.0 Feb 14, 2019

fimad mentioned this issue Feb 16, 2019

Initial support for scraping multiple sub-trees #72

Merged

fimad added the fixed at head The issue has been address but the fix is not yet included in a released version of the library. label Feb 17, 2019

fimad closed this as completed Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out story for queries that span multiple sub-trees. #48

Figure out story for queries that span multiple sub-trees. #48

fimad commented Oct 16, 2016

fimad commented Oct 16, 2016

3noch commented Jun 12, 2017

ocramz commented Feb 2, 2019

3noch commented Feb 4, 2019 •

edited

Loading

fimad commented Feb 9, 2019

fimad commented Feb 10, 2019

fimad commented Feb 18, 2019 •

edited

Loading

Figure out story for queries that span multiple sub-trees. #48

Figure out story for queries that span multiple sub-trees. #48

Comments

fimad commented Oct 16, 2016

fimad commented Oct 16, 2016

#41

#45

3noch commented Jun 12, 2017

ocramz commented Feb 2, 2019

3noch commented Feb 4, 2019 • edited Loading

fimad commented Feb 9, 2019

fimad commented Feb 10, 2019

#41

#45

fimad commented Feb 18, 2019 • edited Loading

3noch commented Feb 4, 2019 •

edited

Loading

fimad commented Feb 18, 2019 •

edited

Loading