Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out story for queries that span multiple sub-trees. #48

Closed
fimad opened this issue Oct 16, 2016 · 7 comments
Closed

Figure out story for queries that span multiple sub-trees. #48

fimad opened this issue Oct 16, 2016 · 7 comments
Labels
enhancement fixed at head The issue has been address but the fix is not yet included in a released version of the library.
Milestone

Comments

@fimad
Copy link
Owner

fimad commented Oct 16, 2016

Right now scalpel makes the assumption that an HTML document is a tree (possibly a malformed one) and that scraping involves selecting one or more sub-trees and extracting (the same) data from each sub-tree.

There are use cases (#41, #45) that don't fit into this model but would nice to support.

This issue is for brainstorming API/architecture changes that would support these types of queries.
#41 could be solved by extending selectors to allow jumping between sibling sub-trees, maybe something that looked like:

<p class="something">Here</p>
<p>Other stuff that matters</p>
chroot "p" @: [hasClass "something"] $ do
  here <- text AnyTag
  otherStuff <- text $ rightSibling "p"

The problem with this is that it doesn't really extend to more complicated scenarios like those #45 which involves collecting a sequence of siblings until a certain condition is met.

This was referenced Oct 16, 2016
@fimad
Copy link
Owner Author

fimad commented Oct 16, 2016

It seems like both of these may be solvable if we were able to insert new fake nodes into the tag tree.

What if there was a group/groups scraper that took a list of selectors and created a new tree where each selector in the list was a child.

#41

<p class="something">Here</p>
<p>Other stuff that matters</p>
[
  "Here", 
  "Other stuff that matters"
] <- group ["p" @: [hasClass "something"], "p"] texts

#45

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>
[
  ("title2 1", ["text 1", "text 2"]),
  ("title2 2", ["text 3"]),
  ("title2 3", []),
] <- chroot "body" $ do
  groups ["h2", many "p"] $ do
    title <- text "h2"
    body  <- texts "p"
    return (title, body)

@fimad fimad modified the milestone: 0.4.1 Oct 17, 2016
@3noch
Copy link

3noch commented Jun 12, 2017

It'd also be very useful to refer to siblings that have naked text (i.e. not part of a subtree).

@ocramz
Copy link
Contributor

ocramz commented Feb 2, 2019

@3noch @fimad I think I just encountered one such edge case: I have a bit of HTML like the following :

<h2 class="...">
    <span class="...">
               .... 
    </span>
    info
</h2>

And I just need to retrieve info (or rather, ignore the span).

@3noch
Copy link

3noch commented Feb 4, 2019

The workaround for now is to chroot to the h2 and use position IIRC.

@fimad
Copy link
Owner Author

fimad commented Feb 9, 2019

Unfortunately I don't think position would help with that example since there is currently no way to select bare text nodes. One of the assumptions scalpel makes is that anything you'd want to select is between <tags>.

It's also not immediately clear how to expose bare text selection in a way that would be backwards compatible. My current thinking is to create an additional value for SelectNode for text nodes. That would let you do something like the following to grab the second text node under an <h2>:

chroot "h2" $ 
  chroots textSelector $ do
    p <- position
    guard (p == 1)
    text textSelector

With an API like the one proposed in #21 you could do something even more snazzy like: text ("h2" /// textSelector) to grab just the text nodes that are direct children of the <h2>.

The potential issue here though is that allowing selection of bare text nodes would create a breaking change in the behavior of anySelector. For example, scrapeStringLike "<a>text</a>" $ texts anySelector currently returns Just ["text"] but if we treated each text node as selectable then it would return Just ["text", "text"].

This might be an OK breaking change though since I think the most useful use of anySelector is to select the current root node in a chroot block like the examples in the read me.

@fimad
Copy link
Owner Author

fimad commented Feb 10, 2019

Was starting to look into this and I'm now thinking of trying to do the following:

Have some way to toggle the behavior of Scraper so that instead of starting from the current root, each subsequent scraper would only "see" the first child of the current context and child would be "consumed" on a match. This could be toggled with a new function serial :: Scraper a -> Scraper a.

#41

<p class="something">Here</p>
<p>Other stuff that matters</p>
("Here", "Other stuff that matters") <- serial $ do
  first <- text "p" @: [hasClass "something"]
  second <- text "p"
  (first, second)

#45

<body>
  <h1>title1</h1>
  <h2>title2 1</h2>
  <p>text 1</p>
  <p>text 2</p>
  <h2>title2 2</h2>
  <p>text 3</p>
  <h2>title2 3</h2>
</body>
[
  ("title2 1", ["text 1", "text 2"]),
  ("title2 2", ["text 3"]),
  ("title2 3", []),
] <- chroot "body" $ serial $ many $ do
    title <- text "h2"
    body <- many $ text "p"
    return (title, body)

Some details that still need to be worked out are how to treat nested matches. Should the next scraper pick up from the sibling of the root or the match? For example:

<a>
  <b>1</b>
  <c>2</c>
</a>
<c>3</c>
serial $ do
  "1" <- text "a" // "b"
  foo <- text "c"

Should foo be "2" or "3"?

Let's say you wanted to select all 3 values. This would be supported by each case:

Start from the sibling of the root of the match:

serial $ do
  ("1", "2") <- chroot "a" $ (,) <$> text "b" <*> text "c"
  "3" <- text "c"

Start from the sibling of the leaf of the match:

serial $ do
  "1" <- text "b"
  "2" <- text "c"
  "3" <- text "c"

Starting from the root is more verbose, but also requires the user to be more precise about the structure of the match.

fimad added a commit that referenced this issue Feb 13, 2019
A new selector, textSelector, is added that matches raw text nodes.
This can currently be used to capture floating text that is not directly
nested within other tags.

This is probably not super useful without the contextual awareness that
will be provided by #48.

Issue #70
fimad added a commit that referenced this issue Feb 14, 2019
A new selector, textSelector, is added that matches raw text nodes.
This can currently be used to capture floating text that is not directly
nested within other tags.

This is probably not super useful without the contextual awareness that
will be provided by #48.

Issue #70
@fimad fimad modified the milestone: 0.6.0 Feb 14, 2019
fimad added a commit that referenced this issue Feb 16, 2019
The approach utilizes a zipper like API to focus on sibling nodes and
execute scrapers on the focused nodes. Regression tests have been added
that shows that this solves the requirements in issues #41, and #45.

The documentation needs to be cleaned up a bit and I'm not quite happy
with the name `visitSerially` and `visitChildrenSerially`.

Also there is a bad interaction with the newly added textSelector that
makes stepNext/Back behave unexpectedly. Whitespace between tags count
as nodes which leads to unintuitive behavior.

Issue #48
@fimad fimad added the fixed at head The issue has been address but the fix is not yet included in a released version of the library. label Feb 17, 2019
@fimad
Copy link
Owner Author

fimad commented Feb 18, 2019

Supported in 0.6.0.

Ended up going with a solution similar to that proposed above where only the immediate children are visited.

The solution utilizes a new type SerialScraper where the user explicitly moves the cursor with functions like stepNext and seekNext.

@fimad fimad closed this as completed Feb 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement fixed at head The issue has been address but the fix is not yet included in a released version of the library.
Projects
None yet
Development

No branches or pull requests

3 participants