-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out story for queries that span multiple sub-trees. #48
Comments
It seems like both of these may be solvable if we were able to insert new fake nodes into the tag tree. What if there was a #41<p class="something">Here</p>
<p>Other stuff that matters</p> [
"Here",
"Other stuff that matters"
] <- group ["p" @: [hasClass "something"], "p"] texts #45<body>
<h1>title1</h1>
<h2>title2 1</h2>
<p>text 1</p>
<p>text 2</p>
<h2>title2 2</h2>
<p>text 3</p>
<h2>title2 3</h2>
</body> [
("title2 1", ["text 1", "text 2"]),
("title2 2", ["text 3"]),
("title2 3", []),
] <- chroot "body" $ do
groups ["h2", many "p"] $ do
title <- text "h2"
body <- texts "p"
return (title, body) |
It'd also be very useful to refer to siblings that have naked text (i.e. not part of a subtree). |
The workaround for now is to |
Unfortunately I don't think It's also not immediately clear how to expose bare text selection in a way that would be backwards compatible. My current thinking is to create an additional value for SelectNode for text nodes. That would let you do something like the following to grab the second text node under an chroot "h2" $
chroots textSelector $ do
p <- position
guard (p == 1)
text textSelector With an API like the one proposed in #21 you could do something even more snazzy like: The potential issue here though is that allowing selection of bare text nodes would create a breaking change in the behavior of This might be an OK breaking change though since I think the most useful use of |
Was starting to look into this and I'm now thinking of trying to do the following: Have some way to toggle the behavior of #41<p class="something">Here</p>
<p>Other stuff that matters</p> ("Here", "Other stuff that matters") <- serial $ do
first <- text "p" @: [hasClass "something"]
second <- text "p"
(first, second) #45<body>
<h1>title1</h1>
<h2>title2 1</h2>
<p>text 1</p>
<p>text 2</p>
<h2>title2 2</h2>
<p>text 3</p>
<h2>title2 3</h2>
</body> [
("title2 1", ["text 1", "text 2"]),
("title2 2", ["text 3"]),
("title2 3", []),
] <- chroot "body" $ serial $ many $ do
title <- text "h2"
body <- many $ text "p"
return (title, body) Some details that still need to be worked out are how to treat nested matches. Should the next scraper pick up from the sibling of the root or the match? For example: <a>
<b>1</b>
<c>2</c>
</a>
<c>3</c> serial $ do
"1" <- text "a" // "b"
foo <- text "c" Should Let's say you wanted to select all 3 values. This would be supported by each case: Start from the sibling of the root of the match: serial $ do
("1", "2") <- chroot "a" $ (,) <$> text "b" <*> text "c"
"3" <- text "c" Start from the sibling of the leaf of the match: serial $ do
"1" <- text "b"
"2" <- text "c"
"3" <- text "c" Starting from the root is more verbose, but also requires the user to be more precise about the structure of the match. |
The approach utilizes a zipper like API to focus on sibling nodes and execute scrapers on the focused nodes. Regression tests have been added that shows that this solves the requirements in issues #41, and #45. The documentation needs to be cleaned up a bit and I'm not quite happy with the name `visitSerially` and `visitChildrenSerially`. Also there is a bad interaction with the newly added textSelector that makes stepNext/Back behave unexpectedly. Whitespace between tags count as nodes which leads to unintuitive behavior. Issue #48
Supported in Ended up going with a solution similar to that proposed above where only the immediate children are visited. The solution utilizes a new type |
Right now scalpel makes the assumption that an HTML document is a tree (possibly a malformed one) and that scraping involves selecting one or more sub-trees and extracting (the same) data from each sub-tree.
There are use cases (#41, #45) that don't fit into this model but would nice to support.
This issue is for brainstorming API/architecture changes that would support these types of queries.
#41 could be solved by extending selectors to allow jumping between sibling sub-trees, maybe something that looked like:
The problem with this is that it doesn't really extend to more complicated scenarios like those #45 which involves collecting a sequence of siblings until a certain condition is met.
The text was updated successfully, but these errors were encountered: