Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Level Selectors #21

Closed
SavageMessiah opened this issue Feb 14, 2016 · 12 comments
Closed

Single Level Selectors #21

SavageMessiah opened this issue Feb 14, 2016 · 12 comments
Labels
enhancement fixed at head The issue has been address but the fix is not yet included in a released version of the library.
Milestone

Comments

@SavageMessiah
Copy link

I can't see any way to select tags that are immediate children of a parent rather than ones at an arbitrary nesting. Basically I have some html like this:

<div id="a">
   <div>
      <span></span>
      <span></span>
   </div>
   <span></span>
</div>

I'd like to chroot into the nested div and do stuff stuff that takes into account the nested spans. I'd ALSO like to process all the spans at the top level of div "a" without touching those under the other div.

In my actual use case I was able to work around this because of what I was doing. In general, though, this is a feature I would expect from any scraping API. If you were super-motivated and just added css selectors that would help a lot :p

I hope you can solve this, this is one of the most pleasant scraping APIs I've ever used, much more pleasant that handsomesoup.

@fimad
Copy link
Owner

fimad commented Feb 16, 2016

Thanks for using the library and I'm glad you like the APIs :)

We can definitely make single level selectors happen. After some thinking, I'm leaning toward adding something like the following two methods:

-- | Constrains a selector to only match tags that are at the top level
-- of the current context.
top :: Selectable a => a -> Selector

-- | Short hand for `a // top b`.
(///) :: (Selectable a, Selectable b) => a -> b -> Selector 

As for full on CSS selectors, if we were to add them I think it'd be best to preserve the syntax as opposed to trying massage the expressions into valid Haskell. The only ways I know how to make that happen are to either use quasiqotes or parse strings at run-time. I'd lean the former for the type safety, but it would be a large departure from the library as it is today.

@SavageMessiah
Copy link
Author

That sounds pretty good. combining top and any would provide an easy way to walk the immediate children of a node as well.

@rpglover64
Copy link

Every time I've reached for this library over the past year or so, I've been disappointed by the lack of this feature. Just now I was about to file a feature request, and someone's beaten me to the punch 😄.

I look forward to seeing it implemented.

I like the interface for the most part, but I have a few suggestions to consider:

  • (///) should be the arbitrary depth one, (//) should be the shallow one
  • top is a special case of a depth :: Selectable a => Int -> a -> Selector (though it probably disallows negative numbers) which matches only if some element at the specified depth (0 for top level, etc.) matches the selector.
  • CSS selectors should be a separate issue (possibly a separate library), and a good number of them don't apply.
    It doesn't seem worthwhile to take their syntax, though; just to look through them for ideas of selectors that the library does not yet but could provide.

Thank you for developing such a useful tool!

@SavageMessiah
Copy link
Author

Yeah, I don't think CSS selectors are really that important, I'd rather write haskell anyway.

I'd agree that (//) being the shallow one and (///) being the deep one would make more sense but is it worth making a breaking change over?

I also agree that top being a special case of depth would be nice, though I think depth specifying a maximum depth rather than a specific one would be more useful. I'm not basing that belief on anything other than a gut feeling though.

@rpglover64
Copy link

[I]s it worth making a breaking change over?

I think so (with the appropriate version bump, of course); no library depends on scalpel (okay... acme-everything does, but that doesn't count), and web scraping is notoriously fragile anyway. I don't imagine there's a lot of meticulously maintained applications depending on scalpel.

@fimad
Copy link
Owner

fimad commented Feb 19, 2016

I agree that (///) makes more sense as the arbitrary depth operator but was hesitant to make a breaking change... but... since scalpel's small and if the non-trivial fraction of users on this thread think it's a good idea I'd be down :)

As for depth, I think it might be worth while to have a method for depth up to and one for exact depth. I can't think of an elegant way to implement one given the other so it seems like the library should provide both.

@rpglover64
Copy link

I can't think of an elegant way to implement one given the other [...]

Perhaps not elegant, but if we have a "consider only nodes this deep or deeper" and any sort of intersection...

The library should provide both in any case.

@fimad fimad added this to the 0.4.0 milestone May 28, 2016
@fimad fimad modified the milestones: 0.4.1, 0.4.0 Oct 17, 2016
@sordina
Copy link

sordina commented Dec 14, 2017

How would this be implemented? It seems like currently there's a list of elements that forms a 'fuzzy path'. Would there be a new kind of element introduced to express adjacency?

@fimad
Copy link
Owner

fimad commented Dec 15, 2017

My latest thinking on this is to have a new scraper, depth, which would return the depth of the match. This would be similar the already existing position function.

You could get single level selectors by doing:

chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
   guard =<< (1 ==) <$> depth 
   text anySelector

This isn't as concise as the originally proposed (///) but would be more flexible in that it would allow for conditions on arbitrary depths and would compose well with position.

As far as how this would actually be implemented, the depth of the current node could be added to the SelectContext type which holds ephemeral meta-data for nodes that can change depending on the context.

@typesanitizer
Copy link
Contributor

I think having an API like

chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
   guard =<< (1 ==) <$> depth 
   text anySelector

could lead to inefficient code in the presence of lots of nodes/nesting. IIUC, what is going to happen is that the entire tree will be flattened and then you will filter it, so we are not exploiting the fact that depth monotonically increases to trim the deeper branches. Instead, we are processing the whole tree every time.

Is my understanding correct?

@fimad
Copy link
Owner

fimad commented Oct 4, 2018

That's a good point. In the use case of filtering to a constant depth this would be less efficient than having a selector which has a chance to short circuit DFS paths.

I am also open to alternate APIs and/or supporting multiple APIs here. I think there is value in being able to read the current depth, but it may not be the best way to enforce depth.

fimad added a commit that referenced this issue Feb 10, 2019
atDepth allows for specifying the depth that a Selector must be at in
relation to the previous Selector or root node. For example the below
will select anchor tags that are direct children of a div tag.

  "div" // "a" `atDepth` 1

Issue #21
fimad added a commit that referenced this issue Feb 10, 2019
This is an optimization where parts of the search space is culled when
the current nodes depth is greater than the depth required for a
successful match.

Issue #21
fimad added a commit that referenced this issue Feb 13, 2019
atDepth allows for specifying the depth that a Selector must be at in
relation to the previous Selector or root node. For example the below
will select anchor tags that are direct children of a div tag.

  "div" // "a" `atDepth` 1

Issue #21
fimad added a commit that referenced this issue Feb 13, 2019
This is an optimization where parts of the search space is culled when
the current nodes depth is greater than the depth required for a
successful match.

Issue #21
@fimad fimad added the fixed at head The issue has been address but the fix is not yet included in a released version of the library. label Feb 13, 2019
@fimad fimad modified the milestone: 0.6.0 Feb 14, 2019
@fimad
Copy link
Owner

fimad commented Feb 18, 2019

atDepth has been added in version 0.6.0 which confines matches based on the depth.

The selector to select <b> tags one level under <a> tags would be

"a" // "b" `atDepth` 1

Any additional functionality can be addressed in future issues if they prove necessary.

@fimad fimad closed this as completed Feb 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement fixed at head The issue has been address but the fix is not yet included in a released version of the library.
Projects
None yet
Development

No branches or pull requests

5 participants