Single Level Selectors #21

SavageMessiah · 2016-02-14T04:34:02Z

I can't see any way to select tags that are immediate children of a parent rather than ones at an arbitrary nesting. Basically I have some html like this:

<div id="a">
   <div>
      <span></span>
      <span></span>
   </div>
   <span></span>
</div>

I'd like to chroot into the nested div and do stuff stuff that takes into account the nested spans. I'd ALSO like to process all the spans at the top level of div "a" without touching those under the other div.

In my actual use case I was able to work around this because of what I was doing. In general, though, this is a feature I would expect from any scraping API. If you were super-motivated and just added css selectors that would help a lot :p

I hope you can solve this, this is one of the most pleasant scraping APIs I've ever used, much more pleasant that handsomesoup.

The text was updated successfully, but these errors were encountered:

fimad · 2016-02-16T01:37:36Z

Thanks for using the library and I'm glad you like the APIs :)

We can definitely make single level selectors happen. After some thinking, I'm leaning toward adding something like the following two methods:

-- | Constrains a selector to only match tags that are at the top level
-- of the current context.
top :: Selectable a => a -> Selector

-- | Short hand for `a // top b`.
(///) :: (Selectable a, Selectable b) => a -> b -> Selector

As for full on CSS selectors, if we were to add them I think it'd be best to preserve the syntax as opposed to trying massage the expressions into valid Haskell. The only ways I know how to make that happen are to either use quasiqotes or parse strings at run-time. I'd lean the former for the type safety, but it would be a large departure from the library as it is today.

SavageMessiah · 2016-02-18T04:13:54Z

That sounds pretty good. combining top and any would provide an easy way to walk the immediate children of a node as well.

rpglover64 · 2016-02-18T09:52:38Z

Every time I've reached for this library over the past year or so, I've been disappointed by the lack of this feature. Just now I was about to file a feature request, and someone's beaten me to the punch 😄.

I look forward to seeing it implemented.

I like the interface for the most part, but I have a few suggestions to consider:

(///) should be the arbitrary depth one, (//) should be the shallow one
top is a special case of a depth :: Selectable a => Int -> a -> Selector (though it probably disallows negative numbers) which matches only if some element at the specified depth (0 for top level, etc.) matches the selector.
CSS selectors should be a separate issue (possibly a separate library), and a good number of them don't apply.
It doesn't seem worthwhile to take their syntax, though; just to look through them for ideas of selectors that the library does not yet but could provide.

Thank you for developing such a useful tool!

SavageMessiah · 2016-02-18T17:57:15Z

Yeah, I don't think CSS selectors are really that important, I'd rather write haskell anyway.

I'd agree that (//) being the shallow one and (///) being the deep one would make more sense but is it worth making a breaking change over?

I also agree that top being a special case of depth would be nice, though I think depth specifying a maximum depth rather than a specific one would be more useful. I'm not basing that belief on anything other than a gut feeling though.

rpglover64 · 2016-02-18T19:20:55Z

[I]s it worth making a breaking change over?

I think so (with the appropriate version bump, of course); no library depends on scalpel (okay... acme-everything does, but that doesn't count), and web scraping is notoriously fragile anyway. I don't imagine there's a lot of meticulously maintained applications depending on scalpel.

fimad · 2016-02-19T06:52:35Z

I agree that (///) makes more sense as the arbitrary depth operator but was hesitant to make a breaking change... but... since scalpel's small and if the non-trivial fraction of users on this thread think it's a good idea I'd be down :)

As for depth, I think it might be worth while to have a method for depth up to and one for exact depth. I can't think of an elegant way to implement one given the other so it seems like the library should provide both.

rpglover64 · 2016-02-19T07:59:43Z

I can't think of an elegant way to implement one given the other [...]

Perhaps not elegant, but if we have a "consider only nodes this deep or deeper" and any sort of intersection...

The library should provide both in any case.

sordina · 2017-12-14T20:51:03Z

How would this be implemented? It seems like currently there's a list of elements that forms a 'fuzzy path'. Would there be a new kind of element introduced to express adjacency?

fimad · 2017-12-15T07:05:45Z

My latest thinking on this is to have a new scraper, depth, which would return the depth of the match. This would be similar the already existing position function.

You could get single level selectors by doing:

chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
   guard =<< (1 ==) <$> depth 
   text anySelector

This isn't as concise as the originally proposed (///) but would be more flexible in that it would allow for conditions on arbitrary depths and would compose well with position.

As far as how this would actually be implemented, the depth of the current node could be added to the SelectContext type which holds ephemeral meta-data for nodes that can change depending on the context.

typesanitizer · 2018-09-30T22:34:35Z

I think having an API like

chroots "div" @: ["id" @= "a"] $ chroots "span" $ do
   guard =<< (1 ==) <$> depth 
   text anySelector

could lead to inefficient code in the presence of lots of nodes/nesting. IIUC, what is going to happen is that the entire tree will be flattened and then you will filter it, so we are not exploiting the fact that depth monotonically increases to trim the deeper branches. Instead, we are processing the whole tree every time.

Is my understanding correct?

fimad · 2018-10-04T00:19:28Z

That's a good point. In the use case of filtering to a constant depth this would be less efficient than having a selector which has a chance to short circuit DFS paths.

I am also open to alternate APIs and/or supporting multiple APIs here. I think there is value in being able to read the current depth, but it may not be the best way to enforce depth.

atDepth allows for specifying the depth that a Selector must be at in relation to the previous Selector or root node. For example the below will select anchor tags that are direct children of a div tag. "div" // "a" `atDepth` 1 Issue #21

This is an optimization where parts of the search space is culled when the current nodes depth is greater than the depth required for a successful match. Issue #21

atDepth allows for specifying the depth that a Selector must be at in relation to the previous Selector or root node. For example the below will select anchor tags that are direct children of a div tag. "div" // "a" `atDepth` 1 Issue #21

This is an optimization where parts of the search space is culled when the current nodes depth is greater than the depth required for a successful match. Issue #21

fimad · 2019-02-18T05:08:43Z

atDepth has been added in version 0.6.0 which confines matches based on the depth.

The selector to select <b> tags one level under <a> tags would be

"a" // "b" `atDepth` 1

Any additional functionality can be addressed in future issues if they prove necessary.

fimad added this to the 0.4.0 milestone May 28, 2016

fimad added the enhancement label May 28, 2016

fimad modified the milestones: 0.4.1, 0.4.0 Oct 17, 2016

typesanitizer mentioned this issue Sep 29, 2018

Please expose Internal modules :) #65

Closed

fimad mentioned this issue Feb 9, 2019

Figure out story for queries that span multiple sub-trees. #48

Closed

fimad added a commit that referenced this issue Feb 10, 2019

Short circuit when atDepth cannot be satisfied

6ff3148

This is an optimization where parts of the search space is culled when the current nodes depth is greater than the depth required for a successful match. Issue #21

This was referenced Feb 10, 2019

Add atDepth operator and misc fixes and optimizations #69

Merged

Allow selecting bare text nodes #70

Closed

fimad added a commit that referenced this issue Feb 13, 2019

Short circuit when atDepth cannot be satisfied

cec0268

This is an optimization where parts of the search space is culled when the current nodes depth is greater than the depth required for a successful match. Issue #21

fimad added the fixed at head The issue has been address but the fix is not yet included in a released version of the library. label Feb 13, 2019

fimad modified the milestone: 0.6.0 Feb 14, 2019

fimad closed this as completed Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Level Selectors #21

Single Level Selectors #21

SavageMessiah commented Feb 14, 2016

fimad commented Feb 16, 2016

SavageMessiah commented Feb 18, 2016

rpglover64 commented Feb 18, 2016

SavageMessiah commented Feb 18, 2016

rpglover64 commented Feb 18, 2016

fimad commented Feb 19, 2016

rpglover64 commented Feb 19, 2016

sordina commented Dec 14, 2017

fimad commented Dec 15, 2017

typesanitizer commented Sep 30, 2018

fimad commented Oct 4, 2018

fimad commented Feb 18, 2019

Single Level Selectors #21

Single Level Selectors #21

Comments

SavageMessiah commented Feb 14, 2016

fimad commented Feb 16, 2016

SavageMessiah commented Feb 18, 2016

rpglover64 commented Feb 18, 2016

SavageMessiah commented Feb 18, 2016

rpglover64 commented Feb 18, 2016

fimad commented Feb 19, 2016

rpglover64 commented Feb 19, 2016

sordina commented Dec 14, 2017

fimad commented Dec 15, 2017

typesanitizer commented Sep 30, 2018

fimad commented Oct 4, 2018

fimad commented Feb 18, 2019