Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any suggestions for something similar to cheerio with xpath support? #152

Closed
mreinstein opened this issue Jan 29, 2013 · 17 comments
Closed

Comments

@mreinstein
Copy link

@matthewmueller cheerio is freaking awesome. Great work man!

One thing I'm curious about is using xpath selectors instead of jquery. It seems like most browsers have a firebug/web inspector that expresses dom nodes positions in xpath. So while I know and love jquery, I'm tired of having to manually convert my xpath selector to a jquery equivalent. I'm wondering if cheerio supports this? I'm assuming it doesn't, so I'm wondering if there is another modular similar to cheerio that has similarly awesome perf characteristics, foregiving html parsing, and support for xpath that you would recommend. Or maybe I'm just doing it wrong, and there is some other tool where I can click a dom element and get a selector expressed in cheerio compatible syntax?

@mreinstein
Copy link
Author

I've also heard that xpath (in browsers at least) is remarkably more efficient than the css selectors based on DOM traversal. John Resig has a post that is a bit dated on this topic, but it seems compelling. http://ejohn.org/blog/xpath-overnight/

@matthewmueller
Copy link
Member

I'm really not sure of anything. does jquery support xpath selectors? it could be a nice feature to plug into the lib.

This article is interesting, but definitely dated. do you have any recent benchmarks to compare?

@mreinstein
Copy link
Author

does jquery support xpath selectors?

From what I've read, jquery supported it but it was removed a long time ago (1.2!) because it was an unpopular/underused feature. It looks like it was moved to a plugin, but even that seems dubiously stale. http://archive.plugins.jquery.com/project/xpath

Maybe this is worth trying, but the fact that it's based on an xml parser scares the bejesus out of me and I question it's suitability for parsing HTML https://github.com/goto100/xpath

do you have any recent benchmarks to compare

I don't sadly. It may be irrelevant since Resig was talking about browser implementations, well before node (or V8 for that matter) were prime time.

@mreinstein
Copy link
Author

Maybe I'm just going about this with the wrong strategy? My original thinking was "hey, I have a ton of different sites to scrape, surely there must be a better way than manually translating each xpath selector or manually eye-balling the DOM and writing jQuery traversals/selectors. I know, I'll use a library that handles xpath!"

Maybe this is wrong headed.. how do people handle defining large volumes of robust selectors? I find it hard to believe most people are manually translating them. Or maybe node and npm's awesomeness has just made me lazy? :)

@mreinstein
Copy link
Author

actually I just found this: http://plugins.jquery.com/xpath looks like it has indeed been moved to the fancy new jquery builds. I guess the question now is how to use this plugin with node.

@fb55
Copy link
Member

fb55 commented Jan 29, 2013

As far as performance is concerned, css selectors as they are implemented by CSSselect are way faster than how xpath queries are evaluated (O(n) vs O(n^m)). The blogpost, as far as I understand it, reports an advantage due to combined queries, which, at that time, couldn't be done using CSS selectors before (querySelectorAll() didn't exist 2007). And that was done by translating CSS selectors to xpath queries.

I thought about implementing an xpath engine a while back, but being fully spec compliant is much more complicated than with CSS selectors. But in case you're interested here's the spec, and you are free use parts of CSSselect whenever you'd like to (it's open-source, anyway).

@mreinstein
Copy link
Author

@fb55 Ah that's really interesting, thanks for the context. :)

@khrome
Copy link

khrome commented Oct 24, 2013

if you want any kind of speed, using libxml is your only option in node

var xpathText = function(selector, value){ 
    if(!libs.libxmljs) libs.libxmljs = require("libxmljs");
    var xmlDoc = libs.libxmljs.parseHtmlString(value);
    var result = xmlDoc.find(selector);
    var results = [];
    if(type(result) != 'array') result = [result];
    result.forEach(function(node){
        if(node) results.push(node.toString());
    })
    return results;
};

I use this as part of two symmetrical functions for blending xpath and regex, if you aren't trying to match the return form of regex, your code can be even simpler. Happy scraping.

@fb55
Copy link
Member

fb55 commented Oct 24, 2013

if you want any kind of speed, using libxml is your only option in node

Either we're living in distinct universes, or you simply haven't done any benchmarking. At least for parsing, htmlparser2 is faster than libxml even when building a DOM tree (which is optional). The benchmark results are computed on Travis CI, with equal testing environments for all parsers. When you think this benchmark is biased (which could definitely be true), please create your own benchmark and share it!

But of course, libxml offers XPath support, so when you need that, it's definitely worth the bias. (Plus, when you're really dealing with XML, it's also the better tool for the job.)

@khrome
Copy link

khrome commented Oct 24, 2013

right, the question was about xpath, so while I agree htmlparser2 is faster at dom parsing, from the field of xpath options libxml is going to be the fastest, most compatible option. So from the perspective where you are checking parsing benchmarks and I'm answering this guy's question based on xpath selector speed we are, indeed, from two different universes.

@fb55
Copy link
Member

fb55 commented Oct 25, 2013

My point wasn't about the parsing performance; it was meant to be a refutation of the "if you want any kind of speed" claim. I'm sure an optimized XPath engine implemented in plain JS will outperform libxml at ease, but I agree, it's currently the best solution available.

@khrome
Copy link

khrome commented Oct 26, 2013

So having tried a number of solutions I can say, not only is it faster than any other solutions, but it also is more compatible, ensuring that expressions which execute in other environments execute with the expected results. So I stand by my "if you want any kind of speed" claim. I look forward to the day when it's not true, but for now despite the fact your parser is great, it IS NOT USEFUL FOR XPATH. I'm just tring to answer the question, I'm certainly not making the claim libxml is ideal for all tasks. cheerio is awesome, jsdom is awesome, phantomjs is awesome all for different purposes. The same goes for parsing and querying. For xpath, libxml is the fastest (not to mention the best syntax coverage).

I'm not here to disparage anybody's code, but there are pure js xpath evaluators, and the performance sucks no matter what DOM you plug it into. I'm guessing depending on a micro lib to do naive traversals will never be as performant as a DOM which indexes with the express purpose of optimizing such node traversals, which will slow down the basic parse. So I'm unconvinced an independent xpath lib is ever going to do the trick. I'd love it if you proved me wrong... take that as a challenge.

@fb55
Copy link
Member

fb55 commented Oct 26, 2013

The problem with XPath is that it's essentially a programming language, although it isn't Turing complete. Only looking at the examples of ilinsky/xpath.js gives me the creeps.

The biggest problem of the three libraries I could find were polymorphic return types & too many object allocations (especially in a way that V8 couldn't optimize for). Also, they all have to query the DOM multiple times, which isn't ideal.

In an ideal world, I would now have the time to write a common querying kernel, which is able to process queries in both directions and can be used by both CSSselect (in order of supporting jQuery extensions such as :first, :last etc.) & an XPath engine, but I don't. Maybe in February or something =P

@khrome
Copy link

khrome commented Oct 26, 2013

yeah, looking at that code isn't exactly pleasant, and using it isn't very fun either.

I like your idea of a common query kernel, that would not only solve the 'isolated libs' problem I was referring to, but could make all types of selectors more performant. Hit me up if that happens or I could be useful! :)

@shirk3y
Copy link

shirk3y commented Oct 20, 2014

There is hidden xpath implementation in jsdom: https://github.com/fmap/jsdom-xpath

@softmarshmallow
Copy link

softmarshmallow commented Jul 15, 2020

why would't cheerio support xpath by default? (since its lot used for crawling/page parsing purpose)

@fb55
Copy link
Member

fb55 commented Jul 15, 2020

Xpath is a complicated beast. Cheerio is pretty well scoped at the moment, adding xpath support would complicate things a lot. Happy to link to an xpath extension tho, should someone want to invest the time.

@fb55 fb55 closed this as completed Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants