Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: adding soupsieve.strain #219

Closed
Yomguithereal opened this issue Mar 22, 2021 · 29 comments
Closed

Suggestion: adding soupsieve.strain #219

Yomguithereal opened this issue Mar 22, 2021 · 29 comments
Labels
P: maybe Pending approval of low priority request. T: feature Feature.

Comments

@Yomguithereal
Copy link

Yomguithereal commented Mar 22, 2021

Hello @facelessuser,

I need to create BeautifulSoup strainers to optimize scrapers for my tool here. But as a convenience for my users, I am going to build them from simple css selectors such as li.item[align=left].

I can do so by using your CSSParser class that processes and return Selector instances and assess whether the selector is simple enough to befit a strainer. If so, I can build a function that will "apply" this selector to tell the strainer whether it should parse the current node etc.

I will implement this for me in my tool but I was wondering if you'd like me to contribute to this lib instead by adding something like soupsieve.strain basically. It would return an arg (typically a function) you can give to bs4.SoupStrainer and should raise a custom error if the selector is found to be too complex for the task. If this is of any interest I can open a PR for this.

Have a good day and thanks for your work,

@gir-bot gir-bot added the S: triage Issue needs triage. label Mar 22, 2021
@facelessuser
Copy link
Owner

@Yomguithereal, you shouldn't use the CSSParser class directly. Why not use the compile API command? https://facelessuser.github.io/soupsieve/api/#soupsievecompile

As for strainers, let me look into that. I haven't considered how selectors would/could be used in strainers. Let me see if there is already a way to do so, and if not, I can see if there is way to make it possible.

@facelessuser
Copy link
Owner

@gir-bot remove S: triage
@gir-bot add T: feature

@gir-bot gir-bot added T: feature Feature. and removed S: triage Issue needs triage. labels Mar 23, 2021
@facelessuser
Copy link
Owner

Maybe before I attempt anything here, can you give me an example of what you are trying to accomplish?

@Yomguithereal
Copy link
Author

Yomguithereal commented Mar 23, 2021

Why not use the compile API command?

Because I need to "analyze" the parsed Selector (from css_types) instance to be able to assess whether the css selection can be used to strain a soup (e.g. I can only strain based on some tag name and attributes, so for instance ul > li is not strainable but li.item[align=left] is. I did not find a way to access this Selector instance through the Soupsieve instance returned by sv.compile. I guess it would be nice (but maybe I haven't found it) to expose your robust parsing logic in some way so that people may use it to perform analysis/introspection of raw css selectors for various purposes.

Maybe before I attempt anything here, can you give me an example of what you are trying to accomplish?

Sure, here is the gist of it:

import bs4
import soupsieve

strainer_function = soupsieve.strain('li.item[align=left]')
strainer = bs4.SoupStrainer(strainer_function)
soup = bs4.BeautifulSoup(some_html, 'lxml', parse_only=strainer)

In this example, strainer_function would in fact do something of the kind (this is a toy example, I am fully aware of the limitations of the following code):

def strainer_function(tag, attrs):
  if tag != 'li':
    return False
  
  if 'item' not in attrs.get('class', ''):
    return False
  
  if attrs.get('align') != 'left':
    return False

  return True

It would be built through a closure or a callable instance (I am not speaking of building the function through code templating and evaluation, I am not that crazy :) ).

Finally this:

soupsieve.strain('ul > li')

would need to raise some kind of error since we cannot make a strainer out of this selector.

Typically my use case is that either:

  1. I will let users give me simple css selectors to strain their soups when scraping to optimize CPU time & RAM to scrape millions of documents through multiprocessing.
  2. I can sometimes infer this simple css selector myself from the static analysis of my scraping DSL.

@facelessuser
Copy link
Owner

It would be built through a closure or a callable instance (I am not speaking of building the function through code templating and evaluation, I am not that crazy :) ).

I don't think you are, but the strainer seems to really not be built for the CSS selectors, which is why it also doesn't natively support them.

TBH, I haven't delved too much into SoupStrainers. Does the tree actually exist at the time SoupStrainer is run? Does it partially exist (is it being run while building up the tree, so it might not all be there)? Are the elements being evaluated completely detached from the tree even if it does exist? Is SoupStrainer ensuring the physical final tree doesn't have the things not matched in SoupStrainer, or is it just hiding them?

Depending on the state of the tree and such and how the SoupStrainers work, we could potentially just subclass SoupStrainer and add in a way to pass things in the more proper way to SoupSieve.

The alternative would be to have a really dumbed down selector handling that only supports things like tag, .selector, #id, [attr="value"], :not(), :is(), selector, selector. Can we even handle namespaces at the time SoupStrainer runs (namespace|tag)?

This creates a lot of redundant handling.

If we could subclass a special SoupStrainer that passes in the Tag object, we could reuse most of the code and just let the selectors that much nothing match nothing.

@Yomguithereal
Copy link
Author

TBH, I haven't delved too much into SoupStrainers. Does the tree actually exist at the time SoupStrainer is run?

Nope, the tree does not exists, it is being built. I think we are at the SAX tag level (it does not work with the pure python html parser by the way). So the only thing you have is the name of the tag and the attributes and you must tell whether you want to parse the underlying data or skip it when reading underlying data until the tag is closed by returning True or False with the callback given to the SoupStrainer. Or at least this is how I understand it and how it would make sense. This means that the function does not have access to bs4 Tags, for instance so most of soupsieve internals probably cannot work directly on the data given to the function.

This also means the result will be a valid tree but without the parts you skipped with your straining function. The following doc:

<html>
<body>
<div>
  <p>
    <a href="whatever1">Link 1 <em>important</em></a>
  <p>
  <a href="whatever2">Link 2</a>
</div>
</body>

would result in a tree similar to this other html, if you "strain" a tags:

<html>
 <body>
  <a href="whatever1">Link 1 <em>important</em></a>
  <a href="whatever2">Link 2</a>
</body>
</html>

which is fine if you only care about scraping information about links for instance (yes you can use a regex instead of parsing html for this precise use-case but this is, again, a toy example :) ) and results in less CPU time used for parsing and less RAM overall.

So basically, yes we can probably just handle matches on tag names, and attributes (including class and id), and we can probably forget about relations and things like :contains etc. But it just is easier for people to declare those match with a simple CSS selector like a[href^=whatever] rather than using the sometimes byzantine and fragmented bs4 API.

But I also agree it is possible that the feature I describe is not useful to soupsieve itself as it is too far from its concerns. And I can totally implement what I need outside of the library in its current state. The only thing I would need, but for comfort only, would be to have a way to access more officially the CSS parsing utilities of the library to get the css selector string as a Selector which I can work with.

What I can do, if you want, is to implement what I mean on my end and show it to you afterwards so you can have a fully clear idea of the whole thing and then it would be easier to assess what you want or don't want to do with it?

@facelessuser
Copy link
Owner

But it just is easier for people to declare those match with a simple CSS selector like a[href^=whatever] rather than using the sometimes byzantine and fragmented bs4 API.

I agree. I really only use the find and find_all API if I want to use regular expressions.

What I can do, if you want, is to implement what I mean on my end and show it to you afterwards so you can have a fully clear idea of the whole thing and then it would be easier to assess what you want or don't want to do with it?

I'm okay with you taking a stab at this. It seems you have a use case for it and will move forward anyways, so sharing your discoveries would be helpful.

With that said, as I have time, I can also dig into what I think we could do. It may be easier to let the parser compile all valid CSS, and send it through a different matcher that takes a name and attrs returns False for anything compiled that is not supported.

@facelessuser
Copy link
Owner

I will add, this is no small feature, and to do it well, will take time, assuming that after investigation, SoupSieve decides to move forwards with this.

This must work reasonably in XML and HTML (not sure yet if we can properly pull off namespaces during this process, but maybe). We may not have context for XHTML, but let's be honest, HTML5 is the way to go these days.

Assuming we get this working reasonably well, we'll have to test it pretty thoroughly. The thing that makes SoupSieve easy for me to maintain is the fact that we do a lot of testing. I imagine such a feature will add a fair amount of maintenance burden, but I don't think the selector support will change much from what we initially define, so if we test it really well, I imagine it won't be too bad

This is why I am also leaning more towards a separate matcher than hacking up the parser. I think it will be much easier to maintain a down attribute and name matcher than to maintain or augment the parser.

facelessuser added a commit that referenced this issue Mar 23, 2021
@facelessuser
Copy link
Owner

@Yomguithereal, I've put together a very, very, very rough proof of concept for strain on https://github.com/facelessuser/soupsieve/tree/feature/experimental-strain.

Given this example:

from bs4 import BeautifulSoup as BS
from bs4 import SoupStrainer
import soupsieve as sv

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

p = SoupStrainer(sv.compile(':is(#link1, a#link2)').strain)

print(BS(html_doc, "html.parser", parse_only=p).prettify())

I get this result:

<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>

No mucking with css_parser.py.

It "should" reject a tag that doesn't match unsupported selectors. Will it handle XML properly? Nope. Does this guarantee this feature will make it in? Nope. But hey, it's a possible start 🙂.

@facelessuser
Copy link
Owner

As we may not be able to gather context on whether what we are straining is XML, HTML, or even XHTML, we can maybe do that with flags: STRAIN_HTML and STRAIN_XML. I don't know if it matters whether for the limited subset that we'd parse if we need to know XHTML or not, but if we did, I suppose you could pass in both to tell the strainer it is XHTML.

@Yomguithereal
Copy link
Author

Yomguithereal commented Mar 23, 2021

Hello again @facelessuser, like you I experimented a bit with the idea and created a function here: https://github.com/medialab/minet/blob/master/minet/scrape/straining.py

It is being unit tested here: https://github.com/medialab/minet/blob/master/test/scrape_test.py#L594

It handles tag names, ids, classes, attributes, commas and can ignore the relations to only work on the top level selection in the path. I am sure they are a lot of other cases I did not think about like :is as it seems :)

I will read the commits on your branch now.

@facelessuser
Copy link
Owner

Cool, I'd be careful with the CSSParse stuff though. I do not guarantee behavior or API on that class. That is not public, and I won't hesitate to break it if I need to 🙃.

@Yomguithereal
Copy link
Author

Yomguithereal commented Mar 23, 2021

Regarding my ignore_relations kwargs, the rationale is the following: when declaring scrapers using my DSL, users will naturally indicate which are the top-level selectors they are interested in and I can leverage this to strain automatically the documents by keeping only the first selected item of the CSS path.

Cool, I'd be careful with the CSSParse stuff though. I do not guarantee behavior or API on that class. That is not public, and I won't hesitate to break it if I need to 🙃.

Sure, I don't expect you to guarantee this, I am trespassing into the lib's internals here after all :)

@facelessuser
Copy link
Owner

Regarding my ignore_relations kwargs, the rationale is the following: when declaring scrapers using my DSL, users will naturally indicate which are the top-level selectors they are interested in and I can leverage this to strain automatically the documents by keeping only the first selected item of the CSS path.

I guess I'm not sure I understand. Maybe an example would help?

@Yomguithereal
Copy link
Author

Yomguithereal commented Mar 23, 2021

Here is an example of the desired behavior being tested: https://github.com/medialab/minet/blob/master/test/scrape_test.py#L652-L657

For a real-life example, here is how one would write a scraper for the EchoJS website:

---
iterator: article
fields:
    title:
        sel: h2 > a
    url:
        sel: h2 > a
        attr: href
    author:
        sel: username > a
    up:
        sel: .upvotes
    down:
        sel: .downvotes

The "toppest" level CSS selector is this iterator: article part. By analyzing this scraper, I can infer that the document can be strained to target this CSS selection since the user will never access parts which are not contained within the article tags.

But if the person wrote #newslist article instead then I cannot infer the same thing and need to strain #newslist instead, being the first step of this css selector path. People can still give me the strainer selection themselves to be more performant if they know how to, but in this case I cannot infer it for them further than straining #newslist.

@facelessuser
Copy link
Owner

Yeah, such behavior may be too "magic" for a SoupSieve strainer. This adds a number of rules which are out of scope from what we would be trying to provide. In general, we are just sticking to CSS selectors and their intentions, we try not to infer anything extra. There are probably other ways you could let the user define context like that 🤷🏻.

Assuming you still find SoupSieve providing a strainer useful, it would simply be just a limited CSS selector subset based on what we can parse with the context we have. We would support as many as are practical.

Do let me know though what direction you plan to take. If you end up thinking your homegrown way is the way you wish to go long term, and that what we'd provide would not fulfill your needs, that is fine. It will help me to prioritize my time with such a feature. If one person is asking for it but has no intention to use it, it is definitely lower on my priority 🙂.

@Yomguithereal
Copy link
Author

Yeah, such behavior may be too "magic" for a SoupSieve strainer.

Yes I can see why this feels too "magic". But if I can still access your CSSParser I can of "truncate" the css selector before giving it to a potentially lib-sanctionned strainer utility.

Do let me know though what direction you plan to take. If you end up thinking your homegrown way is the way you wish to go long term, and that what we'd provide would not fulfill your needs, that is fine. It will help me to prioritize my time with such a feature. If one person is asking for it but has no intention to use it, it is definitely lower on my priority 🙂.

I will definitely test things using my homegrown code in a near future to see if this is all worth the hassle performance-wise. And seeing I can work satisfyingly with the current state of the library and since I am currently the only person asking for this kind of feature, I would say it should indeed remain low on your priority list, except of course if you have an intellectual or practical interest in implementing this for your own benefit :) My goal was not to give you more work to maintain the lib but rather to offer to contribute to your work and help with the lib's maintenance subsequently if it happens this feature can be useful to others and the fact that it is slightly divorced from CSS's complete scope does not make it easy to understand yet of course.

@facelessuser
Copy link
Owner

One thing to note. We cannot check namespaces because SoupStrainer actually only passes the name and not the prefix. If they passed the original markdown for the tag or attribute value, we'd have no trouble, but they don't. The only way to pull this off would be to subclass SoupStrainer and make sure it gave us the prefix. We would either have to accept no access to namespaces or force it with our own strainer class.

@Yomguithereal
Copy link
Author

Or we could somehow contribute to bs4 by making some strainer having namespace information (which it can access in the scope that actually calls the function here: https://github.com/wention/BeautifulSoup4/blob/03a2b3a9d1fc5877212d9d382a512663f24c887d/bs4/__init__.py#L338) but this seems strenuous and I don't even know how to contribute to bs4 since it is not on GitHub etc. :)

@facelessuser
Copy link
Owner

I've contributed before, it is awkward to learn enough about bazaar to get your stuff up in a pull request, but I know enough now. I would have to think about how I could present the changes in a backwards compatible way.

I honestly think I have most of what I need to pull off a strainer, but I'd like to explore a bit more with namespaces as I feel that's a needed thing for straining XML. The strainer class itself handles them, they just don't pass them to the callback.

@facelessuser
Copy link
Owner

I dug in, and the truth is that the treebuilder never has the context for namespaces. And the only way to get the prefixes (without the namespace) would be to subclass the SoupStrainer, but even then, it is not nearly as useful. But, when you pass a strainer as a find_all/find parameter, you get access to the full tag and can then do namespaces. So a little adjustment, and we can handle both cases:

from bs4 import BeautifulSoup as BS
from bs4 import SoupStrainer
import soupsieve as sv

MARKUP = """
<?xml version="1.0" encoding="UTF-8"?>
<tag id="root">
  <head id="0"></head>
  <foo:other id="1" xmlns:foo="http://me.com/namespaces/foofoo"
         xmlns:bar="http://me.com/namespaces/foobar">
  <foo:head id="2">
    <foo:title id="3"></foo:title>
    <bar:title id="4"></bar:title>
  </foo:head>
  <body id="5">
    <foo:e1 id="6"></foo:e1>
    <bar:e1 id="7"></bar:e1>
    <e1 id="8"></e1>
    <foo:e2 id="9"></foo:e2>
    <bar:e2 id="10"></bar:e2>
    <e2 id="11"></e2>
    <foo:e3 id="12"></foo:e3>
    <bar:e3 id="13"></bar:e3>
    <e3 id="14"></e3>
  </body>
  </foo:other>
  <other id="15" xmlns="http://me.com/namespaces/other">
    <e4 id="16">Inherit</er>
  </other>
</tag>
"""

# Tree builder
print('========== Builder ==========')
p = SoupStrainer(sv.strainer('*:not(tag, head, body, other)', 'xml'))
print(BS(MARKUP, "lxml-xml", parse_only=p).prettify())

print('========== Find ==========')
# Normal find/find_all support
namespaces={
    "foo": "http://me.com/namespaces/foofoo",
    "bar": "http://me.com/namespaces/foobar"
}
p = SoupStrainer(sv.strainer('body > foo|*:is(e1, e3)', namespaces=namespaces))
soup = BS(MARKUP, "lxml-xml")
print(soup.find_all(p))

Output:

========== Builder ==========
<?xml version="1.0" encoding="utf-8"?>
<foo:title id="3"/>
<bar:title id="4"/>
<foo:e1 id="6"/>
<bar:e1 id="7"/>
<e1 id="8"/>
<foo:e2 id="9"/>
<bar:e2 id="10"/>
<e2 id="11"/>
<foo:e3 id="12"/>
<bar:e3 id="13"/>
<e3 id="14"/>
<e4 id="16">
 Inherit
</e4>
========== Find ==========
[<foo:e1 id="6"/>, <foo:e3 id="12"/>]

All of this is working and available on the aforementioned branch. Testing for tree builder has not been done yet, but I think the code is complete.

Whether we pull this in or not to the main branch is a question, but I at least had fun putting this together 🙂.

@facelessuser
Copy link
Owner

Technically attributes get the prefix as the name is passed unaltered, but tags don't. Tag namespaces are never passed to the SoupStrainer class during the building, so there is no way to get them in the builder.

Not sure if I should strip the prefixes in XML attributes for consistency or not 🤷🏻.

@facelessuser
Copy link
Owner

facelessuser commented Mar 24, 2021

I will say that unless you are combining selectors with something else in a SoupStrainer, using select or select_one makes way more sense. I'm still not sure if we should have the strainer behave differently in find_all/find or if it should just be consistent. I think that "magic" may be surprising, so if we included this, we'd have to decide whether we keep the limited CSS support in both use cases or if it intelligently uses the best route.

@facelessuser
Copy link
Owner

It seems that SoupStrainer only gives the prefix if the builder doesn't strip the prefix, so SoupSieve won't remove the prefixes either. It's all dependent on the tree builder.

html.parser and lxml will send in tag prefixes while lxml-xml won't. html5lib doesn't support strainers during the build, so 🤷🏻 . So during the build process, we'd just take whatever is given us and parse them generically. The only exception is if the user has specified stated they want XML handling which will treat names with case-sensitivity.

@facelessuser facelessuser added the P: maybe Pending approval of low priority request. label Mar 26, 2021
@facelessuser
Copy link
Owner

I've marked this issue as a maybe. I think I have a fully working prototype (sans tests), but I think I would need more feedback/interest to pursue this further. I'll keep the branch alive for now. Assuming that more feedback starts rolling in and/or more interest for this to be developed directly into soupsieve, then we may pick this up again. I guess if I had a strong need for this, that could also influence this.

@Yomguithereal
Copy link
Author

Hello @facelessuser, sorry I have been missing lately. I have started to experiment with the concept in real use cases and I must admit this is somewhat of a hit and miss. Sometimes, for simple cases, it can yield a significant advantage (think 20-40% boost in performance) and sometimes it is completely negligible. I will do more tests and report here afterwards.

@facelessuser
Copy link
Owner

@Yomguithereal I imagine there are scenarios that lend better to this approach than others.

  1. I think the size of the HTML snippet will lend to how well this works or not, and how deeply nested the content is that you care about.

  2. There will be some additional overhead when parsing an HTML this way. While the end result is smaller and quicker to parse, initially, you are doing a lot of extra work to get it to this point. So there is a hump to get over before you see the "payoff" if you will.

  3. If you reuse that HTML snippet for a lot of referencing and have to search it multiple times, I imagine, in the long run, it will be faster.

@Yomguithereal
Copy link
Author

One thing I have not benchmarked also is memory usage. I expect straining should consume less memory and this is also a good argument to strain in some scenarios, even if the perf boost is not that impressive.

@facelessuser
Copy link
Owner

facelessuser commented Sep 10, 2021

It seems that this issue is now stale. Closing for housekeeping. If this is desired in the future, we can reopen and continue discussing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P: maybe Pending approval of low priority request. T: feature Feature.
Projects
None yet
Development

No branches or pull requests

3 participants