Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the :not selector don't work as expected. #212

Closed
jimages opened this issue Feb 9, 2021 · 16 comments · Fixed by #213
Closed

the :not selector don't work as expected. #212

jimages opened this issue Feb 9, 2021 · 16 comments · Fixed by #213
Labels

Comments

@jimages
Copy link

jimages commented Feb 9, 2021

the minimal code which can reproduce the bug lists below

import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = None  # str(b) ->  <html><body><a foo href="http://www.example.com"></a></body></html>
b.select("a:not([foo])")  # -> [<a foo href="http://www.example.com"></a>]

in this case, the tag a shouldn't be selected.

@gir-bot gir-bot added the S: triage Issue needs triage. label Feb 9, 2021
@facelessuser
Copy link
Owner

facelessuser commented Feb 9, 2021

I do not view this as a bug, but as a consequence of doing an undefined action. Basically, you've stuffed nonsense into the attribute. Let me demonstrate what an empty attribute actually looks like and why it behaves as it does:

import bs4
b = bs4.BeautifulSoup("<a foo href=\"http://www.example.com\"></a>") 
print(b.body.a.attrs)
print(b.select("a:not([foo])"))

Notice that when an attribute has no value that it is assigned an empty string, not a None.

{'foo': '', 'href': 'http://www.example.com'}
[]

Now, in your case you've placed essentially something that doesn't belong, a None. While I can see how it may seem intuitive, it does not belong. You've created an undefined state. Just because BS doesn't choke on it, doesn't mean you should be doing that. So, does it work as you intend? Maybe not, but this is an unexpected case.

The fact that BS happens to convert the random value is frankly dumb luck:

import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = {'1': 1}
print(b.body.a)
<a foo="{'1': 1}" href="http://www.example.com"></a>

As you can see, BeautifulSoup is a bit too forgiving of what gets shoved into these attributes.


So, now we have a None as the value of the attribute. If we are using the following logic to see if a key is present, we expect that None is returned via get if name is not found in the dictionary.

el.attrs.get(name)

And now you see why what you are doing is throwing a wrench into the machine.

I think it is beyond the scope of SoupSieve to try and anticipate the ways in which a user may abuse the BS structure. We check the elements based on the expected internals. Attribute values are either strings or a list of strings (BS breaks classes up into a list of strings). Anything above and beyond that is undefined.

@facelessuser
Copy link
Owner

I think there is sufficient reasoning to assert there is nothing to fix here. I am aware of no cases that BS ever inserts None as an attribute value itself (though I am open to being proven wrong 🙂). As things stand now, I am flagging this as a wontfix. Maybe there is some case I am overlooking?

@gir-bot remove S: triage
@gir-bot add S: wontfix

@gir-bot gir-bot added S: wontfix The issue will not be fixed for the stated reasons. and removed S: triage Issue needs triage. labels Feb 9, 2021
@facelessuser
Copy link
Owner

Closing the issue for now.

@facelessuser
Copy link
Owner

I wanted to provide one more case that adds validity to our decision not to address this. The official BS documentation also demonstrates removing and checking attributes with get and how it will return None when you remove attributes: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes.

I think by assigning None, to an attribute, you break the user's expectation of the attribute behavior and such assignments should be avoided, regardless of BS's forgiving behavior. As we've shown earlier, BS will take almost anything as an attribute value sometimes, accidentally giving you what you expect and sometimes not at all.

@facelessuser
Copy link
Owner

facelessuser commented Feb 9, 2021

So, I was looking through the BS code. And their search normalizes both search terms and attributes and such through a _normalize_search_value: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1943.

Now it is confusing as to why they would normalize their attribute searches the same way they normalize their attribute values. For instance, a regular expression pattern assigned to an attribute will be pass through as the regex object, but when you output the HTML, it will convert that to a string. So it will never match via SoupStrainer, but it will be output as an attribute value string on export. This is really wrong, and maybe we should submit a Pull request sometime in the future. BS also doesn't handle dicts correctly in matching but outputs them safely as a string.

So, maybe we should at least normalize our attribute values to at least be on par or better than BS.

Since BS does it a little buggy, we'll write our own.

@facelessuser
Copy link
Owner

Not a huge fan of having to handle these weird cases, but I can see BS really gets users accustomed to the idea that they can put anything into attributes and it should still come out reasonable. It turned out that it was quite trivial to do due to the way we wrote things, so it all worked out in the end 🙂.

Pull #213 will fix this issue.

@gir-bot remove S: wontfix
@gir-bot add T: bug

@gir-bot gir-bot added T: bug Bug. and removed S: wontfix The issue will not be fixed for the stated reasons. labels Feb 9, 2021
facelessuser added a commit that referenced this issue Feb 9, 2021
@facelessuser
Copy link
Owner

@jimages Took me a while to come to the conclusion we needed to fix this, but it is now available in the 2.2 release.

@jimages
Copy link
Author

jimages commented Feb 11, 2021

Sorry for the late reply since I am on a long journey. First of all, the reason why I do the undefined action is that the only possible way to get attributes without a value. I will show you an example.

b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></>", "html5lib")
b.body.a['foo']= "" # so we get <a foo="" href="http://www.example.com"></a>

as you can see, it only outputs an attribute with a value an empty string. but what I want is a bare attribute without a value, though these are somehow equivalent.

ref: https://stackoverflow.com/questions/49991604/python-beautifulsoup-add-attribute-to-tag-without-value/49991727

In the end, it's my pleasure to see that this bug is fixed. Thank you!

@jimages
Copy link
Author

jimages commented Feb 11, 2021 via email

@facelessuser
Copy link
Owner

facelessuser commented Feb 11, 2021

as you can see, it only outputs an attribute with a value an empty string. but what I want is a bare attribute without a value, though these are somehow equivalent.

Yeah, I would argue this is a bug with BS as BS imports a bare attribute with an empty string, but then you have to trick it into outputting it the same way you imported it.

maybe the None method is intended.

If it was intended, I'd argue it would assign None to attributes itself, but it doesn't. BS tries everything in its power to output a sane HTML output, and that means trying its best to convert to something that doesn't fail, that doesn't mean it is supposed to allow users to do that. I illustrated that earlier with:

import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = {'1': 1}
print(b.body.a)

But it'll even take something weird like functions:

import bs4

def func():
    pass

b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = func
print(b.body.a)
``

Output:

```html
<a foo="&lt;function func at 0x10aedc160&gt;" href="http://www.example.com"></a>

Just because it is handled, doesn't mean you should ever do something like this.

With that said, out of everything, None, at the very least, seems like a more sane thing to handle than things like functions and dictionaries, but I'd argue, that if this is intentional, BS should also convert <a foo></a> to a['foo']=None, but it doesn't. At least for attributes, on input, it should handle <a foo=''></a> and <a foo></a> the same in the output for these is very different currently.

@facelessuser
Copy link
Owner

I guess you could force empty attributes with something like this:

import bs4
from bs4.formatter import HTMLFormatter

class EmptyAttr(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            if v == '':
                v = None
            yield k, v

b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>", 'html5lib')
b.body.a['foo'] = ''
print(b.body.a.encode(formatter=EmptyAttr()))

Output

b'<a href="http://www.example.com" foo></a>'

I may bring up this whole None handling over with the author of BS.

@facelessuser
Copy link
Owner

Testing it out in real CSS, as far as HTML is concerned, foo="" and foo are both equivalent. and [foo] and [foo=""] will match both, so I don't see why BS can't export foo="" as a bare foo by default. I may request that over at BS and see what the author says.

@facelessuser
Copy link
Owner

Here is the issue I created over at BeautifulSoup: https://bugs.launchpad.net/beautifulsoup/+bug/1915424.

@facelessuser
Copy link
Owner

In general, I guess now None will work, so you can keep doing that. Generally, I don't think people should use None and this should be changed in the HTML formatter to output bare attributes. I think overriding the HTML formatter is probably a cleaner option to do this on output, especially if the author doesn't want to change the output in HTML moving forward (XML should not do this though).

I guess we'll see what he decides. But I guess you have options now.

@jimages
Copy link
Author

jimages commented Feb 12, 2021

If it was intended, I'd argue it would assign None to attributes itself, but it doesn't. BS tries everything in its power to output a sane HTML output, and that means trying its best to convert to something that doesn't fail, that doesn't mean it is supposed to allow users to do that.

Yeah, this may be reasonable. But I still want to argue about the undefined action. It's true that it doesn't assign None to attributes itself nor set similar test cases. if it is an undefined behavior it means BS wouldn't assign None to attributes itself. But if not, it doesn't mean BS should assign. let's see what the author thinks.

@facelessuser
Copy link
Owner

Sounds like he's open to trying this out on the default HTML5 formatter, so that is a start. I'll try to get a pull in there so at least on the HTML5 formatter, empty strings should output as bare attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants