Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help: is soupsieve case-insensitive? #95

Closed
dimaqq opened this issue Jan 22, 2019 · 11 comments
Closed

Help: is soupsieve case-insensitive? #95

dimaqq opened this issue Jan 22, 2019 · 11 comments
Labels
S: confirmed Confirmed bug report or approved feature request. T: bug Bug.
Milestone

Comments

@dimaqq
Copy link

dimaqq commented Jan 22, 2019

In [122]: xml = """<Envelope><Header>...</Header></Envelope>"""

In [123]: s = BeautifulSoup(xml, "xml")

In [124]: s.select("header")
Out[124]: [<Header>...</Header>]

In [125]: s.select("Header")
Out[125]: []

Before, BeautifulSoup accepted (and I think required) case-sensitive tag name in selector.

Now that BeautifulSoup uses soupsieve, it seems that only lower-case selectors are supported.

I'm really not sure why or if I can change this behaviour.

@dimaqq
Copy link
Author

dimaqq commented Jan 22, 2019

Might be a duplicate of #87

@facelessuser
Copy link
Owner

facelessuser commented Jan 22, 2019

@dimaqq, this is not a duplicate of #87. #87 was specifically fixing a bug where in an HTML document, the attribute value of type was not being treated case insensitively. For instance, in HTML, type="submit" and type="SUBMIT" should both be recognized the same. This is a very HTML specific thing. In HTML, type is the only attribute treated this way.


As for your specific issue, you are using XML. XML is a case sensitive language. If I have <Header> and <header> in an XML document, these two tags are not the same thing. In HTML, they are. If you used the html.parser, the lxml (not lxml-xml), or html5lib parser, tag names and attribute names (along with type values), would be treated with case insensitivity. Also, if you use xml (which is the same as lxml-xml), but the first tag is in the XHTML namespace, the document will be treated as if it is an HTML document.

Soup Sieve handles case sensitivity differently for XML and HTML, because the document type requires it. HTML tags and attribute names will be treated with case insensitivity, while in XML they will be treated with case sensitivity. These differences are specifically documented here: https://facelessuser.github.io/soupsieve/api/#api.

Here is the thing. If I make XML tag recognition case insensitive in XML, how do I select just <header> which is a very different tag that than <Header> in XML?

I personally view the behavior of the old select method as an oversight for XML because it doesn't respect the document's rules.

If people really wanted, I could add a flag to force case insensitivity in XML documents, but that feels counter intuitive to XML. But I guess if it was strongly desired, I may consider it.

@dimaqq
Copy link
Author

dimaqq commented Jan 23, 2019

Hi @facelessuser and thank you so much of the detailed explanation.

I've also dug into soupsieve source code and XML vs HTML switches are clearly visible.

Perhaps I didn't describe my problem clearly:

  • I have an XML document
  • It has CamelCase tags
  • I want to query the document using case-sensitive CamelCase selectors
  • I am unable to do that with new BeautifulSuop and soupsieve
  • (I was able to do that in older BeautifulSoup)

There are two problems with this:

  1. Why is my XML document interpreted, seemingly, as if it were HTML?

  2. If someone parses an HTML document that has CamelCase Tag, should they not be able to query said document using both lowecase tag and CamelCase Tag? In my case, only lowercase appears to work.

I'll set up a MRE repo and post a link in this thread.

@facelessuser
Copy link
Owner

A simple reproduction example will definitely help. HTML should be case insensitive. meaning select('tag') would select tag, TAG, etc. In XML, it should not.

I turns out I did not have a test explicitly testing tag case, but locally I just added this which passed:

    def test_tag_xml(self):
        """Test tag for XML."""

        self.assert_selector(
            """
            <Tag id="1">
            <tag id="2"></tag>
            <TAG id="3"></TAG>
            </Tag>
            """,
            "tag",
            ["2"],
            flags=util.XML
        )

        self.assert_selector(
            """
            <Tag id="1">
            <tag id="2"></tag>
            <TAG id="3"></TAG>
            </Tag>
            """,
            "Tag",
            ["1"],
            flags=util.XML
        )

        self.assert_selector(
            """
            <Tag id="1">
            <tag id="2"></tag>
            <TAG id="3"></TAG>
            </Tag>
            """,
            "TAG",
            ["3"],
            flags=util.XML
        )

When you provide the simple reproduction, please also include what version of soupsieve you are using as well as what version of BeautifulSoup.

@facelessuser
Copy link
Owner

facelessuser commented Jan 23, 2019

Oh, and do post whether you have lxml installed and what version. I've never attempted to run the XML parser when lxml is not installed. I am assuming you didn't forget to install lxml, but you never know, and I don't know what BS4 does in that case.

@facelessuser
Copy link
Owner

@dimaqq, I was testing something that wasn't on tip. There was indeed a regression. And since I didn't have a test in place to catch it, I wasn't aware. I have a fix coming.

@dimaqq
Copy link
Author

dimaqq commented Jan 23, 2019

MRE is at https://github.com/dimaqq/mre-bs4-soupsieve-xml-case

the last test, namespaces= obv. didn't work with older BeautifulSoup4 because namespaces were not supported in select*()

@facelessuser
Copy link
Owner

Thanks, I'll put in appropriate tests this time to make sure I don't break this again in the future.

@facelessuser facelessuser added T: bug Bug. selectors S: confirmed Confirmed bug report or approved feature request. labels Jan 23, 2019
@facelessuser facelessuser added this to the 1.7.3 milestone Jan 23, 2019
@facelessuser
Copy link
Owner

@dimaqq, thanks for the MRE. #96 will fix case related issues.

I've made sure that all XML documents will use case sensitivity for attribute values and tag names. There are tests to prevent future breakage.

I've ensured CSS defined prefixes are always treated with case sensitivity, even in HTML5, as per the spec, they are always case sensitive. There was no test for this either, but now there is.

I'm hoping this fixes all case related issues 🤞.

@facelessuser
Copy link
Owner

1.7.3 has been released. Hopefully that gets where you need to be.

@dimaqq
Copy link
Author

dimaqq commented Jan 23, 2019

Yes, it does, thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: confirmed Confirmed bug report or approved feature request. T: bug Bug.
Projects
None yet
Development

No branches or pull requests

2 participants