Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "encoding" keyword argument for lxml's tostring() #2885

Merged
merged 1 commit into from Sep 6, 2021

Conversation

eht16
Copy link
Member

@eht16 eht16 commented Sep 3, 2021

Newer versions of libxml2 (used by lxml) crash in tostring() when no encoding argument is present. Passing "unicode" as encoding makes tostring() returning already a Python unicode string, so we don't need to decode it anymore.

On Debian Sid where libxml2 2.9.12 is included, the following error occurs without the change:

/usr/bin/python3 ../scripts/gen-api-gtkdoc.py xml -d . -o geany-gtkdoc.h \
		--sci-output geany-sciwrappers-gtkdoc.h
Traceback (most recent call last):
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 460, in <module>
    sys.exit(main(sys.argv))
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 389, in main
    e = DoxyStruct.from_compounddef(n0)
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 321, in from_compounddef
    e.add_member(p)
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 233, in add_member
    proc.process_element(xml.find("detaileddescription"))
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 136, in process_element
    s = self.__process_element(xml)
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 163, in __process_element
    s += self.__process_element(n) + "\n"
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 167, in __process_element
    ss = self.at.cb(n.get("kind"), self.__process_element(n))
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 163, in __process_element
    s += self.__process_element(n) + "\n"
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 170, in __process_element
    s += self.get_program_listing(n)
  File "/build/geany-1.37.1-1+20210903gitb7bd5fa/doc/../scripts/gen-api-gtkdoc.py", line 126, in get_program_listing
    arr.append("  " + tostring(etree.HTML(html), method="text").decode("utf-8"))
  File "src/lxml/etree.pyx", line 3437, in lxml.etree.tostring
  File "src/lxml/serializer.pxi", line 103, in lxml.etree._tostring
  File "src/lxml/serializer.pxi", line 75, in lxml.etree._textToString
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 130970: ordinal not in range(128)

I'm not completely sure why this happens with libxml 2.9.12 (2.9.10 works fine), the XML contents which are processed here should be plain ASCII. Anyway, it might not be bad to set the encoding anyways.

To reproduce, start a Docker container with a Debian Sid image, like: docker run --rm -it debian:sid and within the container execute:

apt-get update && apt-get install --no-install-recommends -y git intltool libtool build-essential libgtk-3-dev  python3-docutils rst2pdf doxygen python3-lxml nano
git clone https://github.com/geany/geany
cd geany
./autogen.sh
make -C doc

Newer versions of libxml2 (used by lxml) crash in tostring() when no
encoding argument is present. Passing "unicode" as encoding makes
tostring() returning already a Python unicode string, so we don't
need to decode it anymore.
@eht16 eht16 added the build-system Related to the build system(s) label Sep 3, 2021
@eht16 eht16 added this to the 1.38 milestone Sep 3, 2021
@b4n
Copy link
Member

b4n commented Sep 5, 2021

Does that work with older (yet not so ancient they don't matter) libxml?

@eht16
Copy link
Member Author

eht16 commented Sep 5, 2021

I justed checked again with Debian Stable (Bullseye), Old-Old-Stable (Stretch) and Ubuntu 18.04.

@b4n
Copy link
Member

b4n commented Sep 6, 2021

Works as well on oldstable (Buster) here, so I guess we're safe enough. And Travis is happy as well.

@b4n b4n merged commit 6d9e24e into geany:master Sep 6, 2021
@eht16
Copy link
Member Author

eht16 commented Sep 11, 2021

It turned out this change fixed only one symptom of the real underlying problem:
https://gitlab.gnome.org/GNOME/libxml2/-/issues/255 - some changes in libxml2 2.9.12 in combination with lxml broke the lxml.etree.tostring() method which we also use.
This leads to extra content in the extracted XML elements and this is why we suddenly got non-UTF8 content in the generated GTK doc header (and hence this change was necessary).
Recent nightly builds failed at generating translation files with:

The following files contain translations and are currently not in use. Please
consider adding these to the POTFILES.in file, located in the po/ directory.

doc/geany-gtkdoc.h

but this was also only a symptom because now the generated file contained more than the filtered content by the generation script.

Luckily, libxml2 got a workaround for this bug and Debian already included this workaround in the latest libxml2 package, so future nightly builds on Debian Unstable should work again.

The change in this PR is good anyway and can be kept, I'd say.

@eht16 eht16 deleted the set_encoding_for_lxml_tostring branch October 9, 2021 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build-system Related to the build system(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants