Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER #363

nemobis · 2020-02-11T13:33:42Z

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2456, in <module>
    main()
  File "./dumpgenerator.py", line 2446, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 2099, in resumePreviousDump
    config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 723, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 979, in getXMLRevisions
    xml = makeXmlFromPage(pages[page])
  File "./dumpgenerator.py", line 1050, in makeXmlFromPage
    return etree.tostring(p, pretty_print=True)
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.tostring
  File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring
  File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: IO_ENCODER
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

The text was updated successfully, but these errors were encountered:

nemobis · 2020-02-13T07:02:20Z

Actually doesn't have to do with resuming. Example on a small wiki:

983 page titles loaded                                                                                                                                                                          [112/1853]
https://girlfriend-kari.fandom.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
27 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each
    Anime
3 more revisions exported
    Battle (バトル)
4 more revisions exported
    Browser Edition
2 more revisions exported
    Clubs (部活)
16 more revisions exported
    Cupid
13 more revisions exported
    Date
3 more revisions exported
    Download
12 more revisions exported
    Event Boss Life
27 more revisions exported
    Events
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2458, in <module>
    main()
  File "./dumpgenerator.py", line 2450, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 2017, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 723, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 976, in getXMLRevisions
    xml = makeXmlFromPage(pages[pageid])
  File "./dumpgenerator.py", line 1052, in makeXmlFromPage
    return etree.tostring(p, pretty_print=True)
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.tostring
  File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring
  File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: IO_ENCODER
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis · 2020-02-13T12:32:54Z

See also https://bugs.launchpad.net/lxml/+bug/400588

nemobis · 2020-02-13T14:04:38Z

The basic problem here is that we make up an XML from thin air and we don't really respect what the wiki's own XML export tells in its header. However, I wouldn't want to trust everything the wikis say about their encoding, because often it's broken.

So it's probably best to try to fit everything in Python unicode string, so at least we can concatenate them and write to the UTF-8 file.

nemobis added the bug label Feb 11, 2020

nemobis added this to the 0.4 milestone Feb 11, 2020

nemobis changed the title ~~Error on resume: lxml.etree.SerialisationError: IO_ENCODER~~ Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER Feb 13, 2020

nemobis closed this as completed in d161939 Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER #363

Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER #363

nemobis commented Feb 11, 2020

nemobis commented Feb 13, 2020

nemobis commented Feb 13, 2020

nemobis commented Feb 13, 2020

Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER #363

Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER #363

Comments

nemobis commented Feb 11, 2020

nemobis commented Feb 13, 2020

nemobis commented Feb 13, 2020

nemobis commented Feb 13, 2020