Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER #363

Closed
nemobis opened this issue Feb 11, 2020 · 3 comments
Closed

Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER #363

nemobis opened this issue Feb 11, 2020 · 3 comments
Labels
Milestone

Comments

@nemobis
Copy link
Member

nemobis commented Feb 11, 2020

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2456, in <module>
    main()
  File "./dumpgenerator.py", line 2446, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 2099, in resumePreviousDump
    config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 723, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 979, in getXMLRevisions
    xml = makeXmlFromPage(pages[page])
  File "./dumpgenerator.py", line 1050, in makeXmlFromPage
    return etree.tostring(p, pretty_print=True)
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.tostring
  File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring
  File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: IO_ENCODER
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
@nemobis nemobis added the bug label Feb 11, 2020
@nemobis nemobis added this to the 0.4 milestone Feb 11, 2020
@nemobis nemobis changed the title Error on resume: lxml.etree.SerialisationError: IO_ENCODER Error on --xmlrevisions: lxml.etree.SerialisationError: IO_ENCODER Feb 13, 2020
@nemobis
Copy link
Member Author

nemobis commented Feb 13, 2020

Actually doesn't have to do with resuming. Example on a small wiki:

983 page titles loaded                                                                                                                                                                          [112/1853]
https://girlfriend-kari.fandom.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
27 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each
    Anime
3 more revisions exported
    Battle (バトル)
4 more revisions exported
    Browser Edition
2 more revisions exported
    Clubs (部活)
16 more revisions exported
    Cupid
13 more revisions exported
    Date
3 more revisions exported
    Download
12 more revisions exported
    Event Boss Life
27 more revisions exported
    Events
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2458, in <module>
    main()
  File "./dumpgenerator.py", line 2450, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 2017, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 723, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 976, in getXMLRevisions
    xml = makeXmlFromPage(pages[pageid])
  File "./dumpgenerator.py", line 1052, in makeXmlFromPage
    return etree.tostring(p, pretty_print=True)
  File "src/lxml/etree.pyx", line 3435, in lxml.etree.tostring
  File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring
  File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: IO_ENCODER
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

@nemobis
Copy link
Member Author

nemobis commented Feb 13, 2020

@nemobis
Copy link
Member Author

nemobis commented Feb 13, 2020

The basic problem here is that we make up an XML from thin air and we don't really respect what the wiki's own XML export tells in its header. However, I wouldn't want to trust everything the wikis say about their encoding, because often it's broken.

So it's probably best to try to fit everything in Python unicode string, so at least we can concatenate them and write to the UTF-8 file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant