Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API-only export option (without Special:Export) #311

Open
nemobis opened this issue May 7, 2018 · 35 comments
Open

API-only export option (without Special:Export) #311

nemobis opened this issue May 7, 2018 · 35 comments
Assignees
Milestone

Comments

@nemobis
Copy link
Member

nemobis commented May 7, 2018

Wanted for various reasons. Current implementation: --xmlrevisions, false by default. If the default method to download wikis doesn't work for you, please try using the flag --xmlrevisions and let us know how it went.
https://groups.google.com/forum/#!topic/wikiteam-discuss/ba2K-WeRJ-0

Previous takes:
#195
#280

@nemobis nemobis added this to the 0.4 milestone May 7, 2018
@nemobis
Copy link
Member Author

nemobis commented May 7, 2018

Does not yet work for Wikia, partly because they return a blank page for exportnowrap used in getXMLHeader(). Have to use wikitools there as well?

  File "./dumpgenerator.py", line 2195, in <module>
    main()
  File "./dumpgenerator.py", line 2187, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1756, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 717, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 792, in getXMLRevisions
    for page in result['query']['allrevisions']:
KeyError: 'query'
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

@nemobis
Copy link
Member Author

nemobis commented May 9, 2018

Before even downloading the first revisions, there is some wiki where the export gets stuck in an endless loop of "Invalid JSON response. Trying the request again" or similar message:

Analysing http://www.haplozone.net/wiki/index.php
Trying generating a new dump into a new directory...
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again

@nemobis
Copy link
Member Author

nemobis commented May 18, 2018

@nemobis
Copy link
Member Author

nemobis commented May 18, 2018

For Wikia, the API export works without exportnowrap: http://00eggsontoast00.wikia.com/api.php?action=query&prop=revisions&meta=siteinfo&titles=Main%20Page&export&format=json

But facepalm, where the API help says "Export the current revisions of all given or generated pages" it really means that any revision other than the current one is ignored: http://00eggsontoast00.wikia.com/api.php?action=query&revids=3|80|85&export is the same as http://00eggsontoast00.wikia.com/api.php?action=query&revids=85&export

@nemobis
Copy link
Member Author

nemobis commented May 19, 2018

Here we go: 7143f7e

It's very fast on most wikis, because it makes way less requests if your average number of revisions per page is less than 50.

The first dump produced with this method is: https://archive.org/download/wiki-ferstaberindecom_f2_en/ferstaberindecom_f2_en-20180519-history.xml.7z

@nemobis
Copy link
Member Author

nemobis commented May 19, 2018

And now also Wikia, without the allrevisions module: https://archive.org/details/wiki-00eggsontoast00wikiacom

The XML built "manually" with --xmlrevisions is almost the same as usual (at the cost of making at least one request per page), but it's missing parentid and at the moment minoredit.

@nemobis
Copy link
Member Author

nemobis commented May 19, 2018

Analysing http://nimiarkisto.fi/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
29 namespaces found
    Retrieving titles in the namespace 0
.Traceback (most recent call last):
  File "./dumpgenerator.py", line 2288, in <module>
    main()
  File "./dumpgenerator.py", line 2280, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1844, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "./dumpgenerator.py", line 416, in getPageTitles
    for title in titles:
  File "./dumpgenerator.py", line 292, in getPageTitlesAPI
    allpages = jsontitles['query']['allpages']
KeyError: 'query'

@nemobis
Copy link
Member Author

nemobis commented May 21, 2018

In testing this for Wikia, remember that the number of edits on Special:Statistics isn't always truthful (this is normal on MediaWiki). For instance http://themodifyers.wikia.com/wiki/Special:Statistics says 2333 edits, but dumpgenerator.py exports 1864, and that's the right amount: entering all the titles on themodifyers.wikia.com/wiki/Special:Export and exporting all revisions gives the same amount.

Also, a page with 53 revisions on that wiki was correctly exported, which means that API continuation works; that's something!

@nemobis
Copy link
Member Author

nemobis commented May 21, 2018

Not sure what's going on at http://zh.asoiaf.wikia.com/api.php

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2308, in <module>
    main()
  File "./dumpgenerator.py", line 2300, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1864, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "./dumpgenerator.py", line 429, in getPageTitles
    for title in titles:
  File "./dumpgenerator.py", line 252, in getPageTitlesAPI
    config=config, session=session)
TypeError: 'NoneType' object is not iterable
tail: cannot open 'zhasoiafwikiacom-20180521-wikidump/zhasoiafwikiacom-20180521-history.xml' for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

http://zhpad.wikia.com/api.php seems to eventually fail as well

@nemobis
Copy link
Member Author

nemobis commented May 22, 2018

Next step: implementing resuming. I'll probably take the readTitles() part out of getXMLRevisions() to make things clearer.

I think it would be the occasion to make sure that we log something to error.log when we catch an exception or call sys.exit(1), so that it's easier to inspect failed dumps and see what happened when they stopped. I have almost 4k interrupted Wikia dumps.

@nemobis
Copy link
Member Author

nemobis commented May 25, 2018

Later I'll post a series of errors.log from failed dumps.

For now I tend to believe that, when the dump runs to the end, the XML really is as complete as possible. For instance, on a biggish wiki like http://finalfantasy.wikia.com/wiki/Special:Statistics :

$ grep -c "<revision>" finalfantasywikiacom-20180523-history.xml
1638424
$ grep -c "<page>" finalfantasywikiacom-20180523-history.xml
311259

That's over a million "missing" revisions compared to what Special:Statistics says, which however cannot really be trusted. The number of pages is pretty close.

On the other hand, it could be that the continuation is not working in some cases... In clubpenguinwikiacom-20180523-history.xml, I'm not sure I see the 3200 revisions that the main page ought to have.

nemobis added a commit that referenced this issue May 25, 2018
Otherwise the query continuation may fail and only the top revisions
will be exported. Tested with Wikia:
http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki

Also add parentid since it's available after all.

#311 (comment)
@nemobis
Copy link
Member Author

nemobis commented May 27, 2018

Some wiki might be in a loop...

1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1061 more revisions exported
1062 more revisions exported
1061 more revisions exported
1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1062 more revisions exported

Or not: it seems legit, some bot is editing a series of pages every day. http://runescape.wikia.com/wiki/Module:Exchange/Dragon_crossbow_(u)/Data?limit=1000&action=history

@nemobis
Copy link
Member Author

nemobis commented Feb 8, 2020

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 5 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 10 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 15 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 20 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 25 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 30 seconds

nemobis added a commit to nemobis/wikiteam that referenced this issue Feb 8, 2020
* It was just an old trick to get past some barriers which were waived with GET.
* It's not conformant and doesn't play well with some redirects.
* Some recent wikis seem to not like it at all, see also issue WikiTeam#311.
@nemobis
Copy link
Member Author

nemobis commented Feb 9, 2020

Sometimes allpages works until it doesn't:

Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Warning!: "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump" path exists
There is a dump in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? n
You have selected: NO
Trying to use path "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump-2"...
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
16 namespaces found
    Retrieving titles in the namespace 0
..    602 titles retrieved in the namespace 0
    Retrieving titles in the namespace 1
.    1 titles retrieved in the namespace 1
    Retrieving titles in the namespace 2
.    3 titles retrieved in the namespace 2
    Retrieving titles in the namespace 3
.    3 titles retrieved in the namespace 3
    Retrieving titles in the namespace 4
.The allpages API returned nothing. Exit.

@nemobis
Copy link
Member Author

nemobis commented Feb 9, 2020

How nice some webservers are:

Titles saved at... halachipediacom-20200209-titles.txt
2364 page titles loaded
http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 503: Service Unavailable trying request again in 5 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 10 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 15 seconds
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 20 seconds
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 25 seconds

@nemobis
Copy link
Member Author

nemobis commented Feb 9, 2020

Gotta check for actual presence of the export field in the response:

Titles saved at... aroundisleofwightinfo-20200209-titles.txt
3230 page titles loaded
http://www.aroundisleofwight.info/api.php
Getting the XML header from the API
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2323, in <module>
    main()
  File "./dumpgenerator.py", line 2315, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1882, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 731, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "./dumpgenerator.py", line 471, in getXMLHeader
    xml = r.json()['query']['export']['*']
KeyError: 'export'
tail: cannot open ‘aroundisleofwightinfo-20200208-wikidump/aroundisleofwightinfo-20200208-history.xml’ for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

@nemobis
Copy link
Member Author

nemobis commented Feb 9, 2020

HTTP 405:

Titles saved at... wikiainigmaeu-20200209-titles.txt
139 page titles loaded
http://wiki.ainigma.eu/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 10 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 15 seconds

@nemobis
Copy link
Member Author

nemobis commented Feb 9, 2020

Or even the query:

Titles saved at... masu6fsk-20200209-titles.txt
247 page titles loaded
http://masu.6f.sk/api.php
Getting the XML header from the API
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2323, in <module>
    main()
  File "./dumpgenerator.py", line 2315, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1882, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 731, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "./dumpgenerator.py", line 471, in getXMLHeader
    xml = r.json()['query']['export']['*']
KeyError: 'query'

@nemobis
Copy link
Member Author

nemobis commented Feb 10, 2020

HTTP Error 493 :o

Titles saved at... opendiagnostixorg-20200210-titles.txt
28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
16 namespaces found
Trying to export all revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
/home/federico/.local/lib/python2.7/site-packages/wikitools/api.py:155: FutureWarning: The querycontinue option is deprecated and will be removed
in a future release, use the new queryGen function instead
for queries requring multiple requests
  for queries requring multiple requests""", FutureWarning)
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
3 more revisions exported
1 more revisions exported
1 more revisions exported
4 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
5 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
HTTPError: HTTP Error 493: Forbidden WAF trying request again in 5 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 10 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 15 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 20 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 25 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 30 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 35 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 40 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 45 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 50 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 55 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 60 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 65 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 70 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 75 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 80 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 85 seconds

@nemobis
Copy link
Member Author

nemobis commented Feb 10, 2020

I'm not quite sure why this happens in my latest local code, will need to check:

  <page>
    <title>Main Page</title>
    <ns>0</ns>
    <id>1</id>
    <redirect title="Main page" />
    <revision>
      <id>3677</id>
      <parentid>1</parentid>
      <timestamp>2018-12-19T22:15:31Z</timestamp>
      <contributor>
        <username>Wiki-admin</username>
        <id>45</id>
      </contributor>
      <comment>Redirected page to [[Main page]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="23">#REDIRECT [[Main_page]]</text>
      <sha1>o2jw5c565achwt31azfnu9zc2zxgqpr</sha1>
    </revision>
  </page>
<page>
  <title>Main Page</title>
  <ns>0</ns>
  <id>1</id>
  <revision>
    <id>3677</id>
    <parentid>1</parentid>
    <timestamp>2018-12-19T22:15:31Z</timestamp>
    <contributor>
      <id>45</id>
      <username>Wiki-admin</username>
    </contributor>
    <comment>Redirected page to [[Main page]]</comment>
    <text bytes="23" space="preserve">#REDIRECT [[Main_page]]</text>
    <model>wikitext</model>
    <sha1>ce111c28c158bacd1ad89fbacb33e48d0e2e383f</sha1>
  </revision>
  <revision>
    <id>1</id>
    <parentid>0</parentid>
    <timestamp>2018-12-13T21:14:03Z</timestamp>
    <contributor>
      <id>0</id>
      <username>MediaWiki default</username>
    </contributor>
    <comment></comment>
    <text bytes="735" space="preserve">&lt;strong&gt;MediaWiki has been installed.&lt;/strong&gt;

Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.

== Getting started ==
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ]
* [https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Combating_spam Learn how to combat spam on your wiki]</text>
    <model>wikitext</model>
    <sha1>5702e4d5fd9173246331a889294caf01a3ad3706</sha1>
  </revision>
</page>

@nemobis
Copy link
Member Author

nemobis commented Feb 10, 2020

28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2363, in <module>
    main()
  File "./dumpgenerator.py", line 2355, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1922, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 814, in getXMLRevisions
    site = mwclient.Site(apiurl.netloc, apiurl.path.replace("api.php", ""))
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 131, in __init__
    self.site_init()
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 153, in site_init
    retry_on_error=False)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 235, in get
    return self.api(action, 'GET', *args, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
    info = self.raw_api(action, http_method, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
    http_method=http_method)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
    stream = self.connection.request(http_method, url, **args)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 486, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 598, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 314, in _raise_timeout
    if 'timed out' in str(err) or 'did not complete (read)' in str(err):  # Python 2.6
TypeError: __str__ returned non-string (type SysCallError)
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

@nemobis
Copy link
Member Author

nemobis commented Feb 10, 2020

mwclient doesn't seem to handle retries very well, need to check:

Traceback (most recent call last):
  File "dumpgenerator.py", line 2375, in <module>
    
  File "dumpgenerator.py", line 2367, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1934, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "dumpgenerator.py", line 875, in getXMLRevisions
    exportrequest = site.api(**exportparams)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
    info = self.raw_api(action, http_method, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
    http_method=http_method)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
    stream = self.connection.request(http_method, url, **args)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='gobblerpedia.org', port=443): Read timed out. (read timeout=30)

@nemobis
Copy link
Member Author

nemobis commented Feb 10, 2020

Seems fine now on a MediaWiki 1.16 wiki. There are some differences in what we get for some optional fields like parentid, userid, size of a revision; and our XML made by etree is less eager to escape Unicode characters. Hopefully doesn't matter, although we should ideally test an import on a recent MediaWiki.
wikirabenthalnet-20200210-history-test.zip

@nemobis
Copy link
Member Author

nemobis commented Feb 10, 2020

HTTP Error 493 :o

This comes and goes, could try adding to status_forcelist together with 406 seen for other wikis.

http://masu.6f.sk/api.php

Here we can do little, the index.php and api.php responses confuse the script but indeed there isn't much we can do as even the most basic response gets a DB error:

internal_api_error_DBQueryError
http://masu.6f.sk/api.php?action=query&meta=siteinfo&siprop=general

HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds

This is not helped by setting http_method="GET" (https://mwclient.readthedocs.io/en/latest/reference/site.html#mwclient.client.Site.api ). It's a MediaWiki 1.21.1 wiki so allrevisions is not available, but the HTTPError prevented the exception from making us switch to the next strategy. Once we catch that, it works via GET: 49017e3 . Ideally we'd need to check this only once at the beginning, but it seems that the webservers do not want to afford us this luxury.

http://www.aroundisleofwight.info/api.php

This is a misconfigured wiki, see #355 (comment)

http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again

This one now (MediaWiki 1.31.1) gives:

http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
20 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
This mwclient version seems not to work for us. Exiting.

Sometimes allpages works until it doesn't:

Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Still broken (MediaWiki 1.23)

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Still broken (MediaWiki 1.27).

Analysing http://nimiarkisto.fi/w/api.php

Still broken (MediaWiki 1.31)

@nemobis
Copy link
Member Author

nemobis commented Feb 13, 2020

The number of revisions cannot always be a multiple of 50 (example from https://villainsrpg.fandom.com/ ):

    Eve Man                                                                                                                                                                                    4 more revisions exported
    Event Horizon
50 more revisions exported
50 more revisions exported
    Evil
50 more revisions exported
50 more revisions exported
    Existence (Secret)
9 more revisions exported
Downloaded 400 pages
    Extinction
10 more revisions exported

It should be 51 in https://villainsrpg.fandom.com/wiki/Evil?offset=20111224190533&action=history
We're getting 49 revisions again and then the 1 we were missing. Not a big deal but not ideal either.

Ouch no, we were not using the new batch at all. Ahem.

@nemobis
Copy link
Member Author

nemobis commented Feb 13, 2020

The XML doesn't validate against the respective schema:

$ xmllint --schema ../export-0.10.xsd --noout girlfriend_karifandomcom-20200213-history.xml
...
girlfriend_karifandomcom-20200213-history.xml:76504: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
girlfriend_karifandomcom-20200213-history.xml fails to validate

But then even the vanilla Special:Export output doesn't. Makes me sad.

$ xmllint --schema export-0.10.xsd --noout /tmp/Girlfriend+Kari+Wiki-20200213070422.xml
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml:52: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml fails to validate
$ xmllint --version
xmllint: using libxml version 20909
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma

@nemobis
Copy link
Member Author

nemobis commented Feb 13, 2020

http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Fine now

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Fixed with API limit 50 at b162e7b

Analysing http://nimiarkisto.fi/w/api.php

Fixed with automatic switch to HTTPS at d543f7d

@nemobis
Copy link
Member Author

nemobis commented Feb 14, 2020

Still have to implement resume:

Analysing https://gundam.fandom.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "File:G Saviour Bugu2 rear view.JPG"
https://gundam.fandom.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
40 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each
    "Kurenai Musha" Red Warrior Amazing
        1 more revisions exported
    ...So We Meet Again
        5 more revisions exported
    0-Riser
        3 more revisions exported

It should just be a matter of passing start to getXMLRevisions() in generateXMLDump().

@nemobis
Copy link
Member Author

nemobis commented Feb 14, 2020

I'm happy to see that we sometimes receive less than the requested 50 revisions and nothing bad happens:

"This result was truncated because it would otherwise be larger than the limit of 8388608 bytes"

https://tinyvillage.fandom.com/api.php?action=query&prop=revisions&rvprop=ids|timestamp|user|userid|size|sha1|contentmodel|comment|content&rvlimit=50&format=json&titles=Fusion_Reports

@nemobis
Copy link
Member Author

nemobis commented Feb 15, 2020

nothing bad happens

Except that they didn't check whether they had revisions bigger than that:
https://pvx.fandom.com/wiki/User_talk:PVX-Misfate?offset=20071116000000&limit=20&action=history

@nemobis
Copy link
Member Author

nemobis commented Feb 24, 2020

Hm, I wonder why so many errors on this MediaWiki 1.25 wiki (the XML became half of the previous round) https://archive.org/download/wiki-wikimarionorg/wikimarionorg-20200224-history.xml.7z/errors.log

@nemobis
Copy link
Member Author

nemobis commented Mar 2, 2020

        2 more revisions exported
'*'
Traceback (most recent call last):
  File "dumpgenerator.py", line 2528, in <module>
    main()
  File "dumpgenerator.py", line 2518, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 2165, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 727, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session, start=start):
  File "dumpgenerator.py", line 829, in getXMLRevisions
    yield makeXmlFromPage(page)
  File "dumpgenerator.py", line 1083, in makeXmlFromPage
    raise PageMissingError(page['title'], e)
__main__.PageMissingError: page 'DevStack' not found

@nemobis
Copy link
Member Author

nemobis commented Mar 7, 2020

http://www.veikkos-archiv.com/api.php fails completely

@nemobis
Copy link
Member Author

nemobis commented Mar 7, 2020

Simple command with which I found some XML files which were actually empty (only the header):

find -maxdepth 1 -type f -name "*7z" -size -500k -print0 | xargs -0 -P32 -n1 7z l | grep xml | grep -E " [0-9]{4} " | grep -Ev " [0-9]{5,} "  | grep -Eo "[^ ]+$" | sed 's,.xml$,.xml.7z,g'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant