Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dumpgenerator.py --xmlrevisions creates Error:list index out of range on pokewiki.de #430

Closed
GERZAC1002 opened this issue Apr 12, 2022 · 3 comments

Comments

@GERZAC1002
Copy link

Full comand that was used:

./dumpgenerator.py --xmlrevisions --images --xml --curonly https://pokewiki.de --namespace 0
I used the command without '--namespace 0' before with the same result, i only had to add it for reproducing the error while not putting to much stress on the wiki page it self.

Expected behaviour:

creating a dump of https://pokewiki.de

Actual behaviour after a a few minutes:

Traceback (most recent call last):
 File "./dumpgenerator.py", line 2569, in <module>
   main()
 File "./dumpgenerator.py", line 2561, in main
   createNewDump(config=config, other=other)
 File "./dumpgenerator.py", line 2128, in createNewDump
   generateXMLDump(config=config, titles=titles, session=other['session'])
 File "./dumpgenerator.py", line 741, in generateXMLDump
   for xml in getXMLRevisions(config=config, session=session, start=start):
 File "./dumpgenerator.py", line 877, in getXMLRevisions
   print "        %d more revisions listed, until %s" % (len(revids), revids[-1])
IndexError: list index out of range

Full log:
dumgenerator.py_xmlrevisions.log

Tail of the output file:

{{Karte Designs/Zeile|typ=Farblos|Damythir-V (Time Gazer 059)|illus=aky CG Works|seltenheit=RR|num=1}}
{{Karte Designs/Zeile|typ=Farblos|Damythir-V (Time Gazer 076)|illus=aky CG Works|seltenheit=SR|num=2}}
&lt;/div&gt;

[[en:Wyrdeer V (Time Gazer 59)]]
[[ja:&#12450;&#12516;&#12471;&#12471;V (S10D)]]</text>
      <sha1>ip8lev6wdaqnyxpyw926h46ktlmtoup</sha1>
    </revision>
  </page> 

Quick 'integrity' check on the output file

 grep "<title>" -c *-current.xml ; grep "<page" -c *-current.xml ; grep "</page>" -c *-20220412-current.xml 
2231
2231
2231

Number of page titles in side *-titles.txt: 86796

Test without '--xmlrevisions'

./dumpgenerator.py --xmlrevisions --images --xml --curonly https://pokewiki.de --namespace 0
Checking API... https://www.pokewiki.de/api.php
API is OK: https://www.pokewiki.de/api.php
Checking index.php... https://www.pokewiki.de/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://www.pokewiki.de/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = 0
Excluding titles from namespaces = None
1 namespaces found
    Retrieving titles in the namespace 0
    86795 titles retrieved in the namespace 0
Titles saved at... pokewikide-20220412-titles.txt
86795 page titles loaded
https://www.pokewiki.de/api.php
HTTP Error 404.
Not found. Is Special:Export enabled for this wiki?
https://www.pokewiki.de/index.php?action=submit&curonly=1&limit=1&pages=Main_Page&title=Special%3AExport

After using the pull request #280 back from 2016 and integrating it into a new version(pull request #429) i managed to get a full dump of the mentioned wiki.

@nemobis
Copy link
Member

nemobis commented Apr 12, 2022 via email

@GERZAC1002
Copy link
Author

GERZAC1002 commented Apr 12, 2022

Oh okay but understandable considering that the whole dump ended up at over 30GB, I actually considered asking them for a dump if i hadn't found the alternative.
The alternative to using this tool would have been mirroring the whole page using httrack which would have had a much bigger overhead as last time I tried that on a wiki page it tried to download the complete history of every page and had no options to easily exclude namespaces
Any recommendations on how to put it on the internet archive as it is huge with all the images?
(compressing the folder would still exceed the default maximum file size of a Fat32(sadly it is still a common standard) formatted drive so i don't know how viable it is)
so i guess after answering the above question this issue can be closed as it seems like the features were intentionally disabled by the Administrators of the wiki

EDIT: found https://archive.org/download/wiki-pokewikide so is there a way to add the dump that I already have?(after i compressed it)

@nemobis
Copy link
Member

nemobis commented Apr 12, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants