Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dumpgenerator.py: check if the filename actually contains a file extension (Wikia) #212

Closed
Southparkfan opened this issue Jan 4, 2015 · 11 comments

Comments

@Southparkfan
Copy link
Contributor

A wiki founder at Orain wants the images of their wiki at Wikia (http://donjon.wikia.com) imported to their wiki at Orain. I tried to download the images with "python dumpgenerator.py --api http://donjon.wikia.com --images", but no luck. The file names in the images/ folder are like "latest?cp=XXXXXXXXXXXXXX" where the XXX-string is a timestamp in the YmdHis format.

When I looked at the donjonwikiacom-20150104-images.txt file I saw entries like these:
latest?cb=20120816112532 http://vignette4.wikia.nocookie.net/donjonbd/images/8/89/Wiki-wordmark.png/revision/latest?cb=20120816112532 Nclm
latest?cb=20120816114055 http://vignette3.wikia.nocookie.net/donjonbd/images/6/64/Favicon.ico/revision/latest?cb=20120816114055 Nclm

This is because file names include the "/revision/latest?cb=XXXXXXXXXXXXXXXXX" part:
http://donjon.wikia.com/api.php?action=query&list=allimages

To avoid this problem, I'll see if I can write a PHP script to download images from Wikia.

@nemobis
Copy link
Member

nemobis commented Jan 4, 2015

Southparkfan, 04/01/2015 20:18:

To avoid this problem, I'll see if I can write a PHP script to download
images from Wikia.

For individual wikis it's easier to visit Special:Statistics and click
"request dump". Images will then be available at
http://s3.amazonaws.com/wikia_xml_dumps/d/do/donjon_images.tar

@nclm
Copy link

nclm commented Jan 4, 2015

Hi, I am the admin of this wiki.
I already requested a dump a day ago, but only the XML files were generated.
The Wikia help page about this feature says that it “does not include private user data or images”.

@nemobis
Copy link
Member

nemobis commented Jan 4, 2015

nicolas, 04/01/2015 22:53:

The Wikia help page
http://community.wikia.com/wiki/Help:Database_download about this
feature says that it “does not include private user data or images”.

The help page is a pile of lies. Check
http://archiveteam.org/index.php?title=Wikia for true information.

@nclm
Copy link

nclm commented Jan 4, 2015

Thanks, okay, it’s good to know.
However, http://s3.amazonaws.com/wikia_xml_dumps/d/do/donjon_images.tar (or http://s3.amazonaws.com/wikia_xml_dumps/d/do/donjonbd_images.tar using the old name of the wiki) doesn’t seem to be available.

@nclm
Copy link

nclm commented Jan 4, 2015

For the naming issue Southparkfan found, the options could be :

  • Make an edited version of dumpgenerator.py which get the file names differently;
  • Or write a script which renames all the pictures downloaded using the generated images.txt file as reference to get the actual file names. The file looks like
latest?cb=20120816112532    http://vignette4.wikia.nocookie.net/donjonbd/images/8/89/Wiki-wordmark.png/revision/latest?cb=20120816112532    Nclm
latest?cb=20120816114055    http://vignette3.wikia.nocookie.net/donjonbd/images/6/64/Favicon.ico/revision/latest?cb=20120816114055  Nclm
latest?cb=20120816115626    http://vignette4.wikia.nocookie.net/donjonbd/images/0/0a/Logo_Donjon_Simple.png/revision/latest?cb=20120816115626   Nclm

so it’s something that looks possible.

@nemobis
Copy link
Member

nemobis commented Jan 4, 2015

Don't give up hopes for the images.tar; from what I can see, it's
usually created only one month or more after the XML.

nicolas, 05/01/2015 00:21:

so it’s something that looks possible.

It sure is. :) We know the true filename, so we can just use -O to force
the filename; however we need to decide on an escaping format for that
(wget picks its own).

I wish we could just use content-disposition
https://superuser.com/a/327254/283120 but unsurprisingly Wikia doesn't
provide that.

@nclm
Copy link

nclm commented Jan 4, 2015

Right, using this regex (surely improvable):

(latest\?cb=\d+)\thttp.+/images/[\d\w]+/[\d\w]+/(.+)/revision(.+)

and this substitution rule:

mv $1 $2

I was able to turn *images.txt into a .sh script which renames quite correctly all the misnamed pictures downloaded from Wikia.
They still have some URL encoding (like %28 for opening parenthesis). This is probably rewritable with a second script, but maybe MediaWiki accepts it directly.

@brunosso
Copy link

brunosso commented Jun 26, 2016

I have the same problem, the txt is created, but the images doesn't downloaded because the filename contain "?"..
How i can fix the dumpgenerator.py??
I think it's around the line 1104 when the filename3 is declared

I fix this problem! I edit the line 998 changing
url.split('/')[-1]) with url.split('/')[-3]) and now the filename of the images is correct.

emijrp added a commit that referenced this issue Sep 17, 2016
@emijrp
Copy link
Member

emijrp commented Sep 17, 2016

Thanks @brunosso for the patch. I fixed it and it works fine!

@emijrp emijrp closed this as completed Sep 17, 2016
@moll
Copy link

moll commented Nov 27, 2018

I think this needs to be reopened. Wikia now has the following new URLs with the same /revision/ nuance:

  • wikia.com
  • wikia.nocookie.net
  • fandom.com

@Markel
Copy link

Markel commented Dec 25, 2019

Yeah, it gives errors again... (This is a fandom site)

Traceback (most recent call last):
  File "dumpgenerator.py", line 2323, in <module>
    main()
  File "dumpgenerator.py", line 2313, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 2030, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 1299, in generateImageDump
    imagefile = open(filename3, 'wb')
IOError: [Errno 22] invalid mode ('wb') or filename: u'./gravityfallsfandomcom-20191224-wikidump/images/latest?cb=20110210205046'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants