python-mwdump-tools

Quick parsing of Mediawiki XML dumps: Parses stdin XML dumps using simple string searching and Python's C implementation of elementree for parsing each <page> node.

TODO

docs
Examples of other parsing than just image downloading
Packaging for PIP

Features

Fast

The outermost parsing will not try to parse the whole XML dump but simply moves from <page> to </page> to allow for small buffers and quick deployment of jobs.

Mulitiprocessing

Since Python 3 has truly parallel job tasking, all these I/O heavy tasks for parsing revision texts, downloading related files etc. can be performed with maximum utility of a single server.

Resuming and skipping

Where applicaple, jobs can be resumed by parsing in a line number from which the job should start.

If a job finds that something has already been processed, it will skip this.

Super configurable

Most behaviour can be configured.

Commands

imagedownloader

Downloads and downsamples images found in an XML dump.

Usage:

./imagedownloader --help

Takes a mediawiki from stdin and parses all titles as file names, so you need to feed it the namespace of all File:XXX pages. For instance:

./imagedownloader --namespaces=6 < mywiki.dump

It will download and place all images in the destined location and send SQL INSERT statements for populating the images table.

patternmatcher (TODO)

Reads a list of Python regular expressions and counts their occurences in a dump.

patternreplacer (TODO)

Replaces a list of (search, replace) pairs of Python

autotranslator (TODO)

Idea sketch: Call online API or other cloud translation service for translation of article text to output a translated XML dump.

Python 3

You need Python 3 to use this because it's running futures.concurrent stuff for parallel processing.

You need pip for Python 3, consider setting up a virtual env:

The following is essential for your Pillow install to process images with imagedownloader:

libjpeg provides JPEG functionality.
zlib provides access to compressed PNGs
libtiff provides group4 tiff functionality

sudo apt-get install libjpeg-dev libtiff4-dev

To reinstall Pillow after adding dependencies, run:

pip install pillow --force-reinstall --upgrade

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
mwdumptools		mwdumptools
.gitignore		.gitignore
COPYING		COPYING
README.md		README.md
imagedownloader		imagedownloader
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-mwdump-tools

TODO

Features

Fast

Mulitiprocessing

Resuming and skipping

Super configurable

Commands

imagedownloader

patternmatcher (TODO)

patternreplacer (TODO)

autotranslator (TODO)

Python 3

About

Releases

Packages

Languages

License

benjaoming/python-mwdump-tools

Folders and files

Latest commit

History

Repository files navigation

python-mwdump-tools

TODO

Features

Fast

Mulitiprocessing

Resuming and skipping

Super configurable

Commands

imagedownloader

patternmatcher (TODO)

patternreplacer (TODO)

autotranslator (TODO)

Python 3

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages