Wikipedia dump to JSON converter
I wrote this project to convert
enwiki-20150304-pages-articles.xml.bz2 to a JSON file that can be properly loaded by Spark's
jsonFile method. Other dates and other languages may also work (
Only Python 3 is required. You have two options to run the scripts. The first one is the easiest, but you have to run the scripts from this repository's root folder only. With the second option, you are free to run them in any folder you want. The examples in the next sections are based in the second option.
Running from repo's root
No extra configuration is needed in order to run the scripts as long as you are in the repository root.
In the examples below, you will have to prefix the scripts with
scripts/ (e.g.: change
Running from any directory
To run the scripts from any directory, change the variables below (you may prefer adding the following commands to
~/.bashrc, so you don't need to execute them in every shell):
export PYTHONPATH="$PYTHONPATH:/full/path/to/wikipedia2json" export PATH="$PATH:/full/path/to/wikipedia2json/scripts"
Converts pages in the XML file (
<page>.*</page>) to JSON format, one page per line.
To validate the output, use
check_json.py and to limit the output size into one or many files, use
split.py (see Tips section).
- Input: wikipedia XML dump by stdin or filename
- Output: one-line JSON per page element
# Converts the whole XML to JSON bzcat enwiki-*.xml.bz2 | w2j.py >enwiki.json # Compressed output bzcat enwiki-*.xml.bz2 | w2j.py | \ bzip2 >enwiki.json.bz2
Filters valid one-line JSONs.
- Input: one-line JSONs by stdin or filename
- Stdout: valid JSONs
- Stderr: invalid JSONs and, in the end, the total amount of processed lines
# Filename check_json.py enwiki.json 2>not_valid.json >/dev/null # stdin bzcat enwiki.json.bz2 | \ check_json.py 2>not_valid.json >/dev/null
Splits the input in many custom-sized files, in a line-by-line way. If adding the next line exceeds the specified size, it won't be appended.
- Input: multiline text by stdin
- Output: write specified files
# 100M.json file will have the first 100 MB. # 924M.json will have the remaining 924 MB to complete 1GB. cat enwiki.json | \ split.py 100M 100M.json 1G 924M.json
Limit conversion by size
You can use the
split.py script to limit the conversion:
# Convert up to 10MB of JSON file bzcat enwiki-*.xml.bz2 | w2j.py | \ split.py 10M enwiki_small.json
Validation while converting
You can use
check_json.py together with
w2j.py to convert and check for errors at the same time:
bzcat enwiki-*.xml.bz2 | w2j.py | \ check_json.py 2>not_valid.json >enwiki_valid.json
Different sizes of the same data - Saving storage
If you want to run an application with different sizes of the same data (e.g. scalability experiment), you can save storage space by splitting and then
cating all files up to the desired size.
For example, suppose we want to use the first 4 and 16 GB of enwiki and put them in Hadoop Distributed File System (HDFS):
# enwiki04G.json will have the first 4GB of data # enwiki16G.json will have the next 12GB cat enwiki.json | split.py 4G enwiki04G.json 16G enwiki16G.json # Upload 4G to HDFS hadoop fs -copyFromLocal enwiki04G.json /enwiki/04.json # Upload 16G to HDFS cat *.json | hadoop fs -put - hdfs://master/enwiki/16.json
To save much more space, you can use gzip or bzip2. As
split.py does not compress data, we will use linux named pipes:
# Create named pipes mkfifo enwiki04G.json enwiki16G.json # Compress data from named pipes in background cat enwiki04G.json | bzip2 >enwiki04G.json.bz2 & cat enwiki16G.json | bzip2 >enwiki16G.json.bz2 & # Unaltered split.py command cat enwiki.json | split.py 4G enwiki04G.json 16G enwiki16G.json # Delete the named pipes (they are just files) rm enwiki04G.json enwiki16G.json # Upload to HDFS using bzcat bzcat enwiki04G.json.bz2 | hadoop fs -put - hdfs://master/enwiki/04.json bzcat *.json.bz2 | hadoop fs -put - hdfs://master/enwiki/16.json
This way, you will need only 4.3G to generate files with 20G in total. You may want to use pbzip2 for parallel (de)compressing.
Pipe party :)
To convert the enwiki XML to JSON, verify the output, split line by line and compress the output at once:
# Create named pipes mkfifo enwiki04G.json enwiki16G.json # Compress data from named pipes in background cat enwiki04G.json | pbzip2 >enwiki04G.json.bz2 & cat enwiki16G.json | pbzip2 >enwiki16G.json.bz2 & # All but output compression bzcat enwiki-*.xml.bz2 | w2j.py | \ check_json.py 2>not_valid.json | \ split.py 4G enwiki04G.json 16G enwiki16G.json # Delete the named pipes (they are just files) rm enwiki04G.json enwiki16G.json