Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Research code to extract raw text from Wikipedia dump files. No markup-language, or HTML tags. The results are a number of text files, each containing articles separated by newlines.


⚠️ For lab-internal use only. Requires access to the UIUC Learning & Language Lab file server.

Running the script

The corpus-creation is computationally expensive and is designed to be run in parallel across multiple machines. To do so, ludwig is used. Jobs are submitted by invoking its command line interface:

ludwig -r 1 -c data/ wikiExtractor/

The -r flag indicates how many times to run a job on each ludwig machine. This should always be set to 1.

The -c flag ensures that all folders required for execution are uploaded to each machine. By default, ludwig automatically uploads source code if it is in a folder the name of which is equivalent to the name of the root folder. Third-party code for Wikipedia template expansion, available here, is not part of the source code folder, and therefore must be uploaded explicitly. Note that folders specified by the -c flag are not uploaded to each machine, but are uploaded to the remote root folder (on the shared drive), which is accessible by each machine. Modules inside such folders are nonetheless importable because ludwig automatically appends the remote root folder to sys.path.

Note that -c data/ wikiExtractor/ must only be specified once, and can be omitted subsequently to save time. Don't forget add -c data/ wikiExtractor/ whenever changes to the files in either folder have been made.

Verifying output

A simple way to verify the output is to count the number of articles:

cd /media/research_data/CreateWikiCorpus
find runs/ -name titles.txt | xargs wc -l

Verify that the total number of lines is close to the total number of articles in the Wikipedia dump file.

If there are multiple corpora in runs/, you need to specify a subset of folder corresponding to the corpus of interest. For example, say the parameter configuration 15-21 are associated with a corpus, the total number of articles can be calculated by:

cd /media/research_data/CreateWikiCorpus/runs
find param_15 param_16 param_17 param_18 param_19 param_20 param_21 -name titles.txt | xargs wc -l

Technical Notes

Tested on Ubuntu 16.04 and MacOs using Python=>3.6.


Extract raw text articles from Wikipedia dump







No releases published


No packages published