GitHub - aklement/wikiextractor: Extraction of Wikipedia plain text bi-lingual page pairs

aklement / wikiextractor Public

Notifications You must be signed in to change notification settings
Fork 0
Star 3

Extraction of Wikipedia plain text bi-lingual page pairs

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.settings		.settings
lib		lib
src		src
tools		tools
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README		README

Repository files navigation

This project contains some code and scripts to extract plain text pairs of 
linked wikipedia articles in English and L2.

1. Get wikipedia dumps from wikimedia.org:

  * Download (tools/download.pl) and unzip (tools/unzip.pl) wikipedia dumps
    for a set of languages.

2. Extracting interlinked plain text page pairs

  * Compile the wikiextractor project (tools/build.sh).  The script generates
    ../lib/wikiextractor.jar.
   
  * Extract plain text pages (extractAll.pl) for a set of languages.  Pages are
    extracted if they have inter-language links to corresponding pages in 
    English.  Filenames are the names of the corresponding English articles
    (both plain text articles and original markup are saved).

  * Pair up the articles in English and L2 (allpairs.pl), saving names of the
  	pairs in list.txt.

E.g. (for en-es):

cd tools
perl download.pl languages dumps
perl unzip.pl languages dumps

bash build.sh
perl extractAll.pl languages dumps out

mkdir -p out/pagepairs/logs
perl allpairs.pl en es out/pages out/pagepairs/en-es/pairs &> out/pagepairs/logs/en-es.pairs.log