Skip to content

aklement/wikiextractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project contains some code and scripts to extract plain text pairs of 
linked wikipedia articles in English and L2.

1. Get wikipedia dumps from wikimedia.org:

  * Download (tools/download.pl) and unzip (tools/unzip.pl) wikipedia dumps
    for a set of languages.

2. Extracting interlinked plain text page pairs

  * Compile the wikiextractor project (tools/build.sh).  The script generates
    ../lib/wikiextractor.jar.
   
  * Extract plain text pages (extractAll.pl) for a set of languages.  Pages are
    extracted if they have inter-language links to corresponding pages in 
    English.  Filenames are the names of the corresponding English articles
    (both plain text articles and original markup are saved).

  * Pair up the articles in English and L2 (allpairs.pl), saving names of the
  	pairs in list.txt.

E.g. (for en-es):

cd tools
perl download.pl languages dumps
perl unzip.pl languages dumps

bash build.sh
perl extractAll.pl languages dumps out

mkdir -p out/pagepairs/logs
perl allpairs.pl en es out/pages out/pagepairs/en-es/pairs &> out/pagepairs/logs/en-es.pairs.log

About

Extraction of Wikipedia plain text bi-lingual page pairs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published