-
Notifications
You must be signed in to change notification settings - Fork 0
aklement/wikiextractor
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project contains some code and scripts to extract plain text pairs of linked wikipedia articles in English and L2. 1. Get wikipedia dumps from wikimedia.org: * Download (tools/download.pl) and unzip (tools/unzip.pl) wikipedia dumps for a set of languages. 2. Extracting interlinked plain text page pairs * Compile the wikiextractor project (tools/build.sh). The script generates ../lib/wikiextractor.jar. * Extract plain text pages (extractAll.pl) for a set of languages. Pages are extracted if they have inter-language links to corresponding pages in English. Filenames are the names of the corresponding English articles (both plain text articles and original markup are saved). * Pair up the articles in English and L2 (allpairs.pl), saving names of the pairs in list.txt. E.g. (for en-es): cd tools perl download.pl languages dumps perl unzip.pl languages dumps bash build.sh perl extractAll.pl languages dumps out mkdir -p out/pagepairs/logs perl allpairs.pl en es out/pages out/pagepairs/en-es/pairs &> out/pagepairs/logs/en-es.pairs.log
About
Extraction of Wikipedia plain text bi-lingual page pairs
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published