Skip to content

Your à-la-carte in-domain corpora extraction tool from Wikipedia

Notifications You must be signed in to change notification settings

cristinae/WikiTailor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiTailor

WikiTailor is temporarily out of business as it is refurbishing to use up-to-date Lucene libraries. We are currently working on it and hoping it will be ready soon.

Your à-la-carte corpora extraction tool

WikiTailor is a tool for extracting in-domain corpora from Wikipedia. A domain must be defined as an existing category in Wikipedia (or in Vikipèdia, or in ويكيبيديا or in Βικιπαίδεια) and the articles belonging to that domain are extracted even if they are not tagged as such. Two extraction methods are implemented: the main system is based on the exploration of Wikipedia's category graph and a secondary one based information retrieval techniques is also included.

WikiTailor 1.0 functionalities

  • Monolingual in-domain corpora extraction
  • Multilingual in-domain comparable corpora extraction
  • Multilingual in-domain parallel corpora extraction build with the articles' titles

Available languages: Arabic, Basque, Catalan, English, French, German, Greek, Romanian, Portuguese and Spanish.

Upcoming

  • New available languages: Czech and Bulgarian
  • Bilingual in-domain parallel corpora extraction build with the articles' content
  • Evaluation of the quality of the extractions

Usage

For the main functionality, that is, the extraction of a corpus of a specific domain do:

java -jar wikiTailor.v1.0.0.jar [-c <arg> | -n <arg>] [-d <arg>] [-e <arg>] [-h]
          -i <FILE> -l <arg> [-m <arg>]  [-o <arg>] [-s <arg>] [-t <arg>] -y <arg>

where the arguments are:
 -c,--category <arg>      Name of the main category (with '_' instead of ' ';
                          you can use -n instead)
 -d,--depth <arg>         depth obtained in a previous execution
                          (default: 0)
 -e,--end <arg>           Last step for the process
                          (default: 7)
 -h,--help                This help
 -i,--ini <FILE>          Global config file for WikiTailor
 -l,--language <arg>      Language of interest (e.g., en, es, ca)
 -m,--model <arg>         Percentage of in-domain categories
                          (default: 0.5)
 -n,--numcategory <arg>   Numerical identifier of the category (you can use -c instead)
 -o,--outpath <arg>       Save the output into this directory
                          (default: current)
 -s,--start <arg>         Initial step for the process
                          (default: 1)
 -t,--top <arg>           Number of vocabulary terms within the 10%
                          (default: 100, all: -1)
 -y,--year <arg>          Wikipedia year edition (2013, 2015, 2016)

Ex: java -jar wikiTailor.v1.0.0.jar -l en -y 2015 -i wikiTailor.ini -c Science

For other uses see the manual and the project webpage.

References

For a complete analysis of the methods implemented see:

  • Cristina España-Bonet, Alberto Barrón-Cedeño and Lluís Màrquez. Tailoring Wikipedia for in-Domain Comparable Corpora Extraction. Preprint arXiv:2005.01177

  • Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez. A Factory of Comparable Corpora from Wikipedia. In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC 2015), pages 3-13, July 2015, Beijing, China

About

Your à-la-carte in-domain corpora extraction tool from Wikipedia

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages