List-extractor - Extract Data from Wikipedia Lists
List-Extractor is a tool that can extract information from wikipedia lists and form appropriate RDF triples from the list data.
How to run the tools
This project contains 2 differnt tools:
rulesGenerator.py first to generate desired rules, and then use
listExtractor.py to extract triples for wiki resources.
Alternatively, you can use only
listExtractor.py and extract with existing default settings.
For more details, refer to the documentation present in the
docs folder. The sample generated datasets can be found here. Some example triples for different domains are present in
python listExtractor.py [collect_mode] [source] [language] [-c class_name]
sto specify a single resource or
afor a class of resources in the next parameter.
source: a string representing a class of resources from DBpedia ontology (find supported domains below), or a single Wikipedia page of an actor/writer.
deetc. (for now, available only for some languages, for selected domains)
- a two-letter prefix corresponding to the desired language of Wikipedia pages and SPARQL endpoint to be queried.
-c --classname: a string representing classnames you want to associate your resource with. Applicable only for
NOTE: While extracting triples from multiple resources in a domain (
collect_mode = a), using
Ctrl + C will skip the current resource and move on to the next resource. To quit the extractor, use
Ctrl + \.
python listExtractor.py a Writer it
python listExtractor.py s William_Gibson en: Uses the default inbuilt mapper-functions
python listExtractor.py s William_Gibson en -c CUSTOM_WRITER: Uses the
CUSTOM_WRITERmapping only to extract list elements.
If successful, a .ttl file containing RDF statements about the specified source is created inside a subdirectory called
- This is an interactive tool, select the options given in the menu for using the rules generator.
- While creating new mapping rules or mapper functions, make sure to follow the required format as suggested by the tool.
- Upon successful addition/modification, it will update the
custom_mapper.jsonso that the new user defined rules/functions can run with extractor.
Default Mapped Domains:
More Domains can be added using the
Attributions for 3rd party tools:
This project uses 2 other existing open source projects.
- JSONpedia, a framework designed to simplify access at MediaWiki contents transforming everything into JSON. Such framework provides a library, a REST service and CLI tools to parse, convert, enrich and store WikiText documents.
- JCommander, a very small Java framework that makes it trivial to parse command line parameters.