Extracting RDF Relations from Wikipedia’s Tables
This project aims to extract relationships as RDF triples from tables found in Wikipedia's articles Link.
We propose that existing RDF knowledge-bases can be leveraged to extract facts (in the form of RDF triples) from relational HTML tables on the Web with high accuracy. In particular, we propose methods using the DBpedia knowledge-base to extract facts from tables embedded in Wikipedia articles (henceforth "Wikitables"), effectively enriching DBpedia with additional triples. We first survey the Wikitables from a recent dump of Wikipedia to see how much raw data can potentially be exploited by our methods. We then propose methods to extract RDF from these tables: we map table cells to DBpedia entities and, for cells in the same row, we isolate a set of candidate relationships based on existing information in DBpedia for other rows. To improve accuracy, we investigate various machine learning methods to classify extracted triples as correct or incorrect. We ultimately extract 7.9 million unique novel triples from one million Wikitables at an estimated precision of 81.5%.
All the codes are written using Java 6.0 and eclipse framework. To compile, each package contains a
build.xml file to be used by ant.
We use English-language data from DBpedia v3.8, describing 9.4 million entities. For answering queries that looks for relation
between a pair of resources, we use local indexes of this DBpedia knowledge-base, and for each pair, we perform two atomic on-disk lookups for relations in either direction.
Important: The used indexes are not included in the repository given their size (ca. 9G), but are available under request. See contact information.
The system is modularized into the following packages:
The web application built using Spring MVC that integrates the extraction and classification of RDF triples.
The core or engine that performs the extraction of the RDF triples.
For a given set of extracted triples, this performs the prediction and returns correct or incorrect label for each triple.
Contains the classes that represent the model of the entire application and the access to indexes.
Extract statistics from DBpedia to help in the features definition.
An index used to fill the autocomplete data in the web application.
We also make available in this repository the training set used to build our machine learning models comprising 503 examples in two formats.
These can be used to validate our results and try new machine learning schemas.
wikitables-training-set file shows the feature vectors extracted for each example which are formatted using SVMLight format as follows:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info> <target> .=. +1 | -1 <feature> .=. <integer> <value> .=. <float> <info> .=. <string>
where the target value and each of the feature/value pairs are separated by a space character. The
<info> field contains the URL from where the
example cames from and the
<s,p,o> RDF triple. We also publish an ARFF version
wikitables-training-set.arff to be used in Weka.
How to use it?
You can deploy the demo
wikitables-demo-release-1.0 in for example a Tomcat 6 server. Follow this checklist:
- Update the paths in the file
build.propertiesaccording your tomcat configuration.
- Update the paths in the file
/WebContent/WEB-INFwith the paths to the root folder, models and indexes.
- Build all the packages with the supplied script
- Build and copy the web project into tomcat using
- Restart your tomcat.
- Go to http://localhost/wikitables-demo-1.0 in your browser.
We have developed an on-line demo of our approach, where we extract RDF relations for a given Wikipedia article.
Our system receives a Wikipedia article's title as parameter and uses a selected (or default) machine-learning model to filter the best candidate triples.
Go to our demo page, search for some article already in Wikipedia, select a model see how it works.
The program can be used under the terms of the Apache License, 2.0.
Please do not hesitate to contact us if you have any further questions about this project:
Emir Muñoz email@example.com and Aidan Hogan firstname.lastname@example.org
Digital Enterprise Research Institute
National University of Ireland