Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Introduction to table extractor
Wikipedia is full of data hidden in tables. The aim of this project is to explore the possibilities of exploiting all the data represented with the appearance of tables in Wiki pages, in order to populate the different chapters of DBpedia through new data of interest. The Table Extractor has to be the engine of this data “revolution”: it would achieve the final purpose of extracting the semi structured data from all those tables now scattered in most of the Wiki pages.
Idea behind table extractor
As said previously, Wikipedia is full of data hidden in tables, so we need a general software that can run over different domains and languages. That's the target of the project of this year (With domain we means a set of resource, that can range from basketball players to tv series).
To reach this objective, I need a help from user in order to know all
mapping rules that are defined in a particular domain.
mapping rule stands for an association between a table's header and a dbpedia ontology property).
I have built two modules that can be summarized in this way:
pyDomainExplorer: this module has the purpose to surf over resources of domain chosen by user (single resource, dbpedia mapping class or a SPARQL where) and it collect every section and table met during its work. Then it print in output a file (default called
domain_settings.pyand it's created in
domain_explorerfolder) that contains all sections ( that are grouped to simplify user's work) and table's headers. When you open this file you can see that each section has its own dictionary that has to be filled. You can observe that some fields are already full. This means that
pyDomainExplorerhas found that header in
pyTableExtractordictionary or there is a dbpedia property that has same name. After you have written mapping rules that you need, you can start
pyTableExtractor(that doesn't need any parameters).
pyTableExtractor: this module takes parameters (like chapter or output format value) from
domain_settings.py. Then it reads all rules defined by user and update its own dictionary. After that it pick up resources analyzed in
pyDomainExplorerand extract all tables from wikipedia pages. With a simple research over dictionary of mapping rules it can create RDF triples: firstly it creates a bridge between actual resource and table rows (in this point I'm using
domain_settings) and secondly it map all row's data with respective table's row.
Table Extractor feature
It checks parameters given by user in order to make a correct exploration and extraction of domain (For example I verify if mapping class wrote by user is in DBpedia or not).
The lists of resources involved in your extractions are stored to help user to understand if the scope targeted has been correctly hit.
In domain_settings.py there are a lot of comments to help in filling all required fields so that obtain an higher effectiveness of extraction process.
Easy to write mapping rules because of intuitive structure of domain_settings.py ("table's header":"ontology property"). You have also wikipedia example page where it was found a particular section.
Both modules produces log files that contain info regarding exploring and extracting domains selected by user. You can see each operation for each resource involved.
domain_settings.py allows you to analyze any domains in several languages.
At the end of each log file, there is a statistic part that explain how exploration and extraction has gone well over domain.
Possible to customize scripts on your necessity through settings.py. For example you can activate or not the filter on table's data, you can also disable check over properties written by user.
Use output format parameter to change output by pyDomainExplorer to fit it to your work. (I recommend to use output= 2 and Notepad++ to fill
Example of result
Basketball domain - English
pyDomainExplorer log file
REPORT: Total # resources analyzed: 100 Total # tables found : 83 Total # tables analyzed : 83 Total # of rows extracted: 534 Total # of data cells extracted : 5788 Total # of exceptions extracting data : 0 Total # of 'header not resolved' errors : 0 Total # of 'no headers' errors : 9
pyTableExtractor log file
REPORT: Total # resources analyzed: 100 Total # tables found : 83 Total # tables analyzed : 83 Total # of rows extracted: 534 Total # of data cells extracted : 5788 Total # of exceptions extracting data : 0 Total # of 'header not resolved' errors : 0 Total # of 'no headers' errors : 9 Total # of 'no mapping rule' errors for section : 0 Total # of 'no mapping rule' errors for headers : 0 Total # of data cells extracted that needs to be mapped: 5788 Total # of table's rows triples serialized : 534 Total # of table's cells triples serialized : 5788 Total # of triples serialized : 6322 Percentage of mapping effectiveness : 1.000
Note that effectiveness of project mostly depends on how many properties are written in
(Effectiveness is calculated as ratio between data cells extracted that needs to be mapped and table's cells triples serialized).
Example of reification
<http://dbpedia.org/resource/Larry_Bird> ns1:regularSeason <http://dbpedia.org/resource/Larry_Bird__10>, <http://dbpedia.org/resource/Larry_Bird__5>, <http://dbpedia.org/resource/Larry_Bird__6>, <http://dbpedia.org/resource/Larry_Bird__7>, <http://dbpedia.org/resource/Larry_Bird__8>, <http://dbpedia.org/resource/Larry_Bird__9> .
Example of RDF triple
<http://dbpedia.org/resource/Larry_Bird__10> ns1:Year "1984–85"^^xsd:string ; ns1:assistsPerGame "6.6"^^xsd:float ; ns1:blocksPerGame "1.2"^^xsd:float ; ns1:fieldGoal "0.522"^^xsd:float ; ns1:freeThrow "0.882"^^xsd:float ; ns1:gamesPlayed "80.0"^^xsd:float ; ns1:gamesStarted "77.0"^^xsd:float ; ns1:minutesPerGame "39.5*"^^xsd:string ; ns1:pointsPerGame "28.7"^^xsd:float ; ns1:reboundsPerGame "10.5"^^xsd:float ; ns1:stolePerGame "1.6"^^xsd:float ; ns1:team <http://dbpedia.org/resource/Boston> ; ns1:threePoints "0.427"^^xsd:float .
Small digression on searching mapping rules
I think that is worth to explain how
Mapper class searching for mapping rules. In my work, you can observe two types of rule for table's headers: one is
"header":"property" and the other is
"section name + _ + header":"property".
Mapper analyze an header, it firstly search on dictionary a key named as section name + _ + header. If it doesn't find a key like that, it will research for only header string.
In this way, if user hasn't defined a strict rule (section name + _ + header), I will search for a less strict rule (only header) that could be defined previously in another exploration or extraction.
These different rules (strict and less strict) are defined depending on output format parameter.
-f equal to 1 will define only strict rules, while
-f equal to 2 will write less strict rules.
In my opinion there are two points of project that can be improved:
Facilitate user in his work: I have already added comments, progress bars, a search on dbpedia ontology properties, but maybe there are other ways to help user in filling
- Html Table Parser: I think that the actual parser (firstly implemented by Simone, then improved a bit by me) is really performing, but as in all parsers, it can be improved. For example it could filtering tables that are used as legend instead of giving errors (many E2 errors depend on this aspect).
Note on languages: Obviously my project works on languages that use latin alphabet. Greek and russian for example is not supported. If you need extracting data in those languages, you have to add them in my scripts.
If you want to read all steps done in this two years project, take a look to: