Luca Virgili edited this page Aug 26, 2017 · 23 revisions

Introduction to table extractor

Wikipedia is full of data hidden in tables. The aim of this project is to explore the possibilities of exploiting all the data represented with the appearance of tables in Wiki pages, in order to populate the different chapters of DBpedia through new data of interest. The Table Extractor has to be the engine of this data “revolution”: it would achieve the final purpose of extracting the semi structured data from all those tables now scattered in most of the Wiki pages.

Idea behind table extractor

As said previously, Wikipedia is full of data hidden in tables, so we need a general software that can run over different domains and languages. That's the target of the project of this year (With domain we means a set of resource, that can range from basketball players to tv series). To reach this objective, I need a help from user in order to know all mapping rules that are defined in a particular domain. (Remember that mapping rule stands for an association between a table's header and a dbpedia ontology property). I have built two modules that can be summarized in this way:

  • pyDomainExplorer: this module has the purpose to surf over resources of domain chosen by user (single resource, dbpedia mapping class or a SPARQL where) and it collect every section and table met during its work. Then it print in output a file (default called domain_settings.py and it's created in domain_explorer folder) that contains all sections ( that are grouped to simplify user's work) and table's headers. When you open this file you can see that each section has its own dictionary that has to be filled. You can observe that some fields are already full. This means that pyDomainExplorer has found that header in pyTableExtractor dictionary or there is a dbpedia property that has same name. After you have written mapping rules that you need, you can start pyTableExtractor (that doesn't need any parameters).
  • pyTableExtractor: this module takes parameters (like chapter or output format value) from domain_settings.py. Then it reads all rules defined by user and update its own dictionary. After that it pick up resources analyzed in pyDomainExplorer and extract all tables from wikipedia pages. With a simple research over dictionary of mapping rules it can create RDF triples: firstly it creates a bridge between actual resource and table rows (in this point I'm using sectionProperty of domain_settings) and secondly it map all row's data with respective table's row.

Table Extractor feature

  • It checks parameters given by user in order to make a correct exploration and extraction of domain (For example I verify if mapping class wrote by user is in DBpedia or not).

  • The lists of resources involved in your extractions are stored to help user to understand if the scope targeted has been correctly hit.

  • In domain_settings.py there are a lot of comments to help in filling all required fields so that obtain an higher effectiveness of extraction process.

  • Easy to write mapping rules because of intuitive structure of domain_settings.py ("table's header":"ontology property"). You have also wikipedia example page where it was found a particular section.

  • Both modules produces log files that contain info regarding exploring and extracting domains selected by user. You can see each operation for each resource involved.

  • domain_settings.py allows you to analyze any domains in several languages.

  • At the end of each log file, there is a statistic part that explain how exploration and extraction has gone well over domain.

  • Possible to customize scripts on your necessity through settings.py. For example you can activate or not the filter on table's data, you can also disable check over properties written by user.

  • Use output format parameter to change output by pyDomainExplorer to fit it to your work. (I recommend to use output= 2 and Notepad++ to fill domain_settings.py).

Example of result

Basketball domain - English

pyDomainExplorer log file

REPORT:
Total # resources analyzed: 100
Total # tables found : 83
Total # tables analyzed : 83
Total # of rows extracted: 534
Total # of data cells extracted : 5788
Total # of exceptions extracting data : 0
Total # of 'header not resolved' errors : 0
Total # of 'no headers' errors : 9

pyTableExtractor log file

REPORT:
Total # resources analyzed: 100
Total # tables found : 83
Total # tables analyzed : 83
Total # of rows extracted: 534
Total # of data cells extracted : 5788
Total # of exceptions extracting data : 0
Total # of 'header not resolved' errors : 0
Total # of 'no headers' errors : 9
Total # of 'no mapping rule' errors for section : 0
Total # of 'no mapping rule' errors for headers : 0
Total # of data cells extracted that needs to be mapped: 5788
Total # of table's rows triples serialized : 534
Total # of table's cells triples serialized : 5788
Total # of triples serialized : 6322
Percentage of mapping effectiveness  : 1.000

Note that effectiveness of project mostly depends on how many properties are written in domain_settings.py file (Effectiveness is calculated as ratio between data cells extracted that needs to be mapped and table's cells triples serialized).

Example of reification

<http://dbpedia.org/resource/Larry_Bird> ns1:regularSeason <http://dbpedia.org/resource/Larry_Bird__10>,
        <http://dbpedia.org/resource/Larry_Bird__5>,
        <http://dbpedia.org/resource/Larry_Bird__6>,
        <http://dbpedia.org/resource/Larry_Bird__7>,
        <http://dbpedia.org/resource/Larry_Bird__8>,
        <http://dbpedia.org/resource/Larry_Bird__9> .

Example of RDF triple

<http://dbpedia.org/resource/Larry_Bird__10> ns1:Year "1984–85"^^xsd:string ;
    ns1:assistsPerGame "6.6"^^xsd:float ;
    ns1:blocksPerGame "1.2"^^xsd:float ;
    ns1:fieldGoal "0.522"^^xsd:float ;
    ns1:freeThrow "0.882"^^xsd:float ;
    ns1:gamesPlayed "80.0"^^xsd:float ;
    ns1:gamesStarted "77.0"^^xsd:float ;
    ns1:minutesPerGame "39.5*"^^xsd:string ;
    ns1:pointsPerGame "28.7"^^xsd:float ;
    ns1:reboundsPerGame "10.5"^^xsd:float ;
    ns1:stolePerGame "1.6"^^xsd:float ;
    ns1:team <http://dbpedia.org/resource/Boston> ;
    ns1:threePoints "0.427"^^xsd:float .

Small digression on searching mapping rules

I think that is worth to explain how Mapper class searching for mapping rules. In my work, you can observe two types of rule for table's headers: one is "header":"property" and the other is "section name + _ + header":"property". When Mapper analyze an header, it firstly search on dictionary a key named as section name + _ + header. If it doesn't find a key like that, it will research for only header string. In this way, if user hasn't defined a strict rule (section name + _ + header), I will search for a less strict rule (only header) that could be defined previously in another exploration or extraction.

These different rules (strict and less strict) are defined depending on output format parameter.

-f equal to 1 will define only strict rules, while -f equal to 2 will write less strict rules.

Future development

In my opinion there are two points of project that can be improved:

  • Facilitate user in his work: I have already added comments, progress bars, a search on dbpedia ontology properties, but maybe there are other ways to help user in filling domain_settings.py file.
  • Html Table Parser: I think that the actual parser (firstly implemented by Simone, then improved a bit by me) is really performing, but as in all parsers, it can be improved. For example it could filtering tables that are used as legend instead of giving errors (many E2 errors depend on this aspect).

Note on languages: Obviously my project works on languages that use latin alphabet. Greek and russian for example is not supported. If you need extracting data in those languages, you have to add them in my scripts.

Progress

If you want to read all steps done in this two years project, take a look to:

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.