Skip to content

How To Contribute Links to DBpedia

akirsche edited this page Feb 1, 2018 · 26 revisions

About

Links are the key enabler for retrieval of related information on the Web of Data. Currently, DBpedia is one of the central interlinking hubs in the Linked Open Data (LOD) cloud. With over 28 million of described and localized things it is one of the largest and open datasets. With the increasing number of linked datasets, there is need for proper maintenance of these links. The DBpedia-Links repository maintains linksets between DBpedia and other LOD datasets. System for maintenance, update and quality assurance of the linksets are in place and can be explored further.

We urge you to read not only the README but also this wiki carefully, if questions remain please use the GitHub Issue tracker. If you want to give us more feedback, feel free to use the DBpedia Discussion mailinglist.

Disclaimer

In order to allow widest possible dissemination, all data and code in this repository is to be treated as public domain or CC-0. We assume that you are aware of this when contributing to the repository, your links and scripts will be re-used, hosted and mixed with other data.

We expect that anybody using data from this repository will give proper attribution to the work of:

  • DBpedia as a community
  • This repository and its contributors as a whole
  • The individual contributions

However, we will send friendly emails instead of lawyers, if we think attribution is not given properly.

How to contribute links

In order to successfully contribute your links follow the guidelines, modalities and structure as it is described here

The DBpedia-Links Lifecycle

DBpedia Links Lifecycle

Guide

Please do a GitHub pull request to allow us to check your contribution.

  1. Choose an appropriate folder:
  2. We are linking by domain and subdomain, so please have a look whether your domain/subdomain already exists. Examples are:
    • viaf.org - links/dbpedia.org/viaf.org
    • lobid.org - links/dbpedia.org/lobid.org
    • lobid.org - links/xxx.dbpedia.org/de/lobid.org
  3. Submit the links. Please note that in this repo you can submit in one of the following ways:
  • [RECOMMENDED] a download link to a dump with the links

    • N-Triples, one triple per line, DBpedia URL as subject. You can use bzip2 or gz2 compression.
    • The link should be provided in metadata.ttl file as ntriplefilelocation
    • Multiple download links can be provided
    • Updates: all ntriplefilelocations are updated based on the last-modified header
    • Example metadata file of the Amsterdam Museum
  • [RECOMMENDED] a SPARQL endpoint from where they can be retrieved. In this scenario, the user should provide a SPARQL endpoint and one or several CONSTRUCT queries within the metadata file

    • DBpedia URL as subject
    • Updates: SPARQL results are cached for daily builds and refreshed every 7 days per default, please use dbp:updateFrequencyInDays to change the frequency.
    • Example metadata file of the data.camera.it entry
  • [RECOMMENDED] a script which is used to re-generate the linkset (a script will usually download and unpack the files, but in some cases a script could implement a more advanced logic for generation of the links.

    • Parameters: The script has to take the outputfile as the first parameter, this allows us to call it in a way that we can find the output and integrate it.
    • Outputfile should have DBpedia URL as subject
    • Updates: Scripts are run every 7 days unless otherwise stated in the dbp:updateFrequencyInDays property
    • We only implemented a call to bash scripts at the moment. However, you can run java or python from this bash script.
    • Example in the scripts directory of the GADM entry
  • [RECOMMENDED] a configuration file (ttl or XML) for LIMES, a link discovery framework.

    • Use DBpedia as the source datase
    • Use NT as output formt
    • The Acceptance codition is the one which determines which triples will be included in the repository
    • Example of LIMES configuration file can be found in the links-specs directory of the Airport entry
  • [NOT WORKING AT THE MOMENT] an XML configuration file for SILK

  • [DISCOURAGED] the links file is submitted with the git commit in the contribution.

    • Note that, if you provide a SPARQL enpoint and query and a static dump file, theses will be merged.
    • The filename should be provided in the metadata.ttl file as dbp:ntriplefilelocation <fileame.nt>
    • Updates: updates via git commit by you only, As the links here are static, this form of contribution is discouraged
    • Example metadata file of the EUNIS entry
    • you can also provide multiple linksets per contribution, see, e.g., the metadata file of the www4.wiwiss.fu-berlin.de entry
  • [NOT WORKING AT THE MOMENT] a patch which provides all needed update information

Contribution Structure

Within in folder mentioned in 2. please adhere to the following structure:

  • metadata.ttl - a metadata linkset file
  • README.md - a brief description of the linkset
  • scripts/ - (optional) a folder containing a script that produces a linkset
  • <name>_links.nt or <name>_links.nt.bz2 - (optional) a file containing the links
  • patches/ - (optional) a folder containing patches including whitelists and / or blacklists of links

Conventions

Please adhere to the following conventions for your contributions.

README.md

The README.md file is very important and should document, who created the links and how the links were created.

metadata.ttl

The ´metadata.ttl´ is an imminent part of each contribution, it provides machine-readable descriptions of the linkset contribution.

Here is a detailed example:

@prefix dct: <http://purl.org/dc/terms/> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix dbp: <http://dbpedia.org/property/> .
@prefix ex: <http://example.org> .

<#1 > a void:Linkset ;
    dbp:script <scripts/makeLinks.sh> ;
    dbp:endpoint "http://example.org/sparql" ;
    dbp:constructquery "CONSTRUCT {?b <http://www.w3.org/2002/07/owl#sameAs> ?o} where { ?o <http://www.w3.org/2002/07/owl#sameAs> ?b. FILTER(REGEX(STR(?b),’http://dbpedia.org’)) }" ;
    dbp:ntriplefilelocation <links.nt> ;
    dbp:approvedPatch <patches/pat_1/patch.ttl> ;
    dbp:approvedPatch <patches/pat_2/patch.ttl> ;
    dbp:optionalPatch <patches/opt/unofficial.ttl> ;
    dbp:outputFile <out.nt> ;
    dbp:updateFrequenceInDays "10" ;
    void:objectsTarget <http://example.org/> ;
    dct:author ex:dave-horn ;
    dct:description "These links link DBpedia with the RichData dataset." ;
    dct:license <http://creativecommons.org/publicdomain/zero/1.0/ > .

<name>_links.nt

If you just have the link file, you can submit it to the appropriate folder. The file must:

  • be in N-Triples format
  • have the DBpedia IRI as subject
  • use either:
  • If the file is larger then 200.000 triples or 20MB please compress it by using bzip2

**Example: **

[...]
<http://dbpedia.org/resource/Lardal> <http://www.w3.org/2002/07/owl#sameAs> <http://gadm.geovocab.org/id/2_25145> .
<http://dbpedia.org/resource/Lardaro> <http://www.w3.org/2002/07/owl#sameAs> <http://gadm.geovocab.org/id/3_35304> .
<http://dbpedia.org/resource/Lardirago> <http://www.w3.org/2002/07/owl#sameAs> <http://gadm.geovocab.org/id/3_32118> .
<http://dbpedia.org/resource/Lardjem_District> <http://www.w3.org/2002/07/owl#sameAs> <http://gadm.geovocab.org/id/2_13215> .
<http://dbpedia.org/resource/Larecaja_Province> <http://www.w3.org/2002/07/owl#sameAs> <http://gadm.geovocab.org/id/2_3549> .
<http://dbpedia.org/resource/Laredo,_Cantabria> <http://www.w3.org/2002/07/owl#sameAs> <http://gadm.geovocab.org/id/4_10790> .
<http://dbpedia.org/resource/Laredo,_La_Libertad,_Peru> <http://www.w3.org/2002/07/owl#sameAs> <http://gadm.geovocab.org/id/3_45524> .
[...]

##link-specs/ You can submit XML configurations for SILK or LIMES, for SILK examples take a look at www.geonames.org/link-specs

scripts/

This folder should contain a simple script that generates the link file. We are using command-line linux in order to run it. Therefore bash would be greatly preferred.

Furthermore, if your linkset generation needs to download more the 1GB of data, please contact the maintainer.

Examples:

  1. Java program started with a shell script:
#!/bin/sh
#generate links to gadm
#first argument should contain DBpedia endpoint URL, otherwise the default http://dbpedia.org/sparql will be used
java -jar db2gadm.jar $1

#alphabetically sort the triples
sort -u gadm-linksRaw.nt > gadm-links.nt

#remove the initial file
rm gadm-linksRaw.nt
  1. Shell-script downloading the links:
#!/bin/bash
# get the data from datahub.io and transform it to the expected links

rapper -i turtle http://lobid.org/download/dumps/DE-605/enrich/2dbpedia.ttl | perl -pe 's|(<.*?>).*(<.*?>).*|\2 <http://rdvocab.info/RDARelationshipsWEMI/manifestationOfWork> \1 .\n\2 <http://www.w3.org/2000/01/rdf-schema#seeAlso> \1 .\n\2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://vocab.org/frbr/core#Work> .|' | grep '^<http://dbpedia.org/' | sort -u  > manifestation_links.nt
# validation:
rapper -c -i ntriples manifestation_links.nt

Shell script doing a SPARQL Construct query to retrieve links

#!/bin/bash
# get the data from a SPARQL endpoint of lobid. Filter to get the
# links to de.dbpedia.org and convert file to utf8 (not really needed
# by now, but may be in future).

curl -L -H "Accept: text/turtle"  --data-urlencode "query=
CONSTRUCT {
    ?o <http://umbel.org/umbel#isLike> ?s .
    ?o  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Organization> .
}
WHERE {
  graph <http://lobid.org/organisation/> {
  ?s <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?o.
  }
}
LIMIT 50000" http://aither.hbz-nrw.de:8000/sparql/ | grep -v "de.dbpedia" | sort -u > organisation_links_ascii.nt
native2ascii -encoding UTF-8 -reverse organisation_links_ascii.nt ../organisation_links.nt
rm organisation_links_ascii.nt

patches/

A patch can be the removal of invalid triples or the introduction of new triples. You can contribute such updates by providing a so-called ”patch” within the patches/ folder:

  • each patch should be described with a simple metadata description file
  • provide whitelist and a blacklist dump file in the N-Triples format
    • Whitelists should contain triples which should be added to the linkset
    • Blacklists should contain invalid triples which should be removed from the linkset

Example of the patch description file:

@prefix dct: <http://purl.org/dc/terms/> .
@prefix dbp: <http://dbpedia.org/property/> .

<patch.ttl> a dbp:Patch ;
    dbp:whitelistFile <xx.wl.ttl> ;
    dbp:blacklistFile <xx.bl.ttl> ;
    dct:author <http://example.org/dave-horn> ;
    dct:description "Patch for the RichData." ;
    dct:license <http://creativecommons.org/publicdomain/zero/1.0/> .

Detailed patch example:

repo:MyPatch a pro:Patch ;
    pro:update [
        a guo:UpdateInstruction ;
        guo:target_subject dbpedia:Achille_Starace ;
        guo:delete [
          owl:sameAs <http://dati.camera.it/ocd/persona.rdf/p9970> ]
    ] ;
    pro:update [
        a guo:UpdateInstruction ;
        guo:target_subject dbpedia:Achille_Starace ;
        guo:insert [
            owl:sameAs <http://dati.camera.it/ocd/persona.rdf/pr11089> ]
    ] ;
    pro:appliesTo <http://example.org/void.ttl> ;
    prov:wasGeneratedBy [
        a prv:DataCreation ;
        prv:involvedActor repo:Author ;
        prv:performedAt "2016-04-20T21:32:52"^^xsd:dateTime  ] .

Submission of alternate classifications

Submit the N-Triples file compressed as bzip2 to the types folder.

  1. create a folder with the domain the types are from
  2. the file should be in N-Triples format and should only contain triples having rdf:type as property

Example:

[...]
<http://dbpedia.org/resource/Marino_Defendi> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/SoccerPlayer> .
<http://dbpedia.org/resource/Marino_de_Luanco> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/Club_Organization> .
<http://dbpedia.org/resource/Marino_Lejarreta> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/Athlete> .
<http://dbpedia.org/resource/Marino_Lejarreta> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/Cyclist> .
<http://dbpedia.org/resource/Marin_Or%C5%A1uli%C4%87> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/SoccerPlayer> .
<http://dbpedia.org/resource/Marin_Organic> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/NonProfitOrganization> .
<http://dbpedia.org/resource/Marino_Salas> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/BaseballPlayer> .
<http://dbpedia.org/resource/Marinos_Ouzounidis> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/SoccerCoach> .
<http://dbpedia.org/resource/Marinosphaera> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://umbel.org/umbel/rc/Fungus> .
[...]

For the full file you can go here

Automated process

An automated process generates links for all entries in the DBpedia-links repository, using this script.

For entries, which use dynamic methods (SPARQL, script, LIMES), links are generated according to an update interval, which is either specified in the metadata or assigned a default value (seven days). For external dump files and static links we use a HEAD request to check their Last-Modified header, and update accordingly.

The script, which runs as a cronjob, generates a daily snapshot that captures the state of the linksets in the repository, as well as monthly releases