Skip to content

Serialization format properties

Andrea Di Menna edited this page Feb 5, 2014 · 27 revisions

This page explains the meaning of lines starting with uri-policy and format. (with a dot at the end) in extraction configuration files like extraction.default.properties.

Note: Unless otherwise noted, in this text the term URI also includes IRI.

URI policies

URI policies modify the URI serialization format in four ways (as of May 2013 - there may be more in the future):

  1. uri: Write URIs, not IRIs. IRIs are the default. URIs are hardly human-readable for many languages because non-ASCII characters have to be escaped.
  2. generic: Use the generic DBpedia domain, i.e. http://dbpedia.org/... instead of local DBpedia domains like http://de.dbpedia.org/... Local domains are the default.
  3. xml-safe: Make URIs safe for use in RDF/XML by adding an underscore at the end of URIs that could otherwise not be serialized as predicates in RDF/XML. Only necessary for the predicate position, i.e. xml-safe-predicates. By default, no URIs are made safe for RDF/XML.
  4. reject-long: Reject URIs longer than 500 characters. Overly long URIs are most often caused by faulty extraction.

By appending one of the following positions to a policy prefix, the user can specify to which URIs in a triple/quad the policy should be applied. If no suffix is given, the policy is applied to all URIs.

  1. -subjects: apply policy only to subject URIs
  2. -predicates: apply policy only to predicate URIs
  3. -objects: apply policy only to object URIs
  4. -datatypes: apply policy only to datatype URIs
  5. -contexts: apply policy only to context URIs (only useful if a RDF quad format is used, ignored otherwise)

URI policy specifiers like xml-safe-predicates are used as keys, while their values define for which languages a policy should be used. An asterisk * means that a policy is used for all languages. Note: One cannot use the article-count ranges and the @mappings keyword that may specify language sets in other DBpedia configuration properties.

If the URI policy value is empty, the default policies are used: IRIs with generic domains and arbitrary length, no xml-safe predicates.

Examples for key-value-pairs

uri:en; generic:en

means that the uri and generic policies should be applied to URIs in all triple/quad positions, but only for language en. In other words, English DBpedia quads are serialized using the generic domain and URIs (not IRIs), e.g. http://dbpedia.org/resource/%C3%85hus, while all other languages use the default policies, namely local domains and IRIs (not URIs), e.g. http://de.dbpedia.org/resource/Åhus.

xml-safe-predicates:*

means that the xml-safe policy should be applied for all languages, but only to predicate URIs, not to subjects etc.

Examples for complete lines

uri-policy.uri=uri:en; generic:en; xml-safe-predicates:*
uri-policy.iri=generic:en; xml-safe-predicates:*

With the part after uri-policy. each URI policy gets a unique name - in this case, uri-policy.uri and uri-policy.iri. These URI policy names can be used in the format configuration lines. See below.

Formats

The format configuration lines define which files are written and which RDF format (N-Triples, N-Quads, etc.), URI policy and compression they use.

The part after format. is the file suffix. If the last part of the suffix is .gz or .bz2, the file will be compressed using gzip or bzip2. Other parts of the suffix. e.g. .nt or .ttl, have no meaning for the configuration, but should match the given RDF format.

The value of a format. property has two parts: RDF format and URI policy.

The following RDF formats are available (as of April 2013 - there may be more in the future):

  1. n-triples: write N-Triples. Not human-readable for many languages because non-ASCII characters have to be escaped. Recommended filename extension: .nt.
  2. n-quads: write N-Quads. Not human-readable for many languages because non-ASCII characters have to be escaped. Recommended filename extension: .nq.
  3. turtle-triples: write Turtle triples. Like N-Triples, but human-readable because very few Unicode characters have to be escaped. Recommended filename extension: .ttl.
  4. turtle-quads: write Turtle quads. Like N-Quads, but human-readable because very few Unicode characters have to be escaped. Recommended filename extension: .tql.
  5. trix-triples: write TriX triples.
  6. trix-quads: write TriX quads. The context URI is given as the graph URI.
  7. rdf-json: write RDF/JSON triples. This is experimental support for now and only through the live module.

It is an error if the URI policy name is not defined in the current settings.

Examples for complete lines

format.nq.bz2=n-quads;uri-policy.iri

means that extracted triples/quads are written to bzip2-compressed files with suffix .nq.bz2 as N-Quads, using URI policy uri-policy.iri.

format.nt.gz=n-triples;uri-policy.uri

means that extracted triples/quads are written to gzip-compressed files with suffix .nt.gz as N-Triples, using URI policy uri-policy.uri.