Skip to content
Sarven Capadisli edited this page Feb 15, 2013 · 27 revisions

Before you start

You are reading this because you have some familiarity with SDMX-ML or the RDF Data Cube vocabulary. Some knowledge of Linked Data practices, XML, XSLT would be handy as well.

What can it do?

Given Generic SDMX-ML data or metadata as input, XSL 2.0 templates transforms them to RDF/XML. It uses vocabularies like RDF Data Cube, SDMX-RDF, SKOS, XKOS, PROV-O..

The transformation follows some common Linked Data practices as well as other ones out of thin air :) If you disagree or would like to propose alternatives, please either contact me or better yet, create an issue. Relevant changes will then be reflected here.

Configuration

The scripts/config.rdf file is used to configure some stuff for the transformations. Here is an outline for some of the noteworthy things in the templates.

Agency identifiers and URIs

agencies.ttl is used to track some of the mappings for maintenance agencies. It includes the maintenance agency's i.e., the SDMX publisher's, identifier that's in the SDMX Registry, as well as the base URI for that agency. This file allows references to external agency identifiers to be looked up for their base URI and used in the transformations. Currently this agency recognition is treated as either "SDMX" or some agency that's publishing the actual statistics.

In the case of SDMX, when there is a reference to SDMX CodeLists and Codes, it is typically indicated by the component agency being set to SDMX e.g., codelistAgency="SDMX" of a structure:Component and/or agencyID="SDMX" of a CodeList with id="CL_FREQ". When this is detected, corresponding URIs from the SDMX-RDF vocabulary is used e.g., for metadata; http://purl.org/linked-data/sdmx/2009/code#freq, and data; http://purl.org/linked-data/sdmx/2009/code#freq-A.

Similarly, an agency might use some other agency's codes. By following the same URI pattern conventions, the agency file is used to find the corresponding base URI in order to make a reference. For example, here is a coded property that's used by European Central Bank (4F0) to associate a code list that's defined by Eurostat (4D0):

<http://4F0.270a.info/property/OBS_STATUS>
     <http://purl.org/linked-data/cube#codeList> <http://4D0.270a.info/code/CL_OBS_STATUS>

Naturally, the transformation does not re-define metadata that's from an external agency as the owners of the data would define them under their authority.

URI configurations

Base URIs can be set for classes, codelists, concept schemes, datasets, slices, properties, provenance, as well as for the source SDMX data.

The value for uriThingSeparator e.g., /, lets one set the delimiter to separate the "thing" from the rest of the URI. In the Linked Data community, this is typically either a / or #. For example, if slash is used, an URI would end up like http://example.org/code/CL_GEO (note the last slash before CL_GEO). If hash is used, an URI would end up like http://example.org/code#CL_GEO.

Similarly, uriDimensionSeparator can be set to separate dimension values that's used in RDF Data Cube observation URIs. As observation should have its own unique URI, the method to construct URIs is done by taking dimension values as safe terms to be used in URIs separated by the value in uriDimensionSeparator. For example, here is a crazy looking observation URI where uriDimensionSeparator is set to /: http://example.org/dataset/DSD_T_PERSON_STATTAB-01-2A01/5938/1/15497/4/21/1/2011/2011-12-31. But with uriThingSeparator set to # and uriDimensionSeparator set to -, it could end up like http://example.org/dataset/DSD_T_PERSON_STATTAB-01-2A01#5938-1-15497-4-21-1-2011-2011-12-31. If you are wondering about DSD_T_PERSON_STATTAB-01-2A01, that's the KeyFamily (DSD) id, and http://example.org/dataset/ would be the value that can be set in config for the base URI for dataset.

Creator's URI can also be set which is also used for provenance data.

Default to language

Possibility to force a default xml:lang on skos:prefLabel and skos:definition when lang is not originally in the data. If config.rdf contains a non-empty lang value it will use it e.g.,:

<rdf:Description>
    <rdf:value>en</rdf:value>
    <rdfs:label>lang</rdfs:label>
</rdf:Description>

Default language may also be applied in the case of Annotations. See Interlinking SDMX Annotations for example.

Interlinking SDMX Annotations

SDMX Annotations contain important information that can be put to use by the publisher. Data in AnnotationTypes are typically used as publisher's internal conventions. Hence, there is no standardization on how they are used across all SDMX publishers. In order not to leave this information behind in the final transformation, the configuration allows publishers to define the way they should be transformed. This done by setting interlinkAnnotationTypes: the AnnotationType to detect (in rdfs:label), the predicate (as an XML QName) to use (in rdf:predicate), and the instances of Concepts or Codes to apply to (in rdf:type). Currently this feature is only applied to Annotations in Concepts and Codes. For example, given the following SDMX snippet:

<structure:CodeList id="CL_HGDE_GDE" agencyID="CH1_RN">
  <structure:Code value="13256">
    <structure:Description>Aeugst am Albis</structure:Description>
    <structure:Annotations>
      <common:Annotation>
        <common:AnnotationType>CODE_OFS</common:AnnotationType>
        <common:AnnotationText>1</common:AnnotationText>
      </common:Annotation>
      <common:Annotation>
        <common:AnnotationType>ABBREV</common:AnnotationType>
        <common:AnnotationText>A.a.A.</common:AnnotationText>
      </common:Annotation>
      <common:Annotation>
        <common:AnnotationType>REC_TYPE</common:AnnotationType>
        <common:AnnotationTitle>11</common:AnnotationText>
      </common:Annotation>
  </structure:Code>
</structure:CodeList>

and the following configuration in config.rdf:

<rdf:value>
  <rdf:Description>
    <rdf:value>http://example.org/property/</rdf:value>
    <rdfs:label>property</rdfs:label>
  </rdf:Description>
</rdf:value>

<rdf:value>
  <rdf:Description>
    <rdf:value>
      <rdf:Description>
        <rdf:predicate>xkos:hasPart</rdf:predicate>
        <rdf:type>CODE_OFS</rdf:type>
        <rdfs:label>AnnotationText</rdfs:label>
        <rdfs:range>http://example.org/code/CL_HGDE_GDE</rdfs:range>
      </rdf:Description>
    </rdf:value>
    <rdf:value>
      <rdf:Description>
        <rdf:predicate>skos:altLabel</rdf:predicate>
        <rdf:type>ABBREV</rdf:type>
        <rdfs:label>AnnotationText</rdfs:label>
        <rdfs:range>Literal</rdfs:range>
      </rdf:Description>
    </rdf:value>
    <rdf:value>
      <rdf:Description>
        <rdf:predicate>property:GDE_GARTE</rdf:predicate>
        <rdf:type>REC_TYPE</rdf:type>
        <rdfs:label>AnnotationTitle</rdfs:label>
        <rdfs:range>http://example.org/code/CL_HGDE_MODALITY</rdfs:range>
      </rdf:Description>
    </rdf:value>
    <rdfs:label>interlinkAnnotationTypes</rdfs:label>
  </rdf:Description>
</rdf:value>

would result in the final RDF/XML transformation like:

<rdf:Description rdf:about="http://example.org/code/CL_HGDE_GDE/13256">
  <xkos:hasPart rdf:resource="http://example.org/code/CL_HGDE_GDE/1"/>
  <skos:altLabel>A.a.a.</skos:altLabel>
  <property:GDE_GARTE rdf:resource="http://example.org/CL_HGDE_MODALITY/11"/>
</rdf:Description>

Only the AnnotationTypes with a corresponding configuration will be applied, and unspecific ones will be skipped.

If the default language had been set, the output would have contained xml:lang="{$lang}".

Omitting components

There are cases in which certain data parts contain errors. To get around this until the data is fixed at source, and without giving up on rest of the data at hand, as well as without making any significant assumptions or changes to the remaining data, omitComponents is an configuration option to explicitly skip over those parts. For example, if the Attribute values in a DataSet don't correspond to coded values - where they may contain whitespace - they can be skipped without damaging the rest of the data. This obviously gives up on precision in favour of still making use of the data. The configuration looks like this (in Turtle):

[ rdfs:label "omitComponents" ;
    rdf:value [ rdf:type "structure:Attribute" ;
                rdf:value "UNIT"
    ]
]

See also issue #30.

Vocabularies

Besides the common vocabularies: RDF RDFS, XSD, OWL, XSD, the RDF Data Cube vocabulary is used to describe multi-dimensional statistical data, and SDMX-RDF for the statistical information model. PROV-O is used for provenance coverage. And of course SKOS and XKOS to cover concepts, concept schemes and their relationships to one another. XKOS is currently applied primarily for hierarchical lists here (I hope I understood the vocabulary correctly).

Provenance

There is provenance level data:

Resources of type qb:DataStructureDefinition, qb:DataSet, skos:ConceptScheme are also typed with prov:Entity, and given prov:wasAttributedTo with the value from creator (which is typed with prov:Agent) in config.rdf.

There is a unique prov:Activity for each transformation, and it has a dcterms:title, and contains values for prov:startedAtTime, prov:wasAssociatedWith (the creator), prov:used (i.e., source XML, XSL to transform) to what was prov:generated (and source data URI that it prov:wasDerivedFrom). It also declares the licensing information (taken from config.rdf) using dcterms:license.

A provenance document may be provide to the transformer. This XML document would contain prov:Activity information which indicates the location of the XML document on the local filesystem which would later be transformed. It contains other provenance data like when it was retrieved, with what tools, and so on.

If that provenance document is provided to the transformer, the provenance template looks into that XML to see if there is provenance information about the XML that it is transforming. If it does, it makes a link between the current provenance activity (which is the transformation), with the earlier provenance activity (which is the retrieval) using prov:wasInformedBy.

Versions

As SDMX data publishers version their classifications and in turn the Cubes that are generated refer to particular versions of those classifications, versions need to be explictly part of URIs in order uniquely identify classifications. Although this goes against the general recommendation out there for not including the version in the URI, it is a good exception here. Otherwise, how would creating new terms for URIs without the version information be any different? For some background, see also #31.

URI Patterns

Here is an outline for the URI patterns that's used. example.org is used for the domain as an example (see also: Agency identifiers and URIs) followed with class, code, concept, dataset, property, provenance, or slice as example (i.e., they can be changed from config). /s are used to separate the things and dimensions in URIs, which can also be changed from config. Variable values are derived directly from source SDMX. Some skos:ConceptSchemes have uriValidFromToSeparator which is generated by combining date validity information when both validFrom and validTo are provided.

qb:DataStructureDefinition

http://example.org/dataset/{$KeyFamilyRef}/structure

qb:Observation

http://example.org/dataset/{$KeyFamilyRef}/{dimension-1}/../dimension-n}

qb:Slice

http://example.org/slice/{$KeyFamilyRef}/{dimension-1}/../dimension-n-exluding-FREQ-concept}

skos:Collection

http://example.org/code/{$version}/{$hierarchicalCodeListID}
http://example.org/code/{$version}/{$hierarchyID}

sdmx:CodeList

http://example.org/code/{$version}/{$codeListID}

skos:ConceptScheme

http://example.org/concept/{$version}/{$conceptSchemeID}

skos:Concept , sdmx:Concept

http://example.org/code/{$version}/{$codeListID}/{@codeID}
http://example.org/concept/{$version}/{$conceptSchemeID}/{@conceptID}

owl:Class and rdfs:Class

http://example.org/class/{$version}/{$codeListID}

rdf:Property , qb:DimensionProperty , qb:MeasureProperty , qb:AttributeProperty

http://example.org/property/{$conceptID}

Properties

Properties used in structure (DSD, codelists, ..) and data (observations) are listed below:

Structure

http://example.org/property/{$conceptID}
http://purl.org/dc/terms/identifier
http://purl.org/dc/terms/references
http://purl.org/linked-data/cube#attribute
http://purl.org/linked-data/cube#codeList
http://purl.org/linked-data/cube#component
http://purl.org/linked-data/cube#componentAttachment
http://purl.org/linked-data/cube#componentProperty
http://purl.org/linked-data/cube#concept
http://purl.org/linked-data/cube#dimension
http://purl.org/linked-data/cube#measure
http://purl.org/linked-data/cube#order
http://purl.org/linked-data/cube#sliceKey
http://purl.org/linked-data/sdmx/2009/concept#dataRev
http://purl.org/linked-data/sdmx/2009/concept#dsi
http://purl.org/linked-data/sdmx/2009/concept#mAgency
http://purl.org/linked-data/sdmx/2009/concept#validFrom
http://purl.org/linked-data/sdmx/2009/concept#validTo
http://purl.org/linked-data/xkos#hasPart
http://purl.org/linked-data/xkos#isPartOf
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2000/01/rdf-schema#comment
http://www.w3.org/2000/01/rdf-schema#range
http://www.w3.org/2000/01/rdf-schema#seeAlso
http://www.w3.org/2000/01/rdf-schema#subClassOf
http://www.w3.org/2004/02/skos/core#definition
http://www.w3.org/2004/02/skos/core#hasTopConcept
http://www.w3.org/2004/02/skos/core#inScheme
http://www.w3.org/2004/02/skos/core#member
http://www.w3.org/2004/02/skos/core#notation
http://www.w3.org/2004/02/skos/core#prefLabel
http://www.w3.org/2004/02/skos/core#topConceptOf
http://www.w3.org/ns/prov#generated
http://www.w3.org/ns/prov#startedAtTime
http://www.w3.org/ns/prov#used
http://www.w3.org/ns/prov#wasAssociatedWith
http://www.w3.org/ns/prov#wasAttributedTo
http://www.w3.org/ns/prov#wasDerivedFrom

Data

http://example.org/property/{$conceptID}
http://purl.org/linked-data/cube#dataSet
http://purl.org/linked-data/cube#observation
http://purl.org/linked-data/cube#slice
http://purl.org/linked-data/cube#sliceStructure
http://purl.org/linked-data/cube#structure
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/ns/prov#generated
http://www.w3.org/ns/prov#startedAtTime
http://www.w3.org/ns/prov#used
http://www.w3.org/ns/prov#wasAssociatedWith
http://www.w3.org/ns/prov#wasAttributedTo
http://www.w3.org/ns/prov#wasDerivedFrom

Types of resources

Type of resources in the structure (DSD, codelists, ..) and data (observations) are listed below:

Structure

http://example.org/class/{$version}/{$codeListID}
http://purl.org/linked-data/cube#AttributeProperty
http://purl.org/linked-data/cube#ComponentSpecification
http://purl.org/linked-data/cube#DataStructureDefinition
http://purl.org/linked-data/cube#DimensionProperty
http://purl.org/linked-data/cube#MeasureProperty
http://purl.org/linked-data/sdmx#CodeList
http://purl.org/linked-data/sdmx#Concept
http://purl.org/linked-data/sdmx#DataStructureDefinition
http://www.w3.org/1999/02/22-rdf-syntax-ns#Property
http://www.w3.org/2000/01/rdf-schema#Class
http://www.w3.org/2002/07/owl#Class
http://www.w3.org/2004/02/skos/core#Collection
http://www.w3.org/2004/02/skos/core#Concept
http://www.w3.org/2004/02/skos/core#ConceptScheme
http://www.w3.org/ns/prov#Activity
http://www.w3.org/ns/prov#Agent
http://www.w3.org/ns/prov#Entity

Data

http://purl.org/linked-data/cube#DataSet
http://purl.org/linked-data/cube#Observation
http://www.w3.org/ns/prov#Activity
http://www.w3.org/ns/prov#Agent
http://www.w3.org/ns/prov#Entity

Datatypes

Some of the XSD datatypes are applied to object resources based on SDMX strucutre:TextFormat/@textType. See also issues #3 and #9, the coverage below.

How to run:

  1. Edit scripts/config.rdf to configure things like base URIs, delimiters to use in URIs, or even how to put SDMX AnnotationTypes into good use. If you don't edit, it will work with defaults (e.g., example.org, /).

  2. Either use the provided scripts/generic.sh to transform generic SDMX-ML in data/ to RDF/XML, or use it on your own data with an XSLT 2.0 processor, with a command something along the lines of (using the Debian saxonb-xslt for example here):

The following takes the metadata from generic.structure.xml using the scripts/generic.xsl template to create the corresponding RDF/XML in generic.structure.rdf. The parameter xmlDocument value is used in the final transformation to let the processor know the file that was being transformed (also used for provenance data) - just reuse the same value as the input XML value in -s, and pathToGenericStructure parameter value is same as xmlDocument in this case because we are going to transform the SDMX KeyFamily / DSD):

saxonb-xslt -s generic.structure.xml -xsl generic.xsl xmlDocument=generic.structure.xml pathToGenericStructure=generic.structure.xml > generic.structure.rdf

Similar to above, but this time we are going to use the generic.structure.xml for the generic data. The following generates the RDF/XML generic.data.rdf from generic.data.xml by making use of the generic structure data in generic.structure.xml with parameter pathToGenericStructure:

saxonb-xslt -t -tree:linked -s generic.data.xml -xsl generic.xsl xmlDocument=generic.structure.xml pathToGenericStructure=generic.structure.xml > generic.data.rdf

-tree:linked in saxonb-xslt helps for large files, not to mention giving more memory to the processor.

Optionally, pathToProvDocument (for extra provenance information) and pathToConfig (to use a custom config, otherwise default config.rdf is used) parameters can be passed in.

Coverage

The following is a coverage (in progress) based on sample data.

BIS OECD UN ECB WB IMF FAO EUROSTAT BFS
"External agencies" refers to agencies in which the SDMX publisher is using an external agency's concepts, codelists etc.
External Agencies SDMX EUROSTAT IAEG SDMX OECD
Annotation(Type) Y Y Y
Hierarchical CodeLists Y Y Y Y Y Y
Datatype (OBS_VALUE) String Double Double Double
Datatype (TIME_FORMAT) String String
Datatype (TIME_PERIOD) String
Datatype (OBS_STATUS) String String
Datatype (OBS_CONF) String
SDMX Version 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0