# DataDictionary_RDF_Data_Cube

This Notebook steps through the development of a method to convert a UKDS DataDictionary .rtf file to a RDF file using the Data Cube vocabulary https://www.w3.org/TR/vocab-data-cube/

## Initial setup

### Import packages

In [1]:
import os, ukds
import pandas as pd

### Set filepaths

This sets a filepath to an example data dictionary on a local file system, in this case the 'uktus15_household_ukda_data_dictionary.rtf' file.

In [2]:
base_dir=os.path.join(*[os.pardir]*4,r'_Data\United_Kingdom_Time_Use_Survey_2014-2015\UKDA-8128-tab')
dd_fp=os.path.join(base_dir,r'mrdoc\allissue\uktus15_household_ukda_data_dictionary.rtf')

### Create DataDictionary

A ukds.DataDictionary instance is created and the .rtf file is read into it.

In [32]:
dd=ukds.DataDictionary()
dd.read_rtf(dd_fp)
dd.variable_list[0:4]

[{'pos': '1',
  'variable': 'serial',
  'variable_label': 'Household number',
  'variable_type': 'numeric',
  'SPSS_measurement_level': 'SCALE',
  'SPSS_user_missing_values': '',
  'value_labels': ''},
 {'pos': '2',
  'variable': 'strata',
  'variable_label': 'Strata',
  'variable_type': 'numeric',
  'SPSS_measurement_level': 'SCALE',
  'SPSS_user_missing_values': '',
  'value_labels': {-2.0: 'Schedule not applicable'}},
 {'pos': '3',
  'variable': 'psu',
  'variable_label': 'Primary sampling unit',
  'variable_type': 'numeric',
  'SPSS_measurement_level': 'SCALE',
  'SPSS_user_missing_values': '',
  'value_labels': {-2.0: 'Schedule not applicable'}},
 {'pos': '4',
  'variable': 'HhOut',
  'variable_label': 'Final outcome - household',
  'variable_type': 'numeric',
  'SPSS_measurement_level': 'SCALE',
  'SPSS_user_missing_values': '',
  'value_labels': {0.0: 'Outstanding',
   640.0: 'Unknown whether address is residential: No contact after 6+ calls',
   214.0: 'Productive : Household q

## Discussion

### Aim

The aim of this notebook is to develop a method to convert the information in UKDS data dictionary files into the Concept Schemes and Code List of the RDF Data Cube vocabulary https://www.w3.org/TR/vocab-data-cube/#schemes.

Once converted this RDF data can be combined with RDF Data Cube for the UKDS data table files.

### Sample call

Sample code could look like:

```python
t=dd.to_rdf_data_cube() # dd is a DataDictionary instance
```

where t is a string of a turtle file.

### Sample RDF file

The RDF Data Cube webpage gives, in Example 14, an example of a concept (or variable) and it's code list using the `skos` vocabulary.

The proposal is the RDF file would look as below. This shows the data in turtle (.ttl) format for the variable *serial*:

```turtle
@prefix qb:	<http://purl.org/linked-data/cube#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ukds8128-code: <http://purl.org/berg/ukds8128/code/> . # a prefix for the UKDS Time Use Survey 2014-2015 dataset
@prefix ukds8128-measure: <http://purl.org/berg/ukds8128/measure/> .

ukds8128-measure:serial a rdf:Property, qb:MeasureProperty ;
    rdfs:label "serial"@en ;
    rdfs:subPropertyOf sdmx-measure:obsValue ;
    rdfs:range xsd:decimal . 

ukds8128-code:serial a skos:ConceptScheme ;
    skos:prefLabel "serial"@en ; # the 'variable' value
    rdfs:label "serial"@en ; # the 'variable' value
    skos:notation "serial" ; # the 'variable' value
    skos:note "Household number."@en ; # the 'variable_label' value
    skos:definition <ukds8128:uktus15_household_ukda_data_dictionary> ; #  a uri based on the file name
    .
```

This shows the data in turtle (.ttl) format for the variable *strata*:

```turtle
@prefix qb:	<http://purl.org/linked-data/cube#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ukds8128: <http://purl.org/berg/ukds8128/> .

ukds8128-code:strata a skos:ConceptScheme ;
    skos:prefLabel "strata"@en ;
    rdfs:label "strata"@en ;
    skos:notation "strata" ;
    skos:note "Strata"@en ;
    skos:definition <ukds8128:uktus15_household_ukda_data_dictionary> ;
    rdfs:seeAlso ukds8128-code:Strata ;
    skos:hasTopConcept ukds8128-code:strata_code_-2.0 .

ukds8128-code:Strata a rdfs:Class, owl:Class;
    rdfs:subClassOf skos:Concept ;
    rdfs:label "strata"@en;
    rdfs:comment "Strata"@en;
    rdfs:seeAlso ukds8128-code:strata .

ukds8128-code:strata_code_-2.0 a skos:Concept, ukds8128-code:Strata;
    skos:topConceptOf ukds8128-code:strata;
    skos:prefLabel "Schedule not applicable"@en ;
    skos:notation -2.0 ;
    skos:inScheme ukds8128-code:strata .

```

Here this includes a single 'value label' code.


## Developing the method

### to_data_structure_definition

In [96]:
def to_data_structure_definition(self,prefix,dataset_name):
    """Returns a RDF Turtle string of the qb:DataStructureDefinition using the Data Cube and skos vocabulary
    
    Arguments:
        - self: the DataDictionary instance
        - prefix (str): the prefix to use for the data dictionary uris   
        - dataset_name (str): the name of the dataset
    
    """
    
    l=['%s:%s-dsd a qb:DataStructureDefinition' % (prefix,dataset_name)]
        
    for i,variable in enumerate(dd.get_variable_names()):
        l.append('qb:component [ qb:measure %s-measure:%s; qb:order %s ]' % (prefix,variable,i+1))
        
    st=' ;\n\t'.join(l) + ' .\n\n'
    
    return st

print(to_data_structure_definition(dd,'ukds8128','uktus15_household'))

ukds8128:uktus15_household-dsd a qb:DataStructureDefinition ;
	qb:component [ qb:measure ukds8128-measure:serial; qb:order 1 ] ;
	qb:component [ qb:measure ukds8128-measure:strata; qb:order 2 ] ;
	qb:component [ qb:measure ukds8128-measure:psu; qb:order 3 ] ;
	qb:component [ qb:measure ukds8128-measure:HhOut; qb:order 4 ] ;
	qb:component [ qb:measure ukds8128-measure:hh_wt; qb:order 5 ] ;
	qb:component [ qb:measure ukds8128-measure:IMonth; qb:order 6 ] ;
	qb:component [ qb:measure ukds8128-measure:IYear; qb:order 7 ] ;
	qb:component [ qb:measure ukds8128-measure:DM014; qb:order 8 ] ;
	qb:component [ qb:measure ukds8128-measure:DM016; qb:order 9 ] ;
	qb:component [ qb:measure ukds8128-measure:DM510; qb:order 10 ] ;
	qb:component [ qb:measure ukds8128-measure:DM1115; qb:order 11 ] ;
	qb:component [ qb:measure ukds8128-measure:DM1619; qb:order 12 ] ;
	qb:component [ qb:measure ukds8128-measure:NumAdult; qb:order 13 ] ;
	qb:component [ qb:measure ukds8128-measure:NumChild; qb:order 14 ] ;


### to_measure_property

In [94]:
def to_measure_property(self,variable,prefix,):
    """Returns a RDF Turtle string of the qb:MeaureProperty using the Data Cube and skos vocabulary
    
    Arguments:
        - self: the DataDictionary instance
        - variable (str): the variable to convert to RDF
        - prefix (str): the prefix to use for the data dictionary uris   
    
    """
    d=self.get_variable_dict(variable)
    
    if d['value_labels']:
        x=', qb:CodedProperty'
    else:
        x=''
    
    l=[
        'ukds8128-measure:%s a rdf:Property, qb:MeasureProperty%s' % (variable,x) ,
        'rdfs:label "serial"@en' ,
        'rdfs:subPropertyOf sdmx-measure:obsValue',
    ]
    
    if d['value_labels']:
        l.append('qb:CodeList %s-code:%s' % (prefix,variable))
        l.append('rdfs:range %s-code:%s' % (prefix,variable[0].upper()+variable[1:]))
    else:
        if d['variable_type']=='numeric':
            l.append('rdfs:range xsd:decimal')
        
    st=' ;\n\t'.join(l) + ' .\n\n'
    
    return st

print(to_measure_property(dd,'serial','ukds8128')) 
print(to_measure_property(dd,'strata','ukds8128'))    

ukds8128-measure:serial a rdf:Property, qb:MeasureProperty ;
	rdfs:label "serial"@en ;
	rdfs:subPropertyOf sdmx-measure:obsValue ;
	rdfs:range xsd:decimal .


ukds8128-measure:strata a rdf:Property, qb:MeasureProperty, qb:CodedProperty ;
	rdfs:label "serial"@en ;
	rdfs:subPropertyOf sdmx-measure:obsValue ;
	qb:CodeList ukds8128-code:strata ;
	rdfs:range ukds8128-code:Strata .




### to_codelist

In [75]:
def to_codelist(self,variable,prefix,filename_no_ext):
    """Returns a RDF Turtle string of the codelist using the Data Cube and skos vocabulary
    
    Arguments:
        - self: the DataDictionary instance
        - variable (str): the variable to convert to RDF
        - filename_no_ext (str): the filename of the Data Dictionary file with no extension included.
        - prefix (str): the prefix to use for the data dictionary uris
    
    """
    d=self.get_variable_dict(variable)
    variable_lower=d['variable'][0].lower() + d['variable'][1:]
    variable_upper=d['variable'][0].upper() + d['variable'][1:]
    
    # ConceptScheme
    l=[]
    l+=[
        '%s-code:%s a skos:ConceptScheme' % (prefix,variable_lower),
        'skos:prefLabel "%s"@en' % d['variable'],
        'rdfs:label "%s"@en' % d['variable'],
        'skos:notation "%s"' % d['variable'],
        'skos:note "%s"@en' % d['variable_label'],
        'skos:definition <%s:%s>' % (prefix,filename_no_ext),
        ]
    
    if d['value_labels']:
        
        l.append('rdfs:seeAlso %s-code:%s' % (prefix,variable_upper))
        for k,v in d['value_labels'].items():
            l.append('skos:hasTopConcept %s:%s_code_%s' % (prefix,variable_lower,k))
    
    st=' ;\n\t'.join(l) + ' .\n\n'
    
    # Code
    
    if d['value_labels']:
        
        l=[
            '%s-code:%s a rdfs:Class, owl:Class ' % (prefix,variable_upper),
            'rdfs:subClassOf skos:Concept ',
            'rdfs:label "%s"@en ' % d['variable'],
            'rdfs:comment "%s"@en ' % d['variable_label'],
            'rdfs:seeAlso %s-code:%s ' % (prefix,variable_lower),
        ]
        st+=' ;\n\t'.join(l) + ' .\n\n'
        
        for k,v in d['value_labels'].items():
            l=[
                '%s-code:%s_code_%s a skos:Concept, %s-code:%s' % (prefix,variable_lower,k,prefix,variable_upper),
                'skos:topConceptOf %s-code:%s' % (prefix,variable_lower),
                'skos:prefLabel "%s"@en' % v,
                'skos:notation %s' % k,
                'skos:inScheme %s-code:%s'  % (prefix,variable_lower),
            ]
            st+=' ;\n\t'.join(l) + ' .\n\n'
            
    return st
    
print(to_codelist(dd,'serial','ukds8128','uktus15_household_ukda_data_dictionary'))
print(to_codelist(dd,'strata','ukds8128','uktus15_household_ukda_data_dictionary'))

ukds8128-code:serial a skos:ConceptScheme ;
	skos:prefLabel "serial"@en ;
	rdfs:label "serial"@en ;
	skos:notation "serial" ;
	skos:note "Household number"@en ;
	skos:definition <ukds8128:uktus15_household_ukda_data_dictionary> .


ukds8128-code:strata a skos:ConceptScheme ;
	skos:prefLabel "strata"@en ;
	rdfs:label "strata"@en ;
	skos:notation "strata" ;
	skos:note "Strata"@en ;
	skos:definition <ukds8128:uktus15_household_ukda_data_dictionary> ;
	rdfs:seeAlso ukds8128-code:Strata ;
	skos:hasTopConcept ukds8128:strata_code_-2.0 .

ukds8128-code:Strata a rdfs:Class, owl:Class  ;
	rdfs:subClassOf skos:Concept  ;
	rdfs:label "strata"@en  ;
	rdfs:comment "Strata"@en  ;
	rdfs:seeAlso ukds8128-code:strata  .

ukds8128-code:strata_code_-2.0 a skos:Concept, ukds8128-code:Strata ;
	skos:topConceptOf ukds8128-code:strata ;
	skos:prefLabel "Schedule not applicable"@en ;
	skos:notation -2.0 ;
	skos:inScheme ukds8128-code:strata .




In [103]:
def to_rdf_data_cube(self,prefix,base_uri,filename_no_ext,dataset_name):
    """Returns a RDF Turtle string using the Data Cube vocabulary
    
    Arguments:
        - self: the DataDictionary instance
        - filename_no_ext (str): the filename of the Data Dictionary file with no extension included.
        - prefix (str): the prefix to use for the data dictionary base uri
        - base_uri (str): the data dictionary base uri
        - dataset_name (str): the name of the dataset
    
    """
    
    st="""
@prefix qb:    <http://purl.org/linked-data/cube#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix %s: %s> .
@prefix %s-code: %scode#> .
@prefix %s-measure: %smeasure#> .

""" % (prefix,base_uri,prefix,base_uri,prefix,base_uri)
    
    st+=to_data_structure_definition(self,prefix,dataset_name)
    
    for variable in dd.get_variable_names():
        st+=to_measure_property(self,variable,prefix)
        st+=to_codelist(self,variable,prefix,filename_no_ext)
        break
    return st

In [104]:
print(to_rdf_data_cube(dd,'ukds8128','<http://purl.org/berg/ukds8128/','uktus15_household_ukda_data_dictionary','uktus15_household'))


@prefix qb:    <http://purl.org/linked-data/cube#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ukds8128: <http://purl.org/berg/ukds8128/> .
@prefix ukds8128-code: <http://purl.org/berg/ukds8128/code#> .
@prefix ukds8128-measure: <http://purl.org/berg/ukds8128/measure#> .

ukds8128:uktus15_household-dsd a qb:DataStructureDefinition ;
	qb:component [ qb:measure ukds8128-measure:serial; qb:order 1 ] ;
	qb:component [ qb:measure ukds8128-measure:strata; qb:order 2 ] ;
	qb:component [ qb:measure ukds8128-measure:psu; qb:order 3 ] ;
	qb:component [ qb:measure ukds8128-measure:HhOut; qb:order 4 ] ;
	qb:component [ qb:measure ukds8128-measure:hh_wt; qb:order 5 ] ;
	qb:component [ qb:measure ukds8128-measure:IMonth; qb:order 6 ] ;
	qb:component [ qb:measure ukds8128-measure:IYear; qb:order 7 ] ;
	qb:componen