# create_metadata

This notebook creates the CSVW metadata file for the sensor CSV file.

## Setup

Imports packages.

Information on the `csvw_functions` package is available here: https://github.com/stevenkfirth/csvw_functions

In [1]:
import csvw_functions
import json

## Initial data processing

Initially a UnicodeDecodeError was occurring when opening the CSV file. This is because this particular CSV file is encoded in ANSI rather than the default endcoding of utf-8.

You can see what the encoding of a CSV file is by opening it in Notepad, selecting 'save as' and then viewing the 'encoding box.

The solution was to open the file using Python and resave it with the UTF8 encoding, which is the standard encoding used by the CSVW standards.

In [2]:
with open('ABCE_atrium_U12.csv',encoding='ANSI') as f:
    with open('ABCE_atrium_U12_UTF8.csv','w',encoding='UTF8') as f1:
        f1.write(f.read())

## Get embedded metadata

Reads the CSV file and extracts the information from the column headings to form an initial CSVW metadata document.

In [3]:
metadata_table_dict=\
    csvw_functions.get_embedded_metadata(
        'ABCE_atrium_U12_UTF8.csv',
        skip_rows=1,  # skips the first row of the CSV file
        relative_path=True  # sets the 'url' table property to a path relative to the current working directory.
)
metadata_table_dict

{'@context': 'http://www.w3.org/ns/csvw',
 'rdfs:comment': [{'@value': '"Plot Title: 01_10_13 "'}],
 'tableSchema': {'columns': [{'titles': {'und': ['#']}, 'name': '%23'},
   {'titles': {'und': ['Time, GMT+01:00']}, 'name': 'Time%2C%20GMT%2B01%3A00'},
   {'titles': {'und': ['Temp, 째C()']}, 'name': 'Temp%2C%20%C2%B0C%28%29'},
   {'titles': {'und': ['RH, %()']}, 'name': 'RH%2C%20%25%28%29'},
   {'titles': {'und': ['Intensity, Lux()']},
    'name': 'Intensity%2C%20Lux%28%29'},
   {'titles': {'und': ['Bad Battery()']}, 'name': 'Bad%20Battery%28%29'},
   {'titles': {'und': ['Host Connected()']}, 'name': 'Host%20Connected%28%29'},
   {'titles': {'und': ['Stopped()']}, 'name': 'Stopped%28%29'},
   {'titles': {'und': ['End Of File()']}, 'name': 'End%20Of%20File%28%29'}]},
 'url': 'ABCE_atrium_U12_UTF8.csv'}

## Add new information to the metadata document

This section adds additional information to create a complete metadata document.

### set properties on Table object

New information about the table.

In [4]:
metadata_table_dict.update(
    {
        "dc:title": "Indoor conditions in the atrium of the Sir Frank Gibb building",
        "dc:description": 
            "Air temperature and Relative humidity data from Hobo U12 sensor placed in the atrium of the Sir Frank Gibb building""",
        "dc:location": "Atrium, Sir Frank Gibb building, Loughborough University, LE11 3TU, UK",
        "dc:creator": "ABCE Open Research Team",
        "dialect": {
            "skipRows": 1
        }
    }
)
metadata_table_dict

{'@context': 'http://www.w3.org/ns/csvw',
 'rdfs:comment': [{'@value': '"Plot Title: 01_10_13 "'}],
 'tableSchema': {'columns': [{'titles': {'und': ['#']}, 'name': '%23'},
   {'titles': {'und': ['Time, GMT+01:00']}, 'name': 'Time%2C%20GMT%2B01%3A00'},
   {'titles': {'und': ['Temp, 째C()']}, 'name': 'Temp%2C%20%C2%B0C%28%29'},
   {'titles': {'und': ['RH, %()']}, 'name': 'RH%2C%20%25%28%29'},
   {'titles': {'und': ['Intensity, Lux()']},
    'name': 'Intensity%2C%20Lux%28%29'},
   {'titles': {'und': ['Bad Battery()']}, 'name': 'Bad%20Battery%28%29'},
   {'titles': {'und': ['Host Connected()']}, 'name': 'Host%20Connected%28%29'},
   {'titles': {'und': ['Stopped()']}, 'name': 'Stopped%28%29'},
   {'titles': {'und': ['End Of File()']}, 'name': 'End%20Of%20File%28%29'}]},
 'url': 'ABCE_atrium_U12_UTF8.csv',
 'dc:title': 'Indoor conditions in the atrium of the Sir Frank Gibb building',
 'dc:description': 'Air temperature and Relative humidity data from Hobo U12 sensor placed in the atrium of th

### update column names

Updates the column names to better formatted strings.

In [5]:
for col_dict in metadata_table_dict['tableSchema']['columns']:
    col_dict['name']=col_dict['titles']['und'][0].split(',')[0].split('(')[0].lower().replace(' ','_')
metadata_table_dict['tableSchema']['columns'][0]['name']='index'
metadata_table_dict

{'@context': 'http://www.w3.org/ns/csvw',
 'rdfs:comment': [{'@value': '"Plot Title: 01_10_13 "'}],
 'tableSchema': {'columns': [{'titles': {'und': ['#']}, 'name': 'index'},
   {'titles': {'und': ['Time, GMT+01:00']}, 'name': 'time'},
   {'titles': {'und': ['Temp, 째C()']}, 'name': 'temp'},
   {'titles': {'und': ['RH, %()']}, 'name': 'rh'},
   {'titles': {'und': ['Intensity, Lux()']}, 'name': 'intensity'},
   {'titles': {'und': ['Bad Battery()']}, 'name': 'bad_battery'},
   {'titles': {'und': ['Host Connected()']}, 'name': 'host_connected'},
   {'titles': {'und': ['Stopped()']}, 'name': 'stopped'},
   {'titles': {'und': ['End Of File()']}, 'name': 'end_of_file'}]},
 'url': 'ABCE_atrium_U12_UTF8.csv',
 'dc:title': 'Indoor conditions in the atrium of the Sir Frank Gibb building',
 'dc:description': 'Air temperature and Relative humidity data from Hobo U12 sensor placed in the atrium of the Sir Frank Gibb building',
 'dc:location': 'Atrium, Sir Frank Gibb building, Loughborough University,

### add column descriptions, datatypes and units

Adds additional information to each column.

In [6]:
data={
    'index':{
        'dc:description':'Integer count of sensor observations, starting at 1.',
        'datatype':'integer'
    },
    'time':{
        'dc:description':'Date and time of a sensor observation, recorded at time zone Greenwich Mean Time +01:00.',
        'datatype':'string',
        "rdfs:comment": "The 'Time, GMT+01:00' column contains values such as '10/02/13 06:00:00 AM'. This cannot be represented by the format options available in CSVW. The format of this column approximately corresponds to 'MM/dd/yy HH:mm:ss' plus a 'AM' or 'PM' flag.",
        "schema:variableMeasured": "Timestamp",
    },
    'temp':{
        "dc:description": "Half hourly air temperature (C).",
        'datatype':'decimal',
        "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {
            "@id": "http://qudt.org/vocab/unit/DEG_C"
        },
        "schema:variableMeasured": "Air temperature",
        "schema:duration": "30M",
        "schema:unitText": "C",
        "datatype": "number"
    },
    'rh':{
        "dc:description": "Half hourly air relative humidity.",
        "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {
            "@id": "http://qudt.org/vocab/unit/PERCENT_RH"
        },
        "schema:variableMeasured": "Air relative humidity",
        "schema:duration": "30M",
        "schema:unitText": "%",
        "datatype": "number"
    },
    'intensity':{
        "dc:description": "Half hourly light intensity (lux)",
        "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {
            "@id": "http://qudt.org/vocab/unit/LUX"
        },
        "schema:variableMeasured": "Light intensity",
        "schema:duration": "30M",
        "schema:unitText": "Lux",
        "datatype": "number"
    },
    'bad_battery':{
        "dc:description": "Half hourly sensor internal diagnostic test for bad battery.",
        "schema:duration": "30M",
        "datatype": {
            "base": "string",
            "format": "Logged"
        }
    },
    'host_connected':{
        "dc:description": "Half hourly sensor internal diagnostic test for the connection of a 'host' - i.e. connection of a PC or similar.",
        "schema:duration": "30M",
        "datatype": {
            "base": "string",
            "format": "Logged"
        }
    },
    'stopped':{
        "dc:description": "Half hourly sensor internal diagnostic test for the command to stop recording.",
        "schema:duration": "30M",
        "datatype": {
            "base": "string",
            "format": "Logged"
        }
    },
    'end_of_file':{
        "dc:description": "Half hourly value indicating the end of the downloaded data file.",
        "schema:duration": "30M",
        "datatype": {
            "base": "string",
            "format": "Logged"
        }
    }
}
for col_dict in metadata_table_dict['tableSchema']['columns']:
    for k,v in data[col_dict['name']].items():
        col_dict[k]=v
metadata_table_dict

{'@context': 'http://www.w3.org/ns/csvw',
 'rdfs:comment': [{'@value': '"Plot Title: 01_10_13 "'}],
 'tableSchema': {'columns': [{'titles': {'und': ['#']},
    'name': 'index',
    'dc:description': 'Integer count of sensor observations, starting at 1.',
    'datatype': 'integer'},
   {'titles': {'und': ['Time, GMT+01:00']},
    'name': 'time',
    'dc:description': 'Date and time of a sensor observation, recorded at time zone Greenwich Mean Time +01:00.',
    'datatype': 'string',
    'rdfs:comment': "The 'Time, GMT+01:00' column contains values such as '10/02/13 06:00:00 AM'. This cannot be represented by the format options available in CSVW. The format of this column approximately corresponds to 'MM/dd/yy HH:mm:ss' plus a 'AM' or 'PM' flag.",
    'schema:variableMeasured': 'Timestamp'},
   {'titles': {'und': ['Temp, 째C()']},
    'name': 'temp',
    'dc:description': 'Half hourly air temperature (C).',
    'datatype': 'number',
    'http://purl.org/linked-data/sdmx/2009/attribute#uni

## Save the newly created metadata table object

In [7]:
with open('ABCE_atrium_U12_UTF8.csv-metadata.json','w') as f:
    json.dump(metadata_table_dict,f,indent=4)

## Testing

To test the newly created metadata file, we can use the `csvw_functions` package to create an annotated table group object and chaeck for errors. We can also convert the data to JSON-LD to check that this process works fine.


In [8]:
annotated_table_group_dict=csvw_functions.create_annotated_table_group(
    'ABCE_atrium_U12_UTF8.csv-metadata.json'
)

*(No runtime errors)*

In [9]:
csvw_functions.get_errors(annotated_table_group_dict)

[]

*(No errors stored in the annotated table group object)*

In [10]:
json_ld=csvw_functions.create_json_ld(
    annotated_table_group_dict,
    mode='minimal'
)
json_ld[0:2]

[{'index': 1,
  'time': '10/02/13 06:00:00 AM',
  'temp': 19.865,
  'rh': 59.728,
  'intensity': 11.8},
 {'index': 2,
  'time': '10/02/13 06:30:00 AM',
  'temp': 19.817,
  'rh': 59.781,
  'intensity': 11.8}]

*(No runtime errors. Conversion looks fine.)*