# Wrangle OpenStreetMap data
[Cédric Campguilhem](https://github.com/ccampguilhem/Udacity-DataAnalyst)

<a id="Top"/>

## Table of contents
- [Introduction](#Introduction)
- [Map area selection](#Area selection)
- [XML data structure](#XML data structure)
- [Data quality audit](#Data quality)
    - [Validity](#Data validity)
    - [Accuracy](#Data accuracy)
    - [Completeness](#Data completeness)
    - [Consistency](#Data consistency)
    - [Uniformity](#Data uniformity)
- [Appendix](#Appendix)

<a id="Introduction"/>

## Introduction *[top](#Top)*

This project is related to Data Wrangling with MongoDB course for Udacity Data Analyst Nanodegree program.
The purpose of this project is to clean data from [OpenStreetMap](https://www.openstreetmap.org).

OpenStreetMap is open data, licensed under the Open Data Commons Open Database License (ODbL) by the OpenStreetMap Foundation (OSMF). 

This project cover various aspects of data wrangling phase:
- **screen scraping** with [Requests](http://requests.readthedocs.io/en/master/), an http Python library for making requests on web services,
- **parsing** XML files with iterative and SAX parsers with Python standard library [xml.etree.ElementTree](https://docs.python.org/2/library/xml.etree.elementtree.html?highlight=iterparse#module-xml.etree.ElementTree) and [xml.sax](https://docs.python.org/2/library/xml.sax.html),
- **auditing** (validity, accuracy, completeness, consistency and uniformity) and **cleaning** data with Python,
    - validity: does data conform to a schema ?
    - accuracy: does data conform to gold standard (a dataset we trust) ?
    - completeness: do we have all records ?
    - consistency: is dataset providing contradictory information ?
    - uniformity: are all data provided in the same units ?
- **storing** data into SQL database (SQLite) with Python [sqlite3](https://docs.python.org/2/library/sqlite3.html) module and [MongoDG](https://www.mongodb.com/) no-SQL database.
- exploring dataset **statistics** as per project requirements (size of the file, number of unique users, number of nodes and ways, number of chosen type of nodes, like cafes, shops etc.)

The storing step will make use of [csv](https://docs.python.org/2/library/csv.html?highlight=csv#module-csv) and [json](https://docs.python.org/2/library/json.html?highlight=json#module-json) formats respectively for SQL and MongoDB exports.

I am already familiar with SQL but I will also provide SQL output in addition to MongoDB output for the cleaned dataset.

<a id="Area selection"/>

## Map area selection *[top](#Top)*

If you don't want to have details on how the data from OpenStreetMap is retrieved, you can skip this section. At the end of the processing, you should have a *data.osm* file in the same directory than this notebook.

I have made the map area selection dynamic. By configuring few variables, a different map area may be extracted from OpenStreetMap. Some pre-selections are available:

| Pre-selection | Description               | Usage               | File size (bytes) | OpenStreetMap link |
|:------------- |:------------------------- |:------------------- | -----------------:|:------------------ |
| Tournefeuille | The city I live in        | Project review      | 103 143 437       | [link](https://www.openstreetmap.org/relation/35735)
| City center   | Tournefeuille city center | Testing, debugging  | 583 419           | [link](https://www.openstreetmap.org/export#map=14/43.5848/1.3516)
| Toulouse      | Toulouse and surroundings | Benchmark           | 1 271 859 210     | [link](https://www.openstreetmap.org/search?query=toulouse#map=11/43.6047/1.4442)

The box variables are in the following order (south-west to north-east):

- minimum latitude
- minimum longitude
- maximum latitude
- maximum longitude

**Note: ** The data cleaning provided in this project works for French area, if you select a non-french area no data cleaning will be performed.

In [240]:
SELECTION = "PRESELECTED" #Update the PRESELECTION variable
#SELECTION = "USER" #Update the USER_SELECTION with the box you want
#SELECTION = "CACHE" #Use any data file present in directory
USER_SELECTION = (43.5799, 1.3434, 43.5838, 1.3496)
PRESELECTIONS = {"Tournefeuille": (43.5475, 1.2767, 43.6019, 1.3909),
                 "City center": (43.5799, 1.3434, 43.5838, 1.3496),
                 "Toulouse": (43.3871, 0.9874, 43.8221, 1.9006)}
PRESELECTION = "Tournefeuille"
TEMPLATE = \
"""
(
   node({},{},{},{});
   <;
);
out meta;
"""

I have used screen scrapping techniques presented throught the course to extract data from OpenStreetMap:

- I use the Overpass API (http://wiki.openstreetmap.org/wiki/Overpass_API)
- The query form (http://overpass-api.de/query_form.html) sends a POST request to http://overpass-api.de/api/interpreter
- From the api/interpreter we can just make a GET request which takes a data parameter containing the box selection:

```
(
   node(51.249,7.148,51.251,7.152);
   <;
);
out meta;
```

The idea is to send a http GET request using [Requests](http://requests.readthedocs.io/en/master/) and collect results in a stream. This is because the data we get from the request may be huge and may not fit into memory.

The following method `download_map_area` enables to download map area data and store it in a *data.osm* file:

In [241]:
import os
import shutil
import requests


def download_map_area():
    """
    Download the map area in a file named data.osm.
    
    This function takes into account the following global variables: SELECTION, USER_SELECTION, PRESELECTIONS, 
    PRESELECTION and TEMPLATE
    
    If a http request is made, the response status code is returned, otherwise None in returned.
    If SELECTION is set to CACHE and no file is present an exception is raised.
    
    - raise ValueError: if SELECTION=CACHE and there is no cached file
    - raise ValueError: if SELECTION is not [PRESELECTED, USER, CACHE]
    - raise NameError if either of SELECTION, PRESELECTION, PRESELECTIONS, USER_SELECTION or TEMPLATE does not exist.
    - return: tuple:
        - status code or None
        - path to dataset
        - dataset file size (in bytes)
    """
    filename = "data.osm"
    if SELECTION == "CACHE":
        if not os.path.exists(filename):
            raise ValueError("Cannot use SELECTION=CACHE if no {} file exists.".format(filename))
        else:
            return None, filename, os.path.getsize(filename)
    elif SELECTION == "PRESELECTED":
        data = TEMPLATE.format(*PRESELECTIONS[PRESELECTION])
    elif SELECTION == "USER":
        data = TEMPLATE.format(*USER_SELECTION)
    else:
        raise ValueError("SELECTION=")
        
    #Get XML data
    r = requests.get('http://overpass-api.de/api/interpreter', params={"data": data}, stream=True)
    with open(filename, 'wb') as fobj:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk:
                fobj.write(chunk)
    return r.status_code, filename, os.path.getsize(filename)

In [328]:
#Download dataset
status_code, dataset_path, dataset_size = download_map_area()
if status_code is None:
    print "The file {} is re-used from a previous download. Its size is {} bytes.".format(dataset_path, dataset_size)
elif status_code == 200:
    print "The file {} has been successfully downloaded. Its size is {} bytes.".format(dataset_path, dataset_size)
else:
    print "An error occured while downloading the file. Http status code is {}.".format(status_code)

The file data.osm has been successfully downloaded. Its size is 103240060 bytes.


<a id="XML data structure"/>

## XML data structure *[top](#Top)*

In the previous section, we have downloaded a dataset from OpenStreetMap web service. The XML file retrieved this way is stored in the file named *data.osm*.

In this section we are going to familiarize with the dataset to understand how it's built. As dataset may be a very large file (depending on the map area extracted) we are going to use an iterative parser that does not need to load the entire document in memory.

In [343]:
#Import the XML library
import xml.etree.cElementTree as et

from collections import Counter, defaultdict
from pprint import pprint

from IPython.core.display import display, HTML

In [344]:
#Iterative parsing
element_tags = Counter()
for (event, elem) in et.iterparse(dataset_path):
    element_tags[elem.tag] += 1
pprint(dict(element_tags))

{'member': 17264,
 'meta': 1,
 'nd': 602009,
 'node': 428784,
 'note': 1,
 'osm': 1,
 'relation': 469,
 'tag': 218904,
 'way': 71960}


In OpenStreetMap, data is structured this way:
- A **node** is a location in space defined by its latitude and longitude. It might indicate a standalone point and/or can be used to define shape of a way.
- A **way** can be either a polyline to represent roads, rivers... or a closed polygon to delimit areas (buildings, parks...).
- A **nd** is used within way to reference nodes.
- A **relation** can be defined from **member** nodes and ways to represent routes, bigger area such as regions or city boundaries.
- A **member** is a subpart of a relation pointing either to a node or a way.
- A **tag** is a (key, value) information attached to nodes, ways and relations to document in more detail the item.
- **osm** is the root node in .osm files.
- **note** and **meta** are metadata.

We are now going to parse the XML file again to get the full path of each tag in the dataset. We need to use a SAX parser with a custom handler. Do not pay attention to callbacks, this will be explained in next sections.

In [397]:
import xml.sax
import tempfile
import shutil
#This library enables to improve performances with big files using disk caching
try:
    import diskcache
except ImportError:
    WITH_DISKCACHE = False
else:
    WITH_DISKCACHE = True

"""
Custom handlers for parsing OpenStreetMap XML files.

While parsing the XML file, handler keeps trace of:

- tags count
- tags ancestors

It is possible to register callback functions for start or end events.
The callbacks for start event will be called passing the following arguments:
- element name, 
- element attributes,
- locator,
- ancestor

The callbacks for end event will be called passing the following arguments:
- element name 
- locator,
- element children

Return value of callbacks is ignored by the handler class.

This enables to enhance the parser with 'on the fly' data quality audit.
"""
class OpenStreetMapXmlHandler(xml.sax.ContentHandler):
    def __init__(self, caching=False):
        """
        Constructor.
        
        The state of object keeps a trace of stack while parsing. This enables to collect information 
        from children. The stack is destroyed when end event occured. This enables to limit memory usage
        while parsing.
        
        The _stack internal variable stores tuples
        - element unique identifier
        - element name (as provided by start event)
        - element attributes (as provided by start event)
        """
        xml.sax.ContentHandler.__init__(self)      #super not working here ???
        self._caching = caching
        
    def __enter__(self):
        """
        Context manager entry point.
        """
        self._id = 0                               #unique identifier incremented at
        self._stack = [ ]                          #current stack of element being read
        self._element_tags = Counter()             #counter of element tags
        self._element_ancestors = defaultdict(set) #collection of ancestors per tag
        self._start_callbacks = [ ]                #start event callbacks
        self._end_callbacks = [ ]                  #end event callbacks
        #Disk caching ?
        if self._caching and WITH_DISKCACHE:
            self._tmpdir = tempfile.mkdtemp()
            self._children = diskcache.Cache(self._tmpdir)
        else:
            self._children = { }                   #children elements of elements being read
        return self
            
    def __exit__(self, *args):
        """
        Context manager exit point.
        
        Clean up temporary directories and cache.
        """
        if self._caching and WITH_DISKCACHE:
            self._children.close()
            shutil.rmtree(self._tmpdir)

    def startElement(self, name, attrs):
        """
        Method invoked when starting to read an element in XML dataset.

        This method is part of of xml.sax.ContentHandler interface and is overloaded here.

        - name: tag of element being read
        - attrs: element attributes
        """
        #Get identifier for current element
        identifier = self._requestUniqueIdentifier()

        #Has element a parent? If yes get the id.
        try:
            parent = self._stack[-1][0]
        except IndexError:
            parent = None
                    
        #Exploit current stack to get ancestor
        ancestor = ".".join([s[1] for s in self._stack])
        self._element_ancestors[name].add(ancestor)
        
        #Update tag counter
        self._element_tags[name] += 1
        
        #Update parent children (if any)
        if parent is not None:
            self._children[parent].append((name, attrs))
            
        #Initialisation of own children
        self._children[identifier] = [ ]
        
        #Update stack
        self._stack.append((identifier, name, attrs))
        
        #Use registered callbacks
        for callback in self._start_callbacks:
            callback(name, attrs, self._locator, ancestor)
        
    def endElement(self, name):
        """
        Method invoked when ending to read an element in XML dataset.

        This method is part of of xml.sax.ContentHandler interface and is overloaded here.

        - name: tag of element being read
        """        
        #Get identifier
        identifier = self._stack[-1][0]
        
        #Use registered callbacks before element is cleaned        
        for callback in self._end_callbacks:
            callback(name, self._locator, self._children[identifier])
            
        #Cleaning
        identifier, name, attrs = self._stack.pop(-1)
        del self._children[identifier]
            
    def getTagsCount(self):
        """
        Get a dictionnary with tags count.

        - return: dictionnary where keys are tags and values are count
        """
        return dict(self._element_tags)

    def getTagsAncestors(self):
        """
        Get a dictionnary with tags ancestors.

        - return: dictionnary where keys are tags and values are a sequence of all different ancestors path
        """
        return dict(self._element_ancestors)
    
    def registerStartEventCallback(self, func):
        """
        Register a callback for start event.

        Note that return value of callback is ignored. Any exception raised by callback is not catched by handler, 
        so you should take care of catching all exceptions within the callback itself.

        - func: a callable object taking element name, element attributes, locator and ancestor as arguments.
        """
        self._start_callbacks.append(func)
        
    def registerEndEventCallback(self, func):
        """
        Register a callback for end event.

        Note that return value of callback is ignored. Any exception raised by callback is not catched by handler, 
        so you should take care of catching all exceptions within the callback itself.

        - func: a callable object taking element name, locator and element children as arguments.
        """
        self._end_callbacks.append(func)
        
    def clearCallbacks(self):
        """
        Remove all registered callbacks.
        """
        self._end_callbacks = [ ]
        self._start_callbacks = [ ]
        
    def _requestUniqueIdentifier(self):
        """
        Return a unique identifier used at parsing time.
        
        - return: identifier
        """
        self._id += 1
        return self._id

We can now use the handler in SAX parsing:

In [389]:
parser = xml.sax.make_parser()
with OpenStreetMapXmlHandler(caching=False) as handler:
    parser.setContentHandler(handler)
    parser.parse(dataset_path)

In [390]:
#Get tag counts
pprint(handler.getTagsCount())

{u'member': 17264,
 u'meta': 1,
 u'nd': 602009,
 u'node': 428784,
 u'note': 1,
 u'osm': 1,
 u'relation': 469,
 u'tag': 218904,
 u'way': 71960}


The returned tag count is the same than the one we have calculated using `et.iterparse`.

In [391]:
#Get tag ancestors
pprint(handler.getTagsAncestors())

{u'member': set([u'osm.relation']),
 u'meta': set([u'osm']),
 u'nd': set([u'osm.way']),
 u'node': set([u'osm']),
 u'note': set([u'osm']),
 u'osm': set(['']),
 u'relation': set([u'osm']),
 u'tag': set([u'osm.node', u'osm.relation', u'osm.way']),
 u'way': set([u'osm'])}


As we discussed later on:
- **osm** element has no ancestor (it's root element)
- **meta** and **note** only appear in **osm** element
- **node**, **way** and **relation** are direct children of **osm**
- **tag** can be used to document any of **node**, **way** and **relation**
- **member** are only used in **relation** elements (to reference either nodes, ways or other relations)
- **nd** are only used in **way** elements (to reference nodes)

Such result will help us a lot when auditing [data quality](#Data quality).

<a id='Data quality'/>

## Data quality audit *[top](#Top)*

This chapter is divided into 5 sections for each kind of data quality audit:
- [Validity](#Data validity)
- [Accuracy](#Data accuracy)
- [Completeness](#Data completeness)
- [Consistency](#Data consistency)
- [Uniformity](#Data uniformity)

<a id='Data validity'/>

### Validity *[audit](#Data quality)*

Validity is about compliance to a schema. The data we have retrieved from OpenStreetMap servers is a XML file. It exists techniques to validate XML structures such as XML Schema. We won't use such technique here because schema is relatively simple and because XML files can be large enough so we want to stick to using SAX parser.

Actually, the SAX content handler that has been introduced in previous [section](#XML data structure) will be helpful here as it's already able to list ancestors for each element. We can then define a schema in a similar form and compare both to see if there is any issue.

The schema is a dictionnary structured this way:
- key: element tag
- value: dictionnary with the following keys / values:
    - *ancestors*: List of any acceptable ancestor path. For example, the path ('osm.way') means that element shall be a children of a way element which itself is a children of a osm element.
    - *minOccurences*: minimum number of element in the dataset (greater or equal to 0), optional
    - *maxOccurences*: maximum number of element in the dataset (greater or equal to 1), optional
    - *requiredAttributes*: list of attribute names that shall be defined for element
    - *requiredChildren*: list of required children element
    - *attributesFuncs*: list of callable objects to be run on the element attributes for further checks

In [392]:
import functools

#Function to check numbers
check_digit = lambda name, attr: attr[name].isdigit()
check_id_digit = functools.partial(check_digit, 'id')
check_ref_digit = functools.partial(check_digit, 'ref')

#Define a schema
schema = {
    #osm is root node. There shall be exactely one.
    'osm': { 
        'ancestors': {''}, 
        'minOccurences': 1,
        'maxOccurences': 1},
    #meta shall be within osm element. There shall be exactely one of those.
    'meta': {
        'ancestors': {'osm'},
        'minOccurences': 1,
        'maxOccurences': 1},
    #meta shall be within osm element. There shall be exactely one of those.
    'note': {
        'ancestors': {'osm'},
        'minOccurences': 1,
        'maxOccurences': 1},        
    #node shall be within osm element. A node shall have id, lat (latitude) and lon (longitude) attributes.
    #Additionally, lat shall be in the range [-90, 90] and longitude in the range [-180, 180]. Id shall be a digit 
    #number
    'node': {
        'ancestors': {'osm'},
        'requiredAttributes': ['id', 'lat', 'lon'],
        'attributesFuncs': [lambda attr: -90 <= float(attr['lat']) <= 90, 
                            lambda attr: -180 <= float(attr['lon']) <= 180,
                            check_id_digit]},
    #way shall be within osm element. A way shall have id attribute. It shall have at least one nd children.
    #id shall be a digit.
    'way': {
        'ancestors': {'osm'},
        'requiredAttributes': ['id'],
        'requiredChildren': ['nd'],
        'attributesFuncs': [check_id_digit]},
    #nd shall be within way element. A nd shall have ref attribute. ref attribute shall be a digit.
    'nd': {
        'ancestors': {'osm.way'},
        'requiredAttributes': ['ref'],
        'attributesFuncs': [check_ref_digit]},
    #relation shall be within a osm element. It shall have a id attribute and at least one member children. id shall
    #be a digit
    'relation': {
        'ancestors': {'osm'},
        'requiredAttributes': ['id'],
        'requiredChildren': ['member'],
        'attributesFunc': [check_id_digit]},
    #member shall be within a relation element. It shall have type, ref and role attributes. The type attribute shall
    #be either way or node. The ref attribute shall be a digit.
    'member': {
        'ancestors': {'osm.relation'},
        'requiredAttributes': ['type', 'ref', 'role'],
        'attributesFuncs': [lambda attr: attr['type'] in ['way', 'node', 'relation'],
                            check_ref_digit]},
        
    #tag shall be within node, way or relation. It shall have k and v attributes.
    'tag': {
        'ancestors': {'osm.node', 'osm.way', 'osm.relation'},
        'requiredAttributes': ['k', 'v']},
    }

In order to have this schema validated, we are going to create a callback to be passed to SAX content handler we have created earlier:

In [393]:
"""
Data validity audit object in a form of a callback for SAX content handler.

This audit class checks the validity to a schema. The nonconformities can be requested after parsing.
"""
class DataValidityAudit(object):
    """
    Constructor.
    
    The specified schema has the following structure:
    
    - key: element tag
    - value: dictionnary with the following keys / values:
        - *ancestors*: List of any acceptable ancestor path. For example, the path 'osm.way' means that element 
        shall be a children of a way element which itself is a children of a osm element.
        - *minOccurences*: minimum number of element in the dataset (greater or equal to 0), optional
        - *maxOccurences*: maximum number of element in the dataset (greater or equal to 1), optional
        - *requiredAttributes*: list of attribute names that shall be defined for element
        - *requiredChildren*: list of required children element
        - *attributesFuncs*: list of callable objects to be run on the element attributes for further checks
    
    - schema: dictionnary with schema to be checked.
    """
    def __init__(self, schema):
        self._schema = schema
        self._count_tags = Counter()
        self._nonconformities = [ ]
    
    """
    Method called back when a start event is encountered.
    
    - name: element name
    - attrs: element attributes
    - locator: locator object from SAX parser
    - ancestor: ancestor in the form of a string
    """
    def startEventCallback(self, name, attrs, locator, ancestor):
        #Update counter
        self._count_tags[name] += 1
        
        #Check ancestors
        try:
            ancestors = self._schema[name]['ancestors']
        except KeyError:
            pass
        else:
            if ancestor not in ancestors:
                message = "{} element at line {} and column {} has an invalid ancestor: {}".format(
                    name, locator.getLineNumber(), locator.getColumnNumber(), ancestor)
                self._nonconformities.append(('Validity', message))
                
        #Check attributes
        try:
            required_attributes = self._schema[name]['requiredAttributes']
        except KeyError:
            pass
        else:
            for attribute in required_attributes:
                try:
                    attrs[attribute]
                except KeyError:
                    message = "{} element at line {} and column {} is missing a required attribute {}.".format(
                        name, locator.getLineNumber(), locator.getColumnNumber(), attribute)
                    self._nonconformities.append(('Validity', message))
                    
        #Special checks for attributes
        try:
            funcs = self._schema[name]['attributesFuncs']
        except KeyError:
            pass
        else:
            for i, func in enumerate(funcs):
                try:
                    status = func(attrs)
                except Exception as e:
                    exception = "{}({})".format(type(e).__name__, e)
                    message = "An exception {} has been raised while checking attributes with function {} " \
                            "for element {} at line {} and column {}.".format(
                            exception, i, name, locator.getLineNumber(), locator.getColumnNumber())
                    self._nonconformities.append(('Validity', message))
                else:
                    if not status:
                        message = "A custom attribute check failed with function {} for element {} at line {} " \
                            "and column {}.".format(i, name, locator.getLineNumber(), locator.getColumnNumber())
                        self._nonconformities.append(('Validity', message))
                        
    """
    Method called back when an end event is encountered.
    
    - name: element name
    - locator: locator object from SAX parser
    - children: element children
    """
    def endEventCallback(self, name, locator, children):
        #Check required children
        try:
            required_children = self._schema[name]['requiredChildren']
        except KeyError:
            pass
        else:
            actual_children = {c[0] for c in children}
            for r in required_children:
                if r not in actual_children:
                    message = "An element {} is missing in element {} at line {} and column {}.".format(
                            r, name, locator.getLineNumber(), locator.getColumnNumber())
                    self._nonconformities.append(('Validity', message))

    """
    Return nonconformities.
    
    A list of tuple is returned:
    - type of audit
    - nonconformity description
    """
    def getNonconformities(self):
        #Initialization
        nonconformities = self._nonconformities[:]
        
        #Check occurences (we cannot do that on the fly)
        for tag, conf in self._schema.iteritems():
            try:
                min_occurs = conf['minOccurences']
            except KeyError:
                pass
            else:
                if self._count_tags[tag] < min_occurs:
                    message = "The minOccurences criteria failed for {} element. " \
                        "Found {} element(s) while {} is the minimum.".format(tag, self._count_tags[tag], min_occurs)
                    nonconformities.append(('Validity', message))
            try:
                max_occurs = conf['maxOccurences']
            except KeyError:
                pass
            else:
                if self._count_tags[tag] > max_occurs:
                    message = "The maxOccurences criteria failed for {} element. " \
                        "Found {} element(s) while {} is the maximum.".format(tag, self._count_tags[tag], max_occurs)
                    nonconformities.append(('Validity', message))
            
        #End of post-processing
        return nonconformities

In [398]:
import tabulate

#Define a method to parse and audit
def parse_and_audit(dataset_path, audit=None):
    """
    Parse XML dataset and perform audit quality.
    
    - dataset_path: path to the dataset to be parsed and audited
    - audit: a sequence of audit objects
    - return: sequence of nonconformities
    """
    with OpenStreetMapXmlHandler() as handler:
        if audit is not None:
            for obj in audit:
                handler.registerStartEventCallback(obj.startEventCallback)
                handler.registerEndEventCallback(obj.endEventCallback)
        parser = xml.sax.make_parser()
        parser.setContentHandler(handler)
        parser.parse(dataset_path)
    nonconformities = []
    if audit is not None:
        for obj in audit:
            nonconformities.extend(obj.getNonconformities())
    return nonconformities

In [400]:
#Parse and audit
%time nonconformities = parse_and_audit(dataset_path, [DataValidityAudit(schema)])
display(HTML(tabulate.tabulate(nonconformities, tablefmt='html')))

CPU times: user 13.1 s, sys: 148 ms, total: 13.2 s
Wall time: 13.3 s


The returned list above shall be empty. It means that no nonconfirmity has been detected for validity audit. The data we get from OpenStreetMap may be trusted in terms of schema compliance.

The `%%timeit` Jupyter magic command enables to monitor how much time it takes to parse and audit the data. As a reference it takes approximately 12 seconds to parse and audit the dataset of around 100 Mb.

<a id='Data accuracy'/>

### Accuracy *[audit](#Data quality)*

Accuracy is a measurement of coformity with gold standard. On a dataset such as the one from OpenStreetMap it may be difficult to find a gold standard. We are then going to limit this audit to values that are sometimes provided in the dataset for items which represents a town:
- INSEE indentifier (ref:INSEE in the above example)
- Population
- Date of last census (source:population in the above example)

Here is an example:

```xml
<node id="26691412" lat="43.5827846" lon="1.3466543" version="16" timestamp="2017-08-21T22:29:38Z"  changeset="51321527" uid="6523296" user="ccampguilhem">
    <tag k="addr:postcode" v="31170"/>
    <tag k="name" v="Tournefeuille"/>
    <tag k="name:fr" v="Tournefeuille"/>
    <tag k="name:oc" v="Tornafuèlha"/>
    <tag k="place" v="town"/>
    <tag k="population" v="26 674"/>
    <tag k="ref:FR:SIREN" v="213105570"/>
    <tag k="ref:INSEE" v="31557"/>
    <tag k="source:population" v="INSEE 2014"/>
    <tag k="wikidata" v="Q328022"/>
    <tag k="wikipedia" v="fr:Tournefeuille"/>
</node>
```

For this example, I have updated the OpenStreetMap database manually to match official data published by [INSEE](https://www.insee.fr/en/accueil). I will use INSEE data as gold standard (see [here](https://www.insee.fr/fr/statistiques/1405599?geo=COM-31557+COM-31291+COM-31149+COM-31424+COM-31157+COM-31417)). The last census in my region is from 2014.

We are going to define a gold standard in a dictionnary for few towns in the surrounding of Tournefeuille. If you have selected a user-defined area map, it may not be suitable to you:

In [401]:
#Used to convert digit in XML with thoudand separators into a Python integer
convert_to_int = lambda x: int(x.replace(" ", ""))

gold_standard = {
    u'Tournefeuille': {
        'population': (convert_to_int, 26674),
        'source:population': (str, 'INSEE 2014'),
        'ref:INSEE': (convert_to_int, 31557)},
    u'Léguevin': {
        'population': (convert_to_int, 8892),
        'source:population': (str, 'INSEE 2014'),
        'ref:INSEE': (convert_to_int, 31291)},
    u'Colomiers': {
        'population': (convert_to_int, 38541),
        'source:population': (str, 'INSEE 2014'),
        'ref:INSEE': (convert_to_int, 31149)},
    u'Plaisance-du-Touch': {
        'population': (convert_to_int, 17278),
        'source:population': (str, 'INSEE 2014'),
        'ref:INSEE': (convert_to_int, 31424)},
    u'Cugnaux': {
        'population': (convert_to_int, 17004),
        'source:population': (str, 'INSEE 2014'),
        'ref:INSEE': (convert_to_int, 31157)},
    u'Pibrac': {
        'population': (convert_to_int, 8226),
        'source:population': (str, 'INSEE 2014'),
        'ref:INSEE': (convert_to_int, 31417)},
    u'Toulouse': {
        'population': (convert_to_int, 466297),
        'source:population': (str, 'INSEE 2014'),
        'ref:INSEE': (convert_to_int, 31555)},       
}

Let's create an audit class for accuracy. It will compare each information from items having a "population" tag to the standard above.

In [402]:
"""
Data accuracy audit object in a form of a callback for SAX content handler.

This audit class checks compliance to gold standard. The nonconformities can be requested after parsing.
This audit is only applied to elements which has a tag element child with k = population.
"""
class DataAccuracyAudit(object):
    """
    Constructor.
    
    The specified standard has the following structure:
    
    - key: town name
    - value: dictionnary with the following keys / values. Each value is a tuple of conversion function and expected 
    value:
        - *population*: population as measured during the last census
        - *source:population*: source of last census
        - *ref:INSEE*: identifier of town in gold standard (INSEE)
        
    - standard: gold standard dictionnary
    - fix: toggle automatic fixing of data
    """
    def __init__(self, standard, fix=False):
        self._standard = standard
        self._nonconformities = [ ]
        self._fix = fix
    
    """
    Method called back when a start event is encountered.
    
    - name: element name
    - attrs: element attributes
    - locator: locator object from SAX parser
    - ancestor: ancestor in the form of a string
    """
    def startEventCallback(self, name, attrs, locator, ancestor):
        pass
                                
    """
    Method called back when an end event is encountered.
    
    - name: element name
    - locator: locator object from SAX parser
    - children: element children
    """
    def endEventCallback(self, name, locator, children):
        #Find item with a tag child haing population as k value and compare to standard
        match = self._findTagInChildren(children, 'population')
        if match is not None:
            town = self._findTagInChildren(children, 'name:fr')
            try:
                standard = self._standard[town]
            except KeyError:
                message = "Town {} has been found and not in standard. Accuracy cannot be assessed.".format(town)
                self._nonconformities.append(('Accuracy', message))
            else:
                for key, value in standard.iteritems():
                    dataset_value = value[0](self._findTagInChildren(children, key))
                    if dataset_value != value[1]:
                        message = '"{}" value provided for "{}" of town {} is inaccurate. '\
                                'Expected value is "{}".'.format(dataset_value, key, town, value[1])
                        self._nonconformities.append(('Accuracy', message))
        
    """
    Return nonconformities.
    
    A list of tuple is returned:
    - type of audit
    - nonconformity description
    """
    def getNonconformities(self):
        return self._nonconformities[:]
    
    def _findTagInChildren(self, children, key, value=None):
        """
        Find in children a tag element with specified attribute key.
        
        If value is set to None, the value is returned. If value is specified, name et attrs of child are returned.
        In case no element or value is found, None is returned
        
        - children: list of tuples (name of element, element attributes)
        - return: value, (name, attibutes) or None
        """
        #try to get tag with k = place
        for name, attrs in children:
            #Skip if this is not a tag
            if name != "tag":
                continue
            #It's a tag
            try:
                k = attrs['k']
            except KeyError:
                continue
            else:
                if k != key:
                    continue
                else:
                    try:
                        v = attrs['v']
                    except KeyError:
                        continue
                    else:
                        if value is None:
                            return v
                        elif v == value:
                            return name, attrs
            return

In [403]:
#Parse and audit
%time nonconformities = parse_and_audit(dataset_path, [DataValidityAudit(schema), DataAccuracyAudit(gold_standard)])
display(HTML(tabulate.tabulate(nonconformities, tablefmt='html')))

CPU times: user 14 s, sys: 192 ms, total: 14.2 s
Wall time: 14.2 s


0,1
Accuracy,"""INSEE 2013"" value provided for ""source:population"" of town Plaisance-du-Touch is inaccurate. Expected value is ""INSEE 2014""."
Accuracy,"""16091"" value provided for ""population"" of town Plaisance-du-Touch is inaccurate. Expected value is ""17278""."
Accuracy,"""INSEE 2013"" value provided for ""source:population"" of town Colomiers is inaccurate. Expected value is ""INSEE 2014""."
Accuracy,"""35186"" value provided for ""population"" of town Colomiers is inaccurate. Expected value is ""38541""."
Accuracy,"""INSEE 2013"" value provided for ""source:population"" of town Toulouse is inaccurate. Expected value is ""INSEE 2014""."
Accuracy,"""441802"" value provided for ""population"" of town Toulouse is inaccurate. Expected value is ""466297""."
Accuracy,"""INSEE 2013"" value provided for ""source:population"" of town Pibrac is inaccurate. Expected value is ""INSEE 2014""."
Accuracy,"""8091"" value provided for ""population"" of town Pibrac is inaccurate. Expected value is ""8226""."


Some accuracy issues are reported because data in OpenStreetMap is not up to date since the new census of 2014.
No issue is reported for Tournefeuille because I have manually updated the OpenStreetMap database.

<a id='Data completeness'/>

### Completeness *[audit](#Data quality)*

<a id='Data consistency'/>

### Consistency *[audit](#Data quality)*

<a id='Data uniformity'/>

### Uniformity *[audit](#Data quality)*

<a id="Appendix"/>

## Appendix *[top](#Top)*

### References

[OpenStreetData wiki](http://wiki.openstreetmap.org/wiki/Main_Page)<hr>
[INSEE](https://www.insee.fr/en/accueil) is French National Institute of Statistics and Economic Information. In this project, it is used as *gold* standard.<hr>
Validating XML tree with [XML Schema](https://www.w3schools.com/xml/schema_intro.asp) can be done with [lxml](http://lxml.de/validation.html) library. This technique has not been used here as the structure of XML is simple enough. Additionaly, XML Schema validation requires to have XML data into memory and may not be suitable for large files like the ones we might have here.<hr>
Get line number in a content handler with SAX parser on [StackOverflow](https://stackoverflow.com/a/15477803/8500344)<hr>
Display lists as html tables in notebook on [StackOverflow](https://stackoverflow.com/a/42323522/8500344)<hr>
[Diskcache](http://www.grantjenks.com/docs/diskcache/tutorial.html), a disk and file backed cache library<hr>
