# P3: Wrangle OpenStreetMap Data

## 1. Choose Your Map Area

I have chosen a Spanish town (Santander, Spain) as area. All data downloaded from https://www.openstreetmap.org as a XML OSM dataset. Using tool Overpass API to download a square are of Santander, the file size is 62.4MB. The are downloaded is located, N: 43.4978, S: 43.3893, E: -3.7096, W: -3.9791. 

In this project we will use data munging techniques to clean OpenStreetMap data for a part of the world that we care about. We will use MongoDB in order to help us. We start thoroughly audit and clean our dataset, converting it from XML OSM to JSON format. Then we will import the cleaned .json file into a MongoDB database and try some commands.


## 2. Process and Audit Data


I will use the following code to take a systematic sample of elements from your original OSM region. This code will create a file called sample.osm which we will use for testing future functions.

In [58]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow

OSM_FILE = "mapSantander.osm"  # Replace this with your osm file
SAMPLE_FILE = "sample.osm"

k = 200 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

### 2.1. Tags

First of all, I will count the number of unique element types and import needed libraries.

In [59]:
import pprint
import re

def count_tags(filename):
    tags = {}
    for _, element in ET.iterparse(filename):

        if element.tag in tags:
            tags[element.tag] += 1
        else:    
            tags[element.tag] = 1
        
    return tags


tags = count_tags(OSM_FILE)
pprint.pprint(tags)


{'bounds': 1,
 'member': 27161,
 'meta': 1,
 'nd': 320009,
 'node': 262105,
 'note': 1,
 'osm': 1,
 'relation': 579,
 'tag': 244418,
 'way': 32434}


We get a overall understanding with this tags. Now we will find problems with tag key names and try to solve them.

In [60]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(k):
                keys['lower'] += 1
            elif lower_colon.search(k):
                keys['lower_colon'] += 1
            elif problemchars.search(k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
    return keys


def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys


tags = process_map(OSM_FILE)
pprint.pprint(tags)



{'lower': 145664, 'lower_colon': 97828, 'other': 926, 'problemchars': 0}


The only tag we might have problems is the 'problemchars' but we don not have any. In case we had, we would ignore them.

### 2.2. Street Names

We found problems in some abbreviation in the dataset. In spanish is very common to say c/ instead of calle (street). We start updating this mistakes for the right ones. We have found other curiosities, such as tildes and capital letters in some data, which we have leave it.

We are processing only two types of top level tags, "node" and "way"

In [61]:
from collections import defaultdict

street_type_re = re.compile(r'^\b\S+\.?', re.IGNORECASE)

expected = ["Calle", "CALLE", u"Barrio", u"Centro", "Calleja", "Centro Comercial", "Avenida", "Plaza", "Camino", "Estacion", "Parking", "Campus", "Carretera", 
            "Glorieta", "Paseo", "Rotonda", "Juan", "Gran", "Dante", "Maria", "Pasaje", u'Le\xf3n', u'Comisar\xeda', 
            "Edificio", "Vivero", "CARRETERA", "Centro", "Lope", u'pol\xedgono', u'Pol\xedgono', "Bajada", "Subida", "Grupo", "Rampa", 
            "Barrio", "AREA", "La", "Acceso", "POLIGONO", "Mercado", "Cuesta", u"Urbanizaci\xf3n", "Ernest", "Pol", "Puerto", "Jardines",
            "San",u"Autov\xeda", u"V\xeda", "MercaSantander", u"Traves\xeda", u"ISLA", u"Playa", "N-611", "BARRIO", "Las"]

# UPDATE THIS VARIABLE
mapping = { "C/": "Calle",
            "Barrio": "Barrio",
            "Calle": "Calle",
            "Calles": "Calle",
            "Avenidad" : "Avenida",
            "Avda.": "Avenida",
            u"Calla": "Calle",
            "name=Avenida": "Avenida",
            "name=Calle": "Calle",
            "AREA,": "Area",
            "Bajade": "Bajada",
            "Ramapa": "Rampa"
          }


As you can see we will get the first word from the street as in Spanish we used to say Street or Avenue at the begining. However it is very common to say other words.

We will create three functions to help us to clean the data.

* audit_street_type, will help us to find the input wich does not match within expected and will add this string to the set.
* is_street_name, will say if that element is a street.
* auditstreet, will return a list which elements either street that do not match in expected.
* update_name, will change the string name

In [62]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)
            
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def auditstreet(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag): # Checking house name and street name
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types


def update_name(name, mapping):

    m = street_type_re.search(name)
    better_name = name
    if m:
        #print mapping[m.group()]
        better_street_type = mapping[m.group()]        
        better_name = street_type_re.sub(better_street_type, name)

    return better_name


street_types = auditstreet(OSM_FILE)

pprint.pprint(dict(street_types))
#for st_type, ways in street_types.iteritems():
#    for name in ways:
#        better_name = update_name(name, mapping)
#        print name, "=>", better_name




{'AREA,': set(['AREA, ARRABAL PUERTO DE RAOS']),
 'Avda.': set(['Avda. de la Reina Victoria']),
 u'Avenidad': set([u'Avenidad de P\xe9rez Gald\xf3s']),
 'Bajade': set(['Bajade del Caleruco']),
 u'Calla': set([u'Calla de San Mart\xedn', 'Calla del Convento']),
 'Calles': set(['Calles de los Abedules']),
 'Ramapa': set(['Ramapa de Sotileza']),
 'name=Avenida': set(['name=Avenida del Cardenal Herrera Oria']),
 'name=Calle': set(['name=Calle de Luis Salgado Lodeiro'])}


### 2.3. Postcode

We just see one error in Postcodes, that the postcode is longer than 5 numbers. We just change this postcode for a s/n (unknown value).

In [63]:
def audit_postcode(postcode_types, postcode):
    postcode_types[postcode].add(postcode)


def is_postcode(elem):
    return (elem.attrib['k'] == "addr:postcode" and len(elem.attrib['v']) != 5)

def auditpostcode(osmfile):
    osm_file = open(osmfile, "r")
    postcode_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_postcode(tag): # Here we check postcode > 5 number
                    audit_postcode(postcode_types, tag.attrib['v'])
    osm_file.close()
    return postcode_types


def update_postcode(postcode):
    return ('s/n')

postcode_types = auditpostcode(OSM_FILE)

pprint.pprint(dict(postcode_types))

#for st_type, ways in postcode_types.iteritems():
#    for name in ways:
#        better_name = update_postcode(name)
#        print name, "=>", better_name


{'3012': set(['3012']),
 '390012': set(['390012']),
 '<diferente>': set(['<diferente>']),
 'Santander': set(['Santander'])}


### 2.4. House Number

We can see different formats to show the house number. We will use number format, so format like 'Numero...' will be replaced. Other formats like '15-17', number + letter '21D' or two numbers '2, 4' are accepted.

In [87]:
def is_housenummer(elem):
    return (elem.attrib['k'] == "addr:housenumber")


def audit_housenummer(no_housenummer, house_nummer):
    if not house_nummer.isdigit():
        no_housenummer[house_nummer].add(house_nummer)
    

def audithousenumber(osmfile):
    osm_file = open(osmfile, "r")
    no_housenummer = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_housenummer(tag): # Checking house nummer
                    audit_housenummer(no_housenummer, tag.attrib['v'])
    osm_file.close()
    return no_housenummer


def update_housenumber(housenumber):
    num = re.findall('[a-zA-Z]*', housenumber)
    if num:
        num = num[0]

    if num == "Numero":
        housen = (re.findall(r'\d+', housenumber))
#        print re.findall(r'\d+', housenumber)
        if housen:
            return (re.findall(r'\d+', housenumber))


housenummer = audithousenumber(OSM_FILE)

#pprint.pprint(dict(housenummer))

#for st_type, ways in housenummer.iteritems():
#    for name in ways:
#        better_name = update_housenumber(name)
#        print name, "=>", better_name







## 3. Preparing for MongoDB

In [None]:
import codecs
import json

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
"""
Your task is to wrangle the data and transform the shape of the data
into the model we mentioned earlier. The output should be a list of dictionaries
that look like this:

{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}

You have to complete the function 'shape_element'.
We have provided a function that will parse the map file, and call the function with the element
as an argument. You should return a dictionary, containing the shaped data for that element.
We have also provided a way to save the data in a file, so that you could use
mongoimport later on to import the shaped data into MongoDB. 

Note that in this exercise we do not use the 'update street name' procedures
you worked on in the previous exercise. If you are using this code in your final
project, you are strongly encouraged to use the code from previous exercise to 
update the street names before you save them to JSON. 

In particular the following things should be done:
- you should process only 2 types of top level tags: "node" and "way"
- all attributes of "node" and "way" should be turned into regular key/value pairs, except:
    - attributes in the CREATED array should be added under a key "created"
    - attributes for latitude and longitude should be added to a "pos" array,
      for use in geospacial indexing. Make sure the values inside "pos" array are floats
      and not strings. 
- if the second level tag "k" value contains problematic characters, it should be ignored
- if the second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
- if the second level tag "k" value does not start with "addr:", but contains ":", you can
  process it in a way that you feel is best. For example, you might split it into a two-level
  dictionary like with "addr:", or otherwise convert the ":" to create a valid key.
- if there is a second ":" that separates the type/direction of a street,
  the tag should be ignored, for example:

<tag k="addr:housenumber" v="5158"/>
<tag k="addr:street" v="North Lincoln Avenue"/>
<tag k="addr:street:name" v="Lincoln"/>
<tag k="addr:street:prefix" v="North"/>
<tag k="addr:street:type" v="Avenue"/>
<tag k="amenity" v="pharmacy"/>

  should be turned into:

{...
"address": {
    "housenumber": 5158,
    "street": "North Lincoln Avenue"
}
"amenity": "pharmacy",
...
}

- for "way" specifically:

  <nd ref="305896090"/>
  <nd ref="1719825889"/>

should be turned into
"node_refs": ["305896090", "1719825889"]
"""


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


def shape_element(element):
    node = {}
    # you should process only 2 types of top level tags: "node" and "way"
    if element.tag == "node" or element.tag == "way" :
        for key in element.attrib.keys():
            val = element.attrib[key]
            node["type"] = element.tag
            if key in CREATED:
                if not "created" in node.keys():
                    node["created"] = {}
                node["created"][key] = val
            elif key == "lat" or key == "lon":
                if not "pos" in node.keys():
                    node["pos"] = [0.0, 0.0]
                old_pos = node["pos"]
                if key == "lat":
                    new_pos = [float(val), old_pos[1]]
                else:
                    new_pos = [old_pos[0], float(val)]
                node["pos"] = new_pos
            else:
                node[key] = val
            for tag in element.iter("tag"):
                tag_key = tag.attrib['k']
                tag_val = tag.attrib['v']
                if problemchars.match(tag_key):
                    continue
                elif tag_key.startswith("addr:"):
                    if not "address" in node.keys():
                        node["address"] = {}
                    addr_key = tag.attrib['k'][len("addr:") : ]
                    if lower_colon.match(addr_key):
                        continue
                    else:
                        if tag_val.split(' ')[0] in expected:
                            node["address"][addr_key] = tag_val
                        elif tag_key.endswith("street"):
                            node["address"][addr_key] = update_name(tag_val, mapping)
                        elif tag_key.endswith("postcode"):
                            node["address"][addr_key] = update_postcode(tag_val)
                        elif tag_key.endswith("housenumber"):
                            node["address"][addr_key] = update_housenumber(tag_val)
                        else:
                            node["address"][addr_key] = tag_val
                            
                elif lower_colon.match(tag_key):
                    node[tag_key] = tag_val
                else:
                    node[tag_key] = tag_val
        for tag in element.iter("nd"):
            if not "node_refs" in node.keys():
                node["node_refs"] = []
            node_refs = node["node_refs"]
            node_refs.append(tag.attrib["ref"])
            node["node_refs"] = node_refs

        return node
    else:
        return None


def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        fo.write("[")
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+",\n")
                else:
                    fo.write(json.dumps(el) + ",\n")
        fo.write("]")

    return data

data = process_map(OSM_FILE, True)
pprint.pprint(data[0])


In [None]:
import json
import pymongo

from pymongo import MongoClient

In [None]:
def insert_data(data, db):
    for row in data:
        print row
        db.maps.insert(row)
        
client = MongoClient("mongodb://localhost:27017")
db = client.maps

with open('mapSantander.osm.json') as f:
    data = json.loads(f.read())
    insert_data(data, db)

## 4. Explore Database with MongoDB

In [57]:
# Size of the file
import os
def get_size(file_name):
    wd = %pwd
    return os.stat(wd + '/' + file_name).st_size/1000.0/1000.0

file_size = get_size('mapSantander.osm.json')
print "{} is {} MB in size.".format("mapSantander.osm.json", file_size)

# Number of documnents
print "Number of Documents: " + str(db.maps.find().count())

# Number of unique users
print "Numebr of uniques Users: " + str(len(db.maps.distinct("created.user")))

# Number of nodes and ways
print "Number of Nodes: " + str(db.maps.find({'type': "node"}).count())
print "Number of Ways: " + str(db.maps.find({'type': "way"}).count())

# Number of chosen type of nodes, like cafes, shops etc.
print "Number of Cafes: " + str(db.maps.find({'amenity':u"cafe",'type':"node"}).count())

print "Number of Restaurantes: " + str(db.maps.find({'amenity':"restaurant", 'type':"node"}).count())

print "Top 10 amenities: "
pprint.pprint([doc for doc in db.maps.aggregate([{'$match':{"amenity":{"$exists":1},"type":"node"}},
                        {"$group":{"_id":"$amenity","count":{"$sum":1}}},
                        {'$sort':{"count":-1}},
                        {"$limit":10}])])

print "Top 1 contributing user" 
pprint.pprint([doc for doc in db.maps.aggregate([
    {"$match":{"type":"node"}},
    {"$group":{"_id":"$created.user","count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":1}
])])



mapSantander.osm.json is 94.341991 MB in size.
Number of Documents: 593497
Numebr of uniques Users: 293
Number of Nodes: 528139
Number of Ways: 65304
Number of Cafes: 170
Number of Restaurantes: 420
Top 10 amenities: 
[{u'_id': u'recycling', u'count': 737},
 {u'_id': u'restaurant', u'count': 420},
 {u'_id': u'bench', u'count': 263},
 {u'_id': u'bar', u'count': 247},
 {u'_id': u'drinking_water', u'count': 230},
 {u'_id': u'waste_disposal', u'count': 210},
 {u'_id': u'cafe', u'count': 170},
 {u'_id': u'waste_basket', u'count': 131},
 {u'_id': u'bank', u'count': 128},
 {u'_id': u'place_of_worship', u'count': 125}]
Top 1 contributing user
[{u'_id': u'Emilio Gomez', u'count': 268610}]
