# P3: Wrangle OpenStreetMap Data

## 1. Choose Your Map Area

I have chosen a Spanish town (Santander, Spain) as area. All data downloaded from https://www.openstreetmap.org as a XML OSM dataset. Using tool Overpass API to download a square are of Santander, the file size is 62.4MB. The are downloaded is located, N: 43.4978, S: 43.3893, E: -3.7096, W: -3.9791. 

In this project we will use data munging techniques to clean OpenStreetMap data for a part of the world that we care about. We will use MongoDB in order to help us. We start thoroughly audit and clean our dataset, converting it from XML OSM to JSON format. Then we will import the cleaned .json file into a MongoDB database and try some commands.


## 2. Process and Audit Data


I will use the following code provided by Udacity to create a file called sample.osm which we will use for testing future functions. You can find this code in audit.py

In [1]:
from audit import get_element

OSM_FILE = "mapSantander.osm"  # Replace this with your osm file
SAMPLE_FILE = "sample.osm"
k = 200 # Parameter: take every k-th top level element

import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow

with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))
    output.write('</osm>')



### 2.1. Tags

First of all, I will count the number of unique element types and import needed libraries.

In [2]:
import pprint
import re

def count_tags(filename):
    tags = {}
    for _, element in ET.iterparse(filename):

        if element.tag in tags:
            tags[element.tag] += 1
        else:    
            tags[element.tag] = 1
        
    return tags


tags = count_tags(OSM_FILE)
pprint.pprint(tags)


{'bounds': 1,
 'member': 27161,
 'meta': 1,
 'nd': 320009,
 'node': 262105,
 'note': 1,
 'osm': 1,
 'relation': 579,
 'tag': 244418,
 'way': 32434}


We get a overall understanding with this tags. Now we will find problems with tag key names and try to solve them.

In [3]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(k):
                keys['lower'] += 1
            elif lower_colon.search(k):
                keys['lower_colon'] += 1
            elif problemchars.search(k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
    return keys


def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys


tags = process_map(OSM_FILE)
pprint.pprint(tags)



{'lower': 145664, 'lower_colon': 97828, 'other': 926, 'problemchars': 0}


The only tag we might have problems is the 'problemchars' but we don not have any. In case we had, we would ignore them.

### 2.2. Street Names

We found problems in some abbreviation in the dataset. In spanish is very common to say c/ instead of calle (street). We start updating this mistakes for the right ones. We have found other curiosities, such as tildes and capital letters in some data, which we have leave it.

We are processing only two types of top level tags, "node" and "way"

As you can see we will get the first word from the street as in Spanish we used to say Street or Avenue at the begining. However it is very common to say other words.

We will create three functions to help us to clean the data.

* audit_street_type, will help us to find the input wich does not match within expected and will add this string to the set.
* is_street_name, will say if that element is a street.
* auditstreet, will return a list which elements either street that do not match in expected.
* update_name, will change the string name

In [4]:
from collections import defaultdict
from audit import auditstreet

In [5]:
street_types = auditstreet(OSM_FILE)

pprint.pprint(dict(street_types))

#for st_type, ways in street_types.iteritems():
#    for name in ways:
#        better_name = update_name(name, mapping)
#        print name, "=>", better_name


{'AREA,': set(['AREA, ARRABAL PUERTO DE RAOS']),
 'Avda.': set(['Avda. de la Reina Victoria']),
 u'Avenidad': set([u'Avenidad de P\xe9rez Gald\xf3s']),
 'Bajade': set(['Bajade del Caleruco']),
 u'Calla': set([u'Calla de San Mart\xedn', 'Calla del Convento']),
 'Calles': set(['Calles de los Abedules']),
 'Ramapa': set(['Ramapa de Sotileza']),
 'name=Avenida': set(['name=Avenida del Cardenal Herrera Oria']),
 'name=Calle': set(['name=Calle de Luis Salgado Lodeiro'])}


### 2.3. Postcode

We just see one error in Postcodes, that the postcode is longer than 5 numbers. We just change this postcode for a s/n (unknown value).

In [6]:
from audit import auditpostcode

In [7]:
postcode_types = auditpostcode(OSM_FILE)

pprint.pprint(dict(postcode_types))

#for st_type, ways in postcode_types.iteritems():
#    for name in ways:
#        better_name = update_postcode(name)
#        print name, "=>", better_name

{'3012': set(['3012']),
 '390012': set(['390012']),
 '<diferente>': set(['<diferente>']),
 'Santander': set(['Santander'])}


### 2.4. House Number

We can see different formats to show the house number. We will use number format, so format like 'Numero...' will be replaced. Other formats like '15-17', number + letter '21D' or two numbers '2, 4' are accepted.

In [8]:
from audit import audithousenumber

housenummer = audithousenumber(OSM_FILE)

pprint.pprint(dict(housenummer))

#for st_type, ways in housenummer.iteritems():
#    for name in ways:
#        better_name = update_housenumber(name)
#        print name, "=>", better_name


{'.': set(['.']),
 '1 Bajo': set(['1 Bajo']),
 '1-3': set(['1-3']),
 '1-C': set(['1-C']),
 '10 A': set(['10 A']),
 '10-1': set(['10-1']),
 '10-10': set(['10-10']),
 '10-11': set(['10-11']),
 '10-12': set(['10-12']),
 '10-13': set(['10-13']),
 '10-14': set(['10-14']),
 '10-15': set(['10-15']),
 '10-16': set(['10-16']),
 '10-17': set(['10-17']),
 '10-18': set(['10-18']),
 '10-19': set(['10-19']),
 '10-2': set(['10-2']),
 '10-20': set(['10-20']),
 '10-21': set(['10-21']),
 '10-22': set(['10-22']),
 '10-23': set(['10-23']),
 '10-24': set(['10-24']),
 '10-3': set(['10-3']),
 '10-4': set(['10-4']),
 '10-5': set(['10-5']),
 '10-6': set(['10-6']),
 '10-7': set(['10-7']),
 '10-8': set(['10-8']),
 '10-9': set(['10-9']),
 '10-A': set(['10-A']),
 '10-E': set(['10-E']),
 '100A': set(['100A']),
 '100B': set(['100B']),
 '101A': set(['101A']),
 '103A': set(['103A']),
 '105A': set(['105A']),
 '106A': set(['106A']),
 '108A': set(['108A']),
 '108B': set(['108B']),
 '108C': set(['108C']),
 '109A': set(['1

## 3. Preparing for MongoDB

In [9]:
from data import process_map

data = process_map(OSM_FILE, True)
pprint.pprint(data[0])


{'created': {'changeset': '22006064',
             'timestamp': '2014-04-28T16:51:42Z',
             'uid': '2904',
             'user': 'Emilio Gomez',
             'version': '2'},
 'id': '26347361',
 'pos': [43.4466522, -3.8327673],
 'type': 'node'}


In [10]:
import json
import pymongo

from pymongo import MongoClient

In [17]:
def insert_data(data, db):
    for row in data:
        #print row
        db.maps.insert(row)
        
client = MongoClient("mongodb://localhost:27017")
db = client.maps

with open('mapSantander.osm.json') as f:
    data = json.loads(f.read())
    insert_data(data, db)



## 4. Explore Database with MongoDB

In [18]:
# Size of the file
import os
def get_size(file_name):
    wd = %pwd
    return os.stat(wd + '/' + file_name).st_size/1000.0/1000.0

file_size = get_size('mapSantander.osm.json')
print "{} is {} MB in size.".format("mapSantander.osm.json", file_size)

# Number of documnents
print "Number of Documents: " + str(db.maps.find().count())

# Number of unique users
print "Numebr of uniques Users: " + str(len(db.maps.distinct("created.user")))

# Number of nodes and ways
print "Number of Nodes: " + str(db.maps.find({'type': "node"}).count())
print "Number of Ways: " + str(db.maps.find({'type': "way"}).count())

# Number of chosen type of nodes, like cafes, shops etc.
print "Number of Cafes: " + str(db.maps.find({'amenity':u"cafe",'type':"node"}).count())

print "Number of Restaurantes: " + str(db.maps.find({'amenity':"restaurant", 'type':"node"}).count())

print "Top 10 amenities: "
pprint.pprint([doc for doc in db.maps.aggregate([{'$match':{"amenity":{"$exists":1},"type":"node"}},
                        {"$group":{"_id":"$amenity","count":{"$sum":1}}},
                        {'$sort':{"count":-1}},
                        {"$limit":10}])])

print "Top 1 contributing user" 
pprint.pprint([doc for doc in db.maps.aggregate([
    {"$match":{"type":"node"}},
    {"$group":{"_id":"$created.user","count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":1}
])])



mapSantander.osm.json is 94.359568 MB in size.
Number of Documents: 1477114
Numebr of uniques Users: 293
Number of Nodes: 1314448
Number of Ways: 162531
Number of Cafes: 425
Number of Restaurantes: 1050
Top 10 amenities: 
[{u'_id': u'recycling', u'count': 1820},
 {u'_id': u'restaurant', u'count': 1050},
 {u'_id': u'bench', u'count': 653},
 {u'_id': u'bar', u'count': 613},
 {u'_id': u'drinking_water', u'count': 575},
 {u'_id': u'waste_disposal', u'count': 525},
 {u'_id': u'cafe', u'count': 425},
 {u'_id': u'waste_basket', u'count': 323},
 {u'_id': u'bank', u'count': 320},
 {u'_id': u'place_of_worship', u'count': 308}]
Top 1 contributing user
[{u'_id': u'Emilio Gomez', u'count': 668519}]


# 5. Additional Ideas

After see the data from Santander, it is clear that it is not complete, however I find it so useful.

First of all, I see a lack of information in many inputs, it might be necessary an upgrade of OpenStreetMap. It should be mandatory when you introduce a business, a house or anything to introduce a minimum of information.

Another problem is the written accent. Depending on the language of the country, you might have written accent in your languages and this accent is translated into ASCII code. It would be helpful a program which fix this problem.