# Project 4: Wrangle OpenStreetMap Data

## Summary

**Map area:**
+ Location: Mumbai India - [https://mapzen.com/data/metro-extracts/your-extracts/693a2a74b296](https://mapzen.com/data/metro-extracts/your-extracts/693a2a74b296)

Objective: Audit, clean the OSM dataset, convert from XML to JSON format and analyze insight within the data.

## 1. Data Audit

In [1]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
import collections
import pymongo

In [2]:
import os
datadir = "data"
datafile = "mumbai.osm"
cal_data = os.path.join(datadir, datafile)

function count_tags will parse through Mumbai dataset with ElementTree and count the number of unique elements to get an overview of the data and use pretty print to print the results.

In [3]:
def count_tags(filename):
        tags = {}
        for event, elem in ET.iterparse(filename):
            if elem.tag in tags: 
                tags[elem.tag] += 1
            else:
                tags[elem.tag] = 1
        return tags
cal_tags = count_tags(cal_data)
pprint.pprint(cal_tags)

{'bounds': 1,
 'member': 1842,
 'nd': 307388,
 'node': 262994,
 'osm': 1,
 'relation': 585,
 'tag': 75186,
 'way': 40296}


For following functions: key_type & process_map. We check the "k" value for each tag and see if they can be valid keys in MongoDB, as well as see if there are any other potential problems. As we saw in the quiz earlier, we would like to change the data model and expand the "addr:street" type of keys to a dictionary like this:
{"address": {"street": "Some value"}}
So, we have to see if we have such tags, and if we have any tags with problematic characters.

In [4]:
import re

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

In [5]:
def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(k):
                keys['lower'] += 1
            elif lower_colon.search(k):
                keys['lower_colon'] += 1
            elif problemchars.search(k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
    return keys

For the function 'key_type', we have a count of each of three tag categories in a dictionary:
  - "lower", for tags that contain only lowercase letters and are valid,
  - "lower_colon", for otherwise valid tags with a colon in their names,
  - "problemchars", for tags with problematic characters

In [6]:
def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

cal_keys = process_map(cal_data)
pprint.pprint(cal_keys)

{'lower': 68019, 'lower_colon': 7069, 'other': 97, 'problemchars': 1}


find how many unique users have contributed to the map editing.

In [7]:
#people invovlved in the map editing.
def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for e in element:
            if 'uid' in e.attrib:
                users.add(e.attrib['uid'])
    return users
users = process_map(cal_data)
len(users)

783

## 2. Problems encountered

<h3><a name="street"></a> **2.1 Street address abbreviation **</h3>

One of the problem is the street name abbreviation inconsistency. Below we make the regex matching the last element in the string, where usually the street type is based.

In [8]:
from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Road"]
mapping = {'Rd'   : 'Road',
           'Rd.'   : 'Road',
           'road'   : 'Road',
           'road No.' :"Road",
           'Road No.' :"Road",
           }

+ audit_street_type function search the input string for the regex. If there is a match and it is not within the "expected" list, add the match as a key and add the string to the set.


In [9]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

+ is_street_name function looks at the attribute k if k="addre:street" 

In [10]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

+ audit function will return the list that match previous two functions. After that, we would do a pretty print the output of the audit. With the list of all the abbreviated street types we can understand and fill-up our "mapping" dictionary.

In [11]:
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])

    return street_types
cal_street_types = audit(cal_data)
pprint.pprint(dict(cal_street_types))

{'1': set(['Road No 1']),
 '10': set(['Road No 10']),
 '13': set(['Road No 13', 'Road No. 13', 'road No. 13']),
 '19': set(['Road No 19']),
 '2': set(['L. J. Cross Road No. 2', 'Road No 2', 'Road Number 2']),
 '25': set(['Road No 25']),
 '26': set(['Road No 26']),
 '3': set(['Road No 3', 'Road Number 3']),
 '4': set(['RCF Colony Type 4', 'Road No 4', 'Road Number 4']),
 '5': set(['Road Number 5']),
 '8': set(['Road No. 8']),
 '9': set(['Road No. 9']),
 'Avenue)': set(['D Saraswati Marg (Central Avenue)']),
 'Baug': set(['Saras Baug']),
 'Bhavan': set(['Vidhan Bhavan']),
 'Bridge': set(['A Irani Bridge', 'Elphinstone Bridge']),
 'COLONY': set(['P.L. Lokhande Marg,MHADA COLONY', 'THAKKAR BAPPA COLONY']),
 'CST': set(['Bora Bazaar Street, Siddhi Vinayak CHS, Fort, CST']),
 'Chawl': set(['Sutar Chawl']),
 'Chembur': set(['Amar Mahal , Chembur', 'Tilak Nagar, Chembur']),
 'Chowk': set(['Doctor Shyama Prasad Mukherjee Chowk']),
 'Colony': set(['Mysore Colony', 'R.C.F. Colony', 'RCF Colony'])

function update_name takes the old name and update them with a better name

In [12]:
def update_name(name, mapping, regex):
    m = regex.search(name)
    if m:
        street_type = m.group()
        if street_type in mapping:
            name = re.sub(regex, mapping[street_type], name)

    return name

for street_type, ways in cal_street_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping, street_type_re)
        print name, "=>", better_name

D Saraswati Marg (Central Avenue) => D Saraswati Marg (Central Avenue)
Laxman Umaji Gadkari Marg => Laxman Umaji Gadkari Marg
Dr Madhukar B. Raut Marg => Dr Madhukar B. Raut Marg
Dr Babasaheb Ambedkar Marg => Dr Babasaheb Ambedkar Marg
V N Purav Marg => V N Purav Marg
Lockmanya Talik Marg => Lockmanya Talik Marg
Shahid Bhagat Singh Marg => Shahid Bhagat Singh Marg
Keshavrao Khadye Marg => Keshavrao Khadye Marg
R N Goenka Marg => R N Goenka Marg
Pradip Dattatra Samant Marg => Pradip Dattatra Samant Marg
Jagannath Shankarsheth Marg => Jagannath Shankarsheth Marg
D K Sandu Marg => D K Sandu Marg
Sulochana Shetty Marg => Sulochana Shetty Marg
G.D. Ambedkar Marg => G.D. Ambedkar Marg
Abaji Marg => Abaji Marg
Tukaram Javji Marg => Tukaram Javji Marg
Prakash Thorat Marg => Prakash Thorat Marg
K C Marg => K C Marg
Pandurang Budhkar Marg => Pandurang Budhkar Marg
Pradip Dattatraya Samant Marg => Pradip Dattatraya Samant Marg
Amit Keshav Nayak Marg => Amit Keshav Nayak Marg
N.M Joshi Marg => N.M

<h3><a name="postal"></a> **2.2 Zip codes **</h3>

We can reuse part of the code above to clean the zipcodes.

In [13]:
from collections import defaultdict

def audit_zipcode(invalid_zipcodes, zipcode):
    twoDigits = zipcode[0:2]
    
    if not twoDigits.isdigit():
        invalid_zipcodes[twoDigits].add(zipcode)
    
    elif twoDigits != 95:
        invalid_zipcodes[twoDigits].add(zipcode)
        
def is_zipcode(elem):
    return (elem.attrib['k'] == "addr:postcode")

def audit_zip(osmfile):
    osm_file = open(osmfile, "r")
    invalid_zipcodes = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_zipcode(tag):
                    audit_zipcode(invalid_zipcodes,tag.attrib['v'])

    return invalid_zipcodes

cal_zipcode = audit_zip(cal_data)



In [14]:
pprint.pprint(dict(cal_zipcode))

{'11': set(['110092']),
 '12': set(['123']),
 '40': set(['400 021',
            '400 022',
            '400 051',
            '400 071',
            '400001',
            '400002',
            '400003',
            '400004',
            '400005',
            '400007',
            '400008',
            '400009',
            '40001',
            '400010',
            '400011',
            '400012',
            '400013',
            '400014',
            '400015',
            '400016',
            '400017',
            '400018',
            '400019',
            '400020',
            '400021',
            '400022',
            '400023',
            '400024',
            '400025',
            '400027',
            '400028',
            '400030',
            '400031',
            '400033',
            '400034',
            '400036',
            '400037',
            '400038',
            '400039',
            '400043',
            '400050',
            '400051',
            '400070',
      

The output of the clean zip code is below.

In [15]:

def update_name(zipcode):
    testNum = re.findall('[a-zA-Z]*', zipcode)
    if testNum:
        testNum = testNum[0]
    testNum.strip()
    if testNum == "CA":
        convertedZipcode = (re.findall(r'\d+', zipcode))
        if convertedZipcode:
            if convertedZipcode.__len__() == 2:
                return (re.findall(r'\d+', zipcode))[0] + "-" +(re.findall(r'\d+', zipcode))[1]
            else:
                return (re.findall(r'\d+', zipcode))[0]

for street_type, ways in cal_zipcode.iteritems():
    for name in ways:
        better_name = update_name(name)
        print name, "=>", better_name



110092 => None
123 => None
400012 => None
400088 => None
400089 => None
400 021 => None
400 022 => None
400024 => None
400025 => None
400027 => None
400020 => None
400021 => None
400022 => None
400023 => None
400070 => None
400028 => None
400008 => None
400009 => None
400043 => None
400002 => None
400003 => None
400001 => None
400007 => None
400004 => None
400005 => None
40001 => None
400017 => None
40051 => None
400094 => None
400 071 => None
400098 => None
400033 => None
400031 => None
400030 => None
400037 => None
400036 => None
400034 => None
400039 => None
400038 => None
400 051 => None
400019 => None
400018 => None
400051 => None
400050 => None
400011 => None
400010 => None
400013 => None
400074 => None
400015 => None
400014 => None
400071 => None
400016 => None
400950 => None


##### Preparing for MongoDB by converting XML to JSON

To transform the data from XML to JSON, we should follow these rules:
+ Process only 2 types of top level tags: "node" and "way"
+ All attributes of "node" and "way" should be turned into regular key/value pairs, except: attributes in the CREATED array should be added under a key "created", attributes for latitude and longitude should be added to a "pos" array, for use in geospacial indexing. Make sure the values inside "pos" array are floats and not strings. 
+ If second level tag "k" value contains problematic characters, it should be ignored
+ If second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
+ If second level tag "k" value does not start with "addr:", but contains ":", process it same as any other tag.
+ If there is a second ":" that separates the type/direction of a street, ignore this tag

After all the cleaning is done, we use process_map function to convert the file from XML into JSON.

In [16]:
import re
import codecs
import json

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
address_regex = re.compile(r'^addr\:')
street_regex = re.compile(r'^street')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


def shape_element(element):
    node = {}
    if element.tag == "node" or element.tag == "way" :
        node['type'] = element.tag
        address = {}
        # parsing through attributes
        for a in element.attrib:
            if a in CREATED:
                if 'created' not in node:
                    node['created'] = {}
                node['created'][a] = element.get(a)
            elif a in ['lat', 'lon']:
                continue
            else:
                node[a] = element.get(a)
        if 'lat' in element.attrib and 'lon' in element.attrib:
            node['pos'] = [float(element.get('lat')), float(element.get('lon'))]

        # parse second-level tags for nodes
        for e in element:
            # parse second-level tags for ways and populate `node_refs`
            if e.tag == 'nd':
                if 'node_refs' not in node:
                    node['node_refs'] = []
                if 'ref' in e.attrib:
                    node['node_refs'].append(e.get('ref'))

            # throw out not-tag elements and elements without `k` or `v`
            if e.tag != 'tag' or 'k' not in e.attrib or 'v' not in e.attrib:
                continue
            key = e.get('k')
            val = e.get('v')

            # skip problematic characters
            if problemchars.search(key):
                continue

            # parse address k-v pairs
            elif address_regex.search(key):
                key = key.replace('addr:', '')
                address[key] = val
            # catch-all
            else:
                node[key] = val
        # compile address
        if len(address) > 0:
            node['address'] = {}
            street_full = None
            street_dict = {}
            street_format = ['prefix', 'name', 'type']
            # parse through address objects
            for key in address:
                val = address[key]
                if street_regex.search(key):
                    if key == 'street':
                        street_full = val
                    elif 'street:' in key:
                        street_dict[key.replace('street:', '')] = val
                else:
                    node['address'][key] = val
            # assign street_full or fallback to compile street dict
            if street_full:
                node['address']['street'] = street_full
            elif len(street_dict) > 0:
                node['address']['street'] = ' '.join([street_dict[key] for key in street_format])
        return node
    else:
        return None


def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data
process_map(cal_data)

[{'capital': '4',
  'created': {'changeset': '47074993',
   'timestamp': '2017-03-22T18:40:18Z',
   'uid': '2815653',
   'user': 'Srihari Thalla',
   'version': '42'},
  'ele': '8',
  'id': '16173235',
  'is_capital': 'state',
  'is_in:iso_3166_2': 'IN-MH',
  'name': 'Mumbai',
  'name:bn': u'\u09ae\u09c1\u09ae\u09cd\u09ac\u0987',
  'name:cs': 'Bombaj',
  'name:de': 'Mumbai',
  'name:en': 'Mumbai',
  'name:eo': 'Mumbajo',
  'name:es': 'Bombay',
  'name:fr': 'Bombay',
  'name:gu': u'\u0aae\u0ac1\u0a82\u0aac\u0a88',
  'name:hi': u'\u092e\u0941\u0902\u092c\u0908',
  'name:ia': 'Mumbai',
  'name:io': 'Mumbai',
  'name:ja': u'\u30e0\u30f3\u30d0\u30a4',
  'name:jbo': '.mumbais.',
  'name:kn': u'\u0cae\u0cc1\u0c82\u0cac\u0cc8',
  'name:lt': 'Mumbajus',
  'name:ml': u'\u0d2e\u0d41\u0d02\u0d2c\u0d48',
  'name:mr': u'\u092e\u0941\u0902\u092c\u0908',
  'name:pl': 'Mumbaj',
  'name:ru': u'\u041c\u0443\u043c\u0431\u0430\u0438',
  'name:sk': 'Bombaj',
  'name:sr': u'\u041c\u0443\u043c\u0431\u0430\u04

<h2><a name="data_overview"></a> **3. Data Overview with MongoDB**</h2>

In [17]:
import signal
import subprocess
pro = subprocess.Popen('mongod')#, preexec_fn = os.setsid)

In [18]:
from pymongo import MongoClient
db_name = 'openstreetmap'

# Connect to Mongo DB
client = MongoClient('localhost:27017')
db = client[db_name]

In [19]:
# Build mongoimport command
collection = cal_data[:cal_data.find('.')]
json_file = cal_data + '.json'

mongoimport_cmd = 'mongoimport -h 127.0.0.1:27017 ' + '--db ' + db_name + ' --collection ' + collection + ' --file ' + json_file

# Before importing, drop collection if it is already running 
if collection in db.collection_names():
    print 'Dropping collection: ' + collection
    db[collection].drop()
    
# Execute the command
print 'Executing: ' + mongoimport_cmd
subprocess.call(mongoimport_cmd.split())

Executing: mongoimport -h 127.0.0.1:27017 --db openstreetmap --collection data\mumbai --file data\mumbai.osm.json


0

In [20]:
mumbai = db[collection]

#### File sizes

In [21]:
import os
print 'The original OSM file is {} MB'.format(os.path.getsize(cal_data)/1.0e6) # convert from bytes to mb
print 'The JSON file is {} MB'.format(os.path.getsize(cal_data + ".json")/1.0e6) # convert from bytes to mb

The original OSM file is 56.364893 MB
The JSON file is 65.88346 MB


#### Number of documents

In [22]:
mumbai.find().count()

303290

#### Number of unique users

In [23]:
len(mumbai.distinct('created.user'))

778

#### Number of Nodes and Ways

In [24]:
print "Number of nodes:",mumbai.find({'type':'node'}).count()
print "Number of ways:",mumbai.find({'type':'way'}).count()

Number of nodes: 262990
Number of ways: 40191


#### Name of top 5 contributors

In [25]:
result = mumbai.aggregate( [
                                        { "$group" : {"_id" : "$created.user", 
                                        "count" : { "$sum" : 1} } },
                                        { "$sort" : {"count" : -1} }, 
                                        { "$limit" : 5 } ] )
pprint.pprint(list(result))

[{u'_id': u'PlaneMad', u'count': 31353},
 {u'_id': u'anthony1', u'count': 18584},
 {u'_id': u'Zulfiqarib', u'count': 16677},
 {u'_id': u'Nagarjunreddy', u'count': 15081},
 {u'_id': u'venkatkotha', u'count': 14686}]


<h2><a name="exploration"></a> **4. Further data explaration with MongoDB**</h2>

#### List of top 20 amenities in Mumbai

In [26]:
amenity = mumbai.aggregate([{'$match': {'amenity': {'$exists': 1}}}, \
                                {'$group': {'_id': '$amenity', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}, \
                                {'$limit': 10}])
pprint.pprint(list(amenity))

[{u'_id': u'place_of_worship', u'count': 207},
 {u'_id': u'restaurant', u'count': 199},
 {u'_id': u'bank', u'count': 125},
 {u'_id': u'school', u'count': 117},
 {u'_id': u'hospital', u'count': 75},
 {u'_id': u'cafe', u'count': 57},
 {u'_id': u'college', u'count': 51},
 {u'_id': u'parking', u'count': 49},
 {u'_id': u'police', u'count': 43},
 {u'_id': u'fuel', u'count': 43}]


#### List of top 5 Foods in Mumbai

In [27]:
cuisine = mumbai.aggregate([{"$match":{"amenity":{"$exists":1},
                                 "amenity":"restaurant",}},      
                      {"$group":{"_id":{"Food":"$cuisine"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "Food":"$_id.Food",
                                  "Count":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":6}])
pprint.pprint(list(cuisine))

[{u'Count': 115, u'Food': None},
 {u'Count': 29, u'Food': u'indian'},
 {u'Count': 7, u'Food': u'pizza'},
 {u'Count': 6, u'Food': u'vegetarian'},
 {u'Count': 5, u'Food': u'regional'},
 {u'Count': 4, u'Food': u'italian'}]


#### List of top 10 post code in Mumbai

In [28]:
postcode = mumbai.aggregate( [ 
    { "$match" : { "address.postcode" : { "$exists" : 1} } }, 
    { "$group" : { "_id" : "$address.postcode", "count" : { "$sum" : 1} } },  
    { "$sort" : { "count" : -1}},
      {"$limit":10}] )
pprint.pprint(list(postcode))

[{u'_id': u'400050', u'count': 635},
 {u'_id': u'400043', u'count': 81},
 {u'_id': u'400005', u'count': 79},
 {u'_id': u'400089', u'count': 54},
 {u'_id': u'400074', u'count': 42},
 {u'_id': u'400071', u'count': 38},
 {u'_id': u'400001', u'count': 30},
 {u'_id': u'400002', u'count': 28},
 {u'_id': u'400051', u'count': 22},
 {u'_id': u'400004', u'count': 21}]


#### Total users have unique post (post only one time)

In [29]:
users = mumbai.aggregate( [
    { "$group" : {"_id" : "$created.user", 
                "count" : { "$sum" : 1} } },
    { "$group" : {"_id" : "$count",
                "num_users": { "$sum" : 1} } },
    { "$sort" : {"_id" : 1} },
    { "$limit" : 1} ] )
print(list(users))

[{u'num_users': 195, u'_id': 1}]


In [30]:
building = mumbai.aggregate([
       {'$match': {'building': { '$exists': 1}}}, 
        {'$group': {'_id': '$building',
                    'count': {'$sum': 1}}}, 
        {'$sort': {'count': -1}},
        {'$limit': 5}])
pprint.pprint(list(building))

[{u'_id': u'yes', u'count': 26328},
 {u'_id': u'apartments', u'count': 387},
 {u'_id': u'residential', u'count': 250},
 {u'_id': u'commercial', u'count': 76},
 {u'_id': u'house', u'count': 69}]


<h2><a name="conclusion"></a> **4. Conclusion**</h2>

**_Ideas to improve data quality of OSM:_**

While auditing the data, we find that although there are minor human input errors, the dataset is fairly clean. For human errors we can have a srtuctured input form so everyone can input the same data format to reduce errors. Moreover, we can incentivize users in the contribution process, then we can create a recommendation engine to leverage this data, since OpenStreetMaps is an open source project, there're still a lot of areas left unexplored as people tend to focus on a certain key areas and left other part outdated. we can resolve this issue by cross-referencing/cross-validating missing data from other database like Google API. Since each node has a coordinate (lattitude & longtitude). 

**_Potential cost of the implementation:_**

There're few potential issues that may arise from the implementation of this solution. One of which is the amount of effort to engineer all this and the cost of creating, auditing & maintaining thus may require a dedicated team.

**References:**

Udacity "Data Wrangling with MongoDB"

<a href=https://docs.mongodb.org/manual/reference/program/mongoimport/> MongoDB Importing XML to JSON Guide </a> 