#### Python code used for project "OpenStreetMap Data Wrangling with MongoDB"

For this project, I downloaded a compressed XML file from MapZen (www.mapzen.com/data/metro-extracts/), which provides weekly extracts of preselected metro areas in OpenStreetMap. The data represents approximately 2,500 square miles of OpenStreetMap data in and around the Milwaukee, Wisconsin metro area. OpenStreetMap is a contributor maintained map of the world. OpenStreetMap (OSM) users are encouraged to contribute local knowledge such as store opening hours, number of floors in an apartment building, whether or not a business has a drive thru, etc.
After running validation checks on the data to ensure the XML fit the OSM data model (described here: http://wiki.openstreetmap.org/wiki/OSM_XML), Python cElementTree was used to iteratively parse the data for export to JSON and load into a local MongoDB instance.

Map area: Milwaukee, Wisconsin https://mapzen.com/data/metro-extracts/#milwaukee-wisconsin

In [1]:
#tree based XML parsing: read all of the XML into memory, transform into a tree, work all the nodes on this tree
#alternative to in-memory parsing: iterative parsing such as the'sax parser' 
#parse one tag at a time. each time you see a tag is an "event"
#we will use cElementTree to iteratively parse our osm XML file
import xml.etree.cElementTree as ET
import pprint

#next save the osm XML file input and targeted json output paths here
#use codecs and json libraries later to write to a .json file
filename = 'milwaukee_wisconsin.osm'
file_out = 'milwaukee_wisconsin.json'
import codecs
import json

#use pymongo for connecting to a local mongodb database called "osm"
#this database is initially empty, but we will load our milwaukee osm data into a mongo collection called "p3" (as in "project 3")
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client.osm

First, let's get a feel for the osm XML by taking a look at just the first 20 lines of the data file:

In [3]:
counter = 0
limit = 20

for event,elem in ET.iterparse(filename, events=("start",)):
    counter += 1
    if counter >= limit:
        break
    #parse iteratively. looking for specific things ('events') in the xml
    #event is always "end" if you dont specify a tuple parm "events="...
    #remember, end tuples in python with a comma so not ("start"), its ("start",)
    #elem is the <xml tag> (name is "elem.tag"). get attributes using elem.attrib
    
    main_loop_tagname = elem.tag
    if main_loop_tagname in ['tag']:
        counter -= 1
        continue #iterparse (main loop) will parse over the tags we already processed cause they were nested under nodes and ways
    
    print_main_attrib = ": {}".format(elem.attrib) #'' #elem.attrib
    if main_loop_tagname <> 'node':
        print_main_attrib = ": {}".format(elem.attrib)
    else:
        print_main_attrib = ": {}".format(elem.attrib)
        print_main_attrib = ": uid {}, user {}".format(elem.attrib['uid'],elem.attrib['user'])
    print "main loop {}, {}{}".format(counter,main_loop_tagname,print_main_attrib)
    if main_loop_tagname == 'node':
        for tag in elem.iter():
            if tag.tag == elem.tag:
                #interesting that elem.iter() actually starts at the element node itself...
                continue
            #loop thru this parent node, to find only tags w name "tag" use elem.iter("tag")
            print "     {}: {}".format(tag.tag,tag.attrib)
        elem.clear()
        continue
    elem.clear()
    #ways are streets. for both node and ways, they follow with "tag" tags, with attribs for that way or node.

main loop 1, osm: {'timestamp': '2016-02-13T00:26:02Z', 'version': '0.6', 'generator': 'osmconvert 0.7T'}
main loop 2, bounds: {'minlat': '42.656', 'maxlon': '-87.522', 'minlon': '-88.511', 'maxlat': '43.389'}
main loop 3, node: uid 108601, user whammypower788
main loop 4, node: uid 1751737, user Skybunny
     tag: {'k': 'ref', 'v': '73B'}
     tag: {'k': 'highway', 'v': 'motorway_junction'}
main loop 5, node: uid 108601, user whammypower788
main loop 6, node: uid 69864, user Ivan Komarov
main loop 7, node: uid 69864, user Ivan Komarov
main loop 8, node: uid 69864, user Ivan Komarov
main loop 9, node: uid 69864, user Ivan Komarov
main loop 10, node: uid 207745, user NE2
     tag: {'k': 'ref', 'v': '83'}
     tag: {'k': 'highway', 'v': 'motorway_junction'}
main loop 11, node: uid 108601, user whammypower788
main loop 12, node: uid 108601, user whammypower788
main loop 13, node: uid 108601, user whammypower788
main loop 14, node: uid 4732, user iandees
main loop 15, node: uid 108601, use

Next, looping over the entire 166MB osm file, print out the total count of each different XML tag:

In [4]:
results = {}

for event,elem in ET.iterparse(filename, events=("start",)):
    main_loop_tagname = elem.tag
    if main_loop_tagname in results:
        results[main_loop_tagname] += 1
    else:
        results[main_loop_tagname] = 1
pprint.pprint(results) #all tags, ignoring parent/child relationships

{'bounds': 1,
 'member': 7338,
 'nd': 911537,
 'node': 738460,
 'osm': 1,
 'relation': 657,
 'tag': 468592,
 'tax-yearg': 1,
 'way': 83158}


We now know there are 738,460 nodes and 83,158 ways we will want to load as documents into MongoDB. Somewhere in the neighborhood of 468,592 tags will also be loaded as fields on our documents (tags nested under 'relations' will not be loaded). The data model for the osm XML is described here: https://wiki.openstreetmap.org/wiki/OSM_XML

Now that we know the totals, let's try getting a feel for the XML's tag nesting structure so that we can confirm the data conforms to the data model documentation linked above.

In [6]:
results = {}

for event,elem in ET.iterparse(filename, events=("start",)):
    main_loop_tagname = elem.tag
    #if main_loop_tagname in ['tag']:
    #    continue #iterparse (main loop) will parse over the tags we already processed cause they were nested under nodes and ways
    
    if 1==1: #main_loop_tagname in ['node','way']:
        for tag in elem.iter():
            mini_loop_tagname = tag.tag
            #if mini_loop_tagname == main_loop_tagname:
            #    continue
            #loop thru this parent node, to find only tags w name "tag" use elem.iter("tag")
            #print "     {}: {}".format(tag.tag,tag.attrib)
            key_string = main_loop_tagname + "." + mini_loop_tagname
            if key_string in results:
                results[key_string] += 1
            else:
                results[key_string] = 1
    #ways are streets. for both node and ways, they follow with "tag" tags, with attribs for that way or node.
pprint.pprint(results)

{'bounds.bounds': 1,
 'member.member': 7338,
 'nd.nd': 911537,
 'node.node': 738460,
 'node.tag': 48534,
 'osm.bounds': 1,
 'osm.node': 103,
 'osm.osm': 1,
 'osm.tag': 11,
 'relation.member': 6657,
 'relation.relation': 657,
 'relation.tag': 4189,
 'tag.tag': 468592,
 'tax-yearg.tax-yearg': 1,
 'way.nd': 875872,
 'way.tag': 403732,
 'way.tax-yearg': 1,
 'way.way': 83158}


**It seems that something is wrong with our XML parser.** First off, we already know that there are over 700,000 nodes in the file, but our first attempt at parsing the XML nesting structure found only 103 nodes nested under the root ('osm')! 

Next, we also know there are 468,593 "tag" tags. The OSM documentation states that "tags" are always nested under a parent node, way, or relation, yet our nesting-search parser is more than 12,000 tags short of the 468,593 tags that the document-total parser found:

In [5]:
#um... why when just looking at tags we see that there are 738,460 nodes and 468,593 "tag" tags...
#but when i run my "for tag in elem.iter()" loop, it says the count of osm.node = 103, should osm.node be 738,460??
#and the count of all the tags are ~12k short of the total tags??
print 468593 
print 468593 - 48534 - 4190 - 403778

468593
12091


The answer, as described in the Project Summary document, is that the size of this XML file dictates we not iterate over nested (children) tags until reaching the "end" tag of a given element. That means we cannot know how many "nodes" are nested under the root "osm" element until we reach the `"</osm>"`. Similarly, we cannot know how many "tag" elements are nested under a node until we reach the `"</node>"` of that node. 

I wish I could simply change the "events" tuple parameter passed into our parser, but unfortunately we **must** have the parser examine the "start" elements because start elements such as `"<node>"` and `"<way>"` have attribute information that we could not access if we were to completely ignore the start elements.

To accomodate all of this, I had to specify the parser to examine **both** start and end elements in the XML, pulling attribute data from start elements and only iterating over children nodes when at an end element. With this modification to the code, I could finally have the parser return an accurate summary of the XML and its nesting structure:

In [14]:
results = {}

for event,elem in ET.iterparse(filename, events=("start","end")):
    main_loop_tagname = elem.tag
    #if main_loop_tagname in ['tag']:
    #    continue #iterparse (main loop) will parse over the tags we already processed cause they were nested under nodes and ways
    
    if 1==1: #main_loop_tagname in ['node','way']:
        for tag in elem.iter():
            mini_loop_tagname = tag.tag
            #if mini_loop_tagname == main_loop_tagname:
            #    continue
            #loop thru this parent node, to find only tags w name "tag" use elem.iter("tag")
            #print "     {}: {}".format(tag.tag,tag.attrib)
            key_string = "{}.{}".format(main_loop_tagname,mini_loop_tagname)
            if main_loop_tagname in results:
                if mini_loop_tagname in results[main_loop_tagname]:
                    results[main_loop_tagname][mini_loop_tagname][event] += 1
                else:
                    results[main_loop_tagname][mini_loop_tagname] = {"start":0,"end":0}
                    results[main_loop_tagname][mini_loop_tagname][event] = 1
            else:
                results[main_loop_tagname] = {}
                results[main_loop_tagname][mini_loop_tagname] = {"start":0,"end":0}
                results[main_loop_tagname][mini_loop_tagname][event] = 1
    #ways are streets. for both node and ways, they follow with "tag" tags, with attribs for that way or node.
pprint.pprint(results)

{'bounds': {'bounds': {'end': 1, 'start': 1}},
 'member': {'member': {'end': 7338, 'start': 7338}},
 'nd': {'nd': {'end': 911537, 'start': 911537}},
 'node': {'node': {'end': 738460, 'start': 738460},
          'tag': {'end': 48834, 'start': 48534}},
 'osm': {'bounds': {'end': 1, 'start': 1},
         'member': {'end': 7338, 'start': 0},
         'nd': {'end': 911537, 'start': 0},
         'node': {'end': 738460, 'start': 103},
         'osm': {'end': 1, 'start': 1},
         'relation': {'end': 657, 'start': 0},
         'tag': {'end': 468592, 'start': 11},
         'tax-yearg': {'end': 1, 'start': 0},
         'way': {'end': 83158, 'start': 0}},
 'relation': {'member': {'end': 7338, 'start': 6657},
              'relation': {'end': 657, 'start': 657},
              'tag': {'end': 4425, 'start': 4189}},
 'tag': {'tag': {'end': 468592, 'start': 468592}},
 'tax-yearg': {'tax-yearg': {'end': 1, 'start': 1}},
 'way': {'nd': {'end': 911537, 'start': 875872},
         'tag': {'end': 415333,

In [15]:
#using the above, we can see the datamodel in our file conforms to the documentation provided in the wiki
    #at https://wiki.openstreetmap.org/wiki/OSM_XML
'''
osm is root node
    bounds (only appears 1 time)
    node (738k)
        tag (49k)
    way (83k)
        nd (911k)
        tag (415k)
    relation (657)
        member (7k)
        tag (4k)
'''

#and we can verify that we are fully capturing all "tag" children that can be found in the root "osm" node
print 468593 #tags found under osm root in document-wide parser
print 468593 - 48834 - 4425 - 415334 #subtract "tag" nested in node, relation, and way found by our nested-structure parser 

468593
0


The following are my raw notes written shortly after finally solving the mystery on why the iterparse() returned different counts when run document-wide and iterating through nested elements:

In [16]:
#ah ha! so we see our first issue with the data: (not counting elem.iter() starting at the element.self tag...)
#the osm file is so big (103+ MB) that we cannot fully parse it using xml.etree.cElementTree.iterparse with only "start" events
#this is a known issue with cElementTree's iterparse method (see: green text box at http://effbot.org/zone/element-iterparse.htm#usage)
#full reference here: https://mail.python.org/pipermail/xml-sig/2005-January/010838.html
#python docs on iterparse "start" method issue here: https://docs.python.org/2/library/xml.etree.elementtree.html
#essentially, this means that you cant iterate over an elements children using a 'start' event. must use 'end' event to do so.
#this is unfortunate for us because we MUST use start events to capture the attributes stored in the xml start tags
#so we must write more complex code that iterparse's over both start AND end events:
    #start events to retrieve the attributes stored in XML start tags of nodes/ways/etc
    #and end events so we can element iterate (elem.iter()) over the child tags ("tag" and "nd") of our nodes/ways/etc
#how big of an issue is this? for the milwaukee OSM xml file, the count of node tags using 'start' events was equal to 'end' events
    #but the count of node's child "tag" jumped 300 and way's child nd jumped from 875,955 to 911,537 by waiting until an 'end' event to elem.iter()
#data at the end of the file (like relation data) is improved the most when switching to only use end events

### Export data to JSON and Load into MongoDB

In [17]:
#this function will set up the dictionary we can use to evaluate the volume of various issues that might be present in the data
def initialize_errors():
    errors = {}
    errors['num_key_problemchars'] = 0
    errors['key_problemchars'] = {}
    
    errors['num_attrib_parse'] = 0
    errors['attrib_parse'] = {}
    
    errors['num_badkey'] = 0
    errors['badkey'] = {}
    
    errors['num_badvalue'] = 0
    errors['badvalue'] = {}
    
    errors['num_addr_subkeys'] = 0
    errors['addr_subkeys'] = {}
    errors['num_gnis_subkeys'] = 0
    errors['gnis_subkeys'] = {}
    errors['num_tiger_subkeys'] = 0
    errors['tiger_subkeys'] = {}
    errors['num_seamark_subkeys'] = 0
    errors['seamark_subkeys'] = {}
    
    errors['num_colonkeys'] = 0
    errors['colonkeys'] = {}

    errors['colon_key_tails'] = {}
    errors['colon_value_tails'] = {}
    
    errors['num_colonvalues'] = 0
    errors['colonvalues'] = {}
    return errors

In [19]:
#this function will be used to parse the attributes stored in <node> and <way> tags 
    #(a.k.a. "parent" tags because the "tag" and "nd" tags are 'children' nested under these parent <node> and <way>)

CREATED = ['version','changeset','timestamp','user','uid']

def parse_parent_tag(tag_type,attrib,errors):
    document = {}
    document['created'] = {}
    document['k_v_tag_count'] = 0 #initialize the count of children "tags" right here
    lon = None
    lat = None
    
    document['document_tag_type'] = tag_type
    for k,val in attrib.iteritems():
        if k in ['id','visible']:
            document[k] = val
            continue
        elif k in CREATED:
            document['created'][k] = val
            continue
        elif k == 'lon':
            lon = float(val)
        elif k == 'lat':
            lat = float(val)
        else:
            errors['num_attrib_parse'] += 1
            errors['attrib_parse'][attrib['id']] = {tag_type:attrib}
            #print "UNKNOWN ATTRIBUTE PARSE {} [{}] {},{}".format(tag_type,attrib['id'],k,val)
    if lat and lon:
        #store latitude and longitude in a list for 2d geospatial indexing in mongo
        document['pos'] = [lat,lon]
    return document

In [20]:
#set up regular expressions to vet the strings that compose the keys and values for nested tags 
import re
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
valid = re.compile(r'^[a-zA-Z0-9_-]+$')
#these last two regex are used in the course code but I have chosen to use the regex "valid" instead
#lower = re.compile(r'^([a-z]|_)*$')
#lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')


In [21]:
#this function was originally used to remove certain characters from nested tag "v" values
#it was later decided not to clean the nested tag values in this way but I leave the code here for future reference
def format_value(value,errors):
    if value.find(':')>0:
        errors['num_colonvalues'] += 1
        value_head = value[:value.find(':')]
        if not (value_head in errors['colonvalues']):
            errors['colonvalues'][value_head] = 1
        else:
            errors['colonvalues'][value_head] += 1
        value = value.replace(':','-')
    value = value.replace(' ','_')
    value = value.replace(';','_')
    return value

In [23]:
#this function will perform the transform and formatting of the nested tag keys and values
    #much code has been added to log errors into an "errors" dictionary
#nested tag keys starting with "gnis:", "addr:", "tiger:", and "seamark:" are added into document level dictionaries
    #example: a value of "Milwaukee" with a key of "gnis:county" 
        #would be saved into the document under the "gnis" dictionary...
        #'gnis':{'county':'Milwaukee'}
    #these operations are also logged to the same 'errrors' dictionary (though they aren't literally "errors")
#if a key other than gnis, addr, tiger, or seamark has a ":" in it, save to the document level dictionary "bad_keys"
    #these tag keys and values should be examined further in the future to best determine how to transform and incorporate into the wider document

def add_tag_vals(document, key, value, counter ,errors):
    #value = format_value(value,errors) #edit: do not need to remove special characters from the tag values
    if problemchars.search(key):
        errors['num_key_problemchars'] += 1
        errors['key_problemchars'][document['id']] = {'key':key,'value':value}
        
    elif not valid.search(key.replace(':','')):
        errors['num_badkey'] += 1
        errors['badkey'][document['id']] = {'key':key,'value':value}
    
    elif key.find(':')>0:
        key_head = key[:key.find(':')]
        key_tail = key[key.find(':')+1:].replace(':','_')
        
        if key_head == 'gnis':
            if not ('gnis' in document):
                document['gnis'] = {}
            #k="gnis:County",v="Milwaukee" will be saved into document['gnis']['County']= Milwaukee...
            document['gnis'][key_tail] = value
            document['k_v_tag_count'] += 1
            errors['num_gnis_subkeys'] += 1
            if not (key_tail in errors['gnis_subkeys']):
                errors['gnis_subkeys'][key_tail] = 1
            else:
                errors['gnis_subkeys'][key_tail] += 1
        elif key_head == 'addr':
            if not ('addr' in document):
                document['addr'] = {}
            document['addr'][key_tail] = value
            document['k_v_tag_count'] += 1
            errors['num_addr_subkeys'] += 1
            if not (key_tail in errors['addr_subkeys']):
                errors['addr_subkeys'][key_tail] = 1
            else:
                errors['addr_subkeys'][key_tail] += 1
        elif key_head == 'tiger':
            if not ('tiger' in document):
                document['tiger'] = {}
            document['tiger'][key_tail] = value
            document['k_v_tag_count'] += 1
            errors['num_tiger_subkeys'] += 1
            if not (key_tail in errors['tiger_subkeys']):
                errors['tiger_subkeys'][key_tail] = 1
            else:
                errors['tiger_subkeys'][key_tail] += 1
        elif key_head == 'seamark':
            if not ('seamark' in document):
                document['seamark'] = {}
            document['seamark'][key_tail] = value
            document['k_v_tag_count'] += 1
            errors['num_seamark_subkeys'] += 1
            if not (key_tail in errors['seamark_subkeys']):
                errors['seamark_subkeys'][key_tail] = 1
            else:
                errors['seamark_subkeys'][key_tail] += 1
        else:
            errors['num_colonkeys'] += 1

            if not (key_head in errors['colonkeys']):
                errors['colonkeys'][key_head] = 1
            else:
                errors['colonkeys'][key_head] += 1

            if not (key_tail in errors['colon_key_tails']):
                errors['colon_key_tails'][key_tail] = {}
                errors['colon_key_tails'][key_tail][key_head] = 1
            elif not (key_head in errors['colon_key_tails'][key_tail]):
                errors['colon_key_tails'][key_tail][key_head] = 1
            else:
                errors['colon_key_tails'][key_tail][key_head] += 1
            
            if not ('bad_keys' in document):
                document['bad_keys'] = {}
            document['bad_keys'][key.replace(':','--')] = value
    else:
        #print "add to top level of results: {},{}".format(k,v)
        document[key] = value
        document['k_v_tag_count'] += 1

In [25]:
#finally, this is the code that will iterate over the osm XML
    #this function will loop over the tags specified in "targets" (typically: nodes and ways)
    #the function can load into the local mongo database and collection "osm.p3"
    #the function can also save the transformed data into a .json file
def export_osm(file_out,targets,errors,limit=0,to_mongo=None,to_file=None):
    if to_mongo:
        db.p3.drop()
    with codecs.open(file_out, "w") as fo: 
        data = []
        counter = 0
        errors = initialize_errors()

        for event,elem in ET.iterparse(filename, events=("start","end")):
            main_loop_tagname = elem.tag
            if main_loop_tagname in ['osm','bounds']:
                #these are the header info from our xml data file
                if event == 'start':
                    print '{}: {}'.format(main_loop_tagname,elem.attrib)
                continue
            elif main_loop_tagname in ['tag','nd','member']:
                #these are all "children" tags that will be parsed when dealing with the end event of their parent, 
                    #no need to parse here in the main loop
                continue
            elif not (main_loop_tagname in targets): #typically, we will deal with nodes and ways only
                continue

            elif event == "start":
                #when the parser reaches a 'start' tag, the only thing we can retrieve is the attrib dict stored in the start tag.
                #save the iteration over the children tags until we reach the 'end' event of the tag
                document = {} #start new document (row in our db)
                document = parse_parent_tag(main_loop_tagname,elem.attrib,errors)
                continue

            elif event == 'end' and main_loop_tagname in targets:
                #now that we are at the "end" event of the top level tag, we can iterate thru its child tags
                for tag in elem.iter("tag"): 
                    #^specify the target child tag, or else elem.iter() will start at the parent elem itself!
                    add_tag_vals(document, tag.attrib['k'], tag.attrib['v'] ,counter ,errors)
                    
                if main_loop_tagname == 'way':
                    #for "way" tags, we also need to process the list of nodes that are linked to this "way"
                    nd_refs = []
                    for nd in elem.iter("nd"):
                        nd_refs.append(nd.attrib['ref'])
                    if len(nd_refs) > 0:
                        document["node_refs"] = nd_refs
                counter += 1
                if to_file:
                    fo.write(json.dumps(document)+"\n")
                if to_mongo:
                    db.p3.insert_one(document)
                elem.clear()
                if limit > 0 and counter > limit:
                    #for debugging, we can use the limit parameter to set a hardstop on the number of documents parsed
                    break
    return errors

In [26]:
errors = {}
targets = ['node','way']
errors = export_osm(file_out,targets,errors,to_mongo=True,to_file=True)

osm: {'timestamp': '2016-02-13T00:26:02Z', 'version': '0.6', 'generator': 'osmconvert 0.7T'}
bounds: {'minlat': '42.656', 'maxlon': '-87.522', 'minlon': '-88.511', 'maxlat': '43.389'}


### OSM Data Export Log

In [27]:
print errors['num_badvalue']
print errors['num_key_problemchars']
print errors['num_attrib_parse']
print errors['num_badkey']
print errors['num_colonkeys']
print errors['num_gnis_subkeys']
print errors['num_addr_subkeys']
print errors['num_tiger_subkeys']
print errors['num_seamark_subkeys']
print errors['num_colonvalues']

0
0
0
0
4621
7687
14442
225441
104
0


#### ^From the above, we can see there were...
* 7,687 nested tags saved into "gnis" dictionaries
* 14,442 nested tags saved into "address" dictionaries
* 225,441 nested tags saved into "tiger" dictionaries
* 104 nested tags saved into "seamark" dictionaries


* And 4,621 nested tags saved into the "bad_keys" dictionaries


* There were also 4,263 nested tag *values* that contained ":"

The following is an exploration of the sub keys (example: "city" is the subkey for the raw key "addr:city") saved to the address, gnis, tiger, and seamark document level dictionaries:

In [28]:
print "addr subkeys: {}".format(errors['num_addr_subkeys'])
print errors['addr_subkeys']
print ""
print "gnis subkeys: {}".format(errors['num_gnis_subkeys'])
print errors['gnis_subkeys']
print ""
print "tiger subkeys: {}".format(errors['num_tiger_subkeys'])
pprint.pprint(errors['tiger_subkeys'])
print ""
print "seamark subkeys: {}".format(errors['num_seamark_subkeys'])
pprint.pprint(errors['seamark_subkeys'])

addr subkeys: 14442
{'city': 2553, 'full': 32, 'country': 926, 'historic': 1, 'state': 2396, 'street': 3115, 'housename': 66, 'postcode': 2125, 'suite': 5, 'housenumber': 3147, 'unit': 74, 'interpolation': 2}

gnis subkeys: 7687
{'Class': 170, 'feature_type': 49, 'created': 1413, 'import_uuid': 214, 'edited': 102, 'county_name': 263, 'ST_alpha': 170, 'id': 171, 'County': 170, 'feature_id': 1686, 'county_id': 1366, 'state_id': 1364, 'reviewed': 209, 'County_num': 170, 'ST_num': 170}

tiger subkeys: 225441
{'cfcc': 31489,
 'county': 31551,
 'name_base': 27656,
 'name_base_1': 1938,
 'name_base_2': 215,
 'name_base_3': 25,
 'name_base_4': 2,
 'name_direction_prefix': 11308,
 'name_direction_prefix_1': 448,
 'name_direction_prefix_2': 36,
 'name_direction_prefix_3': 5,
 'name_direction_suffix': 221,
 'name_direction_suffix_1': 10,
 'name_direction_suffix_2': 12,
 'name_type': 25638,
 'name_type_1': 1266,
 'name_type_2': 91,
 'name_type_3': 13,
 'reviewed': 31052,
 'separated': 2945,
 'sour

It should be noted that "tiger:" only appears on ways. 

For nodes, the two major ":"-containing tag keys are "addr:" and "gnis:". The above shows that there is no overlap in the subkeys for these two types of keys. Based on this, I decided to add "addr" and "gnis" tag values as **separate** document level dictionaries.

Here is a breakdown of the keys for the 4,621 tags saved into the "bad_keys" document level dictionaries. The first p-printed cell shows the distribution of the bad_keys by [the text appearing before the ":" in the raw key]. The second p-printed cell shows the distribution of the bad_keys by [subkeys (text after the ":"), then broken down by pre-":" text]. Use the first cell to find prominent "keys:" to next incorporate into the Mongo database, and use the second cell to make sure you aren't silo-ing data in one document-level dictionary that might also be saved in a different document-level dictionary (see: "lanes", for example).

In [29]:
pprint.pprint(errors['colonkeys'])

{'FIXME': 2,
 'NHD': 429,
 'abandoned': 1,
 'access': 13,
 'aerialway': 8,
 'alt_name': 1,
 'area': 2,
 'bell': 7,
 'boatyard': 1,
 'bridge': 35,
 'building': 1246,
 'bus': 10,
 'camera': 1,
 'capacity': 11,
 'census': 68,
 'communication': 1,
 'community_centre': 1,
 'contact': 1386,
 'crossing': 1,
 'cycleway': 8,
 'dance': 6,
 'demolished': 3,
 'description': 1,
 'destination': 247,
 'diet': 1,
 'dimensions': 3,
 'disused': 33,
 'entrance': 22,
 'fire_hydrant': 4,
 'flag': 25,
 'fuel': 10,
 'generator': 17,
 'health_facility': 4,
 'healthcare': 4,
 'heritage': 3,
 'hgv': 96,
 'hov': 27,
 'internet_access': 1,
 'is_in': 2,
 'isced': 15,
 'junction': 19,
 'key': 1,
 'lanes': 23,
 'maxheight': 2,
 'maxspeed': 12,
 'motor_vehicle': 10,
 'name': 192,
 'note': 18,
 'odbl': 6,
 'parking': 2,
 'payment': 17,
 'phone': 1,
 'piste': 10,
 'plant': 10,
 'population': 1,
 'railway': 7,
 'ramp': 9,
 'recycling': 68,
 'ref': 6,
 'restaurant': 1,
 'roof': 38,
 'service': 8,
 'social_facility': 9,
 

In [46]:
pprint.pprint(errors['colon_key_tails'])

{'1800': {'name': 1},
 '1850': {'name': 1},
 '1860': {'name': 1},
 '1863': {'name': 1},
 '1880': {'name': 2},
 '1885-1886': {'name': 1},
 '1886-1914': {'name': 1},
 '1892': {'name': 1},
 '1892-1895': {'name': 1},
 '1895': {'name': 2},
 '1895-1900': {'name': 1},
 '1900': {'name': 1},
 '1901': {'name': 2},
 '1903': {'name': 1},
 '1907': {'name': 1},
 '1909': {'name': 1},
 '1911': {'name': 1},
 '1913': {'name': 1},
 '1914': {'name': 1},
 '1917-1927': {'name': 1},
 '1922': {'name': 1},
 '1928-1950': {'name': 1},
 '1931-1945': {'name': 1},
 '1934': {'name': 1},
 '1935': {'name': 1},
 '1945': {'name': 1},
 '1948': {'name': 1},
 '1949': {'name': 1},
 '1951': {'name': 1},
 '1955': {'name': 1},
 '1958': {'name': 1},
 '1960': {'name': 1},
 '1961-2014': {'name': 1},
 '1962': {'name': 1},
 '1963': {'name': 2},
 '1964': {'name': 1},
 '1965': {'name': 1},
 '1968': {'name': 1},
 '1969': {'name': 1},
 '1971': {'name': 1},
 '1972': {'name': 1},
 '1972-1995': {'name': 1},
 '1976': {'name': 1},
 '1980': 

Finally, I also did a check on the tag values that contained ":". I initially had these ":"-containing tag values logged because I was worried there could be messy data in the tag values, perhaps contributors used ":" to nest information in tag values the same way they use it to nest information in tag keys?

Thankfully, most of the tag values that contained ":" were for valid reasons. The most prominent being web addresses (example "http://www.google.com")

In [31]:
#most of the 790 values (v) [from nodes] that had colons in them were web addresses
#pprint.pprint(errors['colonvalues'])