In [1]:
%%javascript
//IPython.load_extensions('calico-spell-check') //uncomment if spell checker not present


<IPython.core.display.Javascript object>

#OpenStreetMap Project
##Zach Farmer


##Table of Contents: 

1. [Data Auditing, Transforming and Loading](#Data Auditing, Transforming and Loading)
    + [Discovering tags](#Discovering tags)
    + [Check Compatibility of Tag Values](#Check Compatibility of Tag Values)
    + [Exploring Unique Users and Keys](#Exploring Unique Users and Keys) 
    + [Auditing and cleaning some of the data](#Auditing and cleaning some of the data)
    + [Preparing for Database Insertion](#Preparing for Database Insertion)
    + [Inserting into Mongo DB](#Inserting into Mongo DB)
    + [Quering MongoDB](#Quering MongoDB)
    + [Document tag overview](#Document tag overview)
2. [Problems Encountered in OSM Data](#Problems Encountered in OSM Data) 
    * [Brief Overview of Documents containing audited tags](#Brief Overview of Documents containing audited tags)
    * [Abbreviated Street Names](#Abbreviated Street Names)
    * [Incorrect and Inconsistent Postcodes](#Incorrect and Inconsistent Postcodes)
    * [Incorrect State Abbreviations](#Incorrect State Abbreviations)
    * [Incorrect City Names](#Incorrect City Names)
3. [Overview of The Data](#Overview of The Data)
    * [Sizes](#Sizes)
    * [Number of Ways and Nodes](#Number of Ways and Nodes)
    * [Uniques](#Uniques)
    * [Exploring the Data](#Exploring the Data)
4. [Further Thoughts on the dataset and OSM data collection methods](#Further Thoughts on the dataset and OSM data collection methods)







###**Map Area:** Seattle-East-Side, WA. specifically Bellevue, Kirkland, Redmond, Mercer Island, Issaquah, Sammamish   
![Image of project_area](openstreetmap_project_area.png)    

OpenStreetMap data for the this project including the above map can be exported from here: [https://www.openstreetmap.org/export#map=10/47.5937/-122.0931](https://www.openstreetmap.org/export#map=10/47.5937/-122.0931)   

OpenStreetMap is an open source mapping service, more information about their service can be found on their [about page.](https://www.openstreetmap.org/about) The specific data used for this project was downloaded using the OpenStreetMap Overpass API with the following bounds.

I choose this particular region because I lived around here while attending university, In addition there are major technology companies in this region, i.e. Microsoft, INRIX, Google, Amazon, Expedia, etc. These companies employ many tech savvy individuals, it was my belief that this area might be well documented with meta-data as a result of some many people living in the area who understand the value of rich meta-data.

```<bounds minlat="47.5024" minlon="-122.256" maxlat="47.7144" maxlon="-121.974"/>```   

> More detailed information can be found on the returned XML data from the API call at the [openstreetmap wiki.](https://wiki.openstreetmap.org/wiki/OSM_XML)   

####Import Modules

In [18]:
# Imported Modules required for the project
import os
import pprint
import json
import codecs
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
from pymongo import MongoClient
import string
from operator import itemgetter

<a id='Data Auditing, Transforming and Loading'></a>

##1. Data Auditing, Transforming and Loading
***    
The following sections involve steps similar to Lesson 6 data auditing procedure, however this audit is provisional to the area selected and I have added additional output to provide a broad overview of the osm data that supplements the MongoDB queries that occur later in the analysis. 

<a id='Discovering tags'></a>

###Discovering tags

In [2]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author Zach Farmer
#
# Assumes the data file has been downloaded, named appropriately and is in the local directory

def count_tags(filename):
    '''Discover what tags there are and how many of each them exist.
The output should be a dictionary with the tag name as the key
and number of times this tag is encountered in the dataset as the value.
    '''
    dict_of_tags = {}
    tree = ET.iterparse(filename)
    for event,elem in ET.iterparse(filename):
        
        if elem.tag not in dict_of_tags.keys():
            dict_of_tags[elem.tag] = 1
            elem.clear()
        else:
            dict_of_tags[elem.tag] += 1
            elem.clear()
                
    return dict_of_tags



if __name__ == "__main__":
    tags = count_tags("seattle-area_east-side.osm")
    pprint.pprint(tags)

{'bounds': 1,
 'member': 2684,
 'meta': 1,
 'nd': 732744,
 'node': 657718,
 'note': 1,
 'osm': 1,
 'relation': 275,
 'tag': 350124,
 'way': 67173}


<a id='Check Compatibility of Tag Values'></a>

###Check Compatibility of Tag Values

In [35]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author Zach Farmer
#
# Check the "k" value for each "<tag>" and determine if they are valid keys for our MongoDB instance,
# as well as determining if there are any other potential problems.

def key_type(element, keys):
    ''' 
    Take 'tag' element and using regex patterns established check for
    types of characters in the tags keys.
    '''
   
    if lower.match(element.attrib['k']):
        keys['lower'] +=1
        
    elif lower_colon.match(element.attrib['k']):
        keys['lower_colon'] += 1
        
    elif problemchars.match(element.attrib['k']):
        keys['problemchars'] += 1
                
    else:
        keys['other'] += 1
        
    return keys

def process_map(filename):
    ''' 
    find elements with 'tag' children add character make-up to keys dictionary
    '''
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    
    for event, element in ET.iterparse(filename,events=("start",)):
        if element.tag == "tag":
            keys = key_type(element, keys)
        element.clear()
         
    return keys

if __name__ == "__main__":
    # For pattern matching the tag keys
    lower = re.compile(r'^([a-z]|_)*$')
    lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
    problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
    
    keys = process_map("seattle-area_east-side.osm")
    pprint.pprint(keys)

{'lower': 138733, 'lower_colon': 208566, 'other': 2825, 'problemchars': 0}


<a id='Exploring Unique Users and Keys'></a>

###Exploring Unique Users and Keys 

In [21]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author Zach Farmer
#
"""
find out how many unique users have contributed to the map in this particular area!
List out all the unique keys in the data set

The function process_map should return a set of unique user IDs ("uid") and a set of unique keys.
"""
def process_map(filename):
    '''return a set of unique user IDs ("uid") and a set of unique keys.'''
    users = set()
    keys = set()
    total_entries = 0 # for auditing overview purposes
    for _, element in ET.iterparse(filename):
        if 'uid' in element.attrib.keys():
            users.add((element.attrib['uid'],element.attrib['user']))
            total_entries += 1
            if element.tag == "tag" and element.attrib['v']:
                keys.add(element.attrib['k'])
            element.clear()
        if element.tag == "tag" and element.attrib['v']:
            keys.add(element.attrib['k']) 
            element.clear()
        else:
            element.clear()
        
    return users,keys,total_entries

if __name__ == "__main__":
    uniq_users, uniq_keys, total_entries = process_map("seattle-area_east-side.osm")
    
    print "Total number of entries by all Users: {0}".format(total_entries),'\n',\
    "Number of Unique Users: {0}".format(len(uniq_users)),'\n',\
    "Unique User ID's and User Names: ",'\n',pprint.pprint(uniq_users),'\n',\
    "Number of Unique tag keys: {0}".format(len(uniq_keys)),'\n',"Unique keys:",'\n',\
    pprint.pprint(uniq_keys)   

Total number of entries by all Users: 725166 
Number of Unique Users: 594 
Unique User ID's and User Names:  
set([('1', 'Steve'),
     ('100023', 'skwash'),
     ('1007057', 'Ceema'),
     ('1007509', 'JDong'),
     ('1012861', 'SalD'),
     ('103253', 'gormur'),
     ('10371', 'msiebuhr'),
     ('1044834', 'Jesse Phillips'),
     ('104962', 'techlady'),
     ('1051550', 'shravan91'),
     ('1058308', 'henningpohl'),
     ('1059812', 'Matthew Kennedy'),
     ('10786', 'stucki1'),
     ('1080997', 'filmfan2206'),
     ('1083211', 'Theo D O Lite'),
     ('109570', 'Vincent Broman'),
     ('110046', 'freietonne-db'),
     ('110263', 'werner2101'),
     ('11039', 'Rob Lanphier'),
     ('1106095', 'mziehm'),
     ('11126', 'seav'),
     ('11131', 'Amoebabadass'),
     ('113624', 'mojodna'),
     ('1137491', 'Pandu Rao'),
     ('114615', 'DuncLaw'),
     ('1168086', 'Extramiler'),
     ('1168707', 'animeigo'),
     ('1177412', 'BallardMapper2012'),
     ('118021', 'maggot27'),
     ('118262

<a id='Auditing and cleaning some of the data'></a>

###Auditing and cleaning the data

In [2]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author Zach Farmer
"""
- audit the OSMFILE and change the variable 'mapping' to reflect the changes needed to fix 
    the unexpected street types to the appropriate ones in the expected list. In addition 
    refelct the changes needed to fix the postcodes, city names and state abbrivations
- write the update_name function, to actually fix the street name, postcodes,city names and state abbrv.
"""

OSMFILE = "seattle-area_east-side.osm"
street_name_re = re.compile(r'\S+\.?', re.IGNORECASE) # look for all words in street addresses

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

mapping = { "av" : "Avenue",
            "ave" : "Avenue",
            "st" : "Street",
            "ste" : "Suite",
            "sr" : "State Route",
            "ave" : "Avenue",
            "blvd" : "Boulevard",
            "ct" : "Court",
            "dr" : "Drive",
            "hwy" : "Highway",
            "ln" : "Lane",
            "rd" : "Road",
            "n" : "North",
            "pl" : "Place",
            "ext" : "Extension",
            "s" : "South",
            "e" : "East",
            "se" : "South-East",
            "sq" : "Square",
            "w" : "West",
            "ne" : "North-East",
            "nw" : "North-West",
            "pkwy" : "Parkway",
            "wy" : "Way"
            }

def audit_street_type(street_types, street_name): 
    '''
    Match all words in an street address that are not expected
    '''
    m = street_name_re.findall(street_name)
    for word in m:
        if word not in expected:
            street_types[word].add(street_name)
            
def is_addr(elem):
    '''
    Determine if tag key is an address
    '''
    return ('addr' in elem.attrib['k'])

def addrs_audit(street_types,tag):
    ''' 
    Perform a provisional address audit on the relevent tags, records all problematic
    tag/value pairs for postcodes,city names and state abbreviations and return them to the audit function.
    Additionally where possible correct bad values and replace non-correctable ones with 'FIXME' place-
    holders.
    '''
    bad_postcodes = {} # for auditing overview purposes
    bad_city_names = {} # for auditing overview purposes
    bad_state_abbr = {} # for auditing overview purposes
    
    if tag.attrib['k'] == "addr:street": # find problem addresses, e.i. abbreviated words. Fix Them
        audit_street_type(street_types, tag.attrib['v'])
    
    elif tag.attrib['k'] == "addr:postcode": # find, if any exist, invalid or inappropriate postcodes
        if len(tag.attrib['v']) > 5 or len(tag.attrib['v']) < 5:
            bad_postcodes[tag.attrib['k']] = tag.attrib['v']
            if len(tag.attrib['v']) == 10:
                tag.attrib['v'] = tag.attrib['v'][0:5]
            else:
                tag.attrib['v'] = "FIXME"

            return bad_postcodes
        else:
            pass
    
    elif tag.attrib['k'] == "addr:city": # find, if any exist, bad or invalid city names
        if ',' in tag.attrib['v']:
            bad_city_names[tag.attrib['k']] =  tag.attrib['v']
            tag.attrib['v'] = (tag.attrib['v'].split(","))[0]
        else:
            pass
        
        return bad_city_names
    
    elif tag.attrib['k'] == "addr:state": # find, if any exist, bad or invalid state abbrvs.
        if len(tag.attrib['v']) > 2:
            bad_state_abbr[tag.attrib['k']] = tag.attrib['v']
            tag.attrib['v'] = "FIXME"
        
        return bad_state_abbr

def audit(osmfile):
    '''
    open file and look through tags one at a time using a sax parser. For the specified tags('addr:*'),
    pass on to a more specific address audit function. Addtionally keep track of the number of unique keys
    with and without colons.
    '''
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    # record following for visual auditing examples
    colon_keys = set()
    single_keys = set()
    bad_postcodes = []
    bad_city_names =[]
    bad_state_abbrs =[]
    total_address_tags = 0
    
    for event, elem in ET.iterparse(osm_file, events=("start",)): # cElementTree sax parser for reduced RAM demand
        # only look at osm data primitives; nodes and ways
        if elem.tag == "node" or elem.tag == "way":
            # find all tags associated with the node and way primitives
            for tag in elem.iter("tag"):
                # find all tag keys with colons in them
                if ':' in tag.attrib['k']:
                    colon_keys.add(tag.attrib['k'])
                    # look at tag keys concerned with addresses
                    if is_addr(tag):
                        total_address_tags +=1
                        address_audit = addrs_audit(street_types,tag)
                        if type(address_audit) == type({}): # record bad entries in address tags
                            if 'addr:postcode' in address_audit.keys():
                                total_address_tags -=1
                                bad_postcodes.append(address_audit.values())
                            elif 'addr:city' in address_audit.keys():
                                total_address_tags -=1
                                bad_city_names.append(address_audit.values())
                            elif 'addr:state' in address_audit.keys():
                                total_address_tags -=1
                                bad_state_abbrs.append(address_audit.values())
                    else:
                        pass
                elif ':' not in tag.attrib['k']:
                    single_keys.add(tag.attrib['k'])
                
            elem.clear()
        elem.clear()
    
    print "Total number of 'addr' tags: {0}".format(total_address_tags),'\n',\
    "Number of unique multi-keys: ",len(colon_keys),'\n',"Unique multi-keys:",'\n',\
    pprint.pprint(colon_keys),'\n',"Number of unique single-keys: ",len(single_keys),'\n',\
    "Unique single-keys:",'\n', pprint.pprint(single_keys),'\n',"Number of bad Postcodes: ",\
    len(bad_postcodes),'\n', "Example of bad postcode: ", bad_postcodes[0:3], '\n',\
    "Number of a bad city names: ", len(bad_city_names),'\n', "Example of a bad city name: ",\
    bad_city_names[0:3], '\n', "Number of bad state abbreviations:",len(bad_state_abbrs),'\n',\
    "Example of a bad state abbreviations: ", bad_state_abbrs[0:3]
    return street_types

def update_street_name(name, mapping):
    '''
    Find abbr. in street names, record offending name and then replace the name with the full
    non-abbreviated version.
    '''
    update_name = name

    for word in street_name_re.findall(name):
        #print "pattern matched word: ", word
        word_striped =  word.translate(None,string.punctuation).lower()
        if word_striped in mapping.keys():
            update_name = re.sub(word, mapping[word_striped],update_name)
        else:
            continue
    if name == update_name:
        return None
    else:
        return update_name
    
def test():
    st_types = audit(OSMFILE)
    fixed_street_names = {}
    for st_type, ways in st_types.iteritems():
        for name in ways:
            better_name = update_street_name(name, mapping)
            if better_name != None:
                fixed_street_names[name] = better_name
    
    print "Number of unique bad street names: ", len(fixed_street_names), '\n',\
    pprint.pprint(fixed_street_names)

if __name__ == '__main__':
    test()

Total number of 'addr' tags: 136361 
Number of unique multi-keys:  150 
Unique multi-keys: 
set(['FIXME:access',
     'abandoned:aeroway',
     'addr:city',
     'addr:country',
     'addr:full',
     'addr:housename',
     'addr:housenumber',
     'addr:postcode',
     'addr:state',
     'addr:street',
     'addr:suite',
     'addr:unit',
     'bridge:name',
     'bridge:structure',
     'building:levels',
     'building:min_levels',
     'building:part',
     'bus:lanes',
     'bus:lanes:conditional',
     'capacity:disabled',
     'census:population',
     'contact:phone',
     'cycleway:left',
     'cycleway:right',
     'disused:aeroway',
     'disused:amenity',
     'disused:shop',
     'exit_to:left',
     'exit_to:right',
     'fuel:diesel',
     'fuel:lpg',
     'fuel:octane_91',
     'fuel:octane_95',
     'fuel:octane_98',
     'gnis:Cell',
     'gnis:Class',
     'gnis:County',
     'gnis:County_num',
     'gnis:ST_alph',
     'gnis:ST_alpha',
     'gnis:ST_num',
     'gnis

<a id='Preparing for Database Insertion'></a>

###Preparing for Database Insertion


In [33]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer

"""
convert osm data file from html to python dictionaries, and finally to Json format for loading 
into a MongoDB instance.
"""
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
street_name_re = re.compile(r'\S+\.?', re.IGNORECASE)
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

mapping = { "av" : "Avenue",
            "ave" : "Avenue",
            "st" : "Street",
            "ste" : "Suite",
            "sr" : "State Route",
            "ave" : "Avenue",
            "blvd" : "Boulevard",
            "ct" : "Court",
            "dr" : "Drive",
            "hwy" : "Highway",
            "ln" : "Lane",
            "rd" : "Road",
            "n" : "North",
            "pl" : "Place",
            "ext" : "Extension",
            "s" : "South",
            "e" : "East",
            "se" : "South-East",
            "sq" : "Square",
            "w" : "West",
            "ne" : "North-East",
            "nw" : "North-West",
            "pkwy" : "Parkway"
            }

def update_name(tag,mapping):
    '''
    audit and update address related tags, specifically the street, postcode,
    city and state.
    '''
    street_name_re = re.compile(r'\S+\.?', re.IGNORECASE)
    
    if tag.attrib['k'] == "addr:postcode":
        if len(tag.attrib['v']) > 5 or len(tag.attrib['v']) < 5:
            if len(tag.attrib['v']) == 10:
                return tag.attrib['v'][0:5]
            else:
                return "FIXME"
        else:
            return tag.attrib['v']
    elif tag.attrib['k'] == "addr:city":
        if "," in tag.attrib['v']:
            return (tag.attrib['v'].split(","))[0].lower()
        else:
            return tag.attrib['v'].lower()
    elif tag.attrib['k'] == "addr:state":
        if len(tag.attrib['v']) > 2:
            return "FIXME"
        else:
            return tag.attrib['v'].upper()
    elif tag.attrib['k'] == "addr:street":
        update_name = tag.attrib['v']
        for word in street_name_re.findall(tag.attrib['v']):
            word_striped =  word.translate(None,string.punctuation).lower()
            if word_striped in mapping.keys():
                update_name = re.sub(word,mapping[word_striped],update_name) 
            else:
                continue
        return update_name
    else:
        return tag.attrib['v']

def shape_element(elem):
    '''
    Pass in elements from a single event, look at only osm primitives 'node' and 'way',
    return element, updated where appropriate.
    '''
    if elem.tag =="way" or elem.tag =="node":
        document = {} # Establish the format (python dictionary) to be converted to a json object
        document["type"] = elem.tag # make variable for type of data primitive
        created = {} #record when event was created
        position = [] #record lat/lon coords for nodes
        node_refs = []
        address = {} # fix and update the address information
        # first level key/values, present for all events
        for key in elem.attrib.keys():
            if key =='lon' or key =='lat':
                position.append(float(elem.attrib[key]))
            elif key in CREATED:
                created[key] = elem.attrib[key]
            else:
                document[key] = elem.attrib[key]

        # second level key/values for events with child tags
        for tags in elem.iter("tag"):
            #Ignore all tags with keys that contain invalid characters
            if problemchars.search(tags.attrib['k']): 
                continue
            elif re.match(r'addr:\w+$',tags.attrib['k']):
                keys = tags.attrib['k'].split(":")
                address[keys[1]] = update_name(tags,mapping)
            elif tags.attrib['k'] == "ref": 
                node_refs.append(tags.attrib['v'])
            else: # re.match(r'[^addr]', tags.attrib['k']):
                #print tags.attrib['k']
                keys = tags.attrib['k'].split(":")
                if len(keys) == 1:
                    document[tags.attrib['k']] = tags.attrib['v']
                if len(keys) == 2:
                    document[keys[0]] = {keys[1]:tags.attrib['v']}
                elif len(keys) > 2:
                    document[tags.attrib['k']] = tags.attrib['v']
            
        # Add transformed and updated key/values to doucment dictionary
        if len(created) != 0:
            document['created'] = created
        if len(position) != 0:
            document['position'] = position
        if len(address) != 0:
            document['address'] = address
        if len(node_refs) != 0:
            document['node_refs'] = node_refs
        
        return document
    
    else: #if not data primitive 'node' or 'way' then ignore it
        pass
    
def process_map(file_in):
    '''
    Input osm xml file, step through with sax parser, return json document
    for mongo db import
    '''
    file_out = "{0}.json".format(file_in)
    #data =[] # for testing purposes

    with codecs.open(file_out, "w") as fo:
        # use cElement sax parser to look at events one at a time
        for event, element in ET.iterparse(file_in, events=("start",)):
            #send each element to be audited and where appropriate corrected
            el = shape_element(element)
            #clear sax event data
            element.clear()

            if el: # if element is valid write the data as a json object
                #data.append(el) # for testing
                fo.write(json.dumps(el) + '\n')


if __name__=="__main__":
    data = process_map("seattle-area_east-side.osm")

<a id='Inserting into Mongo DB'></a>

###Inserting into Mongo DB

In [34]:
%%bash
mongoimport -db DAproject_2 -c osm_data --file seattle-area_east-side.osm.json 

connected to: 127.0.0.1
2015-06-26T17:03:51.088-0700 		Progress: 3822405/154768287	2%
2015-06-26T17:03:51.089-0700 			18400	6133/second
2015-06-26T17:03:54.062-0700 		Progress: 13360709/154768287	8%
2015-06-26T17:03:54.062-0700 			64800	10800/second
2015-06-26T17:03:57.001-0700 		Progress: 22785056/154768287	14%
2015-06-26T17:03:57.001-0700 			110800	12311/second
2015-06-26T17:04:01.774-0700 		Progress: 29757982/154768287	19%
2015-06-26T17:04:01.774-0700 			145000	11153/second
2015-06-26T17:04:04.000-0700 		Progress: 37062450/154768287	23%
2015-06-26T17:04:04.000-0700 			180700	11293/second
2015-06-26T17:04:07.036-0700 		Progress: 46706328/154768287	30%
2015-06-26T17:04:07.036-0700 			227500	11973/second
2015-06-26T17:04:10.062-0700 		Progress: 56772229/154768287	36%
2015-06-26T17:04:10.062-0700 			276400	12563/second
2015-06-26T17:04:13.031-0700 		Progress: 66621357/154768287	43%
2015-06-26T17:04:13.031-0700 			323500	12940/second
2015-06-26T17:04:16.044-0700 		Progress: 76154297/1547

<a id='Quering MongoDB'></a>

###Quering MongoDB 

In [63]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer
# Test the insert into mongodb, check existence of audited field address

def find(collection,query,count=True):
    '''append count method to all find request by default, pass false in order to return normal
    mongodb find method
    '''
    if count == True:
        results = collection.find({"{0}".format(query):{"$exists": True}}).count()
    else:
        results = collection.find({"{0}".format(query):{"$exists": True}})
    return results

def get_db(db_name):
    '''Establish connection to mongo db using python api '''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

if __name__=="__main__":
    db = get_db("DAproject_2")
    find_results = find(db.osm_data,"address",True)
    print "Total number of documents: ", db.osm_data.find().count(),'\n',\
    "Number of documents with address data: ", find_results,'\n',\
    "Number of documents with street: ", find(db.osm_data,"address.street"),'\n',\
    "Number of doucments with city names: ", find(db.osm_data,"address.city"),'\n',\
    "Number of doucments with postcodes: ",  find(db.osm_data,"address.postcode"),'\n',\
    "Number of documents with State abbrv.: ", find(db.osm_data,"address.state")
    documents = db.osm_data.find({"address":{"$exists":True}})[8995:9000]
    print "\nExample of 5 documents from the database: "
    for document in documents:
        pprint.pprint(document)
        

Total number of documents:  724891 
Number of documents with address data:  31672 
Number of documents with street:  30943 
Number of doucments with city names:  30215 
Number of doucments with postcodes:  30354 
Number of documents with State abbrv.:  510

Example of 5 documents from the database: 
{u'_id': ObjectId('558de88bec989cce76aaefce'),
 u'address': {u'city': u'kirkland',
              u'housenumber': u'8630',
              u'postcode': u'98033',
              u'street': u'113th Lane Northeast'},
 u'created': {u'changeset': u'28592592',
              u'timestamp': u'2015-02-03T16:12:36Z',
              u'uid': u'2604212',
              u'user': u'Glassman_Import',
              u'version': u'1'},
 u'id': u'3328590148',
 u'position': [-122.1888728, 47.6804031],
 u'type': u'node'}
{u'_id': ObjectId('558de88bec989cce76aaefcf'),
 u'address': {u'city': u'kirkland',
              u'housenumber': u'8631',
              u'postcode': u'98033',
              u'street': u'113th Lane Nort

<a id='Document tag overview'></a>

###Document tag overview

In [64]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer
#For the purpose of making sure the tags where successfully included in the mongodb document.


taglist = ['tiger:source', 'maxspeed', 'WIDTH', 'snowmobile', 'is_in', 'seamark:type', 'created_by',\
           'gnis:ST_num', 'suite', 'hov:minimum', 'attribution', 'hov:lanes:conditional', 'recycling:magazines',\
           'tiger:county', 'name:uk', 'motor_vehicle', 'Shape_ST_1', 'Shape_ST_2', 'lanes:hov:conditional',\
           'artist_last_name', 'addr:street', 'source:name', 'school', 'level', 'name:zh', 'bridge:structure',\
           'is_in:continent', 'sidewalk', 'cmt', 'disused', 'recycling:paper', 'addr:state', 'gnis:state_id',\
           'bicycle', 'gnis:county_name', 'fuel:lpg', 'source:license', 'name:it', 'MTFCC', 'cycleway:left',\
           'gnis:id', 'wikipedia:en', 'access', 'toll', 'voltage-high', 'water', 'census:population', 'baseball',\
           'address', 'name:bg', 'hoops', 'microbrewery', 'SurfaceTyp', 'TrailMaint', 'gnis:create', 'residents',\
           'payment:dogecoin', 'name:pt', 'gnis:County_num', 'classification', 'tiger:zip_right', 'amenity', 'tower:type',\
           'STATEFP', 'fee', 'tiger:name_type_2', 'tiger:name_type_1', 'destination', 'type', 'start_date', 'addr:unit',\
           'tiger:PLCIDFP', 'club', 'visibility', 'site', 'phone', 'abandoned:aeroway', 'traffic_calming', 'room',\
           'tunnel', 'tiger:NAME', 'service:bicycle:chain_tool', 'noref', 'source:url', 'history', 'Shape_STLe',\
           'Shape_area', 'fax', 'rcn_ref', 'information', 'KC_FAC_FID', 'SITENAME', 'swing_gate:type', 'hov', 'name:fr',\
           'animal', 'exit_to:left', 'old_railway_operator', 'FIXME', 'description', 'name:fa', 'second_hand', 'TrailName',\
           'date', 'Geolocation', 'natural', 'wheelchair', 'outdoor_seating', 'healthcare', 'restriction', 'office',\
           'building:part', 'OWNERTYPE', 'motorcycle', 'name:ta', 'exit_to:right', 'addr:housenumber', 'drive_in',\
           'tiger:name_type', 'capacity:disabled', 'covered', 'old_ref', 'junction', 'food', 'date_off', 'material',\
           'cutting', 'recycling:newspaper', 'foot', 'tourism', 'gnis:edited', 'payment:bitcoin', 'fixme', 'odbl',\
           'name', 'designation', 'tiger:zip_left', 'testing point', 'Functional', 'embankment', 'crossing', 'name_2',\
           'name_1', 'seamark:harbour:category', 'name:ar', 'naptan:Bearing', 'frequency', 'COUNTYFP', 'seamark:small_craft_facility:category',\
           'network', 'diesel', 'MANAGER', 'mountain_pass', 'ref', 'highway', 'barrier', 'hairdresser', 'import_uuid',\
           'seamark:information', 'tiger:reviewed', 'name:sv', 'brushless', 'sac_id', 'noexit', 'organic', 'segregated',\
           'route', 'atm', 'shelter_type', 'place', 'opening_hours', 'toilets:wheelchair', 'tiger:PLACENS',\
           'artist_first_name', 'recycling:glass', 'sac_scale', 'construction_date', 'horse', 'tiger:upload_uuid',\
           'service', 'width', 'unknown', 'addr:housename', 'rcn', 'border_type', 'is_in:county', 'motorcar',\
           'park_ride', 'dogs', 'store', 'MANAGETYPE', 'enforcement', 'noname', 'gym', 'name:he', 'artist_name',\
           'is_in:state', 'name:ja', 'stairs', 'NHD:ComID', 'population', 'tiger:name_direction_suffix', 'project',\
           'aeroway', 'landuse', 'tracktype', 'bridge', 'tiger:PCICBSA', 'mtb:scale:imba', 'sdot:bike_rack:type',\
           'recycling:cardboard', 'sport', 'building:levels', 'lit', 'thermometer', 'disused:amenity', 'addr:full',\
           'exit_to', 'ref:left', 'maxspeed:hgv', 'payment:coins', 'url', 'SITETYPE', 'tiger:MTFCC', 'tactile_paving',\
           'tiger:name_direction_suffix_2', 'tiger:name_direction_suffix_1', 'is_in:iso_3166_2', 'shop', 'disused:aeroway',\
           'golf', 'name:ko', 'automated', 'social_facility', 'trail_visibility', 'tiger:zip_left_4', 'tiger:zip_left_3',\
           'tiger:zip_left_2', 'tiger:zip_left_1', 'gnis:feature_type', 'gnis:Cell', 'is_in:city', 'Trail_Name', 'addr:suite',\
           'biodiesel', 'stop', 'leisure', 'Shape_len', 'gnis:County', 'hour_off', 'name:en', 'tiger:cfcc', 'addr:postcode',\
           'COMMENTS', 'internet_access:fee', 'tiger:zip_right_2', 'public_transport', 'RTTYP', 'name:es', 'voltage',\
           'tiger:LSAD', 'NHD:ReachCode', 'LINEARID', 'tiger:STATEFP', 'wetland', 'parking', 'name:ru', 'traffic_signals:sound',\
           'capacity', 'fuel:diesel', 'wikipedia', 'ele', 'boundary', 'MAINTD_BY', 'email', 'protect_id', 'denomination',\
           'board_type', 'substation', 'source:addr:id', 'tiger:PLACEFP', 'country', 'recycling:plastic', 'cycleway:right',\
           'planned', 'maxlength', 'railway', 'source_ref', 'comment', 'drive_through', 'MAINTTYPE', 'SiteNbr', 'harbour',\
           'height', 'church', 'bench', 'boat', 'recycling:cartons', 'tiger:name_direction_prefix_1', 'wheelchair:description:en',\
           'tiger:name_direction_prefix_2', 'short_name', 'ref:right', 'NHD:FCode', 'basin', 'bicycle_parking', 'website',\
           'direction', 'lanes', 'lanes:hov', 'craft', 'official_name', 'hobby', 'smoking', 'legal:video', 'sound', 'bridge:name',\
           'PROPERTY', 'OWNER', 'layer', 'backrest', 'gnis:county_id', 'surface', 'sloped_curb', 'name:vi', 'waterway',\
           'cuisine', 'sdot:bike_rack:facility', 'doctor', 'media', 'SysChangeD', 'fuel:octane_98', 'collection_times',\
           'fuel:octane_91', 'fuel:octane_95', 'wires', 'mtb:type', 'fence_type', 'sym', 'sdot:bike_rack:id',\
           'motorcycle:lanes', 'gnis:ST_alpha', 'intermittent', 'hgv', 'historic', 'oneway', 'landmark', 'FULLNAME',\
           'TrailID', 'addr:country', 'SysChangeU', 'NHD:FTYPE', 'rest', 'tiger:zip_right_1', 'tiger:zip_right_3', 'date_on',\
           'tiger:tlid', 'tiger:zip_right_4', 'ref:store_number', 'seats', 'note', 'recycling:aluminium', 'Company', 'except',\
           'name:de', 'source', 'location', 'bollard', 'usage', 'gnis:feature_id', 'ski', 'emergency', 'hov:lanes',\
           'tiger:separated', 'disused:shop', 'bridge_name', 'gnis:ST_alph', 'lcn', 'psv', 'addr:city', 'tiger:name_base',\
           'vending', 'motor_vehicle:lanes:conditional', 'building:min_levels', 'recycling:cans', 'internet_access',\
           'Management', 'alt_name', 'amenity_1', 'local_ref', 'measurements', 'man_made', 'religion', 'bus:lanes:conditional',\
           'tiger:CLASSFP', 'artwork_type', 'gas_station', 'power', 'takeaway', 'nails', 'motor_vehicle:lanes',\
           'tiger:NAMELSAD', 'is_in:state_code', 'incline', 'footway', 'gnis:Class', 'supervised', 'tiger:name_direction_prefix',\
           'step_count', 'MgmtNotes', 'sdot:bike_rack:condition', 'operator', 'tiger:FUNCSTAT', 'tiger:CPI', 'KCPARKFID',\
           'existed', 'area', 'vac', 'support', 'contact:phone', 'tiger:name_base_1', 'tiger:name_base_2', 'stars',\
           'tiger:PCINECTA', 'function', 'admin_level', 'bus', 'brand', 'delivery', 'construction', 'old_name', 'bus:lanes',\
           'recycling:glass_bottles', 'voltage-low', 'dispensing', 'display', 'crossing_ref', 'entrance', 'maxheight',\
           'floating', 'cables', 'is_in:country', 'FIXME:access', 'cycleway', 'nq', 'self_service', 'STATUS', 'shelter',\
           'symbol', 'is_in:country_code', 'pumps', 'maxstay', 'gnis:created', 'structure', 'hour_on', 'building', 'trail',\
           'wifi', 'closest_town', 'passing_places']

def get_db(db_name):
    '''Establish connection to Mongo DB with python api'''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

if __name__=="__main__":
    db = get_db("DAproject_2")
    results = []
    for tag in taglist:
        results.append([tag, db.osm_data.find({tag.replace(':','.'):{"$exists":True}}).count()])
    print "Query Mongo DB for all unique tags from python audit and their respective counts: " 
    pprint.pprint(sorted(results, key=itemgetter(1), reverse=True))
    
                       

Query Mongo DB for all unique tags from python audit and there respective counts: 
[['type', 724891],
 ['building', 36922],
 ['address', 31672],
 ['highway', 27410],
 ['name', 19698],
 ['source', 11685],
 ['tiger:zip_right', 6853],
 ['power', 3763],
 ['service', 3704],
 ['amenity', 2812],
 ['surface', 2458],
 ['oneway', 2273],
 ['bicycle', 2074],
 ['created_by', 1949],
 ['foot', 1842],
 ['leisure', 1355],
 ['access', 882],
 ['natural', 797],
 ['man_made', 748],
 ['landuse', 736],
 ['tiger:reviewed', 735],
 ['maxspeed', 709],
 ['sport', 640],
 ['shop', 631],
 ['lanes', 597],
 ['footway', 577],
 ['barrier', 568],
 ['width', 556],
 ['bridge', 516],
 ['website', 456],
 ['crossing', 423],
 ['layer', 407],
 ['cycleway', 379],
 ['area', 367],
 ['ele', 365],
 ['name_1', 320],
 ['fixme', 289],
 ['motor_vehicle', 267],
 ['cuisine', 261],
 ['segregated', 255],
 ['gnis:state_id', 244],
 ['horse', 243],
 ['tiger:zip_left', 240],
 ['tiger:tlid', 231],
 ['phone', 224],
 ['building:levels', 219],
 ['p

<a id='Problems Encountered in OSM Data'></a>

##2. Problems Encountered in OSM Data
***

Utilizing lesson 6 auditing procedures as a guide I created my own provisional auditing code process and analyzed the OSM data for Seattle WA, USA. More specifically the East-side of the greater Seattle area and discovered roughly 725,000 entries containing 460 unique tag key/value pairs for OSM data primitives 'ways' and 'nodes'. Similar to the lesson 6 auditing challenges I focused my attention on the address related tags to perform a more in-depth analysis and auditing. In this data set 'addr'(address) tags along with 149 other unique tags were hierarchical, specifically the 'parent' tag 'addr' contained further components. The 'addr' tags included 10 'child' tags related to addresses, I specifically audited:   
* addr:street  
* addr:city   
* addr:postcode  
* addr:state    

tags while leaving the rest alone. during the course of this audit I found several issues with the data, I would posit that similar errors would likely be found in other tags within the dataset and would suggest that caution be used when utilizing user entered data outside of geo-location(GPS) data without auditing beforehand.

The key issues I focused on in this audit were abbreviated street addresses, inconsistent and invalid postcode values, Invalid state name abbreviations, and invalid city names. Most of the errors outside of the street addresses and postcodes were a result of including extra data or the wrong data in a tag that belonged to a different tag.   


<a id='Brief Overview of Documents containing audited tags'></a>

###Brief Overview of Documents containing audited tags
***   

Given that there are 724,891 documents in this mongoDB I am shocked to find so few (4.3%) of the documents containing any type of address related meta-data, considering that most places in and around the city are likely to have addresses. This suggests to me that priority has been placed on the GPS location data first and secondary meta-data is only inserted by especially motivated or enthusiastic users.

####Total Number of mongodb documents for selected OSM area: 724,891    
```python
db.osm_data.find().count()
```  

####Number of mongodb documents containing address data: 31,670    
```python
db.osm_data.find({"address":{"$exists": True } } ).count()   
```   

####Number of mongodb documents containing street address data: 30,943
```python   
db.osm_data.find({"address.street":{"$exists": True } } ).count()
```

####Number of mongodb documents containing city names data: 30,215
```python
db.osm_data.find({"address.city":{"$exists":True}}).count()
```   

####Number of mongodb documents containing postcode data: 30,354
```python
db.osm_data.find({"address.postcode":{"$exists":True}}).count()
``` 

####Number of mongodb documents containing state abbrv. data: 510
```python
db.osm_data.find({"address.state":{"$exists":True}}).count()
``` 



<a id='Abbreviated Street Names'></a>

###Abbreviated Street Names
***

In [129]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer


#Explore Address information inside Mongodb

def get_db(db_name):
    '''establish connection to mongodb with python api'''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

def make_pipeline():
    '''create and store queries in a pipeline'''
    pipeline = [
                {"$match":{"address.street":{"$exists":1}}},\
                {"$group":{"_id":"$address.city", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                 ]
    return pipeline

if __name__=="__main__":
    db = get_db("DAproject_2")
    query =  make_pipeline()
    results = db.osm_data.aggregate(query)
    pprint.pprint(results['result'])
                                                

[{u'_id': u'kirkland', u'count': 27881},
 {u'_id': u'seattle', u'count': 1059},
 {u'_id': None, u'count': 1024},
 {u'_id': u'bellevue', u'count': 383},
 {u'_id': u'redmond', u'count': 328},
 {u'_id': u'hunts point', u'count': 192},
 {u'_id': u'sammamish', u'count': 33},
 {u'_id': u'issaquah', u'count': 26},
 {u'_id': u'mercer island', u'count': 7},
 {u'_id': u'newcastle', u'count': 4},
 {u'_id': u'renton', u'count': 2},
 {u'_id': u'clyde hill', u'count': 1},
 {u'_id': u'belevue', u'count': 1},
 {u'_id': u'lynwood', u'count': 1},
 {u'_id': u'kirkalnd', u'count': 1}]


####Number of unique address-street tags with over-abbreviated street names: *158*
> Results of provisional python auditing code

Original | Corrected
---------|-----------
 102nd Ave SE | 102nd Avenue South-East
 105th Avenue NE | 105th Avenue North-East
 106th Ave NE | 106th Avenue North-East
 106th St | 106th Street
 107th Avene NE | 107th Avene North-East
 ...  | ...

After running my provisional python auditing code I discovered a number of abbreviations in the addr:street tag values. Above is an shortened example of the types of street name abbreviations and the fixes implemented to correct them. Considering that there existed almost 31,000 address street values and my auditing code found 158 abbreviated unique names there is only a small number of abbreviated street names. It seems likely that this data was already cleaned or that it was entered after OSM issued some guidelines on data entry as it pertained to street addresses.

####Number of addresses with street tags: *30,943*


####Top 5 contributors by user name of Street Addresses
```python
db.osm_data.aggregate([
                {"$match":{"address.street":{"$exists":1}}},\
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 5}
                 ])
```

####Results: 
1. {user: Glassman_Import, count: 19,069}     
2. {user: sctrojan79-import, count: 5,220}     
3. {user: seattlefyi_import, count: 2,185}    
4. {user: Geodesy99, count: 693}    
5. {user: bryceco, count: 627}   

It appears that when it comes to the tags related to addresses there are only a few major contributors who contributed meta-data(top 5: 27,794 of 30,943). Further analysis will find that most of these addresses reside in Kirkland WA. 

####Reported cities which contain street address information  
```python
db.osm_data.aggregate([
                {"$match":{"address.street":{"$exists":1}}},\
                {"$group":{"_id":"$address.city", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                 ])
```   

####Results:   
* {city: kirkland, count: 27,881}   
* {city: seattle, count: 1,059}   
* {city: None, count: 1,024}    
* {city: bellevue, count: 383}    
* {city: redmond, count: 328}     
* {city: hunts point, count: 192}    
* {city: sammamish, count: 33}     
* {city: issaquah, count: 26}    
* {city: mercer island, count: 7}   
* {city: newcastle, count: 4}    
* {city: renton, count: 2}    
* {city: clyde hill, count: 1}   
* {city: belevue, count: 1}   
* {city: lynwood, count: 1}   
* {city: kirkalnd, count: 1}   

These results reaffirm my observations that much of the meta-data contributed through tags that are address related have been contributed by a relatively small number of individuals for a relatively small physical area. 


<a id='Incorrect and Inconsistent Postcodes'></a>

###Incorrect and Inconsistent Postcodes
***

####Number of documents with address-postcodes: *30,354*      
####Number of bad Postcodes: *16*      
####Example of bad or invalid postcodes: 
*[['W Lake Sammamish Pkwy NE'], ['98004-4452'], ['98004-5002']]*      
  
There were almost as many postcodes recorded as street address, the 'error' rate was much lower then for street addresses. As nearly every recorded postcode was the standard 5 digit zip code with only a couple of entries containing the 4 digit extension. I aggregated up to the least common denominator and removed all digits beyond the 5-digit zip. Finally where there were street addresses present in the postcode tags I replaced them with the placeholder 'FIXME'. 

> Note: it might be a good idea for OSM to implement field frameworks for inserting meta-data for common fields, such as postcodes. A component of these frameworks would be to audit at time and point of entry, implementing basic test functions in order to prevent things like street addresses accidentally being placed in the postcode tags.

In [70]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer


#Explore postcode information inside Mongodb

def get_db(db_name):
    '''establish connection to mongodb with python api'''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

def make_pipeline():
    '''create and store queries in a pipeline'''
    pipeline = [
                {"$match":{"address.postcode":{"$exists":1}}},\
                {"$group":{"_id":"$address.city", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                 ]
    return pipeline

if __name__=="__main__":
    db = get_db("DAproject_2")
    query =  make_pipeline()
    results = db.osm_data.aggregate(query)
    pprint.pprint(results['result'])
                                                

[{u'_id': u'kirkland', u'count': 27972},
 {u'_id': u'seattle', u'count': 1061},
 {u'_id': u'bellevue', u'count': 392},
 {u'_id': None, u'count': 343},
 {u'_id': u'redmond', u'count': 325},
 {u'_id': u'hunts point', u'count': 193},
 {u'_id': u'sammamish', u'count': 33},
 {u'_id': u'issaquah', u'count': 20},
 {u'_id': u'mercer island', u'count': 7},
 {u'_id': u'newcastle', u'count': 3},
 {u'_id': u'clyde hill', u'count': 1},
 {u'_id': u'renton', u'count': 1},
 {u'_id': u'belevue', u'count': 1},
 {u'_id': u'lynwood', u'count': 1},
 {u'_id': u'kirkalnd', u'count': 1}]


####Top 5 reported Post codes 
```python
db.osm_data.aggregate([{"$match":{"address.postcode":{"$exists":1}}},\
                {"$group":{"_id":"$address.postcode", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit" : 5}
                ])
```  

####Results:   
1. {Postcode: 98033, count: 18,982}   
2. {Postcode: 98034, count: 9,011}    
3. {Postcode: 98178, count: 775}    
4. {Postcode: 98004, count: 561}    
5. {Postcode: 98052, count: 276}    


Interestingly of the just over 30,000 reported postcodes over half of them 18,982 are for a postcode within the city of Kirkland, WA (98033). This postcode area accounts for only a small percentage of the total area looked at, and a small percentage of the population of the total observed area.    
<img src="98033_area_code.png">    
If we look at all of the reported postcodes and the city for which they were tagged, we see an even greater concentration of postcodes being reported for the city of Kirkland WA, then for any of the other cities in the observed area.    

####List of city name where postcode information was also posted  
```python
db.osm_data.aggregate([{"$match":{"address.postcode":{"$exists":1}}},\
                {"$group":{"_id":"$address.city", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                ])
```  
####Results from query: 
* {city: kirkland, count: 27,972}   
* {city: seattle, count: 1,061} # only a sliver of the city, captured accidentally in the OSM data set  
* {city: bellevue, count: 392}   
* {city: None, count: 343}   # Instances where postcode provided but the city name was not    
* {city: redmond, count: 325}   
* {city: hunts point, count: 193}   
* {city: sammamish, count: 33}   
* {city: issaquah, count: 20}   
* {city: mercer island, count: 7}   
* {city: newcastle, count: 3}   
* {city: clyde hill, count: 1}   
* {city: renton, count: 1}     
* {city: belevue, count: 1}  # Misspelled    
* {city: lynwood, count: 1}        
* {city: kirkalnd, count: 1} # Misspelled        

If we look at all of the cities that were reported, also just over 30,000, we will find that if the postcode has been provided then it is likely the city name was also provided, or visa versa. 

####List of city names reported and the their respective counts  
```python
db.osm_data.aggregate([{"$match":{"address.city":{"$exists":1}}},\
                {"$group":{"_id":"$address.city", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                ])
```    
####Results from query:  
* {city: kirkland, count: 28,116}   
* {city: seattle, count: 1,066} # only a sliver of the city, captured accidentally in the OSM data set  
* {city: bellevue, count: 432}   
* {city: redmond, count: 330}   
* {city: hunts point, count: 194}   
* {city: sammamish, count: 33}   
* {city: issaquah, count: 26}   
* {city: mercer island, count: 7}   
* {city: newcastle, count: 5}   
* {city: renton, count: 2}   
* {city: clyde hill, count: 1}   
* {city: belevue, count: 1}   # Misspelled     
* {city: lynwood, count: 1}     
* {city: kirkalnd, count: 1}  # Misspelled     

> If we are concerned with encouraging users to enter more meta-data then we should take note of the fact that users who entered city names are very likely to also enter postcodes. This fact could be leveraged to increase contributions to meta-data.    

####Top Five postcode contributors to Kirkland by user name and count 
```python
db.osm_data.aggregate([{"$match":{"address.postcode":{"$exists":1},\
                "address.city": "kirkland"}},\
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 5} ])
```

####Results:   
1. {user: Glassman_Import, count: 19,123}    
2. {user: sctrojan79-import, count: 5,243}    
3. {user: seattlefyi_import, count: 2,190}    
4. {user: sctrojan79, count: 412}    
5. {user: Debbie Bull, count: 402}      


It looks as if the vast majority of submitted meta-data regarding the postcodes came from just a couple of sources (Glassman_import, sctrojan79-import,seattlefyi_import), which all appear to be imports into OSM, possibly from large databases containing geographic related data. If a goal of OSM is to encourage all its users to contribute meta-data they should be looking into methods of encouraging 'average' users to include more meta-data with their submissions. 


<a id='Incorrect State Abbreviations'></a>

###Incorrect State Abbreviations
***

In [55]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer


#Explore state abbrv. tags inside Mongodb

def get_db(db_name):
    '''establish connection to mongodb with python api'''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

def make_pipeline():
    '''create and store queries in a pipeline'''
    pipeline = [
                {"$match":{"address.state":{"$exists":1}}},\
                {"$group":{"_id":"$address.city", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                 ]
    return pipeline


if __name__=="__main__":
    db = get_db("DAproject_2")
    query =  make_pipeline()
    results = db.osm_data.aggregate(query)
    pprint.pprint(results['result'])
                                                

{u'ok': 1.0,
 u'result': [{u'_id': u'bellevue', u'count': 270},
             {u'_id': u'redmond', u'count': 135},
             {u'_id': u'kirkland', u'count': 43},
             {u'_id': None, u'count': 41},
             {u'_id': u'sammamish', u'count': 9},
             {u'_id': u'newcastle', u'count': 5},
             {u'_id': u'issaquah', u'count': 4},
             {u'_id': u'clyde hill', u'count': 1},
             {u'_id': u'seattle', u'count': 1},
             {u'_id': u'mercer island', u'count': 1}]}


####Number of documents with address-State abbrv.: *510*      
####Number of bad or invalid state abbreviations: *3*    
####Example of bad or invalid state abbreviations:    
*[['NE 15th Street'], ['156th Avenue NE'], ['NE 18th Street']]*   

While only a small number of documents that contained addresses included the the state abbreviations I ran a simply audit on these tags in order to insure only two characters and found that once again like the postcodes, street addresses had been entered by accident or confusion into this field. Given that the user is submitting geographic data the state, country, postcode and etc. could easily be inferred by location. OSM could automatically fill out this information when a submission is made.

I suspect that for those user who do include meta-data many naturally assume that the state designation would be obvious and therefore don't include it with the rest of their meta-data. 



<a id='Incorrect City Names'></a>

###Incorrect City Names
***

In [57]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer


#Explore city information inside Mongodb

def get_db(db_name):
    '''establish connection to mongodb with python api'''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

def make_pipeline():
    '''create and store queries in a pipeline'''
    pipeline = [
                {"$match":{"address.city":{"$exists":1}}},\
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 5}
                 ]
    return pipeline

#def aggregate(collection,pipeline):
#    '''use mongodb aggregate method for quering provided database'''
#    results = collection.aggregate(pipeline)
#    return results

if __name__=="__main__":
    db = get_db("DAproject_2")
    query =  make_pipeline()
    results = db.osm_data.aggregate(query)
    pprint.pprint(results['result'])
                                                

{u'ok': 1.0,
 u'result': [{u'_id': u'Glassman_Import', u'count': 19221},
             {u'_id': u'sctrojan79-import', u'count': 5268},
             {u'_id': u'seattlefyi_import', u'count': 2204},
             {u'_id': u'Geodesy99', u'count': 697},
             {u'_id': u'bryceco', u'count': 549}]}


####Number of documents with city names: *30,215*    
####Number of bad or invalid city names: *31*      
####Example of bad or invalid city name:
*[['kirkland,wa'], ['Bellevue, WA'], ['Bellevue, WA']]*

Finally I choose to audit the city names field which was almost as well documented as street and postcodes in documents where any address data was recorded. I found again only a few mistakes present, consisting mainly of including the state abbreviations right after the city name, I found and removed those state abbreviations. This should be an easily solvable problem by implementing some basic auditing procedures at time of entry to prevent anything longer then 2 characters. 

####Top five contributors to city names   
```python
db.osm.aggregate([{"$match":{"address.city":{"$exists":1}}},\
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 5} ]) 
```

####Results:    
1. {city: Glassman_Import, count: 19,221}    
2. {city: sctrojan79-import, count: 5,268}    
3. {city: seattlefyi_import, count: 2,204}     
4. {city: Geodesy99, u'count': 697}    
5. {city: bryceco, u'count': 549}    

> We see a close relationship again between users who provide postcode data providing city name data as well.   



<a id='Overview of The Data'></a>

##3. Overview of The Data
***

<a id='Sizes'></a>

###Sizes
***

In [21]:
%%bash
echo -n "Size of osm file: "; ls -lh *.osm | cut -c 26-33
echo -n 'Size of osm.json file: '; ls -lh *.osm.json | cut -c 26-33

Size of osm file:    141M 
Size of osm.json file:    148M 


####Size of the OSM data file for the Seattle-WA-Eastside-Region US:    *141MB * 

####Size of the OSM.json data file for the Seattle-WA-Eastside-Region US:    *148MB * 

####Size of the Mongo DB for the OSM data:  *453MB*


<a id='Number of Ways and Nodes'></a>

###Number of type Ways and Nodes
***

In [74]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer

#Explore inside Mongodb

def get_db(db_name):
    '''establish connection to mongodb with python api'''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

def make_pipeline():
    '''create and store queries in a pipeline'''
    pipeline = [
                {"$match":{"type":{"$exists":1}}},\
                {"$group":{"_id":"$type", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                 ]
    return pipeline

if __name__=="__main__":
    db = get_db("DAproject_2")
    query =  make_pipeline()
    results = db.osm_data.aggregate(query)
    pprint.pprint(results['result'])
                                                

[{u'_id': u'node', u'count': 657716},
 {u'_id': u'way', u'count': 67170},
 {u'_id': u'Chevron', u'count': 1},
 {u'_id': u'parking_aisle', u'count': 1},
 {u'_id': u'gas', u'count': 1},
 {u'_id': u'shaft', u'count': 1},
 {u'_id': u'defunct', u'count': 1}]


####Number of documents of type node: *657,718*

####Number of documents of type way: *67,173*   

####Mongo DB Query:
```python
db.osm_data.aggregate([{"$match":{"type":{"$exists":1}}},\
                {"$group":{"_id":"$type", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                 ])
```

<a id='Uniques'></a>

###Uniques values in documents 
***

####Number of unique single-tag-keys: *278*    

####Number of unique multi-tag-keys: *150*     

####Number of Unique Users: *594*   

<a id='Exploring the Data'></a>

###Exploring the Data
***

In [127]:
#!/usr/bin/env python
# _*_ coding: utf-8 _*_
# Author Zach Farmer

#Explore inside Mongodb

def get_db(db_name):
    '''establish connection to mongodb with python api'''
    client = MongoClient("mongodb://localhost:27017")
    db = client[db_name]
    return db

def make_pipeline():
    '''create and store queries in a pipeline'''
    pipeline = [
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$group": {"_id": "$count", "num_users":{"$sum": 1}}},\
                {"$match":{"_id":{"$gte":10000}}},\
                {"$group":{"_id":"num_users","count":{"$sum":1}}},\
                {"$sort":{"_id":1}},\
                 ]
    return pipeline

if __name__=="__main__":
    db = get_db("DAproject_2")
    query =  make_pipeline()
    results = db.osm_data.aggregate(query)
    pprint.pprint(results['result'])
                                                

[{u'_id': u'num_users', u'count': 13}]


####Total Number of mongodb documents: *724,891*    

####Top 10 Contributors by count 
```python
db.osm_data.aggregate([
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 10}
                ])
```

####Results:  
    1. {user: Glassman_Import, count: 169,296}   
    2. {user: STBrenden, count: 70,400}     
    3. {user: sctrojan79-import, count: 65,594}    
    4. {user: zephyr, count: 47,942}    
    5. {user: Extramiler, count: 34,213}   
    6. {user: csytsma, count: 31,572}    
    7. {user: Heptazane, count: 30,893}   
    8. {user: seattlefyi_import, count: 23,737}   
    9. {user: Djido, count: 22,105}   
    10. {user: Glassman, count: 19,979}  
    
Top ten contributors (1.6% of all unique users) account for 515,731 of the 724,891 documents (71%). Reinforcing the notation that most of the content and therefore value contributed to OSM is contributed by a very small percentage of all the unique contributing users. 

####Number of users having contributed 1-10 times
```python
db.osm_data.aggregate([
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$group": {"_id": "$count", "num_users":{"$sum": 1}}},\
                {"$sort":{"_id":1}},\
                {"$limit": 10} ])
```

####Results:  
* {number_contributions: 1, num_users: 109}   
* {number_contributions: 2, num_users: 51}  
* {number_contributions: 3, num_users: 32}   
* {number_contributions: 4, num_users: 25}    
* {number_contributions: 5, num_users: 20}    
* {number_contributions: 6, num_users: 10}    
* {number_contributions: 7, num_users: 14}    
* {number_contributions: 8, num_users: 9}    
* {number_contributions: 9, num_users: 9}    
* {number_contributions: 10, num_users: 10}   

Of the 594 unique users 289 ( nearly 50%) of them contributed less then ten documents each. There are 136 users who contributed over 100 documents, 44 who contributed over 1,000 and just 13 to contribute to more then 10,000 documents. Suggesting that the top contributors might not be 'normal' everyday people, but businesses whose core mission could involve the collection of geo-located data. 

####Mongo DB Query for the number of users who contributed over a certain amount(10000,1000,100) of documents    
```python
db.osm_data.aggregate([
                {"$group":{"_id":"$created.user", "count":{"$sum":1}}},\
                {"$group": {"_id": "$count", "num_users":{"$sum": 1}}},\
                {"$match":{"_id":{"$gte":100}}},\ # 1000, 10000, etc
                {"$group":{"_id":"num_users","count":{"$sum":1}}},\
                {"$sort":{"_id":1}},\
                 ])
```

The following are exploratory queries on tags with a fair number of documents containing them. Of the 460 unique tags many of them are so specialized that only a few documents contain them. We will explore some of the more common tags, they will be more informative then tags contained in only a few documents. However it is important to keep in mind that even these tags possess at most 36,922 occurrences which represents only 5% of all the documents. I would mention again that without more thorough and exhaustive meta-data, analysis such as the following have to be taken with a large grain of salt.

####Number of Building tags: *36,922*   

####Most popular buildings reported
```python
db.osm_data.aggregate([
                {"$match":{"building":{"$exists":1}}},\
                {"$group":{"_id":"$building", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 10}
                ]}
```   

####Results, slightly edited for logicalness:    
* {building: house, count: 5,107}    
* {building: apartments, count: 338}    
* {building: residential, count: 269}    
* {building: commercial, count: 240}    
* {building: detached, count: 192}   
* {building: retail, count: 64}  

####Number of Amenity tags: *2,812*      

####Top 10 amenities tagged by users
```python
db.osm_data.aggregate([
                {"$match":{"amenity":{"$exists":1}}},\
                {"$group":{"_id":"$amenity", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 10}
                ])
```

####Results: 
1. {amenity: parking, count: 984}     
2. {amenity: restaurant, count: 328}    
3. {amenity: school, count: 249}    
4. {amenity: bench, count: 116}    
5. {amenity: fast_food, count: 112}    
6. {amenity: cafe, count: 111}    
7. {amenity: bank, count: 91}     
8. {amenity: toilets, count: 86}     
9. {amenity: fuel, count: 69}     
10. {amenity: bicycle_parking, count: 66}   

####Number of leisure tags: *1,355*   

####Top 10 leisure spots
```python
db.osm_data.aggregate([
                {"$match":{"leisure":{"$exists":1}}},\
                {"$group":{"_id":"$leisure", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 10}
                ])
```

####Results:
1. {leisure: pitch, count: 484}    
2. {leisure: park, count: 340}    
3. {leisure: swimming_pool, count: 179}    
4. {leisure: playground, count: 177}    
5. {leisure: sports_centre, count: 33}    
6. {leisure: track, count: 31}     
7. {leisure: garden, count: 24}     
8. {leisure: golf_course, count: 23}    
9. {leisure: slipway, count: 14}    
10. {leisure: picnic_table, count: 10} 

####Number of sport tags: *640*   

####Most Popular Sport
```python
db.osm_data.aggregate([
                {"$match":{"sport":{"$exists":1}}},\
                {"$group":{"_id":"$sport", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 1}
                ])
```
####Result:   
1. {sport: tennis, count: 177}   

####Number of cuisine tags: *261*      

####Top 10 Popular cuisines
```python
db.osm_data.aggregate([
                {"$match":{"cuisine":{"$exists":1}}},\
                {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},\
                {"$sort":{"count": -1}},\
                {"$limit": 10}
                ])
```

####Results:  
1. {cuisine: burger, count: 28}
2. {cuisine: mexican, count: 24}
3. {cuisine: sandwich, count: 22}
4. {cuisine: pizza, count: 22}
5. {cuisine: coffee_shop, count: 21}
6. {cuisine: thai, count: 13}
7. {cuisine: chinese, count: 13}
8. {cuisine: sushi, count: 13}
9. {cuisine: american, count: 12}
10. {cuisine: japanese, count: 11}

<a id='Further Thoughts on the dataset and OSM data collection methods'></a>

##4. Further Thoughts on the dataset and OSM data collection methods
***


After auditing, inserting and reviewing this data set I am left with several thoughts. First, meta-data is often not inserted along with GPS coordinates, there were nearly 725,000 total documents but when reviewing what and how many tags are to be found in these documents the most frequent tags outside of `type` are seen less then 40,000 times. This conclusion was reaffirmed when reviewing the address tags, which suggested that most submitted GPS coordinates do not contain additional tags providing meta-data about the GPS coordinates. If we are looking to use OSM data for more then just GPS directions much more work will be necessary to provide greater depth and value. For the meta-data that I audited a great amount of it was added by several users, likely these documents were inserted in bulk from another geo-locational database.   

Second, I found that at least as it concerns address information that OSM could presumably with little effort automatically fill out zip-code, city name, state, and country information using nothing more then the submitted GPS data and an API call to a government website containing location data, which are free and easily accessible. Additionally offering perhaps a uniform  and semi-standard list of tag keys and a framework or guideline for tag values for users would likely create a much cleaner and more easily searchable database. Finally given that businesses or organizations whose mission revolves around the collection and storage of geo-located data with strong meta-data can contribute the most value to OSM, developing inducements for these types of organization to export their data to OSM seems worthwhile. Crowd sourcing geo-located data is fairly straight forward, but crowd sourcing the meta-data is much messier and clearly not as effective given the low submission rates of additional meta-data. I could be that providing a framework and some guidelines that are simple to fill out might encourage otherwise uninterested contributors to spend just a few moments more to provide meta-data about their GPS coordinates. Incentivizing this behavior would benefit the entire OSM community, making it a more valuable resource.

The OSM dataset has the potential to offer a lot of value to individuals who may otherwise be priced out of the type of information that could be contained in these datasets. If OSM were to explain the potential value of rich meta-data and how it can benefit the users of OSM maps, they might be able to achieve higher rates of meta-data submission. In terms of providing clean standard data, Implementing auditing methods after the fact will work as evidenced by my auditing process above but it would certainly be easier and likely cheaper to simply enforce certain standards and guidelines for inserting data to begin with. Providing some incentive to those users who not only contribute but also adhere to meta-data guidelines could increase the submission of clean meta-data.


In [17]:
# Apply css style to notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()