# Project 3: Wrangle OpenStreetMap Data

## Greater Seattle Region

#### Whitney King

This project will wrangle OpenStreetMap data pertaining to the Greater Seattle region in Washington State, USA. Seattle was chosen, as it's my home city, and it will be interesting to work with data that contains locations I'm personally familiar with. The dataset was obtained from a preselected region of OSM data hosted by MapZen:
https://mapzen.com/data/metro-extracts/metro/seattle_washington/

The main objective of this project will be to ensure that pertinent dirty data has been cleaned up and corrected prior to converting the data to JSON. Data wrangling will be done in MongoDB using Python3. We'll begin by importing modules that may come in handy, as well as setting up variables.

### Sample Data

Since the dataset for the Greater Seattle Region is extremely large (1.6 GB), we'll first generate a sample of the data set to assist with preliminary development and analysis. By getting a sample of the data that is less than 100MB, we'll still have a good chunk of the data from the region, but processing time will be significantly less. For the purposes of this stage of the data wrangling, this sample set will work just fine.

In [78]:
# Python Modules
import xml.etree.cElementTree as xET
from collections import defaultdict
import csv
import os
import pprint
import re
import codecs
import json
import ast

# OSM Files
OSM_NAME = "seattle_washington.osm"
OSM_FILE = open(OSM_NAME, "rb")

SAMPLE_NAME = "seattle_sample.osm"  # k = 30
SAMPLE_FILE = open(SAMPLE_NAME, "rb")

SMALL_SAMPLE_NAME = "seattle_small_sample.osm"  # k = 900
SMALL_SAMPLE_FILE = open(SMALL_SAMPLE_NAME, "rb")

# Street Types in Addresses
st_types = defaultdict(set)


# Paramenter: k-th top level element
k = 900  # Larger number, small sample

def get_element(OSM_NAME, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = xET.iterparse(OSM_NAME, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


def create_sample():
    with open(SMALL_SAMPLE_NAME, 'wb') as output:
        output.write(bytes('<?xml version="1.0" encoding="UTF-8"?>\n', encoding="utf-8"))
        output.write(bytes('<osm>\n  ', encoding="utf-8"))

        # Write every 10th top level element
        for i, element in enumerate(get_element(OSM_NAME)):
            if i % k == 0:
                output.write(xET.tostring(element, encoding='utf-8'))

        output.write(bytes('</osm>', encoding="utf-8"))
        
        print(SMALL_SAMPLE_NAME, 'created:')
        print('File size: ', file_size(SMALL_SAMPLE_NAME))
        
       
    
def convert_bytes(num):
    """
    this function will convert bytes to MB.... GB... etc
    
    Reference:
    http://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python
    """
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            size = "%3.1f %s" % (num, x)
            return size
        num /= 1024.0
        
        

def file_size(SMALL_SAMPLE_NAME):
    """
    this function will return the file size
    
    Reference:
    http://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python
    """
    file_info = os.stat(SMALL_SAMPLE_NAME)
    size = convert_bytes(file_info.st_size)
    return size

create_sample()

seattle_small_sample.osm created:
File size:  1.8 MB


### Count Element Tags

After we've imported the OSM XML data, we'll want to take a look at how the elements are broken down. Counting the tags that show up in the dataset will allow us to gain an understanding of the size and structure of the data we're working with.

In [5]:
def count_tags(f):
    '''
    Reference:  Udacity
    '''
    tags = {}
    for ev, elem in xET.iterparse(f):
        tag = elem.tag
        if tag not in tags.keys():
            tags[tag] = 1
        else:
            tags[tag] = tags[tag] + 1    
    return tags


def tag_count():
    #Prints off tags found in the OSM XMl file, along with their counts
    tags = count_tags(SAMPLE_NAME)
    pprint.pprint(tags)
    
    
#May take in excess of several minutes to run due to the size of the data file
tag_count()

{'member': 3023,
 'nd': 287779,
 'node': 257664,
 'osm': 1,
 'relation': 317,
 'tag': 157875,
 'way': 25509}


### List Element Tag Names

It's interesting to see the counts of occurrences for each type of element tag in the XML (particularly ```node``` and ```way```, as they contain the most interesting fields), however these counts only show a top level list of elements, not the tags/values nested therein. To get a better feel for the information available, we'll generate a list of tags, as well as count their nested fields.

In [117]:
def tag_values(f):
    tags = {}
    for ev, elem in xET.iterparse(f):
        tag = elem.tag
        if tag not in tags.keys():
            # Create empty set for tag types
            tags[tag] = {}
        if elem.tag == 'way' or elem.tag == 'node': # Specify parent tag to filter nested tag list
            for t in elem.iter('tag'):
            #Iterate through tag types, adding new ones to the set for the parent tag
                if t.attrib['k'] in tags[tag].keys():
                    tags[tag][t.attrib['k']] = tags[tag][t.attrib['k']] + 1
                else:
                    tags[tag][t.attrib['k']] = 1
    return tags


def get_tag_values():
    tags = tag_values(SAMPLE_NAME)
    print(tags)
    
    
#May take in excess of several minutes to run due to the size of the data file
get_tag_values()

{'node': {'highway': 1468, 'created_by': 2315, 'name': 1200, 'amenity': 763, 'name:en': 27, 'ref': 259, 'crossing': 189, 'button_operated': 1, 'tiger:tzid': 16, 'source': 6828, 'exit_to': 5, 'exit_to:left': 2, 'exit_to:right': 2, 'noref': 4, 'atm': 9, 'power': 1383, 'odbl': 3, 'railway': 184, 'traffic_signals': 7, 'place': 42, 'wikipedia': 5, 'population': 6, 'census:population': 2, 'leisure': 74, 'bicycle': 30, 'level_crossing': 1, 'traffic_calming': 95, 'stop': 3, 'note': 96, 'landuse': 10, 'man_made': 37, 'noexit': 22, 'bus': 101, 'gtfs:stop_id': 203, 'public_transport': 215, 'junction': 13, 'access': 28, 'barrier': 110, 'direction': 24, 'is_in:state_code': 24, 'website': 103, 'ele': 139, 'is_in': 32, 'gnis:id': 51, 'gnis:Class': 51, 'gnis:County': 51, 'gnis:ST_num': 51, 'import_uuid': 31, 'gnis:ST_alpha': 31, 'gnis:County_num': 51, 'addr:city': 3696, 'addr:state': 89, 'wikidata': 3, 'natural': 192, 'name:sal': 1, 'aeroway': 14, 'ref:left': 1, 'ref:right': 1, 'attribution': 1071, 'c

Investigating the fields that occur most in nodes and ways gives us a good indication of what data occurs frequently in the XML and would be worth investigating for cleanup and parsing into JSON/CSV. Between in the way data, there are 6193 total occurrances of addr:street.

The data also contains high numbers of other address elements such as ```:city```, ```:housenumber```, and ```:postcode```. Other fields, such at the ```tiger:``` fields appear interesting as they look like they contain segments of addresses and directions that could be used for uniformity when cleaning the data, but we'll need to look at the data itself to be sure. This is data imported from the US Census Beaureu, but has been heavily edited by users. For the purposes of this audit and import, we will stick to the standard address fields.

>Reference: http://wiki.openstreetmap.org/wiki/TIGER

Additionally, looking at specific attributes under the ```node``` tag can give us an indication of the type of building or business we're looking at, such as there being 274 denoted shops, or 10907 denoted buildings. ```way``` contains 89 denoted shops, so this will be worth investigating.

#### Count Shop and Building Types

In [118]:
ATTRIBUTES = ['shop', 'building']

def count_types(f):
    types = {}
    
    for a in ATTRIBUTES:
        types[a] = {}
        
        for ev, elem in xET.iterparse(f):
            tag = elem.tag
            if elem.tag == 'way':
                for t in elem.iter('tag'):
                    if t.attrib['k'] == a:
                        if t.attrib['v'] not in types[a].keys():
                            types[a][t.attrib['v']] = 1
                        else:
                            types[a][t.attrib['v']] = types[a][t.attrib['v']] + 1
            
    return types

def get_type_count():
    types = count_types(SAMPLE_NAME)
    print(types)
    
#May take in excess of several minutes to run due to the size of the data file
get_type_count()

{'shop': {'outdoor': 2, 'second_hand': 2, 'car_repair': 9, 'craft': 1, 'mall': 2, 'funeral_directors': 1, 'tyres': 1, 'clothes': 2, 'tanning': 1, 'convenience': 12, 'hardware': 2, 'doityourself': 1, 'car': 9, 'supermarket': 7, 'chemist': 1, 'department_store': 2, 'beauty': 3, 'furniture': 2, 'greengrocer': 1, 'scuba_diving': 1, 'pet': 1, 'fabric': 1, 'sports': 1, 'vacant': 2, 'hairdresser': 3, 'deli': 1, 'yes': 3, 'butcher': 1, 'crafts': 1, 'variety_store': 2, 'hobby': 1, 'car_parts': 2, 'gift': 1, 'garden_centre': 2, 'stationery': 2, 'dry_cleaning': 1, 'ticket': 1, 'no': 1}, 'building': {'university': 7, 'yes': 9498, 'commercial': 61, 'school': 37, 'residential': 369, 'house': 610, 'apartments': 92, 'industrial': 13, 'hangar': 5, 'roof': 27, 'office': 4, 'retail': 29, 'dormitory': 2, 'service': 1, 'bunker': 1, 'terrace': 14, 'warehouse': 5, 'detached': 10, 'garages': 7, 'college': 2, 'supermarket': 1, 'public': 4, 'floating_home': 1, 'shed': 16, 'garage': 33, 'church': 6, 'storage': 1

This script gives us a really detailed breakdown of the types of shops and building that are in the sample data. Judging by how random these tags are, it's apparent these keywords are neglected by a lot of users, and in many cases aren't consistent when trying to descibe similar businesses or buildings (such as supermarket and greengrocer). 

Additionally, it seems like most users use the building attribute as a boolean yes/no value, while others use it as a description of the type of building. this type of inconsistent record keeping could really throw off understandings of how this field should be tracked.

### Preview Data

All of the data included in this list of fields is descriptive information about ways themselves, but does not include other metadata such as users, and information about each entry. Digging further into the shape of this data will also allow us to figure out the metadata that's available to go with this map data.

In [119]:
# Metadata
METADATA = [ 'version', 'changeset', 'timestamp', 'user', 'uid', 'id']

# Street Types in Addresses
st_types = defaultdict(set)


def preview_data(tag):
    i = 0
    s = 20
    n = s + 3  # Number of tags to preview
    pv = {}
    for event, elem in xET.iterparse(SAMPLE_NAME, events=('start',)):
        if elem.tag == tag:
            i += 1
            if i >= s and i<= n:
                pv[tag + str(i)] = {}                
                for a in METADATA:
                    if elem.attrib[a] not in pv[tag + str(i)].keys():
                        pv[tag + str(i)][a] = elem.attrib[a]
                        
                pv = get_tag_list(pv, elem, tag, i)
                pv = get_node_list(pv, elem, tag, i)  
                
                if i == n:
                    return pv
                
                
def get_tag_list(pv, elem, tag, i):
    pv[tag + str(i)]['tags'] = {}
    for t in elem.iter('tag'):
        #Iterate through tags, adding new ones to the set for the parent tag
        if t.attrib['k'] not in pv[tag + str(i)]['tags'].keys():
            pv[tag + str(i)]['tags'][t.attrib['k']] = t.get('v')
    return pv


def get_node_list(pv, elem, tag, i):
    pv[tag + str(i)]['nodes'] = set()
    for nd in elem.iter('nd'):
    #Iterate through nodes, adding new ones to the set for the parent tag
        pv[tag + str(i)]['nodes'].add(nd.attrib['ref'])   
    return pv

pprint.pprint(preview_data('way'))       

{'way20': {'changeset': '90945',
           'id': '4736364',
           'nodes': {'30176530',
                     '30178412',
                     '30183676',
                     '30196045',
                     '30197347'},
           'tags': {'created_by': 'JOSM',
                    'from_address_left': '198',
                    'from_address_right': '199',
                    'highway': 'residential',
                    'name': 'Gretchen Way',
                    'name_base': 'Gretchen',
                    'name_type': 'Way',
                    'reviewed': 'no',
                    'separated': 'no',
                    'source': 'tiger_import_20070610',
                    'tiger:cfcc': 'A41',
                    'tiger:tlid': '152178976',
                    'to_address_left': '170',
                    'to_address_right': '171',
                    'zip_left': '98250',
                    'zip_right': '98250'},
           'timestamp': '2007-06-11T11:01:27Z',
           'ui

Here was can see the metadata is now included, and we have a small sample of the shape of the ```way``` data. This also shows us that some way entires don't have all the information about an address, so when we're parsing and cleaning data, this is something we'll need to consider. 

To get an idea of how many people have been involved in the creation of this sample data, we'll take a quick peek at some aggregated user information. We can also see that the ```name``` attribute contains street address information.

### User Data

In [120]:
# Metadata
METADATA = [ 'version', 'changeset', 'timestamp', 'user', 'uid', 'id']


'''
    Reference:  Udacity
'''

def get_user(elem, users):
    if elem.tag == "node":
        uid = elem.get('user')
        if uid not in users and uid != None:
            users.add(uid)
    return users


def user_contributors(filename):
    users = set()
    for _, elem in xET.iterparse(filename):
        user = get_user(elem, users)
    return users


def get_user_contributors():
    users = user_contributors(SAMPLE_NAME)
    print ('Total Users: ', len(users))
    print(users)
    
get_user_contributors()

Total Users:  1558
{'Torsang', 'keepright! ler', 'Syrn', 'KarlaQat', 'godfd379', 'rando67', 'jinalfoflia', 'dethme0w', 'kona314', 'japerry', 'dankgnu', 'emmdoerr', 'AlexRu', 'Natfoot', 'Whitt-E', 'JJMAR', 'sea duck', 'Dilys', 'jBeata', 'MarcEscape', 'kasims', 'djholman', 'Sappe', 'SydneyCarpenter', 'buckey206', 'sebastic', 'Bigcamper', 'Dampee', 'bal_agates', 'Contre', 'neuhausr', 'bwhill', 'MisterOblivious', 'Darryl Karleen', 'cullanp', 'Wrenling', 'Armin Zimmermann', 'Brian Gant', 'schann16', 'Maxim Velichko', 'aaronracicot', 'MikeGost', 'will simms', 'CartographerC', 'nstarksen', 'Paul Buxton', 'Luciola', 'StephenMangum', 'Jyoti Naik', 'Brian2112', 'Chris Lawrence', 'jamesholio', 'Ropino', 'euxneks', 'MappingJunkie', 'DennisL', 'Paul McCombs_Import', 'IanRoskelley', 'nbolten_import', 'WBSKI', 'cdbreiland', 'jacalata', 'FailMeh', 'The Rev', 'Tagalongs', 'DCD1', 'Constable', 'dkav', 'charles92', 'csytsma', 'PhilNi', 'CalliBrown', 'AlexZolotarev', 'Komяpa', 'petersfreeman', 'Chris Roge

For this sample set, there were 1558 unique contributors. When there are so many cooks in the kitchen, data is bound to have formatting issues, as well as inconsistent taxonomy.

### Audit Street Addresses

As we can see from the list of tag elements that contain field data, there is a lot of information that could be generated from this data set. Since we're working with map data, addresses will be one of the most important pieces of information we will wrangle. Part of the process for validating addresses will be checking for uniformity and consistency amongst commonly used street types. 

In [93]:

expected_street_types = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]



'''
    Reference:  Udacity
'''

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected_street_types:
            street_types[street_type].add(street_name)
            
            
def print_sorted_dict(d):
    keys = d.keys()
    keys = sorted(keys, key=lambda s: s.lower())
    for k in keys:
        v=d[k]
        print('%s: %d' % (k, v))

        
def is_street_name(elem):
    return (elem.attrib['k'] == 'addr:street')



def audit_streets():
    for event, elem in xET.iterparse(SMALL_SAMPLE_FILE, events=('start',)):
        if elem.tag == "node" or elem.tag == 'way':
            for tag in elem.iter('tag'):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    print('Total Ways: ', len(street_types)) #Total Number of Different Types of Streets
    pprint.pprint(dict(street_types))#List of Street Types
    

def is_zipcode(elem):
    return (elem.attrib['k'] == 'addr:postcode')
    
    
def audit_zips():
    zipcodes = {}
    inv_zipcodes = {}
    for event, elem in xET.iterparse(SAMPLE_FILE, events=('start',)):
         if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_zipcode(tag): #and tag.attrib['v'].startswith('98')
                    zipc = tag.attrib['v']
                    if zipc not in zipcodes.keys():
                        zipcodes[zipc] = 1
                    else:
                        zipcodes[zipc] = zipcodes[zipc] + 1
                        
                    if not tag.attrib['v'].startswith('98'):
                        inv_zipc = tag.attrib['v']
                        if inv_zipc not in inv_zipcodes.keys():
                            inv_zipcodes[inv_zipc] = 1
                        else:
                            inv_zipcodes[inv_zipc] = inv_zipcodes[inv_zipc] + 1
                    #else:
                    #    print(tag.attrib['v'])
                    #    zipcodes[tag.attrib['v']] += 1
    print('Total Zipcodes: ', len(zipcodes)) #Total Number of Zipcodes
    print('Invalid Zipcodes: ', len(inv_zipcodes)) #Total Number of Invalid Zipcodes
    pprint.pprint(dict(inv_zipcodes)) #List of Invalid Zipcodes

audit_zips()
audit_streets()

Total Zipcodes:  144
Invalid Zipcodes:  20
{'Olympia, 98501': 1,
 'V8N 3E2': 1,
 'V8P 2P4': 1,
 'V8R 5V6': 1,
 'V8S2J8': 1,
 'V8T 1G1': 1,
 'V8T4K7': 1,
 'V8W 1H6': 1,
 'V8Y 3H1': 1,
 'V8Z6E4': 1,
 'V8Z6E6': 1,
 'V9A 6N7': 1,
 'V9A 7N6': 1,
 'V9B 1H2': 1,
 'V9B 1R6': 1,
 'V9B1L8': 2,
 'V9B1V7': 1,
 'V9Z 1B2': 1,
 'v8Z 1H1': 1,
 'v8r 5e9': 1}
Total Ways:  15
{'East': {'13th Avenue East',
          '15th Avenue East',
          '22nd Avenue East',
          '25th Avenue East',
          'Belmont Place East',
          'Boylston Avenue East',
          'Broadmoor Drive East',
          'Dorffel Drive East',
          'Minor Avenue East',
          'Yale Avenue East'},
 'Fir': {'East Fir'},
 'Highway': {'Patricia Bay Highway'},
 'NE': {'161st Avenue NE'},
 'North': {'1st Avenue North',
           'Ashworth Avenue North',
           'Burke Avenue North',
           'Dayton Avenue North',
           'Dexter Avenue North',
           'East Green Lake Way North',
           'Fremont Avenue Nor

This dataset shows us how many variations there are when it comes to how users enter abbreviations for street names. This is one opportunity we'll have to clean up the information before it's parsed into JSON/CSV. Some of the main problems are:

* Abbreviations for types of street and directions are both extremely inconsitent
 * We'll want to decide if abbreviations are appropriate, or if using full words is better, and then stick with one format.
* Some addresses contain junk
 * Junk data may want to be scrubbed out if it cannot be corrected.
* Some streets do not contain street types
 * This could be data entry errors, or by design, so looking into these individually could be important
 
Using additional regexes will help automatically identify patterns that may be associated with problems in data.

When we look at zip codes that might be invalid for Seattle (not beginning with 98-), we can see that the OSM data for the Greater Seattle Region includes parts of British Columbia, Canada. These zip codes begin with V8 or V9. Overall, this is okay since we can update the rule we're using to validate zip codes to include these Canadian zip codes. What we _will_ want to clean up is the zipcode that includes the city of Olympia, as well as ensure that zip codes for BC all contain a space in the middle, and are all caps.

#### Finding Problems via Regex

In [180]:
# Regexes
REGEX_ST_TYPE = re.compile(r'\b\S+\.?$', re.IGNORECASE)
REGEX_LOWER = re.compile(r'^([a-z]|_)*$')
REGEX_LOWER_COLON = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
REGEX_PROBLEMCHAR = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# Street Types in Addresses
st_types = defaultdict(set)
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
street_types = defaultdict(set)

expected_street_types = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

'''
    Reference:  Udacity
'''

def key_type(elem, keys):
    if elem.tag == "tag":
        k = elem.get('k')
        if REGEX_LOWER.search(k):# tags that contain only lowercase letters and are valid
            if 'lower' in keys:
                keys['lower'] += 1
            else:
                keys['lower'] = 1
        elif REGEX_LOWER_COLON.search(k): # valid tags with a colon in their names
            if 'lower_colon' in keys:
                keys['lower_colon'] += 1
            else:
                keys['lower_colon'] = 1
        elif REGEX_PROBLEMCHAR.search(k): # tags with problematic characters
            if 'problemchars' in keys:
                keys['problemchars'] += 1
            else:
                keys['problemchars'] = 1
        else:
            keys['other'] += 1
    return keys



def identify_problem_street_types(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, elem in ET.iterparse(filename):
        keys = key_type(elem, keys)
    return keys



def print_problem_street_types():
    # You can use another testfile 'map.osm' to look at your solution
    # Note that the assertion below will be incorrect then.
    # Note as well that the test function here is only used in the Test Run;
    # when you submit, your code will be checked against a different dataset.
    keys = identify_problem_street_types(OSM_FILE)
    pprint.pprint(keys)

    
print_problem_street_types()

{'lower': 2306963, 'lower_colon': 2362363, 'other': 81841, 'problemchars': 3}


Looking at the counts from this audit shows us that there's a lot of information in the dataset that could be considered ill-formatted. Overall, there were only three street types with problem characters, however there were millions of entries for both the lower and lower_colon audits.

### Data Clean Up

* Mapped desirable values for street types, and ensured street names matched those values before conversion to JSON
 * Added frequently occurring attributes to the shape of the data
 * Many attributes that have interesting pieces of information are very sparsely populated, so to ensure the shape of the data is consistent, any attributes that didn't have a key/value pair for a certain node were populated with None.
* Regexes were run to validate tag format, as well as cleanup to street names to match the desired format specified in the mapping dictionary
 * Invalid tags were ignored
 * Tags with colons were parted out into their own organized dictionary by type of tag (addr and tiger)
* Zip codes validated to be in the format of ```98xxx```

In [1]:
expected_street_types = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

mapping = { "St": "Street",
            "St.": "Street",
            'AVE': 'Avenue',
            'Ave': 'Avenue',
            'Ave.': 'Avenue',
            'Av.': 'Avenue',
            'ave': 'Avenue',
            'Blvd': 'Boulevard',
            'Blvd.': 'Boulevard',
            'boulevard': 'Boulevard',
            'CT': 'Court',
            'Ct': 'Court',
            'Dr': 'Drive',
            'Dr.': 'Drive',
            'E': 'East',
            'E.Division': 'East Division',
            'FI': 'Fox Drive',
            'Hwy': 'Highway',
            'K10': 'NE 8th Street',
            'MainStreet': 'N Main Street',
            'N': 'North',
            'NE': 'Northeast',
            'NW': 'Northwest',
            'nw': 'Northwest',
            'PL': 'Place',
            'Pl': 'Place',
            'Rd': 'Road',
            'RD': 'Road',
            'Rd.': 'Road',
            'S': 'South',
            'S.': 'South',
            'S.E.': 'Southeast',
            'SE': 'Southeast',
            'ST': 'Street',
            'SW': 'Southwest',
            'SW,': 'Southwest',
            'Se': 'Southeast',
            'southeast': 'Southeast',
            'St': 'Street',
            'st': 'Street',
            'street': 'Street',
            'St.': 'Street',
            'Ter': 'Terrace',
            'W': 'West',
            'west': 'West',
            'WA': '17625 140th Avenue Southeast',
            'WA)': 'US 101',
            'WY': 'Way'
            }

'''
    Reference:  Udacity
'''

def street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected_street_types:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street") 
            
def print_sorted_dict(d):
    keys = d.keys()
    keys = sorted(keys, key=lambda s: s.lower())
    for k in keys:
        v=d[k]
        print('%s: %d' % (k, v))
        
def audit_streets(osm_file):
    street_types = defaultdict(set)
    for event, elem in xET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

def update_name(name, mapping):
    n = street_type_re.search(name)
    n = n.group()
    for m in mapping:
        if n == m:
            name = name[:-len(n)] + mapping[m]
    return name

def audit_update_street_types():
    st_types = audit_streets(open(SAMPLE_NAME, "rb"))
    
    print('Total Ways: ', len(st_types)) #Total Number of street types
    
    for st_type, ways in st_types.items():
        for name in ways:
            better_name = update_name(name, mapping)
            if name != better_name:
                #Preview updated data
                print (name, "=>", better_name)
            name = better_name
            
def is_zipcode(elem):
    return (elem.attrib['k'] == 'addr:postcode')


def incorrect_zips():
    zipcodes = {}
    for event, elem in xET.iterparse(SAMPLE_FILE, events=('start',)):
         if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_zipcode(tag): #and tag.attrib['v'].startswith('98')
                    zipc = tag.attrib['v']
                    if zipc not in zipcodes.keys():
                        zipcodes[zipc] = 1
                    else:
                        zipcodes[zipc] = zipcodes[zipc] + 1
    return zipcodes
    
def correct_zipcodes():
    for zipcode in incorrect_zips():
        try:
            better_zipcode = update_zip(zipcode)
        except:
            zipcode = zipcode.upper()
        if zipcode != better_zipcode:
            print (zipcode, "=>", better_zipcode)
            
def update_zip(zipcode):
    if zipcode == "Olympia, 98501":
        zipcode = "98501"
    elif zipcode.startswith('V') or zipcode.startswith('v'):
            z1 = zipcode[:3]
            z2 = zipcode[-3:]
            zipcode = z1 + ' ' + z2
            zipcode = zipcode.upper()
    return zipcode


print("Corrected Zip Codes: ")
correct_zipcodes()
print('\n')
print("Corrected Street Names: ")
audit_update_street_types()

Corrected Zip Codes: 
V9B1V7 => V9B 1V7
v8r 5e9 => V8R 5E9
Olympia, 98501 => 98501
V9B1L8 => V9B 1L8
v8Z 1H1 => V8Z 1H1
V8Z6E4 => V8Z 6E4
V8Z6E6 => V8Z 6E6
V8S2J8 => V8S 2J8
V8T4K7 => V8T 4K7


Corrected Street Names: 
Total Ways:  61
Carnation-Duvall Rd NE => Carnation-Duvall Rd Northeast
234th Place NE => 234th Place Northeast
Bellevue Way NE => Bellevue Way Northeast
236th Ave NE => 236th Ave Northeast
University Way NE => University Way Northeast
156th Pl NE => 156th Pl Northeast
180th Pl NE => 180th Pl Northeast
127th PL NE => 127th PL Northeast
Sand Point Way NE => Sand Point Way Northeast
155th Place NE => 155th Place Northeast
161st Avenue NE => 161st Avenue Northeast
Limited lane NW => Limited lane Northwest
south 58th street => south 58th Street
Laventure Rd => Laventure Road
Echo Lake Rd => Echo Lake Road
S River Rd => S River Road
112th Ave SE => 112th Ave Southeast
224th St SE => 224th St Southeast
241 LN SE => 241 LN Southeast
230 Lane SE => 230 Lane Southeast
NE 110th St

* Manually mapping this data for the cleanup only corrects a portion of the data, but seems to catch most of the glaring problems with street suffixes and zip codes.
* The way the data was corrected was subjective to how I chose to input the key/value pairs.
 * As there are other attribute fields available that contain address parts, ideally these catalogued abbreviations would be referenced and used when the system validates street addresses.

### Convert Data to JSON

In [6]:
'''
    Reference:  Udacity
'''
tiger = {}

def file_size(SMALL_SAMPLE_NAME):
    """
    this function will return the file size
    
    Reference:
    http://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python
    """
    file_info = os.stat(SMALL_SAMPLE_NAME)
    size = convert_bytes(file_info.st_size)
    return size

def convert_bytes(num):
    """
    this function will convert bytes to MB.... GB... etc
    
    Reference:
    http://stackoverflow.com/questions/2104080/how-to-check-file-size-in-python
    """
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            size = "%3.1f %s" % (num, x)
            return size
        num /= 1024.0

def tag_attributes(f):
    attribs = []
    for ev, elem in xET.iterparse(f):
        if elem.tag == 'way' or elem.tag == 'node': # Specify parent tag to filter nested tag list
            for t in elem.iter('tag'):
            #Iterate through tag types, adding new ones to the set for the parent tag
                if t.attrib['k'] not in attribs:
                    if REGEX_LOWER.search(t.attrib['k']): # Only add values for lowercase attributes
                        attribs.append(t.attrib['k'])
    return attribs

#attributes = tag_attributes(SMALL_SAMPLE_NAME)
#pprint.pprint(sorted(attributes))

def shape_element(element):
    node = {
        "id": None, 
        "type": None,
        'name': None,
        'amenity': None,
        'building' : None,
        'shop' : None,
        'cuisine' : None,
        'phone' : None,
        "created": {
            "changeset": None, 
            "user": None, 
            "version": None, 
            "uid": None, 
            "timestamp": None
        },
        "pos": [None, None],
        "refs": [None],
        "address": {
                  "housenumber": None,
                  "postcode": None,
                  "street": None,
                  "state": None,
                  "city": None,
                },
        "tiger": { 
                 'country': None,
                 'name_base': None,
                 'name_base_1': None,
                 'name_base_2': None,
                 'name_base_3': None,
                 'name_direction_prefix': None,
                 'name_direction_prefix_1': None,
                 'name_direction_prefix_2': None,
                 'name_direction_prefix_3': None,
                 'name_direction_suffix': None,
                 'name_direction_suffix_1': None,
                 'name_direction_suffix_2': None,
                 'name_direction_suffix_3': None,
                 'name_type': None,
                 'name_type_1': None,
                 'name_type_2': None,
                 'name_type_3': None,
                 'zip_left': None,
                 'zip_right': None,
                }
        }       
    refs = []

    if element.tag == "node" or element.tag == "way":
        node['id'] = element.attrib['id'] # Get node ID
        node['type'] = element.tag        # Get node type (node or way)
        
        if 'lat' in element.attrib:       # Get node position (lat/lon)
            node['pos'] = [ast.literal_eval(element.attrib['lat']), 
                           ast.literal_eval(element.attrib['lon'])]
            
        for m in METADATA:                # Get 'created' metadata
            if m in element.attrib:
                node['created'][m] = element.attrib[m]
        
        for nd in element.iter('nd'):     # Iterate through nodes references
            refs.append(nd.attrib['ref'])
        if refs != []:
            node['refs'] = refs
        else:
            node['refs'] = None
            
        for a in element.iter('tag'):             # Check each child tag for attributes
            k = a.attrib['k']
            if not(REGEX_PROBLEMCHAR.search(k)):  # Filter out tags with invalid characters
                if k.startswith('addr:'):
                    el = k.split(':')
                    if el[1] == 'street':
                        clean_street = update_name(a.attrib['v'], mapping) #Clean Street Types
                        node['address'][el[1]] = clean_street
                    elif el[1] == 'postcode':
                        cleanzip = update_zip(a.attrib['v'])
                        node['address'][el[1]] = cleanzip
                    else:
                        node['address'][el[1]] = a.attrib['v'] # Add other address parts
                elif k.startswith('tiger:'):      # Get TIGER address info
                    el = k.split(':')
                    if el[1] == 'name_type':
                        clean_type = update_name(a.attrib['v'], mapping) #Clean Street Types
                        node['tiger'][el[1]] = clean_type
                    elif el[1] in node['tiger'].keys():
                        node['tiger'][el[1]] = a.attrib['v'] # Add other tiger parts
                else:
                    if k in node.keys():
                        node[k] = a.attrib['v'] # Add key/value for all other attributes found  
                        
                #if node['tiger']['name_type'] != None:
                #    pprint.pprint(node)
        return node
    else:
        return None
    
def update_zip(zipcode):
    if zipcode == "Olympia, 98501":
        zipcode = "98501"
    elif zipcode.startswith('V') or zipcode.startswith('v'):
            z1 = zipcode[:3]
            z2 = zipcode[-3:]
            zipcode = z1 + ' ' + z2
            zipcode = zipcode.upper()
    return zipcode

def update_name(name, mapping):
    n = street_type_re.search(name)
    n = n.group()
    for m in mapping:
        if n == m:
            name = name[:-len(n)] + mapping[m]
    return name    
    
def process_map(file_in, pretty = False):
    json_file = file_in + ".json"
  
    data = []
    with codecs.open(json_file, "w") as fo:
        for _, element in xET.iterparse(file_in + '.osm'):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    
    #file_info = os.stat(json_file)
    #size = convert_bytes(file_info.st_size)

    return data

def shape_data():
    file_name = SMALL_SAMPLE_NAME.split('.')
    json_file = file_name[0] + ".json"
    
    data = process_map(file_name[0], False)
    
    print(json_file, 'created:')
    print('File size: ', file_size(json_file))
    print ('\n\n\n')
    print('Sample Shape:')    
    pprint.pprint(data[1])
    #pprint.pprint(data[5020:5240])
    pprint.pprint(data[-1])
    
shape_data()

seattle_small_sample.json created:
File size:  8.4 MB




Sample Shape:
{'address': {'city': None,
             'housenumber': None,
             'postcode': None,
             'state': None,
             'street': None},
 'amenity': None,
 'building': None,
 'created': {'changeset': '214775',
             'id': '25840363',
             'timestamp': '2007-02-12T03:19:19Z',
             'uid': '6009',
             'user': 'CoreyBurger',
             'version': '1'},
 'cuisine': None,
 'id': '25840363',
 'name': None,
 'phone': None,
 'pos': [48.4621984, -123.3248062],
 'refs': None,
 'shop': None,
 'tiger': {'country': None,
           'name_base': None,
           'name_base_1': None,
           'name_base_2': None,
           'name_base_3': None,
           'name_direction_prefix': None,
           'name_direction_prefix_1': None,
           'name_direction_prefix_2': None,
           'name_direction_prefix_3': None,
           'name_direction_suffix': None,
           'name_direction

Many attributes that have interesting pieces of information are very sparsely populated, so to ensure the shape of the data is consistent, any attributes that didn't have a key/value pair for a certain node were populated with ```None```.

Adding these extra key/value/null sets makes the size of the cleaned and shaped JSON file data larger than the size of the original OSM Sample file.

Additionally, when the data was staged for the JSON file, regexes were run to validate tag format, as well as cleanup to street names to match the desired format specified in the mapping dictionary.

## Exploring the Data in MongoDB

All of the exploration that was done in the initial phases is made much simpler once the data is cleaned up and imported into MongoDB. Additionally MongoDB will enable us to take a much more in depth look at the information with a lot less code.

We'll look at some of the previously scrapped data points to see how they compare when queried via pymongo now that we've normalized the shape of the most interesting pieces of data.

### Inserting the Data

In [84]:
'''
Reference:  Udacity
'''
file_name = SAMPLE_NAME.split('.')

data = process_map(file_name[0], False)
client = MongoClient()
db = client.SeattleOSM
collection = db.Sample
collection.insert_many(data)

<pymongo.results.InsertManyResult at 0x201693d88b8>

In [64]:
collection

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'SeattleOSM'), 'Sample')

### Statistics

##### File Size

In [70]:
json_file = file_name[0] + ".json"

#Original XML Sample File Size
print('XML File: ', str(os.path.getsize(SAMPLE_NAME)/1024/1024), 'Mb')

#JSON File Size
print('JSON File: ', str(os.path.getsize(json_file)/1024/1024), 'Mb')

XML File:  55.321757316589355 Mb
JSON File:  246.0293664932251 Mb


This simple command takes something previously coded out in it's own function, and brings it down to one line that gives us the size of the files we're working with in Mb.

This data aligns with the observations earlier, but the code is much more efficient.

##### Count Nodes and Ways

In [99]:
print('Number of nodes:', collection.find({"type":"node"}).count())
print('Number of ways:  ', collection.find({"type":"way"}).count())
print('Total entries:  ', collection.count())

Number of nodes: 1030656
Number of ways:   102036
Total entries:   1132692


##### Unique Contributors

In [81]:
# Number of unique users
print('Unique Contributors: ', len(db.Sample.distinct("created.uid")))

Unique Contributors:  1677


##### Top 10 Contributors

In [94]:
pl = [{"$group":{"_id": "$created.user",
                 "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit": 10}]
result = list(collection.aggregate(pl))
print('Top 10 User Contributors: ')
pprint.pprint(result)

Top 10 User Contributors: 
[{'_id': 'Glassman', 'count': 171028},
 {'_id': 'SeattleImport', 'count': 98436},
 {'_id': 'tylerritchie', 'count': 87804},
 {'_id': 'woodpeck_fixbot', 'count': 78332},
 {'_id': 'alester', 'count': 48320},
 {'_id': 'Omnific', 'count': 36588},
 {'_id': 'Glassman_Import', 'count': 30096},
 {'_id': 'STBrenden', 'count': 28616},
 {'_id': 'CarniLvr79', 'count': 28420},
 {'_id': 'Brad Meteor', 'count': 23244}]


##### Count Non-Null Attributes

In [109]:
fields = ['id',
     'name',
     'amenity',
     'building',
     'shop',
     'phone',
     'pos']

for field in fields: #Iterates through basic field values to count non-null occurances of each
    print(field + ':', db.Sample.find({field:{"$ne": None}}).count())

id: 1132692
name: 29400
amenity: 5492
building: 43672
shop: 1452
phone: 484
pos: 1030656


##### Types of Buildings

In [111]:
pl = [{"$group":{"_id": "$building",
                 "count": {"$sum": 1}}},
            {"$sort": {"count": -1}}]
result = list(collection.aggregate(pl))
print('Counts of Building Types: ')
pprint.pprint(result)

Counts of Building Types: 
[{'_id': None, 'count': 1089020},
 {'_id': 'yes', 'count': 38008},
 {'_id': 'house', 'count': 2448},
 {'_id': 'residential', 'count': 1476},
 {'_id': 'apartments', 'count': 368},
 {'_id': 'commercial', 'count': 244},
 {'_id': 'school', 'count': 152},
 {'_id': 'garage', 'count': 132},
 {'_id': 'retail', 'count': 116},
 {'_id': 'roof', 'count': 112},
 {'_id': 'mobile_home', 'count': 104},
 {'_id': 'shed', 'count': 64},
 {'_id': 'terrace', 'count': 56},
 {'_id': 'industrial', 'count': 52},
 {'_id': 'carport', 'count': 40},
 {'_id': 'detached', 'count': 40},
 {'_id': 'university', 'count': 28},
 {'_id': 'garages', 'count': 28},
 {'_id': 'church', 'count': 24},
 {'_id': 'warehouse', 'count': 20},
 {'_id': 'hangar', 'count': 20},
 {'_id': 'public', 'count': 16},
 {'_id': 'office', 'count': 16},
 {'_id': 'static_caravan', 'count': 12},
 {'_id': 'dormitory', 'count': 12},
 {'_id': 'greenhouse', 'count': 12},
 {'_id': 'college', 'count': 8},
 {'_id': 'cabin', 'count':

In [113]:
pl = [{"$group":{"_id": "$shop",
                 "count": {"$sum": 1}}},
            {"$sort": {"count": -1}}]
result = list(collection.aggregate(pl))
print('Counts of Shop Types: ')
pprint.pprint(result)

Counts of Shop Types: 
[{'_id': None, 'count': 1131240},
 {'_id': 'convenience', 'count': 212},
 {'_id': 'car_repair', 'count': 116},
 {'_id': 'hairdresser', 'count': 84},
 {'_id': 'beauty', 'count': 80},
 {'_id': 'clothes', 'count': 80},
 {'_id': 'supermarket', 'count': 72},
 {'_id': 'car', 'count': 64},
 {'_id': 'yes', 'count': 52},
 {'_id': 'mobile_phone', 'count': 44},
 {'_id': 'pet', 'count': 36},
 {'_id': 'dry_cleaning', 'count': 28},
 {'_id': 'furniture', 'count': 28},
 {'_id': 'car_parts', 'count': 24},
 {'_id': 'garden_centre', 'count': 20},
 {'_id': 'tobacco', 'count': 16},
 {'_id': 'massage', 'count': 16},
 {'_id': 'bicycle', 'count': 16},
 {'_id': 'department_store', 'count': 16},
 {'_id': 'hardware', 'count': 16},
 {'_id': 'greengrocer', 'count': 12},
 {'_id': 'confectionery', 'count': 12},
 {'_id': 'shoes', 'count': 12},
 {'_id': 'tattoo', 'count': 12},
 {'_id': 'vacant', 'count': 12},
 {'_id': 'bakery', 'count': 12},
 {'_id': 'antiques', 'count': 12},
 {'_id': 'outdoor',

##### Popular Cuisines

In [116]:
pl = [{"$match": {"amenity":"restaurant", 
                  "cuisine": {"$ne":None}}}, 
            {"$group":{"_id":"$cuisine", 
                       "count":{"$sum":1}}},        
            {"$sort":{"count":-1}}, 
            {"$limit":10}]
result = list(collection.aggregate(pl))
print('Most Popular Types of Food: ')
pprint.pprint(result)

Most Popular Types of Food: 
[{'_id': 'pizza', 'count': 48},
 {'_id': 'mexican', 'count': 44},
 {'_id': 'chinese', 'count': 24},
 {'_id': 'japanese', 'count': 20},
 {'_id': 'asian', 'count': 20},
 {'_id': 'italian', 'count': 16},
 {'_id': 'burger', 'count': 16},
 {'_id': 'american', 'count': 16},
 {'_id': 'indian', 'count': 16},
 {'_id': 'thai', 'count': 12}]


### Problems and Challenges with Dataset

* The investigation done on this dataset is based on limited understanding of the overall structure behind the OpenStreetMaps data. 
* To avoid making assumptions about the meaning or connections between obscure information, this audit chose to focus on cleanup of data with obvious meanings, such as address and user information.
* Many tags are inconsistent and appear to be subjectively entered by users, so getting them uniform would require a large amount of investigatory work.
* Data points show up unders more than one tag, so it's possible for some counts have have duplicate entries.

### Additional Suggestions

If I were wrangling this data for a more structured database, I would take advantage of the parted out date information that is contained within the ```tiger``` fields, and use that to cross check and correct ```addr``` information for each node/way

Additionally, I would include all attributes for each item being transfered to JSON, however this requires keeping track of a LOT of additional data, so for the purposes of this audit, only the most interesting fields were included in the transfer to MongoDB.

Finally, there is a lot of inconsistency between how attributes are used and assigned, so having a more regulated and well documented system detailing how users are expected to ideally enter descritive data about things would be very beneficial to being able to study the data.