# Wrangle OpenStreetMap Data

## 1 Introduction

In the report, I will wrangle the OpenStreetMap data of Manhattan, New York, United States.

First, I will audit the dataset to find out if there is any problem within the dataset that needs to be fixed. Next, I will use SQL queries to obtain an overview of the dataset. Last, I will provide some ideas to further improve and analyze the dataset.

### Map Area

New York (Manhattan), New York, United States I've obtained [a custom extract](Manhattan_NewYork_US.osm.bz2) that includes the Manhattan borough of New York City through Mapzen. I have chosen this area because I had lived in New York City for several years, and I really liked strolling along the streets of the city. I would like to find out if I will be able to find some interesting facts about the city I love by investigating the OpenStreetMap data.

In [1]:
OSM_FILE = 'Manhattan_NewYork_US.osm'

## 2 Auditing and Problems Encountered in the Map

In [2]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint

### 2.1 Map Extracts Included Surrounding Areas and Inconsistent Zip Codes

**Examples: "10001", "10001-2062", "NY 11106", "New York, NY 10065"**

Because of the way in which the data extract is generated, areas that surrounding Manhattan are also included in this dataset. I suspect that the dataset includes parts of other New York City boroughs and some parts of New Jersey. To confirm this, I will look at the zip codes distribution of our dataset. 

Because there is inconsistency in the zip code formats, I will fix that before aggregate the zip codes. I will use a update_zip_code function to update the zip code formats to a 5-digit zip code format (e.g. "10001") for more consistent queries. If more than one zip code is listed for any given address,I will keep only the first one.

In [None]:
# ================================================== #
#      Helper Functions for Auditing Zip Codes       #
# ================================================== #
def is_zip_code(elem):
    return (elem.attrib['k'] == 'addr:postcode')

def audit_zip_codes(zip_code_formats, zip_codes_distribution, zip_code):
    '''Audit zip codes
    
    This function updates two dictionaries showing the distribution of zip code formats
    and the distribution of zip code areas.
    
    Arg:
    zip_code_formats: A dictionary of zip code format: counts of zip code in that format
    zip_codes_distribution: A dictionary of zip code area name: counts of zip codes in that area
    zip_code: A zip code
    '''
    
    # Audit zip code formats
    # Convert any digit to an 'X' sign (e.g. 'NY 10001' becomes 'NY XXXXX')
    zip_code_format = re.sub('\d', 'X', zip_code)
    zip_code_formats[zip_code_format] += 1
    
    # Audit zip code areas
    # Convert zip code to its corresponding area name
    zip_code = re.sub('\D', '', zip_code) # Only look at zip code digits
    if re.match(r'^10[0-2]', zip_code): # Manhattan: 100XX, 101XX, 102XX
        zip_codes_distribution['Manhattan'] += 1
    elif re.match(r'^104', zip_code): # Bronx: 104XX
        zip_codes_distribution['Bronx'] += 1
    elif re.match(r'^112', zip_code): # Brooklyn: 112XX
        zip_codes_distribution['Brooklyn'] += 1
    elif re.match(r'^103', zip_code): # Staten Island: 103XX
        zip_codes_distribution['Staten Island'] += 1
    elif re.match(r'^11', zip_code): # Queens: 11XXX
        zip_codes_distribution['Queens'] += 1
    elif re.match(r'^07', zip_code): # New Jersey: 07XXX
        zip_codes_distribution['New Jersey'] += 1
    else:
        zip_codes_distribution['Other'] += 1
        
# ================================================== #
#         Functions for Updating Zip Codes           #
# ================================================== #
def update_zip_code(zip_code):
    ''' Update zip code format to five digits only
    
    This funtion is used to correct inconsistent zip code formats during XML to csv conversion
    
    Arg:
    zip_code: a raw zip code from the dataset
    
    Return:
    zip_code: an updated zip code consists with 5 digits 
    '''
    # Update zip code format to five digits "XXXXX"
    if re.search(r';', zip_code):
        zip_code = zip_code.split(';')[0] # Keep the first zip code for 'XXXXX;XXXXX' format
    digits = re.sub('\D', '', zip_code)
    if len(digits) ==  5: 
        zip_code = digits # 'XXXXX' stays the same
    elif len(digits) == 9: 
        zip_code = digits[:5] # 'XXXXX-XXXX' only keeps the first 5 digits
    return zip_code
