# OpenStreetMap Data Case Study

## Map Area

Pittsburgh, PA, United States (and greater metropolitan area)

https://mapzen.com/data/metro-extracts/metro/pittsburgh_pennsylvania/

This map is of the Pittsburgh metropolitan area. It includes Pittsburgh and my hometown (a small town about 30 miles north of Pittsburgh).

https://en.wikipedia.org/wiki/Pittsburgh_metropolitan_area

## Problems in the map data

* Inconsistent zip codes (e.g. '15213', '15423-1048', 'PA 15033', 'unknown')

* Inconsistent city names (e.g. 'Pittsburgh', 'Pittsburgh, PA', 'Pittsburg', etc.)


In [4]:
import xml.etree.cElementTree as ET
import pprint

# returns a dictionary that counts instances of values for specified key in tag elements
def count_tag_key_values(filename, key_value):
    data = dict()

    tree = ET.parse(filename)
    root = tree.getroot()

    for element in root.iter():
        tag = element.tag

        if tag == 'tag':
            key = element.attrib['k']

            if key == key_value:
                value = element.attrib['v']

                if(value not in data):
                    data[value] = 0

                data[value] += 1

    return data

filename = "../ex_JJ1igoeA8K3YfUHPuiifFYRg8LHik.osm"
#filename = "../pittsburgh.osm"


### Inconsistent zip codes

Most of the zip codes are in the right format, i.e. 5 digit numbers. The rest can be validated and corrected programmatically.

There are a considerable number of zip codes in the zip code + 4 format, e.g. '26062-4598'. For consistency's sake, I'm going to extract the 5 digit version for my data analysis. This should make potential sql queries that may use something like a 'group by' expression quicker and easier to write.

And finally there exist a small number of invalid zip code values, e.g. 'California PA, 15419'. To fix these, I'll write a regular expression that tries to extract the valid zip code and use that.

In [5]:
pprint.pprint(count_tag_key_values(filename, 'addr:postcode'))

{'15001': 116,
 '15003': 239,
 '15005': 207,
 '15006': 24,
 '15007': 69,
 '15009': 8,
 '15010': 14,
 '15010-4503': 2,
 '15012': 30,
 '15014': 863,
 '15015': 299,
 '15017': 3468,
 '15018': 174,
 '15020': 23,
 '15021': 3,
 '15022': 1,
 '15024': 1860,
 '15025': 4960,
 '15026': 161,
 '15030': 371,
 '15031': 377,
 '15034': 522,
 '15035': 632,
 '15037': 2573,
 '15044': 5355,
 '15045': 5,
 '15050': 1,
 '15056': 4,
 '15057': 176,
 '15061': 2,
 '15063': 3,
 '15066': 2,
 '15068': 17,
 '15071': 1,
 '15074': 2,
 '15076': 1,
 '15083': 1,
 '15084': 5,
 '15085': 1,
 '15086': 1,
 '15088': 89,
 '15090': 52,
 '15091': 1,
 '15095': 1,
 '15101': 21,
 '15102': 14,
 '15104': 1,
 '15106': 12,
 '15108': 23,
 '15110': 1,
 '15112': 421,
 '15116': 4,
 '15120': 19,
 '15122': 6,
 '15126': 2,
 '15129': 5,
 '15132': 24,
 '15133': 10,
 '15136': 6,
 '15137': 54,
 '15139': 40,
 '15142': 1,
 '15143': 66,
 '15145': 25,
 '15146': 35,
 '15147': 28,
 '15147-1423': 1,
 '15148': 1,
 '15201': 29,
 '15202': 163,
 '15203': 513,


In [9]:
# code to fix zips
# snippet from a larger data.py file

import re

ZIP_FULL = re.compile(r'^\d{5}$')
ZIP = re.compile(r'\d{5}')

def fix_zip(value, zip_full_regex=ZIP_FULL, zip_regex=ZIP):

    if zip_full_regex.match(value):
        return value
    elif zip_regex.search(value):
        return zip_regex.search(value).group(0)
    else:
        return False

### Inconsistent city names

Again, most city (town) names look good. However, there are some obvious problems that stand out.

First, some cities have extra information, such as the state. E.g. 'Pittsburgh' and 'Pittsburgh, PA'.

Second, the same city has multiple values. For example: 
* 'Pittsburgh' and 'Pittsburg' (mispelling)
* 'Moon', 'Moon Township', and 'Moon Townshop' (mispelling, different names)
* 'Cranberry', 'Cranberry Township', and 'Cranberry Twp' (different names, abbreviations)
* 'Leetsdale' and 'Leetsdale ' (extra whitespace)

To fix these, I'll create a mapping of bad city names to the right ones. This is manageable because there are only so many cities or towns in this map data.


In [6]:
pprint.pprint(count_tag_key_values(filename, 'addr:city'))

{'1936 5th Ave. Pittsburgh, PA 15219': 1,
 'Acme': 1,
 'Aliquippa': 6,
 'Allison Park': 20,
 'Allison Park, PA': 1,
 'Ambridge': 6,
 'Apollo': 1,
 'Arnold': 4,
 'Aspinwall': 2,
 'Baden': 2,
 'Bakerstown': 3,
 'Beallsville': 1,
 'Beaver': 6,
 'Beaver Falls': 3,
 'Belle Vernon': 1,
 'Bellevue': 5,
 'Bethel Park': 10,
 'Brackenridge': 2,
 'Bradford Woods': 3,
 'Bradfordwoods': 2,
 'Bridgeville': 4,
 'Bridgewater': 2,
 'Brownsville': 4,
 'Buffalo Township': 1,
 'Burgettstown': 3,
 'Butler': 1,
 'Butler, PA': 27,
 'Cabot': 3,
 'California': 21,
 'Canonsburg': 2432,
 'Carnegie': 9,
 'Castle Shannon': 1,
 'Cecil': 12,
 'Chester': 2,
 'Cheswick': 5,
 'Churchill': 1,
 'Clairton': 7,
 'Coal Center': 29,
 'Connellsville': 3,
 'Coraopolis': 3,
 'Crabtree': 1,
 'Cranberry': 4,
 'Cranberry Township': 173,
 'Cranberry Twp': 6,
 'Darlington': 1,
 'Donora': 1,
 'Dormont': 2,
 'Downieville': 2,
 'Dravosburg': 47,
 'Duquesne': 1,
 'East Deer Township': 2,
 'East Liverpool': 22,
 'East Palestine': 1,
 'Ed

In [10]:
# code to fix city names
# snippet from a larger data.py file

CITY_NAME_MAPPING = {
    '1936 5th Ave. Pittsburgh, PA 15219': 'Pittsburgh',
    'Allison Park, PA': 'Allison Park',
    'Bradfordwoods': 'Bradford Woods',
    'Butler, PA': 'Butler',
    'Cranberry': 'Cranberry Township',
    'Cranberry Twp': 'Cranberry Township',
    'Evans City, PA': 'Evans City',
    'Leetsdale ': 'Leetsdale',
    'McKees rocks': 'McKees Rocks',
    'Mckees Rocks': 'McKees Rocks',
    'Moon': 'Moon Township',
    'Moon Townshop': 'Moon Township',
    'Mt. Washington': 'Mount Washington',
    'Pittburgh': 'Pittsburgh',
    'Pittsburg': 'Pittsburgh',
    'Pittsburgh, PA': 'Pittsburgh',
    'Renfrew, PA': 'Renfrew',
    'Renfrew,PA': 'Renfrew',
    'Renfrew,pa': 'Renfrew',
    'South Park Township': 'South Park'
}

def fix_city(value, city_name_mapping=CITY_NAME_MAPPING):
    if value in city_name_mapping:
        return city_name_mapping[value]

    return value