# Data Wrangling with Open Street Map(OSM)

## 1. Choose a city
- OSM includes lots of cities spread globally. 
- I could choose cities in Asia, but lots of information in written in their local languages.
- I am going to examine **Seattle WA, USA** since it is suitable to begin to wrangle with OSM at first.
- After getting done with this one, I will go for another cities like Seoul in S.Korea or Tokyo in Japan.

![map_region_seattle](map_region_seattle.png)

## 1-1. Map Information
- I am going to download map data (.osm) from MapZen. This website provides already prepared data file for popular cities. 
  - (https://goo.gl/kXjffY for Seattle WA, USA)
- When downloading OSM file, it is initially compressed.
- I need to uncompress the file first

##  2. Extract sample data from original
- Because uncompressed file is too large (> 1GB), it is hard to test code with it.
- In order to test functions to be in my code, I need to make sample data file extracted from the original
- The function below will do the trick.

In [19]:
import os
import xml.etree.ElementTree as ET

In [20]:
OSM_FILE = "seattle_washington.osm"  

# Sample generation related
SAMPLE_FILE = "seattle_washington_sample_500.osm"

In [21]:
# Parameter: take every k-th top level element
k = 500

In [22]:
def get_element(osm_file, tags=('node', 'way', 'relation')):
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

def generate_sample():
  with open(SAMPLE_FILE, 'wb') as output:
      output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
      output.write('<osm>\n  ')

      # Write every kth top level element
      for i, element in enumerate(get_element(OSM_FILE)):
          if i % k == 0:
              output.write(ET.tostring(element, encoding='utf-8'))

      output.write('</osm>')

In [23]:
# generate sample file
if not os.path.exists(SAMPLE_FILE):
    generate_sample()

When running the code above, I would get sample data named, "seattle_washington_sample_xxx.osm"

## 3. OSM's XML Structure

- You can find full description about OSM's XML here (http://wiki.openstreetmap.org/wiki/OSM_XML)
- OSM basically consists of four kinds of element, node, way, tag, and nd.
- Node describes a thing 
- Way describes a connection with nodes
- Tag gives additional information for node and way elements
- Nd is part of way element referencing the node by its id.
- Sample structure is shown below
```xml
<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.0.2">
    <bounds minlat="54.0889580" minlon="12.2487570" maxlat="54.0913900" maxlon="12.2524800"/>
     <node id="298884269" lat="54.0901746" lon="12.2482632" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/>
     <node id="261728686" lat="54.0906309" lon="12.2441924" user="PikoWinter" uid="36744" visible="true" version="1" changeset="323878" timestamp="2008-05-03T13:39:23Z"/>
     <node id="1831881213" version="1" changeset="12370172" lat="54.0900666" lon="12.2539381" user="lafkor" uid="75625" visible="true" timestamp="2012-07-20T09:43:19Z">
         <tag k="name" v="Neu Broderstorf"/>
         <tag k="traffic_sign" v="city_limit"/>
     </node>
     ...
     <node id="298884272" lat="54.0901447" lon="12.2516513" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/>
     <way id="26659127" user="Masch" uid="55988" visible="true" version="5" changeset="4142606" timestamp="2010-03-16T11:47:08Z">
          <nd ref="292403538"/>
          <nd ref="298884289"/>
          ...
          <nd ref="261728686"/>
          <tag k="highway" v="unclassified"/>
          <tag k="name" v="Pastower Straße"/>
      </way>
</osm>
```

## 4. Audit elements

## 4-1. Auditing node and way elements
- node element includes attributes...
  - id, lat, lon, user, uid, visible, version, changeset, timestamp
  
- way element includes attributes...
  - id, user, uid, visible, version, changeset, timestamp
  
- their attributes contain not much of human editable information
- id, user, uid, visible, version, changeset, timestamp are all machine generated data
- I am not going to audit these elements for now.
- However, if I find any, I will revise this post later.

## 4-2. Auditing tag element
- tag element is for giving addtional information to node and way elements.
- this is mostly where user's contribution comes in, so there could be some mistakes or inconsistency.
- since there are too many tags available, I will choose some of them to audit for now.
  - (you can find the entire tag set here: http://wiki.openstreetmap.org/wiki/Map_Features)
- I think the best way to audit tag element is 
  - list all possible values from current data
  - correct inconsistencies as much as possible for now
  - then if I encounter other unknown issues while running a program, go back to the first step

##  4-2-1.  tag element where k=[maxspeed|minspeed]
- since mph is the standard speed unit in the US, values not specified with it has to be fixed.
- first, I am going to look up what values there are

In [25]:
def audit_speed(filename):
      speed_types = []

      for event, elem in ET.iterparse(filename, events=("start",)):
            if elem.tag == "node" or elem.tag == "way":
                  for tag in elem.iter("tag"):
                        key = tag.attrib['k']
                        if key == 'maxspeed' or key == 'minspeed':
                              speed_types.append(tag.attrib['v'])

      return speed_types

In [28]:
print audit_speed(SAMPLE_FILE)

['40 mph', '25 mph', '25 mph', '25 mph', '25 mph', '35 mph', '25 mph', '25 mph', '25 mph', '25 mph', '30 mph', '100', '50', '15 mph', '50', '40 mph', '35 mph', '55 mph', '35 mph', '30 mph', '40 mph', '25 mph', '30 mph', '50', '25 mph', '25 mph', '45 mph', '35 mph', '25 mph', '25 mph', '25 mph', '15 mph', '10 mph', '35 mph', '50 mph', '25 mph', '30 mph', '30 mph', '35 mph', '55 mph', '30', '25 mph', '35 mph', '30 mph', '35 mph', '45 mph', '30 mph', '40 mph', '40 mph', '35 mph', '40 mph', '35 mph', '25 mph', '50', '45 mph', '30 mph']


- As you can see, there are some values missing the speed unit. 
- I think when the data grows larger like combining data from other countries, it is pretty important to specify what units are used to measure something.
- I am going to make a helper function to add 'mph' to those missing values.

In [29]:
def update_speed_unit(value):
    if value.find('mph') > -1:
        return value
    else:
        value = '{} mph'.format(value)
        return value

In [32]:
print update_speed_unit('20 mph')
print update_speed_unit('20')

20 mph
20 mph


##  4-2-2.  tag element where k=[phone]
- phone numbers can be written in various different ways.
- some people include national code, or others include parenthesis surrounding local code.
- it is better if all values are stored in the same form.

In [97]:
import re
from collections import defaultdict

- First, I am going to remove all special characters including '+', '-', '(', and ')' since some people use them, but some others don't
- Then, I will investigate list of lengths of phone number. 
- For normal phone numbers, the length should be 10 or 11. 11 is when national code is included.

In [98]:
def audit_phone(filename):
    phone_len = defaultdict(list)

    for event, elem in ET.iterparse(filename, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                key = tag.attrib['k']
                if key == 'phone':
                    phone_num = re.sub(r'[\+\(\)\-\s]', '', tag.attrib['v'])
                    phone_len[len(phone_num)].append(tag.attrib['v'])

    return phone_len

In [99]:
phone_audit_data = audit_phone(SAMPLE_FILE)

In [100]:
phone_audit_data

defaultdict(list,
            {4: ['+1-253-'],
             10: ['206-220-4240', '206-524-7951', '(425) 917-1417'],
             11: ['+1 206-633-3411',
              '+1-206-547-1961',
              '+1 206 448-8677',
              '+1-425-497-8868',
              '+1 206-467-9200',
              '+1 206 659-4043']})

- As you can see, there are 3 different lengths of phone numbers.
- The length 4 is abnormal, it doesn't mean anything. I think I should get rid of it.
- As expected, the length 10 doesn't include the national code, but the length 11 does.

- I am going to define a function to update those inconsistently written phone number to uniform shape. 
  - it is going to be '+NationalCode LocalCode-3Digits-4Digits'
- For abnormal cases, I will just return None so I can ignore later.

In [101]:
def update_phone(phone_num):
    phone_num = re.sub(r'[\+\(\)\-\s]', '', phone_num)
    
    if len(phone_num) != 10 and \
       len(phone_num) != 11:
            return None
    
    if len(phone_num) == 10:
        phone_num = '1{}'.format(phone_num)
    
    phone_num = '+{} {}-{}-{}'.format(phone_num[:1], phone_num[1:4], phone_num[4:7], phone_num[7:])
    return phone_num

In [105]:
updated_phone_nums = []
for length, phone_nums in phone_audit_data.items():
    for phone_num in phone_nums:
        update_phone_num = update_phone(phone_num)

        if update_phone_num is not None:
            updated_phone_nums.append(update_phone_num)
        
print updated_phone_nums

['+1 206-220-4240', '+1 206-524-7951', '+1 425-917-1417', '+1 206-633-3411', '+1 206-547-1961', '+1 206-448-8677', '+1 425-497-8868', '+1 206-467-9200', '+1 206-659-4043']


##  4-2-3.  tag element where k=[addr:street]
- We use many different street names.
- We even abbreviate those names, which makes hard to read sometimes.
- I am going to look up what kind of street names are used, and what names drive the whole data to be inconsistent.

In [106]:
STREET_TYPES_RE = re.compile(r'\b\S+\.?$', re.IGNORECASE)

In [107]:
def audit_street_name(filename):
    street_types = defaultdict(set)

    for event, elem in ET.iterparse(filename, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                key = tag.attrib['k']
                if key == "addr:street":
                    street_name = tag.attrib['v']
                    match = STREET_TYPES.search(street_name)
                    if match:
                        street_type = match.group()
                        street_types[street_type].add(street_name)

    return street_types

In [109]:
audit_street_name(SAMPLE_FILE).keys()

['Northeast',
 'Court',
 'South',
 'West',
 'Boulevard',
 'Northwest',
 'Way',
 'East',
 'Highway',
 'Southwest',
 'North',
 'Southeast',
 'Road',
 'Spur',
 'NW',
 'Loop',
 'Lane',
 'N.',
 'Drive',
 'Place',
 '104',
 'Point',
 'WY',
 'SW',
 'Street',
 'Crescent',
 'Avenue']

- Those values listed above are street name used in the last part of the full street names.
- Ok. Now I am going to look through what full street names are for those .

In [117]:
#audit_street_name(SAMPLE_FILE)

- Here are some of the abbreviations used in the sample data and their mapped full name.
  - NW: Northwest
  - N.: North
  - WY: Way
  - SW: Southwest
  
- Here are some of the commonly used abbreviations
  - NE: Northeast
  - SE: Southeast
  - S.: South
  - St/St.: Street
  - Rd/Rd.: Road
  - Ave: Avenue

- In order to update those abbreviated street names to the full name, I need to create mapping table.

In [111]:
STREET_TYPE_MAPPINGS = { "St"    :  "Street",
                         "St."   :  "Street",
                         "Rd"    :  "Road",
                         "Rd."   :  "Road",
                         "Ave"   :  "Avenue",
                         "SW"    :  "Southwest",
                         "NW"    :  "Northwest",
                         "SE"    :  "Southeast",
                         "NE"    :  "Northeast",
                         "S."    :  "South",
                         "N."    :  "North",
                         "WY"    :  "Way"}

- I am going to write a function to update street name now.

In [112]:
STREET_TYPES_RE = re.compile(r'\b\S+\.?$', re.IGNORECASE)

In [115]:
def update_name(name):
    m = STREET_TYPES_RE.search(name)

    if m:
        street_type = m.group()
        
        try:
            name = re.sub(street_type, STREET_TYPE_MAPPINGS[street_type], name)
            return name
        except KeyError as e:
            return name

In [116]:
updated_street_name = []

for street_name in audit_street_name(SAMPLE_FILE).keys():
    updated_street_name.append(update_name(street_name))
    
updated_street_name

['Northeast',
 'Court',
 'South',
 'West',
 'Boulevard',
 'Northwest',
 'Way',
 'East',
 'Highway',
 'Southwest',
 'North',
 'Southeast',
 'Road',
 'Spur',
 'Northwest',
 'Loop',
 'Lane',
 'North',
 'Drive',
 'Place',
 '104',
 'Point',
 'Way',
 'Southwest',
 'Street',
 'Crescent',
 'Avenue']

## 5. Reorganize OSM into CSV (prepare for SQL)