# Data Wrangling with Open Street Map(OSM)

## 1. Choose a city
- OSM includes lots of cities spread globally. 
- I could choose cities in Asia, but lots of information in written in their local languages.
- I am going to examine **Seattle WA, USA** since it is suitable to begin to wrangle with OSM at first.
- After getting done with this one, I will go for another cities like Seoul in S.Korea or Tokyo in Japan.

![map_region_seattle](map_region_seattle.png)

## 1-1. Map Information
- I am going to download map data (.osm) from MapZen. This website provides already prepared data file for popular cities. 
  - (https://goo.gl/kXjffY for Seattle WA, USA)
- When downloading OSM file, it is initially compressed.
- I need to uncompress the file first

##  2. Extract sample data from original
- Because uncompressed file is too large (> 1GB), it is hard to test code with it.
- In order to test functions to be in my code, I need to make sample data file extracted from the original
- The function below will do the trick.

In [5]:
OSM_FILE = "seattle_washington.osm"  

# Sample generation related
SAMPLE_FILE = "seattle_washington_sample.osm"

In [6]:
# Parameter: take every k-th top level element
k = 1000

In [8]:
def get_element(osm_file, tags=('node', 'way', 'relation')):
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

def generate_sample():
  with open(SAMPLE_FILE, 'wb') as output:
      output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
      output.write('<osm>\n  ')

      # Write every kth top level element
      for i, element in enumerate(get_element(OSM_FILE)):
          if i % k == 0:
              output.write(ET.tostring(element, encoding='utf-8'))

      output.write('</osm>')

When running the code above, I would get sample data named, "seattle_washington_sample.osm"

## 3. OSM's XML Structure

- You can find full description about OSM's XML here (http://wiki.openstreetmap.org/wiki/OSM_XML)
- OSM basically consists of four kinds of element, node, way, tag, and nd.
- Node describes a thing 
- Way describes a connection with nodes
- Tag gives additional information for node and way elements
- Nd is part of way element referencing the node by its id.
- Sample structure is shown below
```xml
<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.0.2">
    <bounds minlat="54.0889580" minlon="12.2487570" maxlat="54.0913900" maxlon="12.2524800"/>
     <node id="298884269" lat="54.0901746" lon="12.2482632" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/>
     <node id="261728686" lat="54.0906309" lon="12.2441924" user="PikoWinter" uid="36744" visible="true" version="1" changeset="323878" timestamp="2008-05-03T13:39:23Z"/>
     <node id="1831881213" version="1" changeset="12370172" lat="54.0900666" lon="12.2539381" user="lafkor" uid="75625" visible="true" timestamp="2012-07-20T09:43:19Z">
         <tag k="name" v="Neu Broderstorf"/>
         <tag k="traffic_sign" v="city_limit"/>
     </node>
     ...
     <node id="298884272" lat="54.0901447" lon="12.2516513" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/>
     <way id="26659127" user="Masch" uid="55988" visible="true" version="5" changeset="4142606" timestamp="2010-03-16T11:47:08Z">
          <nd ref="292403538"/>
          <nd ref="298884289"/>
          ...
          <nd ref="261728686"/>
          <tag k="highway" v="unclassified"/>
          <tag k="name" v="Pastower Straße"/>
      </way>
</osm>
```

## 4. Audit elements

## 4-1. Auditing node and way elements
- node element includes attributes...
  - id, lat, lon, user, uid, visible, version, changeset, timestamp
  
- way element includes attributes...
  - id, user, uid, visible, version, changeset, timestamp
  
- their attributes contain not much of human editable information
- id, user, uid, visible, version, changeset, timestamp are all machine generated data
- I am not going to audit these elements for now.
- However, if I find any, I will revise this post later.

## 4-2. Auditing tag element
- tag element is for giving addtional information to node and way elements.
- this is mostly where user's contribution comes in, so there could be some mistakes or inconsistency.
- since there are too many tags available, I will choose some of them to audit for now.
  - (you can find the entire tag set here: http://wiki.openstreetmap.org/wiki/Map_Features)
- I think the best way to audit tag element is 
  - list all possible values from current data
  - correct inconsistencies as much as possible for now
  - then if I encounter other unknown issues while running a program, go back to the first step

##  4-2-1.  tag element where k=[maxspeed|minspeed]

##  4-2-2.  tag element where k=[country]

##  4-2-3.  tag element where k=[phone]

##  4-2-4.  tag element where k=[addr:street]