# Wrangle OpenStreetMap Data

In this project, I would process and explore the map for my living metropolitan area using MongoDB.
Download XML map data from [Mapzen](https://mapzen.com/data/metro-extracts/).
My living area is Taipei, Taiwan.
The greater Taipei area map data is ready to download on Mapzen.

![](taipei_taiwan.png)

The uncompressed OSM XML file size is 165MB.
The map shows the data covered whole Taipei City, most of districts in New Taipei City, along with small part of both Keelung City and Taoyuan City.
I am really excited to wrangle this dataset.

Steps:

1. Define the data model.<br>
   To import XML map data into MongoDB, I need to convert the dataset from XML to JSON format.
   The data model is how I mapping a XML data element to a JSON object.
1. Take sample from full dataset using the *sample.py*.<br>
   By change the parameter `k` in *sample.py*, I can create different sizes of sample dataset.
1. Audit the sample data.<br>
   Find problems from sample data, and create a data cleaning plan.
1. Clean sample data.<br>
   Execute data cleaning plan, and see if problems are fixed.
1. Repeat step 2~4 with larger sample.
1. Apply the process to full dataset.<br>
   Process map data and export the result to a JSON file.
1. Import the data to MongoDB.<br>
   Use the `mongoimport` command line tool to import a JSON file into MongoDB.
1. Explore map data use PyMongo.

## Data Model

There are three kinds of [data element](http://wiki.openstreetmap.org/wiki/Elements) in OpenStreetMap.
They contain common attributes:

- **id**
- **user**
- **uid**
- **timestamp**
- **visible**
- **version**
- **changeset**

and its own data:

- **node** (defining points in space)
  - **lat** attribute (latitude)
  - **lon** attribute (longitude)
- **way** (defining linear features and area boundaries)
  - **nd** child element (node)
- **relation** (which are sometimes used to explain how other elements work together).
  - **member** child element (could be a node, way, or relation)

also can have associated tags as child element.
A tag consists of a key and a value.

I define the data model like this:

```
{
    "type": str,
    "id": str,
    "edited": {
        "user": str,
        "uid": str,
        "timestamp": str,
        "version": str,
        "changeset": str
    }
    "coordinate": (float, float),  # only for node element
    "node_refs": [str],  # only for way elemnt
    "members": [
        {
            "type": str,
            "ref": str,
            "role": str
        }
    ]  # only for relation element
    "tags": {
        key: value
    }  # key and value are both str type
}
```

I put editing related attributes in the `edited` field.
I also discard the 'visible' attribute because it only useful in the data generated by time-based query.

## Audit Data

In the "Case Study: OpenStreetMap Data" lesson, I learned how to audit and clean the street type in street name.
But I expect local OSM contributors in Taiwan use traditional Chinese, for example 信義路(Xinyi Road).
The common street types in traditional Chinese characters are 路(road), 街(street), or 大道(avenue).

By auditing sample dataset XML format, I found other street types.
Those street types contains: 巷, 弄, and 段.
The 巷(alley) and 弄(lane) both mean the narrow street.
And the meaning of character 段 is the section of road.
Depends on local government laws, avenue or street can also be divided to multiple sections.
I also confirmed most of address tag values(98.9%~99.6% of sample) are traditional Chinese characters.
So I add 段, 巷 and 弄 to expected street types.
The `audit_osm` function in `osm.py` contains code to audit OSM XML data.

## Problems

By repeatedly auditing sample data using the `audit` function I write in `osm.py`
I encounter some problems during auditing data:

- City name, district name, house number, or floor in the street names
- Confusing street names
- Inconsistent city names
- Very rare, but unexpected English addresses

The characters 號(number) and 樓(floor) was found in some of street names.
And some problematic values contains city name such as 台北市(Taipei City) or 新北市(New Taipei City).
I also found district name, like 大同區(Datong District), in a few of street names.

A strange street name like 試院路(雙號) exists.
The English translation is "Shiyuan Road(even no.)".
This is an actual street name.
And there is another street named "Shiyuan Road(odd no.)" in my city.
I can't find why these street names exist.
So the 試院路(雙號) is not an error, but it is a confusing name.

![](shiyuan_road_odd.png)![](shiyuan_road_even.png)

Another confusing street name is 淡水老街內.
It is not a real street name.
The phrase means "in the old street of Tamsui".
In Taiwan, people called the historic commercial district as 老街(old street).
So this name just told me the location is in an area, not on the specific street.

And I audited the city name and district name as well.
Though most of city names end with 市(city), a few of them does not.

## Cleaning Data

In `osm.py`, I write functions to cleaning problematic tag values.
These functions are called by the code parsed the tags.
It use regular expression to search for bad name and replace with expected name.
If the tag value does not contain the expected name, I then delete the tag.

I write a function called `process_map_data` in `osm.py` to perform OSM XML mapping.
It parse the XML data element, covert it to Python dictionary, clean the data, and write all result in a JSON file.

To validate my cleaning plan, I write a function `audit_json` to see if the output JSON data is cleaned correctly.
I compare the information printed before and after processing.

## Process Greater Taipei Area Map

Finally I process the full dataset with my `process_map_data` function.
Then I can run the command

`mongoimport -d udacity -c osm --file taipei_taiwan.json --jsonArray`

to import the map data.

## Greater Taipei Area Overview

By using PyMongo, I can query the MongoDB to get an overview of map data.
And I care about if Taipei is bicycle friendly.
So I like to see the summary of [cycle](https://wiki.openstreetmap.org/wiki/Bicycle) feature as well.
I learn how to find cycle related tags from the Wiki page.

In [None]:
from pymongo import MongoClient

def query_collection(collection, query):
    return [result for result in collection.find(query)]

def aggregate_collection(collection, pipeline):
    return [result for result in collection.aggregate(pipeline)]

client = MongoClient()
db = client.udacity

print("Number of OSM documents: {}".format(db.osm.count()))

pipeline = [{"$group": {"_id": "$edited.user"}}]
result = aggregate_collection(db.osm, pipeline)
print("Number of unique users: {}".format(len(result)))

summation = {"$sum": 1}

pipeline = [{"$group": {"_id": "$type", "count": summation}}]
for result in aggregate_collection(db.osm, pipeline):
    print("Number of {}s: {}".format(
        result['_id'], result['count']))

bike_rental_query = {"tags.amenity": "bicycle_rental"}
result = query_collection(db.osm, bike_rental_query)
print("Number of bike rental stations: {}".format(len(result)))

bike_parking_query = {"tags.amenity": "bicycle_parking"}
result = query_collection(db.osm, bike_parking_query)
print("Number of bike parking locations: {}".format(len(result)))

is_true = {"$exists": True, "$ne": "no"}
cycleway_query = {"$or": [
    {"tags.highway": "cycleway"},
    {"cycleway": is_true},
    {"cycleway:right": is_true},
    {"cycleway:left": is_true},
    {"bicycle": is_true},
    {"bicycle:lanes": is_true},
    {"oneway:bicycle": is_true}
]}
result = query_collection(db.osm, cycleway_query)
print("Number of cycle ways: {}".format(len(result)))

## Additional Improvements

I also want to compare the cycle [POIs](http://wiki.openstreetmap.org/wiki/Points_of_interest) between different divisions.
So I run following code.

In [None]:
from pprint import pprint

sort_by_time = {"$sort": {"edited.timestamp": -1}}
group_by_division = {
    "$group": {
        "_id": {
            "city": "$tags.addr:city",
            "district": "$tags.addr:district"
        },
        "count": summation
    }
}

pipeline = [
    {"$match": bike_rental_query},
    group_by_division
]
result = aggregate_collection(db.osm, pipeline)
print("Bike rental stations in divisions:")
pprint(result)

pipeline = [
    {"$match": bike_parking_query},
    group_by_division
]
result = aggregate_collection(db.osm, pipeline)
print("Bike parking locations in divisions:")
pprint(result)

pipeline = [
    {"$match": cycleway_query},
    group_by_division
]
result = aggregate_collection(db.osm, pipeline)
print("Cycle ways in divisions:")
pprint(result)

The result surprised me.
Most of the POIs does not has the city and district tag.
It make the analysis I want to do very difficult.
Maybe I can grab the coordinate(if it exists) and use some Google Map API to find out which division it belongs.
But I can't completed the task in MongoDB.
So I think a possible improvement is to add the address tag to those POIs.