In [30]:
import xml.etree.cElementTree as ET
from pprint import pprint

#Setting up MongoDB connection
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017")
db = client.osm
#Creating db.bergen as a variable for the sake of brevity
bergen = db.bergen

The aim of this document is to give a summary of the wrangling and analysis process performed in the OSM project, and to highlight interesting findings from that process. For more details about the data wrangling and data analysis processes, see <a href="http://htmlpreview.github.io/?https://github.com/gisledb/udacity_nanodegree/blob/master/ipython_notebooks/p3/osm_analysis.html">osm_analysis.html</a> and <a href="http://htmlpreview.github.io/?https://github.com/gisledb/udacity_nanodegree/blob/master/ipython_notebooks/p3/osm_data_wrangling.html">osm_data_wrangling.html</a>.  

Before starting on the project, my overall goal was to analyze the OpenStreetMaps (OSM) data for my hometown Bergen, and hopefully discover some interesting findings during this process. As you will show later in the report, the main discoveries are related to data errors and structure of the user community.

I started by going to OpenStreetMaps.com and finding the correct entity for the city of Bergen area I wanted to analyze. I settled on using the boundary type entity with boundary variable set to "administrative". Next I went to https://mapzen.com/data/metro-extracts/ to generate and download the necessary data file for Bergen.  

Once I had the bergen.osm data file it was time for the pre-import wrangling process (full details <a href="http://htmlpreview.github.io/?https://github.com/gisledb/udacity_nanodegree/blob/master/ipython_notebooks/p3/osm_data_wrangling.html">here</a>). I started by doing some experiments on a generated sample file, and settled on 2 focus areas for ensuring data quality in the pre-import cleaning phase:  
1) ensuring good quality of postcodes, and correcting erroneous ones.  
2) ensuring good quality of street names within Bergen.

During the analysis, I discovered a few addresses with incorrect postcode format:

In [24]:
for _, element in ET.iterparse('data/bergen.osm'):
    if element.tag == 'node':
        tags = element.findall('tag')
        for el in tags:
            attrib_dict = el.attrib
            if attrib_dict['k'] == 'addr:postcode':
                if attrib_dict['v'][0:2] == 'NO':
                    print("id:",element.attrib['id'])
                    for addr in tags:
                        print("{0}: {1}".format( addr.attrib['k'],addr.attrib['v'] ) )
                    print('----')


id: 21641553
name: Kiwi minipris
shop: supermarket
amenity: post_office
nat_name: Kiwi minipris Frekhaug
addr:city: Frekhaug
wheelchair: yes
addr:street: Havneveien
addr:country: NO
addr:postcode: NO-5918
addr:housenumber: 36
----
id: 2698046129
name: Data Respons AS (Bergen office)
phone: +47 55 38 30 40
source: http://datarespons.com/Company-test/Offices-and-people/Norway/Bergen/
website: http://datarespons.com
addr:city: Bergen
addr:street: Edvard Griegs vei
addr:postcode: NO-5059
addr:housenumber: 3A
----
id: 2698046139
name: Itslearning HQ
phone: +47 55 23 60 70
office: yes
source: http://www.itslearning.eu/itslearning-hq-bergen-norway
website: http://www.itslearning.eu
addr:city: Bergen
addr:street: Edvard Griegs vei
addr:postcode: NO-5059
addr:housenumber: 3A
----
id: 3645588506
name: Circle K
brand: Circle K
amenity: fuel
operator: Circle K Norge AS
addr:city: Bergen
addr:street: Helleveien
addr:postcode: NO-5035
addr:housenumber: 34
----


We see that these four addresses have incorrect postcode format: They include the country abbreviation "NO", while we are only interested in the four digit postcode itself. I cleaned these erroneous records before importing the osm json file to MongoDB. During my analysis, I ensured that the postcodes were in fact corrected.

In [77]:
#After corrections, result from mongodb

def print_address(_id):
    if type(_id) == (str or int):
        _id = [_id]
    for item in _id:
        item = str(item)
        search = bergen.find_one( {'id': item} )
        print(search['id'], search['address'] )
        print('----')

print_address([21641553, 2698046129,2698046139,3645588506])

21641553 {'postcode': '5918', 'housenumber': '36', 'street': 'Havneveien', 'country': 'NO', 'city': 'Frekhaug'}
----
2698046129 {'postcode': '5059', 'housenumber': '3A', 'street': 'Edvard Griegs vei', 'city': 'Bergen'}
----
2698046139 {'postcode': '5059', 'housenumber': '3A', 'street': 'Edvard Griegs vei', 'city': 'Bergen'}
----
3645588506 {'postcode': '5035', 'housenumber': '34', 'street': 'Helleveien', 'city': 'Bergen'}
----


<img src="data/wikipedia_bergen_streets.jpg" align="right" width="290">

Next I had to come up with a good way for measuring street name quality. Since words and names in Norwegian often are concatenated I could not reuse the regex based strategy from the lecture videos. I decided to instead put the most common street name endings into a list which I compared all the street names in the data file too. I quickly discovered that there are way too many street names in Bergen which do not follow a common naming structure, so I realized I needed to come up with an additional strategy for the quality audit.

I decided on a strategy of comparing the street names in the osm data file with street names from more or less official sources. I first scraped a Norwegian wikipedia article which listed all the street names in Bergen, with original data source being the Norwegian Mapping Authority. Since the list was quite dated (from 2005), I decided to look for alternative sources as well. I discovered that the Norwegian Public Roads Administration (NPRA) has a public API containing all the official Norwegian streetnames, which I used to generate a second list of Bergen street names.

I then combined the street names from the two sources and removed any duplicates, ending up with a list of 2093 Bergen street names. At that point I felt I had a good foundation to continue with the quality audit.





When I first compared the osm data set to my list of Bergen street names, the search returned an unexpected high number of non-matched osm street names. While spot checking the street names, I noticed that several of them were located outside of the city of Bergen.

In [89]:
for _, element in ET.iterparse('data/bergen.osm'):
    if element.tag == 'node':
#         358065
        if element.attrib['id'] == '21641553':
            print(element.attrib)
            for el in element:
                print(el.attrib)
#         for el in tags:
#             attrib_dict = el.attrib
#             if attrib_dict['k'] == 'addr:postcode':
#                 if attrib_dict['v'][0:2] == 'NO':
#                     for addr in tags:
#                         print("{0}: {1}".format( addr.attrib['k'],addr.attrib['v'] ) )

{'timestamp': '2012-07-22T21:23:45Z', 'changeset': '12440624', 'lat': '60.5176144', 'id': '21641553', 'lon': '5.2397614', 'version': '6', 'uid': '1694', 'user': 'M E Menk'}
{'k': 'name', 'v': 'Kiwi minipris'}
{'k': 'shop', 'v': 'supermarket'}
{'k': 'amenity', 'v': 'post_office'}
{'k': 'nat_name', 'v': 'Kiwi minipris Frekhaug'}
{'k': 'addr:city', 'v': 'Frekhaug'}
{'k': 'wheelchair', 'v': 'yes'}
{'k': 'addr:street', 'v': 'Havneveien'}
{'k': 'addr:country', 'v': 'NO'}
{'k': 'addr:postcode', 'v': 'NO-5918'}
{'k': 'addr:housenumber', 'v': '36'}
