In [2]:
#importing classes from display and pretty print modules
from pprint import pprint
from IPython.display import HTML
from IPython.display import display
#importing other necessary modules and packages
import pandas as pd
from collections import defaultdict
from pymongo import MongoClient
from operator import itemgetter
import difflib
from fuzzywuzzy import fuzz
from matplotlib import pyplot as plt
import seaborn as sns

#For convenience imports are also included in individual cells where relevant

In [3]:
import xml.etree.cElementTree as ET
from pprint import pprint

#Setting up MongoDB connection
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017")
db = client.osm
#Creating db.bergen as a variable for the sake of brevity
bergen = db.bergen

The aim of this document is to give a summary of the wrangling and analysis process performed in the OSM project, and to highlight interesting findings from that process. For more details about the data wrangling and data analysis processes, see <a href="http://htmlpreview.github.io/?https://github.com/gisledb/udacity_nanodegree/blob/master/ipython_notebooks/p3/osm_analysis.html">osm_analysis.html</a> and <a href="http://htmlpreview.github.io/?https://github.com/gisledb/udacity_nanodegree/blob/master/ipython_notebooks/p3/osm_data_wrangling.html">osm_data_wrangling.html</a>.  

Before starting on the project, my overall goal was to analyze the OpenStreetMaps (OSM) data for my hometown, Bergen, and hopefully discover some interesting findings during this process. As you will see later in the report, the main discoveries are related to data errors and structure of the user community.

## Pre-import Data Wrangling (audit and cleaning)

I started by going to OpenStreetMaps.com and finding the correct entity for the city of Bergen area I wanted to analyze. I settled on using the boundary type entity with boundary variable set to "administrative". Next I went to https://mapzen.com/data/metro-extracts/ to generate and download the necessary data file for Bergen.  

Once I had the bergen.osm data file it was time for the pre-import wrangling process (full details <a href="http://htmlpreview.github.io/?https://github.com/gisledb/udacity_nanodegree/blob/master/ipython_notebooks/p3/osm_data_wrangling.html">here</a>). I started by doing some experiments on a generated sample file, and settled on 2 focus areas for ensuring data quality in the pre-import cleaning phase:  
1) ensuring good quality of postcodes, and correcting erroneous ones.  
2) ensuring good quality of street names within Bergen.

During the analysis, I discovered a few addresses with incorrect postcode format:

In [4]:
for _, element in ET.iterparse('data/bergen.osm'):
    if element.tag == 'node':
        tags = element.findall('tag')
        for el in tags:
            attrib_dict = el.attrib
            if attrib_dict['k'] == 'addr:postcode':
                if attrib_dict['v'][0:2] == 'NO':
                    print("id:",element.attrib['id'])
                    for addr in tags:
                        print("{0}: {1}".format( addr.attrib['k'],addr.attrib['v'] ) )
                    print('----')


id: 21641553
name: Kiwi minipris
shop: supermarket
amenity: post_office
nat_name: Kiwi minipris Frekhaug
addr:city: Frekhaug
wheelchair: yes
addr:street: Havneveien
addr:country: NO
addr:postcode: NO-5918
addr:housenumber: 36
----
id: 2698046129
name: Data Respons AS (Bergen office)
phone: +47 55 38 30 40
source: http://datarespons.com/Company-test/Offices-and-people/Norway/Bergen/
website: http://datarespons.com
addr:city: Bergen
addr:street: Edvard Griegs vei
addr:postcode: NO-5059
addr:housenumber: 3A
----
id: 2698046139
name: Itslearning HQ
phone: +47 55 23 60 70
office: yes
source: http://www.itslearning.eu/itslearning-hq-bergen-norway
website: http://www.itslearning.eu
addr:city: Bergen
addr:street: Edvard Griegs vei
addr:postcode: NO-5059
addr:housenumber: 3A
----
id: 3645588506
name: Circle K
brand: Circle K
amenity: fuel
operator: Circle K Norge AS
addr:city: Bergen
addr:street: Helleveien
addr:postcode: NO-5035
addr:housenumber: 34
----


We see that these four addresses have incorrect postcode format: They include the country abbreviation "NO", while we are only interested in the four digit postcode itself. I cleaned these erroneous records before importing the osm json file to MongoDB. During my analysis, I ensured that the postcodes were in fact corrected.

In [5]:
#After corrections, result from mongodb

def print_address(_id):
    if type(_id) == (str or int):
        _id = [_id]
    for item in _id:
        item = str(item)
        search = bergen.find_one( {'id': item} )
        print(search['id'], search['address'] )
        print('----')

print_address([21641553, 2698046129,2698046139,3645588506])

21641553 {'country': 'NO', 'postcode': '5918', 'city': 'Frekhaug', 'housenumber': '36', 'street': 'Havneveien'}
----
2698046129 {'postcode': '5059', 'city': 'Bergen', 'housenumber': '3A', 'street': 'Edvard Griegs vei'}
----
2698046139 {'postcode': '5059', 'city': 'Bergen', 'housenumber': '3A', 'street': 'Edvard Griegs vei'}
----
3645588506 {'postcode': '5035', 'city': 'Bergen', 'housenumber': '34', 'street': 'Helleveien'}
----


<img src="data/wikipedia_bergen_streets.jpg" align="right" width="290">

Next I had to come up with a good way for measuring street name quality. Since words and names in Norwegian often are concatenated I could not reuse the regex based strategy from the lecture videos. I decided to instead put the most common street name endings into a list which I compared all the street names in the data file too. I quickly discovered that there are way too many street names in Bergen which do not follow a common naming structure, so I realized I needed to come up with an additional strategy for the quality audit.

I decided on a strategy of comparing the street names in the osm data file with street names from more or less official sources. I first scraped a Norwegian wikipedia article which listed all the street names in Bergen, with original data source being the Norwegian Mapping Authority. Since the list was quite dated (from 2005), I decided to look for alternative sources as well. I discovered that the Norwegian Public Roads Administration (NPRA) has a public API containing all the official Norwegian streetnames, which I used to generate a second list of Bergen street names.

I then combined the street names from the two sources and removed any duplicates, ending up with a list of 2093 Bergen street names. At that point I felt I had a good foundation to continue with the quality audit.





When I first compared the osm data set to my list of Bergen street names, the search returned an unexpected high number of non-matched osm street names. While spot checking the street names, I noticed that several of them were located outside of the city of Bergen.

In [6]:
for _, element in ET.iterparse('data/bergen.osm'):
    if element.tag == 'node':
        if element.attrib['id'] == '21641553':
            print(element.attrib)
            for el in element:
                if el.attrib['k'][0:4] == 'addr':
                    print(el.attrib)
            break

{'uid': '1694', 'version': '6', 'lat': '60.5176144', 'user': 'M E Menk', 'id': '21641553', 'changeset': '12440624', 'lon': '5.2397614', 'timestamp': '2012-07-22T21:23:45Z'}
{'k': 'addr:city', 'v': 'Frekhaug'}
{'k': 'addr:street', 'v': 'Havneveien'}
{'k': 'addr:country', 'v': 'NO'}
{'k': 'addr:postcode', 'v': 'NO-5918'}
{'k': 'addr:housenumber', 'v': '36'}


To dig deeper into this, I downloaded an offical dataset from the Norwegian postal service containing all the postcodes in Norway, postcode name and municipality. When I compared the postal service postcodes to all the postcodes in the osm dataset, I discovered that some of the osm documents are located in cities outside of the municipality of Bergen.

Since I mainly focus on Bergen in this project, I decided to limit my address corrections to adresses within Bergen.

In [7]:
import pandas as pd

postcodes_per_municipality = pd.read_csv('data/Postnummerregister_ansi.tsv', encoding='utf-8',delimiter='\t',header=0, names=[
        'postal_code','postal_place','muni_number','muni_name','category'],
            dtype = {'postal_code': str, 'municipality_number': str})

print("Top rows of official postcode records:")
display(postcodes_per_municipality.head(2))

def is_postcode(elem):
    return (elem.attrib.setdefault('k',None) == "addr:postcode")

postcodes_bergen = set(postcodes_per_municipality[
    postcodes_per_municipality['muni_name'] == 'BERGEN']['postal_code'] )
postcodes_outside_bergen = set(postcodes_per_municipality[postcodes_per_municipality[
    'muni_name'] != 'BERGEN']['postal_code'] )


in_bergen_count = 0
outside_bergen_count = 0
postcodes_not_found = list()
outside_dict = defaultdict(int)


for _, element in ET.iterparse('data/bergen.osm'):
    if element.tag in ['way','node']:
        for tag in element.iter("tag"):
            if is_postcode(tag):
                postcode = tag.attrib['v']
                if postcode in postcodes_bergen:
                    in_bergen_count += 1
                elif postcode in postcodes_outside_bergen:
                    outside_bergen_count += 1
                    outside_dict[postcode] += 1
                else:
                    postcodes_not_found.append(postcode)
                    
print("Addresses in source file located in Bergen:",in_bergen_count)
print("Addresses in source file located outside Bergen:",outside_bergen_count)
print("Incorrect postcodes in source file:",postcodes_not_found)

Top rows of official postcode records:


Unnamed: 0,postal_code,postal_place,muni_number,muni_name,category
0,10,OSLO,301,OSLO,B
1,15,OSLO,301,OSLO,B


Addresses in source file located in Bergen: 70286
Addresses in source file located outside Bergen: 14099
Incorrect postcodes in source file: ['NO-5918', 'NO-5059', 'NO-5059', 'NO-5035', 'NO-5059']


I used the postal service postcode dataset to improve my postcode audit, and except for the already mentioned postcode errors, all the postcodes in the osm dataset matched postcodes in the official postcode dataset.

After deciding on the criteria for the street name audit, I compared the osm street names to the street names from the official sources, and the common street name suffixes. This resulted in 15 osm street names within Bergen not found in the official street name dataset. I manually reviewed these 15 street names, and excluded 6 of these from further corrections. One of these are actually a correct street name, verified through online research, while 5 of these cannot easily be corrected (post box address, name of shopping center, unknown street names).

In [8]:
mapping = mapping = { " Gate": " gate", " alle": " allé", "vn.": "vegen",
           "Tokanten": "Nesttunveien", "vei 4-10": "vei"
            }
error_streets = ['Hesthaugvn.',
 'Christies Gate',
 'Steinsvikvegen 430',
 'Smøråshøgda 9',
 'Minde alle',
 'Thormøhlens Gate',
 'Vilhelm Bjerknesvei 4-10',
 'Tokanten',
 'Laguneveien 1']

print("The 9 remaining street names:")
pprint(error_streets)

print("Corrected using the following critera:")
pprint(mapping)

The 9 remaining street names:
['Hesthaugvn.',
 'Christies Gate',
 'Steinsvikvegen 430',
 'Smøråshøgda 9',
 'Minde alle',
 'Thormøhlens Gate',
 'Vilhelm Bjerknesvei 4-10',
 'Tokanten',
 'Laguneveien 1']
Corrected using the following critera:
{' Gate': ' gate',
 ' alle': ' allé',
 'Tokanten': 'Nesttunveien',
 'vei 4-10': 'vei',
 'vn.': 'vegen'}


Once the criteria for the address audit were finalized and implemented, I created and ran a function to generate a json file to import to mongodb.

In [9]:
#Checking if errors have been imported to MongoDB (expecting 0 results)

errors_in_mongodb = bergen.find( { 'address.street': { 
    '$in': error_streets
} 
                 } )

print("Bergen street name errors imported to MongoDB:",errors_in_mongodb.count())

Bergen street name errors imported to MongoDB: 0


## Data Analysis

The following section is a summary of the analysis in osm_analysis.ipynb/osm_analysis.html. My main focus is quality of addresses in Bergen.

#### Basic Statistics
  
  

  
**File Sizes**  

bergen.osm ........ 142.99 MB (original file)  
bergen.osm.json ... 157.24 MB (file imported to mongodb)

**Document Types**

In [10]:
print("Documents in database:",bergen.count() )
print("Nodes in database:",bergen.find({ 'type': 'node' }).count() )
print("Ways in database:",bergen.find({ 'type': 'way' }).count() ) 

Documents in database: 681172
Nodes in database: 628754
Ways in database: 52379


**Addresses**

In [11]:
address_count = bergen.find ( { 'address': {'$exists': True} } ).count()
bergen_count = bergen.find ( { 'address.postcode': {'$in': list(postcodes_bergen) } } ).count()

print("Documents in database with address information:", address_count)
print("Documents in database with addresses within Bergen municipality:",bergen_count)

Documents in database with address information: 84625
Documents in database with addresses within Bergen municipality: 70290


#### Error Hunting####

After getting a general feel of the address data in the dataset, I went on to look for misspelled streetnames. My strategy was to look at streets with low address counts with similiar names to other streets in the database. To find near-matching street names I used a Python library named Fuzzywuzzy. After experimenting a little I settled on a fairly high fuzzy score of 90 and an address count of 10 as criteria for further analysis. 

In the end I ended up correcting the street names in 28 documents containing 15 different misspelled street names, and I exported 4 additional street names to research_street_spellings.csv which require more research than the scope of this project allows.

*Code from osm_analysis.html (and .ipynb) used to correct the misspelled street names in the database.*  
  
`
for inx,name in df_misspelled_streets.iterrows():
    bergen.update_many( { 'address.street': name[2]},
                  { '$set': {'address.street': name[0] } 
                  } )
 `     

In [12]:
#Ensuring all is corrected. First result should return count 0
misspelled_street_names = [{'correct_name': 'Totlandsvegen', 'wrong_name': 'Totlandsveien'},
 {'correct_name': 'Haakon Sheteligs plass',
  'wrong_name': 'Haakon Shetelings plass'},
 {'correct_name': 'Dreggsallmenningen', 'wrong_name': 'Dreggsallmenning'},
 {'correct_name': 'Vestre Murallmenningen',
  'wrong_name': 'Vestre murallmenningen'},
 {'correct_name': 'Haakon Sheteligs plass',
  'wrong_name': 'Haakon Shetelings plass'},
 {'correct_name': 'Dreggsallmenningen', 'wrong_name': 'Dreggsallmenning'},
 {'correct_name': 'Vilhelm Bjerknes’ vei',
  'wrong_name': 'Vilhelm Bjerknesvei'},
 {'correct_name': 'Travparkvegen', 'wrong_name': 'Travparkveien'},
 {'correct_name': 'Travparkvegen', 'wrong_name': 'Travparkveien'},
 {'correct_name': 'Bønesskogen', 'wrong_name': 'Børnesskogen'},
 {'correct_name': 'C. Sundts gate', 'wrong_name': 'C.Sundtsgate'},
 {'correct_name': 'Lars Hilles gate', 'wrong_name': 'Lars Hillesgate'},
 {'correct_name': 'Vestre Mulelvsmauet', 'wrong_name': 'Østre Mulelvsmauet'},
 {'correct_name': 'Siljustølvegen', 'wrong_name': 'Siljustølveien'},
 {'correct_name': 'Herman Foss’ gate', 'wrong_name': "Herman Foss' gate"}]

for name in misspelled_street_names:
    print(name['wrong_name'],bergen.find( { 'address.street': name['wrong_name']}).count() )
    print(name['correct_name'],bergen.find( { 'address.street': name['correct_name']}).count() )
    print('----')

Totlandsveien 1
Totlandsvegen 211
----
Haakon Shetelings plass 2
Haakon Sheteligs plass 6
----
Dreggsallmenning 1
Dreggsallmenningen 10
----
Vestre murallmenningen 1
Vestre Murallmenningen 27
----
Haakon Shetelings plass 2
Haakon Sheteligs plass 6
----
Dreggsallmenning 1
Dreggsallmenningen 10
----
Vilhelm Bjerknesvei 9
Vilhelm Bjerknes’ vei 111
----
Travparkveien 1
Travparkvegen 3
----
Travparkveien 1
Travparkvegen 3
----
Børnesskogen 1
Bønesskogen 222
----
C.Sundtsgate 2
C. Sundts gate 53
----
Lars Hillesgate 1
Lars Hilles gate 26
----
Østre Mulelvsmauet 3
Vestre Mulelvsmauet 4
----
Siljustølveien 1
Siljustølvegen 14
----
Herman Foss' gate 1
Herman Foss’ gate 9
----


Next I looked for duplicate addresses in the dataset.

In [13]:
#Finding duplicate addresses

pipeline = [
    { '$group': { 
            '_id': { 'street': '$address.street', 'housenumber': '$address.housenumber' }, 
                'postcodes': { '$addToSet': '$address.postcode' }, 
            'count': {'$sum': 1 }
            }
        },
    { '$match': {'count': {'$gt': 1} } },
    { '$group': { '_id': 'null', 'count': { '$sum': 1 } } } ]

for row in bergen.aggregate(pipeline):
    print(row)

{'_id': 'null', 'count': 904}


I found quite a few duplicate addresses in the Bergen osm dataset. In many cases the duplicate addresses seems to be there due to individual businesses located at the same address is often listed with its own address, instead of having a node referencing the address document.

Due to vague OSM policies it is unclear whether to consider this as incorrect data. In the cases of multiple businesses at the same address having the same address, the OSM wiki states: "However, there is still some debate on that point (see for example Address information in POI *and* building? on help.openstreetmap.org). Also, the community in some countries has established their own rules."

To address the duplicate issue in detail, I suggest following up by looking at the individual duplicate addresses. One could for example start by looking at the three streets with the most individual duplicate addresses to see if there are any useful patterns to be found.

#### User Contributions

Next I had a look at the distribution of user edits in the Bergen OSM data.

In [14]:
from collections import defaultdict

user_count_query = bergen.aggregate( [
   {
     '$group': {
        '_id' : { 'uid': '$created.uid', 'username': '$created.user' }
           }
        },
   {
     '$group': {
        '_id': 'null',
        'count': { '$sum': 1 }
     }
   }
] )

for doc in user_count_query:
    user_count = doc['count']

average_contributions = bergen.aggregate( [
   {
          '$group': 
            {
                '_id' : 
                { 'uid': '$created.uid', 'username': '$created.user' },
                'count': { '$sum': 1 } 
            } 
    },
    { 
            '$group': 
            {
                '_id': 'null',
                'avg': { '$avg': '$count' } 
            }
    }
] )

for doc in average_contributions:
    user_average = round(doc['avg'],2)
    
grouped_users = list(bergen.aggregate([  
        { 
            "$group" : 
            { 
                "_id" : { "uid": "$created.uid", "username": "$created.user" },
                "count" : { "$sum" : 1} 
            } 
        },
        { "$sort" : { "count" : 1 } }
        ]))

user_no = 0
halfway = round(user_count / 2)
mode_dict = defaultdict(int)

for doc in grouped_users:
        user_no += 1
        val = doc['count']
        if user_no == halfway:
            user_median = val
        
        mode_dict[val] += 1


        
user_mode = max(mode_dict.items(), key=lambda a: a[1])
mode_percentage = round((user_mode[1] / user_count) * 100,2)
            
print("Total user count:",user_count)
print("Average contributions per user:",user_average)
print("Median contributions per user:",user_median)
print("Mode of contribution count: {0} contributors ({1}%) submitted {2} edit.".format(
    user_mode[1],mode_percentage,user_mode[0] ) )

Total user count: 399
Average contributions per user: 1707.2
Median contributions per user: 11
Mode of contribution count: 76 contributors (19.05%) submitted 1 edit.


Based on the difference between the median and the average I suspected that the OSM community has a few heavy contributors working on the Bergen data. Further investigations found this to be true - the top 10 users in the dataset contributed more than 80% of the data edits. 

In [15]:
top_users = list(bergen.aggregate([  
        { "$group" : { 
                "_id" : { "uid": "$created.uid", "username": "$created.user" },"count" : { "$sum" : 1} } },
        { "$sort" : { "count" : -1 } },
         { "$limit" : 10 }
    ]))

for doc in top_users:
    print(doc)

{'_id': {'username': 'FredrikLindseth_import', 'uid': '2114448'}, 'count': 140794}
{'_id': {'username': 'frokor_import', 'uid': '2836853'}, 'count': 133655}
{'_id': {'username': 'gormur', 'uid': '103253'}, 'count': 80243}
{'_id': {'username': 'Christian Madsen', 'uid': '992708'}, 'count': 39789}
{'_id': {'username': 'daviesp12', 'uid': '722193'}, 'count': 36440}
{'_id': {'username': 'frokor', 'uid': '170061'}, 'count': 31427}
{'_id': {'username': 'FredrikLindseth', 'uid': '1965308'}, 'count': 29969}
{'_id': {'username': 'Gazer75', 'uid': '715936'}, 'count': 22168}
{'_id': {'username': 'cmeeren_import', 'uid': '3119148'}, 'count': 19287}
{'_id': {'username': 'gisle', 'uid': '8313'}, 'count': 16081}


Three of the top ten users have user names ending in \_import, indicating their contributions were automated in some way.

#### Feature types

While finishing up my analysis, I got curious about the distribution of the various feature types for the documents in the mongodb database, so I decided to take a quick look.

[]to be continued

#### Eateries and Types of Cuisine

While investigating the feature types, I got interested in the cuisine information for eateries (mainly fast food, restaurants and cafés).

In [None]:
#Most common food served at eatiries in Bergen

#Counts of different type of eateries
pipeline = [ 
    {'$match': {'cuisine': { '$exists': 1 } } },
    { '$group': { 
        '_id': {'cuisine': '$cuisine' }, 
        'count': {'$sum': 1 } } 
    },
    { '$sort': {'count': -1 } },
    { '$limit': 10 }
]

print("Cuisine, all eateries")
for doc in bergen.aggregate(pipeline):
    print(doc)


pipeline = [ 
    {'$match': {'amenity': 'restaurant', 'cuisine': {'$exists': 1} } },
    { '$group': { 
        '_id': {'cuisine': '$cuisine' }, 
        'count': {'$sum': 1 } } 
    },
    { '$sort': {'count': -1 } },
    { '$limit': 10 }
]
print('    ')
print("Cuisine, restaurants only")
for doc in bergen.aggregate(pipeline):
    print(doc)

In Bergen restaurants pizza, sushi and chinese food are the top cuisines. Among all eateries with cuisine information, if we exclude coffee_shop, the top 3 cuisines are burger, pizza and sushi.

In [None]:
#Counts of eatery types with cuisine information
pipeline = [ 
    {'$match': {'cuisine':{ '$exists': 1 }, 'amenity': {'$in': ['cafe','restaurant', 'fast_food'] } } },
    { '$group': { 
        '_id': {'eatery_type': '$amenity' }, 
        'count': {'$sum': 1 } } 
    },
    {'$sort': {'count': -1 } }
]

print('Eateries with cuisine information')
for doc in bergen.aggregate(pipeline):
    print(doc)

print('    ')
#Counts of eatery types without cuisine information
pipeline = [ 
    {'$match': {'cuisine':{ '$exists': 0 }, 'amenity': {'$in': ['cafe','restaurant', 'fast_food'] } } },
    { '$group': { 
        '_id': {'eatery_type': '$amenity' }, 
        'count': {'$sum': 1 } } 
    },
    {'$sort': {'count': -1 } }
]

print('Eateries missing cuisine information')
for doc in bergen.aggregate(pipeline):
    print(doc)

Quite a few restaurants, cafés, and fast food places lack cuisine information. Improving this could be a minimal effort task with a high yield result for the active Bergen OSM contributors.

## Final Thoughts

There are 184 address documents without street names in the dataset. Further investigation into those documents is recommended. 

To address the duplicate issue in detail, I suggest following up by looking at the individual duplicate addresses. One could, for example, start by looking at the three streets with the most individual duplicate addresses to see if there are any useful patterns to be found. Since a few users are very heavy contributors to the Bergen OSM data, it might also be worth searching for user patterns regarding the duplicate addresses.