The primary goal of this analysis is to determine the quality of OpenStreetMap address data in the Bergen, Norway region. The main focus will be on postcode accuracy and duplicate discovery. It is not within the scope of this project to correct any errors, but rather to point out discovered errors and areas which should be investigated further.

In [1]:
#importing classes from display and pretty print modules
from pprint import pprint
from IPython.display import HTML
from IPython.display import display

In [2]:
#Setting up MongoDB connection
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017")
db = client.osm
#Creating db.bergen as a variable for the sake of brewity
bergen = db.bergen

In [3]:
#Getting an initial overview of the data
display(HTML('<b>Count of documents in database:</b>'),bergen.count())
display(HTML('<b>First record:</b>'))
pprint(bergen.find_one())

681172

{'_id': ObjectId('58a02e478ac10fae74bea034'),
 'created': {'changeset': '6007582',
             'timestamp': '2010-10-10T22:29:55Z',
             'uid': '114230',
             'user': 'danerikk',
             'version': '3'},
 'id': '358070',
 'pos': [60.531027, 5.2545927],
 'type': 'node'}


In [4]:
#Creating indexes

from pymongo import ASCENDING

bergen.create_index([('address', ASCENDING),('address.street', ASCENDING),('address.housenumber', ASCENDING)])


'address_1_address.street_1_address.housenumber_1'

In [5]:
#Getting count of documents with address field

address_query = { 'address' : {'$exists' : True }}
address_documents = bergen.find(address_query)
address_count = address_documents.count()

display(HTML('<b>Number of addresses in dataset:</b>'),address_count)


84625

In [6]:
#Getting counts for streetnames and addresses

aggregated = bergen.aggregate([  
        {'$match' : {'address': {'$exists' : True } } },
        { "$group" : { 
                "_id" : "$address.street","count" : { "$sum" : 1} } }
    ])

household_count = 0
unique_street_count = 0
addresses_on_street = {}

for doc in aggregated:
    household_count += doc['count']
    unique_street_count += 1
    
    addresses_on_street[doc['_id']] = doc['count']

print("total addresses in Bergen:", household_count)
print("number of streetnames:", unique_street_count)

total addresses in Bergen: 84625
number of streetnames: 2231


According to January 2016 data from Statistics Norway (SSB), there are 134,328 households in Bergen. The data used by Statistics Norway is collected from the National Registry, and the data include unit numbers for minimum 95% of the addresses where such a number exists. The available OSM data does not contain unit numbers. Several addresses in Bergen contain multiple home units, and although the OSM data also contains non-household addresses (businesses, public institutions etc.) the number of addresses in the OSM data seems reasonable.

Next I will take a look at the streets with the most addresses on them, to see if any of the top 10 streets are surprising, and if any of the streets have a surprisingly high number of addresses.

In [7]:
#Taking a look at the streets with the most addresses

from operator import itemgetter

streetnames_sorted_dict = dict(sorted(addresses_on_street.items(), key=lambda x: x[1], reverse=True)[:10])
streetnames_sorted_list = sorted(addresses_on_street.items(), key=lambda x: x[1], reverse=True)


print("Streets with most addresses on them:")

for street,count in streetnames_sorted_list[0:10]:
    print(street,count)

Streets with most addresses on them:
Myrdalskogen 442
Askvegen 397
Søråshøgda 377
Kringlebotn 304
Flaktveitvegen 293
Stongafjellsvegen 289
Hjellestadvegen 277
Hetlevikåsen 276
Langarinden 273
Nipedalen 250


Based on local knowledge, the list above is not very surprising. None of the streets have a higher number of addresses than I expected.

### DONE Question for 1:1 
Elaborate on this? Keep/remove?

Specify why I am looking at top address streets. Mention that all streets look accurate.


In [8]:
#Checking for potential duplicate data due to misspelled street names

import difflib
from fuzzywuzzy import fuzz

def fuzzy_streets(ratio,house_count):
    
    fuzzy_matches = list()
    compare_count = 0
    
    for k1 in streetnames_sorted_list:

        if k1[0] is None:
            print("Addresses without street name:",k1[1])

        #Only comparing street names with less addresses than house_count
        elif k1[1] <= house_count:
            
            compare_count += 1

            for k2 in streetnames_sorted_list:

                if k2[0] is None:
                    pass

                elif k2[0] == k1[0]:
                    pass

                else:                    
                    
                    fuzz_ratio = fuzz.ratio(k1[0],k2[0])
                    
                    if fuzz_ratio >= ratio:
                        fuzzy_matches.append({k1: k2,"fuzz ratio": fuzz_ratio})

    print("Number of street names compared: {0} of {1}".format(compare_count,len(streetnames_sorted_list)))
    
    return fuzzy_matches

In [9]:
#Lower than 90 fuzzy ratio gives too many false positives. Same goes for higher than 10 addresses on the street.
potential_misspellings = fuzzy_streets(92,10)

Addresses without street name: 184
Number of street names compared: 526 of 2231


### DONE Question for 1:1
Should I take a closer look at the addresses without street name?

Write in conclusion that they should be looked closer into. Good for last question in rubric.



In [10]:
import pandas as pd


In [12]:
#Printing out the potential misspellings

df_potential_misspellings = pd.DataFrame(columns = [
        'high_spelling','high_count','low_spelling','low_count','fuzz_ratio'])

#Adding index to make it easier to sort out the items I need to investigate further
count = 0

for spellings in potential_misspellings:
    count += 1
    df_potential_misspellings.loc[count] = None
    for key, val in spellings.items():
        if type(key) == tuple:
            if key[1] > val[1]:
                df_potential_misspellings.loc[count]['high_spelling'] = key[0]
                df_potential_misspellings.loc[count]['low_spelling'] = val[0]
                df_potential_misspellings.loc[count]['high_count'] = key[1]
                df_potential_misspellings.loc[count]['low_count'] = val[1]

            else:
                df_potential_misspellings.loc[count]['high_spelling'] = val[0]
                df_potential_misspellings.loc[count]['low_spelling'] = key[0]
                df_potential_misspellings.loc[count]['high_count'] = val[1]
                df_potential_misspellings.loc[count]['low_count'] = key[1]
        else:
            df_potential_misspellings.loc[count]['fuzz_ratio'] = val

df_potential_misspellings.sort_values('high_spelling',ascending=True)

Unnamed: 0,high_spelling,high_count,low_spelling,low_count,fuzz_ratio
7,Austrevågen,21,Austevågen,5,95
25,Bønesskogen,222,Børnesskogen,1,96
12,C. Sundts gate,53,C.Sundtsgate,2,92
15,Dreggsallmenningen,10,Dreggsallmenning,1,94
1,Dreggsallmenningen,10,Dreggsallmenning,1,94
16,Espelandsvegen,76,Espelandsveien,1,93
5,Flyplassvegen,27,Flyplassveien,7,92
6,Haakon Sheteligs plass,6,Haakon Shetelings plass,2,98
14,Haakon Sheteligs plass,6,Haakon Shetelings plass,2,98
17,Hallvardsvegen,14,Halvardsvegen,1,96


In [13]:
#Ensuring corrected street names in cleaning script are in fact corrected in the database
for street,count in streetnames_sorted_list:
    
    if street is None:
        pass
    
    elif ('Thormøhlens' or 'Smøråshøgda 9' or 'Laguneveien 1' or 'Gate' or '.' or 'Tokanten') in street:
        #expecting 1 result
        print(street,count)

Thormøhlens gate 47


### DONE Question for 1:1
Should I remove the text explanation above? what about the import test?

Keep test, remove explanation.

In [16]:
#INCOMPLETE. I will use this to filter out what I need to take a closer look at
true_duplicates = []
investigate_further = []

# for index, item in enumerate(potential_misspellings):
#     if item[0] in (6,17,1,)

#### From 1on1
OK to stop at listing the street names. Extra points for investigations.

Above I have performed some QA on the street names from the Bergen OSM dataset. I have taken a closer look at the street names with less than 10 house numbers, and I have compared those street with the other street names to spot potential misspelled and duplicate street names.

I have manually reviewed the returned list of (fuzzy) matched street names, and I have added what I consider true duplicates (based on local knowledge) to a new list, `true_duplicates`. Some of the matched street names require further investigation, and I have therefore created a separate list for those items, called `investigate_further`.

In [26]:
#Creating functions for printing individual address search results

#without postal code
def search_one_address(street, housenumber):
    
    housenumber = str(housenumber)
    
    query = { 'address.street': street, 'address.housenumber': housenumber }

    for doc in bergen.find(query):
        pprint(doc)
    
    return
#with postal code
def search_one_address_with_postal_code(street, housenumber):
    query = { 'address.street': street, 'address.housenumber': housenumber }

    for doc in bergen.find(query):
        pprint(doc)
    
    return

In [28]:
#Searching for duplicates of Laguneveien 1

search_one_address('Laguneveien',1)

{'_id': ObjectId('58a02e4f8ac10fae74c1157c'),
 'address': {'city': 'Rådal',
             'housenumber': '1',
             'postcode': '5239',
             'street': 'Laguneveien'},
 'created': {'changeset': '26026343',
             'timestamp': '2014-10-12T14:10:49Z',
             'uid': '103253',
             'user': 'gormur',
             'version': '1'},
 'id': '3125931672',
 'pos': [60.2968652, 5.3311546],
 'type': 'node'}
{'_id': ObjectId('58a02e5e8ac10fae74c5d1b2'),
 'address': {'city': 'Rådal',
             'floor': '1',
             'housenumber': '1',
             'postcode': '5239',
             'street': 'Laguneveien'},
 'contact': {'facebook': 'https://www.facebook.com/arnasomogstrikkas'},
 'created': {'changeset': '36459796',
             'timestamp': '2016-01-09T09:12:57Z',
             'uid': '1965308',
             'user': 'FredrikLindseth',
             'version': '1'},
 'id': '3935489347',
 'pos': [60.2968112, 5.3317375],
 'type': 'node'}
{'_id': ObjectId('58a02e648ac

### Question for 1:1
I need help interpreting why there are multiple documents for the same address. Are they duplicates, or is this ok? What about the node references?

In [29]:
pipeline = [
    { '$match': { 'address.street': 'Laguneveien' } },
    { '$group': { 
            '_id': '$address.postcode', 'count' : {'$sum': 1 } 
        } 
    },
    {'$sort' : {'count' : -1} }
    
]

for doc in bergen.aggregate(pipeline):
    pprint(doc)

{'_id': '5239', 'count': 19}
{'_id': '5235', 'count': 1}


In [30]:
query = { 'address.street': 'Laguneveien', 'address.housenumber': '1' }

for doc in bergen.find(query):
    pprint(doc)

{'_id': ObjectId('58a02e4f8ac10fae74c1157c'),
 'address': {'city': 'Rådal',
             'housenumber': '1',
             'postcode': '5239',
             'street': 'Laguneveien'},
 'created': {'changeset': '26026343',
             'timestamp': '2014-10-12T14:10:49Z',
             'uid': '103253',
             'user': 'gormur',
             'version': '1'},
 'id': '3125931672',
 'pos': [60.2968652, 5.3311546],
 'type': 'node'}
{'_id': ObjectId('58a02e5e8ac10fae74c5d1b2'),
 'address': {'city': 'Rådal',
             'floor': '1',
             'housenumber': '1',
             'postcode': '5239',
             'street': 'Laguneveien'},
 'contact': {'facebook': 'https://www.facebook.com/arnasomogstrikkas'},
 'created': {'changeset': '36459796',
             'timestamp': '2016-01-09T09:12:57Z',
             'uid': '1965308',
             'user': 'FredrikLindseth',
             'version': '1'},
 'id': '3935489347',
 'pos': [60.2968112, 5.3317375],
 'type': 'node'}
{'_id': ObjectId('58a02e648ac

In [31]:
query = { 'address.street': 'Laguneveien', 'address.postcode': '5235', 'address.housenumber': 1 }

for doc in bergen.find(query):
    pprint(doc)

{'_id': ObjectId('58a02e4d8ac10fae74c05144'),
 'address': {'city': 'Rådal',
             'housenumber': 1,
             'postcode': '5235',
             'street': 'Laguneveien'},
 'created': {'changeset': '39294271',
             'timestamp': '2016-05-13T14:59:10Z',
             'uid': '1965308',
             'user': 'FredrikLindseth',
             'version': '6'},
 'id': '1652908136',
 'pos': [60.2962144, 5.3301382],
 'type': 'node'}


According to The Norwegian Mapping Authority, the correct postal code for Laguneveien is 5239. The 5235 document is incorrect.

In [33]:
#Finding duplicate addresses

pipeline = [
    { '$group': { 
            '_id': { 
                'street': '$address.street', 'housenumber': '$address.housenumber' 
            }, 
            'count' : {'$sum': 1 } 
        } 
    },
    { '$match': {'count': {'$gt': 1} } },
    {'$sort' : {'count' : -1} } ]

duplicate_addresses = []

for doc in bergen.aggregate(pipeline):
    duplicate_addresses.append(doc)

print("Number of potential duplicate addresses:", len(duplicate_addresses))

Number of potential duplicate addresses: 904


In [35]:
#bergen.aggregate( { '$group': '_id': { 'street': '$address.street' }})

In [37]:
#REMOVE
tweets = client.examples.tweets.aggregate([
        { "$group" : { "_id" : "$user.screen_name",
                      "unique_hashtags" : {
                            "$addToSet" : "$user.id"
                }
                     } },
        { "$sort" : { "_id" : -1} } ] )

for doc in tweets:
    print(doc)

{'_id': 'Catherinemull', 'unique_hashtags': [37486277]}


In [38]:
tmp_adr = bergen.aggregate([
        { "$group" : { "_id" : "$address.street", 
                      "postcodes" : {
                            "$addToSet" : "$address.postcode"
                }
                     } },
        { "$sort" : { "_id" : -1} } ] )

count = 0

for doc in tmp_adr:
    print(doc)
    count += 1
    if count > 10:
        break

{'_id': 'Øysteins gate', 'postcodes': ['5007']}
{'_id': 'Øykjeneset', 'postcodes': ['5258']}
{'_id': 'Øyjordsåsen', 'postcodes': ['5038']}
{'_id': 'Øyjordsveien', 'postcodes': ['5038']}
{'_id': 'Øyjordslien', 'postcodes': ['5038']}
{'_id': 'Øyjordsbotn', 'postcodes': ['5038']}
{'_id': 'Øvsttunåsen', 'postcodes': ['5223']}
{'_id': 'Øvsttunvegen', 'postcodes': ['5223']}
{'_id': 'Øvsttunlia', 'postcodes': ['5223']}
{'_id': 'Øvsttunbrekka', 'postcodes': ['5223']}
{'_id': 'Øvretveitvegen', 'postcodes': ['5239']}


In [39]:
tmp_adr = bergen.aggregate([
        { '$group' : { '_id' : '$address.street', 
                      'postcodes' : {
                            '$addToSet' : '$address.postcode'
                }
                     } },
        { '$sort' : { '_id' : -1} } ] )

count = 0

for doc in tmp_adr:
    print(doc)
    count += 1
    if count > 10:
        break

{'_id': 'Øysteins gate', 'postcodes': ['5007']}
{'_id': 'Øykjeneset', 'postcodes': ['5258']}
{'_id': 'Øyjordsåsen', 'postcodes': ['5038']}
{'_id': 'Øyjordsveien', 'postcodes': ['5038']}
{'_id': 'Øyjordslien', 'postcodes': ['5038']}
{'_id': 'Øyjordsbotn', 'postcodes': ['5038']}
{'_id': 'Øvsttunåsen', 'postcodes': ['5223']}
{'_id': 'Øvsttunvegen', 'postcodes': ['5223']}
{'_id': 'Øvsttunlia', 'postcodes': ['5223']}
{'_id': 'Øvsttunbrekka', 'postcodes': ['5223']}
{'_id': 'Øvretveitvegen', 'postcodes': ['5239']}


In [40]:
#Finding duplicate addresses

pipeline = [
    { '$group': { 
            '_id': { 'street': '$address.street', 'housenumber': '$address.housenumber' }, 
                'postcodes': { '$addToSet': '$address.postcode' }, 
            'count': {'$sum': 1 }
            }
        },
    { '$match': {'count': {'$gt': 1} } },
    {'$sort' : {'count' : -1} } ]

duplicate_addresses = []

for doc in bergen.aggregate(pipeline):
    duplicate_addresses.append(doc)

print("Number of potential duplicate addresses:", len(duplicate_addresses))

Number of potential duplicate addresses: 904


In [166]:
for item in duplicate_addresses:
    for pc in item['postcodes']:
        if pc[0:2] == 'NO':
            print(item)

{'_id': {'street': 'Edvard Griegs vei', 'housenumber': '3A'}, 'count': 4, 'postcodes': ['5059', 'NO-5059']}
{'_id': {'street': 'Helleveien', 'housenumber': '34'}, 'count': 2, 'postcodes': ['NO-5035', '5042']}


In [62]:
#Converting result of duplicate address query to Pandas dataframe for easier view

from pandas.io.json import json_normalize

df_duplicate_addresses = json_normalize(duplicate_addresses)

#Changing column names
df_duplicate_addresses.rename(columns={'_id.housenumber': 'housenumber','_id.street':'street'},inplace=True)
#Changing column order
df_duplicate_addresses = df_duplicate_addresses[['street', 'housenumber', 'count', 'postcodes']]

df_duplicate_addresses[0:10]

Unnamed: 0,street,housenumber,count,postcodes
0,,,596731,"[5021, 5835, 5281, 5116, 5918]"
1,Kanalveien,66.0,13,[5068]
2,Lyngmarka,,12,[5302]
3,Valkendorfsgaten,6.0,12,[5012]
4,Kalfarveien,37.0,11,[5022]
5,Kanalveien,64.0,10,[5068]
6,Sandslihaugen,10.0,10,[5254]
7,Lilandsveien,3.0,10,[5258]
8,Kanalveien,62.0,9,[5068]
9,Vetrlidsallmenningen,2.0,8,[5014]


In [215]:
#Getting some information about duplicate postcodes

df_different_postcodes = df_duplicate_addresses[df_duplicate_addresses['postcodes'].apply(lambda x: len(x) > 1)]
# Adding column for count of postcodes for the address
# df_different_postcodes['postcode_count'] = df_different_postcodes['postcodes'].str.len()
df_different_postcodes = df_different_postcodes.assign(postcode_count=df_different_postcodes['postcodes'].str.len())

print('Number of duplicate addresses with different postcodes:',len(df_different_postcodes))

df_different_postcodes.sort_values('postcode_count', ascending=False)[0:10]

Number of duplicate addresses with different postcodes: 252


Unnamed: 0,street,housenumber,count,postcodes,postcode_count
0,,,596731,"[5021, 5835, 5281, 5116, 5918]",5
136,Liavegen,12.0,3,"[5307, 5132, 5378]",3
39,Liavegen,10.0,4,"[5307, 5132, 5378]",3
84,Haugane,4.0,3,"[5307, 5212, 5360]",3
137,Liavegen,17.0,3,"[5307, 5132, 5378]",3
662,Liavegen,4.0,2,"[5307, 5378]",2
660,Djupedalen,22.0,2,"[5310, 5124]",2
656,Heiane,14.0,2,"[5131, 5350]",2
654,Jonas Lies vei,77.0,2,"[5053, 5021]",2
649,Lynghaugen,5.0,2,"[5038, 5350]",2


In [88]:
df_duplicate_addresses[df_duplicate_addresses['street'] == 'Laguneveien']

Unnamed: 0,street,housenumber,count,postcodes
15,Laguneveien,13,5,[5239]
105,Laguneveien,1,3,[5239]
193,Laguneveien,9,2,[5239]
384,Laguneveien,21,2,[5239]
461,Laguneveien,1,2,"[5239, 5235]"


In [244]:

df_duplicate_addresses[df_duplicate_addresses['postcodes'].str.join(',').isin(['5239'])]

Unnamed: 0,street,housenumber,count,postcodes
15,Laguneveien,13,5,[5239]
105,Laguneveien,1,3,[5239]
186,Rådalslien,95,2,[5239]
193,Laguneveien,9,2,[5239]
373,Grimseidvegen,60,2,[5239]
384,Laguneveien,21,2,[5239]
665,Fanavegen,113,2,[5239]
838,Råtun,1,2,[5239]
868,Steinsvikvegen,393,2,[5239]


In [65]:
for address in duplicate_addresses[0:100]:
    if len(address['postcodes']) > 1:
        print(address)

{'_id': {}, 'count': 596731, 'postcodes': ['5021', '5835', '5281', '5116', '5918']}
{'_id': {'street': 'Østre Murallmenningen', 'housenumber': '7'}, 'count': 5, 'postcodes': ['5013', '5012']}
{'_id': {'street': 'Liavegen', 'housenumber': '10'}, 'count': 4, 'postcodes': ['5307', '5132', '5378']}
{'_id': {'street': 'Edvard Griegs vei', 'housenumber': '3A'}, 'count': 4, 'postcodes': ['5059', 'NO-5059']}
{'_id': {'street': 'Parkveien', 'housenumber': '1'}, 'count': 3, 'postcodes': ['5007', '5063']}
{'_id': {'street': 'Vetrlidsallmenningen', 'housenumber': '6'}, 'count': 3, 'postcodes': ['5014', '5003']}
{'_id': {'street': 'Bryggen'}, 'count': 3, 'postcodes': ['5003', '5835']}
{'_id': {'street': 'Christies gate', 'housenumber': '11'}, 'count': 3, 'postcodes': ['5018', '5015']}
{'_id': {'street': 'Haugane', 'housenumber': '4'}, 'count': 3, 'postcodes': ['5307', '5212', '5360']}


In [63]:
#Printing out top 10 addresses with duplicates
pprint(duplicate_addresses[0:10])

[{'_id': {},
  'count': 596731,
  'postcodes': ['5021', '5835', '5281', '5116', '5918']},
 {'_id': {'housenumber': '66', 'street': 'Kanalveien'},
  'count': 13,
  'postcodes': ['5068']},
 {'_id': {'street': 'Lyngmarka'}, 'count': 12, 'postcodes': ['5302']},
 {'_id': {'housenumber': '6', 'street': 'Valkendorfsgaten'},
  'count': 12,
  'postcodes': ['5012']},
 {'_id': {'housenumber': '37', 'street': 'Kalfarveien'},
  'count': 11,
  'postcodes': ['5022']},
 {'_id': {'housenumber': '64', 'street': 'Kanalveien'},
  'count': 10,
  'postcodes': ['5068']},
 {'_id': {'housenumber': '10', 'street': 'Sandslihaugen'},
  'count': 10,
  'postcodes': ['5254']},
 {'_id': {'housenumber': '3', 'street': 'Lilandsveien'},
  'count': 10,
  'postcodes': ['5258']},
 {'_id': {'housenumber': '62', 'street': 'Kanalveien'},
  'count': 9,
  'postcodes': ['5068']},
 {'_id': {'housenumber': '2', 'street': 'Vetrlidsallmenningen'},
  'count': 8,
  'postcodes': ['5014']}]


### From 1on1 DONE

make this table for easier view. Suggest further investigation if Kanalveien is a housing project, or if there are other reasons for having many duplicate addresses on the street.

In [43]:
#Looking at all documents with the top duplicate address
search_one_address('Kanalveien','66')

{'_id': ObjectId('58a02e658ac10fae74c7c244'),
 'address': {'city': 'Bergen',
             'housenumber': '66',
             'postcode': '5068',
             'street': 'Kanalveien'},
 'created': {'changeset': '40278280',
             'timestamp': '2016-06-25T09:11:13Z',
             'uid': '1965308',
             'user': 'FredrikLindseth',
             'version': '1'},
 'id': '4264196717',
 'pos': [60.3621861, 5.3469652],
 'type': 'node'}
{'_id': ObjectId('58a02e658ac10fae74c7c245'),
 'address': {'city': 'Bergen',
             'housenumber': '66',
             'postcode': '5068',
             'street': 'Kanalveien'},
 'created': {'changeset': '40278280',
             'timestamp': '2016-06-25T09:11:13Z',
             'uid': '1965308',
             'user': 'FredrikLindseth',
             'version': '1'},
 'id': '4264196718',
 'pos': [60.3620529, 5.3469736],
 'type': 'node'}
{'_id': ObjectId('58a02e658ac10fae74c7c247'),
 'address': {'city': 'Bergen',
             'housenumber': '66',
     

In [44]:
#INCOMPLETE looking at one of the nodes

bergen.find_one({'id': '4264197029'})

{'_id': ObjectId('58a02e658ac10fae74c7c318'),
 'created': {'changeset': '40278280',
  'timestamp': '2016-06-25T09:11:20Z',
  'uid': '1965308',
  'user': 'FredrikLindseth',
  'version': '1'},
 'id': '4264197029',
 'pos': [60.3620325, 5.3469011],
 'type': 'node'}

### Question for 1:1 DONE
OK to use Pandas?
Yes

But do aggregations in MongoDB. Use Pandas mainly for nicer view.

[] Find out why Kanalveien is not in the list below.

In [245]:
df_duplicate_addresses.sort_values('street')

#Counting number of duplicate housenumbers per street
df_duplicate_addresses.groupby('street').agg({'count':'count'}).sort_values('count',ascending=False)

Unnamed: 0_level_0,count
street,Unnamed: 1_level_1
Strandgaten,42
Storevarden,32
Kong Oscars gate,29
Marken,20
Storhaugen,20
Solåsen,18
Djupedalen,17
Fagerbakken,16
St. Hanshaugen,15
Liavegen,14


In [48]:
#Looking at the duplicate addresses of the street with the most duplicates
df_duplicate_addresses.where(df_duplicate_addresses['street'] == 'Strandgaten').dropna()

Unnamed: 0,street,housenumber,count
35,Strandgaten,18,4.0
42,Strandgaten,68,4.0
89,Strandgaten,74,3.0
198,Strandgaten,84,2.0
210,Strandgaten,77,2.0
213,Strandgaten,72,2.0
235,Strandgaten,3,2.0
295,Strandgaten,212,2.0
316,Strandgaten,201,2.0
332,Strandgaten,71,2.0


Remove above, too deep investigations at this point. Explain potential plan for further research.

In [49]:
# #Taking a look at one of the duplicate addresses of Standgaten
# query = {'address.street': 'Strandgaten', 'address.housenumber': '74'}

# for doc in bergen.find(query):
#     pprint(doc)

<img src="data/kanalveien_66_duplicates.jpg" align="right" width="300">

Looking up the adress with the most duplicates (13), Kanalveien 66, on OpenStreetMap.org, it becomes apparent that the reason for the many duplicates is that there are multiple business located at that address, and each business seems to have gotten its own address. 

According to the [OSM wiki](http://wiki.openstreetmap.org/wiki/Addresses#How_to_map_addresses), the policy on duplicate addresses is unclear in such cases: "However, there is still some debate on that point (see for example Address information in POI *and* building? on help.openstreetmap.org). Also, the community in some countries has established their own rules."

According to the address page of the OSM wiki, in mid-2014 all Norwegian official addresses were released to the public. Efforts is being made by OSM volunteers to include the released data in OSM, and the progress is being tracked using a tool called [Beebeetle](http://osm.beebeetle.com/addrnodeimportstatus.php). As of January 7, 2017, the Bergen import is listed as 99.84% complete. 1 known address duplicate is listed on the site, for Solheimsgaten [SJEKK].

*FOR CONCLUSSION*

There are 184 address documents without street name in the dataset. Further investigation into those documments is recommended.


To address the duplicate issue in detail, I suggest following up by looking at the individual duplicate addresses. You could for example start by looking at the three streets with the most individual duplicate addresses to see if there are any useful patterns to be found.

In [50]:
#INCOMPLETE Looking at contributors

tmp_agg = bergen.aggregate([  
        { "$group" : { 
                "_id" : { "uid": "$created.uid", "username": "$created.user" },"count" : { "$sum" : 1} } },
        { "$sort" : { "count" : -1 } },
         { "$limit" : 10 }
#         { "$project" : { "_id": 0, "user": "$created.user" } } 
    ])

for doc in tmp_agg:
    print(doc)

{'_id': {'username': 'FredrikLindseth_import', 'uid': '2114448'}, 'count': 140794}
{'_id': {'username': 'frokor_import', 'uid': '2836853'}, 'count': 133655}
{'_id': {'username': 'gormur', 'uid': '103253'}, 'count': 80243}
{'_id': {'username': 'Christian Madsen', 'uid': '992708'}, 'count': 39789}
{'_id': {'username': 'daviesp12', 'uid': '722193'}, 'count': 36440}
{'_id': {'username': 'frokor', 'uid': '170061'}, 'count': 31427}
{'_id': {'username': 'FredrikLindseth', 'uid': '1965308'}, 'count': 29969}
{'_id': {'username': 'Gazer75', 'uid': '715936'}, 'count': 22168}
{'_id': {'username': 'cmeeren_import', 'uid': '3119148'}, 'count': 19287}
{'_id': {'username': 'gisle', 'uid': '8313'}, 'count': 16081}


### Sources

SSB: https://ssb.no/befolkning/statistikker/familie/aar/2016-04-14  
SSB: https://www.ssb.no/befolkning/samordnet-statistikk-for-husholdninger-og-boliger  
Kartverket: http://www.seeiendom.no/

OSM resource links:  
http://wiki.openstreetmap.org/wiki/Addresses#How_to_map_addresses  
http://wiki.openstreetmap.org/wiki/Addresses#Norway  
http://osm.beebeetle.com/addrnodeimportstatus.php

OSM links:  
Kanalveien 66 http://www.openstreetmap.org/search?query=kanalveien%2066#map=19/60.36224/5.34696  
