### Problems Encountered in the Map

                                                
After downloading the Utrecht map from mapzen.com, I have created a sample from it. First, I have run my audit script against the sample to inspect the data. I have some minor issues that I will discuss below, but overall the dataset is pretty clean already.

* Abbreviated street names. There are abbreviations in the last part of a streetname, for example "W.Z.". In other cases it is written fully like "Westzijde". This is also the case for "O.Z." and "Oostzijde". I have decided to transform the abbreviations to its full name.

* “Incorrect” postal codes. Utrecht area zip codes all begin with “35” however a large portion of all zip codes were outside this region. It appears that small villages close to Utrecht are also in this dataset. I have done some analysis of this below with pymongo.

In [21]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.test
pipeline = ([{"$match":{"address.postcode":{"$exists":1}}}, 
            {"$group":{"_id":"$address.postcode", "count":{"$sum":1}}}, 
            {"$sort":{"count":-1}},
            {"$limit":3 }])
top_pc = db.utrecht.aggregate(pipeline)
for pc in top_pc:
    print pc

{u'count': 1016, u'_id': u'3706AA'}
{u'count': 701, u'_id': u'3621VC'}
{u'count': 588, u'_id': u'3513EW'}


In [22]:
pipeline = ([{"$match":{"address.city":{"$exists":1}}}, 
             {"$group":{"_id":"$address.city", "count":{"$sum":1}}}, 
             {"$sort":{"count":-1}},
             {"$limit":3 }])
top_cities = db.utrecht.aggregate(pipeline)
for city in top_cities:
    print city

{u'count': 289772, u'_id': u'Utrecht'}
{u'count': 61854, u'_id': u'Nieuwegein'}
{u'count': 52800, u'_id': u'Zeist'}


It turns out that most addresses are located in the city Utrecht, but a substantial part belongs to other surrounding villages. At first sight it is strange that the top 2 postal codes are not from the city of Utrecht (because they don't start with 35), but this could be because there are more houses for some postal codes for example.

### Overview of the Data

Number of objects

In [29]:
totalRecords = db.utrecht.find().count()
print totalRecords

7120051


Number of nodes

In [15]:
db.utrecht.find({"type":"node"}).count()

6193678

In [None]:
Number of ways

In [16]:
db.utrecht.find({"type":"way"}).count()

926373

Unique users

In [17]:
len(db.utrecht.distinct("created.user")) 

838

Top 3 users and their contribution percentage

In [37]:
pipeline = ([{"$match":{"created.user":{"$exists":1}}}, 
             {"$group":{"_id":"$created.user", "count":{"$sum":1}}}, 
             {"$sort":{"count":-1}},
             {"$limit":3 }])
top_users = db.utrecht.aggregate(pipeline)
topUserCount = 0
for user in top_users:
    print user
    topUserCount = topUserCount + user["count"]
print "Contribution percentage:", (topUserCount * 100 / totalRecords), "%"


{u'count': 1440411, u'_id': u'Gertjan Idema_BAG'}
{u'count': 939629, u'_id': u'PeeWee32_BAG'}
{u'count': 932991, u'_id': u'3dShapes'}
Contribution percentage: 46 %


Top amenities

In [24]:
pipeline = ([{"$match":{"amenity":{"$exists":1}}}, 
             {"$group":{"_id":"$amenity", "count":{"$sum":1}}}, 
             {"$sort":{"count":-1}},
             {"$limit":3 }])
top_amenities = db.utrecht.aggregate(pipeline)
for amenity in top_amenities:
    print amenity

{u'count': 2702, u'_id': u'parking'}
{u'count': 1896, u'_id': u'bench'}
{u'count': 806, u'_id': u'restaurant'}


Top cuisines

I am interested in the top 10 cuisines in Utrecht. The dataset provides a tag type "cuisine" that can be used
for that purpose. After retrieving the information from Mongo, it is plotted with pyplot.

In [25]:
pipeline = ([{"$match":{"cuisine":{"$exists":1}}}, 
             {"$group":{"_id":"$cuisine", "count":{"$sum":1}}}, 
             {"$sort":{"count":-1}}, 
             {"$limit":10}])
top_cuisine = db.utrecht.aggregate(pipeline)
for cuisine in top_cuisine:
    print cuisine

{u'count': 29, u'_id': u'chinese'}
{u'count': 28, u'_id': u'burger'}
{u'count': 26, u'_id': u'italian'}
{u'count': 25, u'_id': u'pizza'}
{u'count': 20, u'_id': u'regional'}
{u'count': 18, u'_id': u'kebab'}
{u'count': 17, u'_id': u'greek'}
{u'count': 14, u'_id': u'sandwich'}
{u'count': 12, u'_id': u'asian'}
{u'count': 12, u'_id': u'japanese'}


![Top 10 cuisines in Utrecht](plots/utrecht_top_10_cuisines.jpg)

### Other ideas about the datasets

Encourage user participation through gamification

When viewing the top users of Utrecht data, I noticed the following

* The top 2 users both have the word "BAG" in their name
* Combined top 3 users contribution is 46 %
* Total number os users is 838
* User data is not emphasized on the OpenStreetmapdata website.

The word "BAG" refers to "Basisregistraties Adressen en Gebouwen" which is Dutch for Basic registrations for addresses and buildings. It contains all official data of addresses and buildings in the Netherlands and it is maintained by all municipalities and [Kadaster] (http://www.kadaster.nl/web/Themas/Registraties/BAG-1.htm). So, the top 2 users are probably working for Kadaster. The top 3 users contribution percentage is quite high, so that means the quality of the datasets depend on a small group of people. It would be better to have a more spread out distribution. 
To accomplish this, gamification could be used. One can think of a leaderboard of top users for a certain region. 
When a user adds data to a region, he will get points related to the amount and quality of data. An indication of the quality of data could be obtained to run the uploaded data against a basic validation template. 
