# Open Street Map Analysis
### Course: Data Wrangling with MongoDB

Content: Processing of OSM Dataset
- Problems encountered in your map
- Overview of the data
- Other ideas about the dataset


Note: The Report about this project is stored in a separated file (Data_Analyst_ND_Project_3_ProcessData.html).

## Project Report

### Problems encountered in your map
__Guideline:__ Student response describes the challenges encountered while auditing, fixing and processing the dataset for the area of their choice. Some of the problems encountered during data audit are cleaned programmatically.

#### Data Problems
- Updating/cleaning street names - abbreviations as part of the complete name: earlier in the project I had issues for specific cases of abbreviations of street names. Eg. "Str" -> "Street" (abbreviation -> streetname) lead to "Street" -> "Streeteet". Also, if there were more than one abbreviation in a street name, my first update function was only considering the first abbreviation and left out other ones.


- Updating/cleaning street names - wrongly spelled abbreviations: If an abbreviation is spelled wrongly, the name is not updated correctly.


- Postal codes are not in the correct format: Some post codes are in an incorrect format. Like a "from" - "to" format for example. Altough most postcodes are good.


- Postal codes are not only from within the selected county: Surprisingly, a lot of data points inside the data do not fit to the map area which was selected.


- Amenities count (for different amentity values) for XML and JSON format not equal: This is due to the migration and restructureing of the information due to the different types of "objects" - nodes, ways, relations - and what information they include.

#### Other Problems
- Complexity of formatting XML data to JSON strings: Formatting data from XML to JSON is quite a hard job, especially if you wanna make the data format prettier in the target format.


- Issues durint restructuring of data - eg. schools vs buildings: schools are schools, and schools are buildings. Therefore one needs to take care to walk a straight line through all the conversion tasks. I decided to have schools flagged as "addinfo.type" = "schools" and (in case) building information added in a sub document. Similar problems have occured for other objects as well. (An Example of such a case can be seen in the file issue_example.txt.)


- Diversity of data: Data is very divers and it's not so easy to find a good common format which fits all needs.


- Import of data to MongoDB: Earlier in the process I had some issues getting the strucutre of the JSON string right. As so much data, different cases, and sub documents were inside the data.


- Issues while processing of data file due to file size and resulting RAM utilization: I have a business laptop, not really meant to be utilized hardly. Therefore the RAM utilization was really at the peak. However, I managed to get things done by freeing up varialbes as needed.

### Overview of the Data based on XML File

All the sections below are the output of programatic investivations of the data. The code itself can be found in _Data_Analyst_ND_Project_3_ProcessData.html_ file. The data has been retrieved using __Python and various Python modules supporting XML processing__.

Size of OSM XML data file: 
192404 KB

### Overview of the Data based on MongoDB DB

All the sections below are the output of programatic investivations of the data. The code itself can be found in _Data_Analyst_ND_Project_3_ProcessData.html_ file. The data has been retrieved using __Python, MongoDB and pymongo Python module__.

A brief overview how JSON data is structured can be seen at json_structure.txt file.

Size of OSM JSON data file: 
243607 KB

#### Number of Documents

In [None]:
# Code Example
num_docs  = db.wau_county.find().count()
num_nodes = db.wau_county.find({"type":"node"}).count()
num_ways  = db.wau_county.find({"type":"way"}).count()
num_rels  = db.wau_county.find({"type":"relation"}).count()

The numbers for Nodes, Ways and Relations is consistent with the numbers gathered earlier on XML basis (see Key Types & Count). Means, thus far it looks like the migration from XML to JSON as well as the import into MongoDB was successful.

#### Users

In [None]:
# Code Example
print('--------------- Users ------------------')
unique_users = len(db.wau_county.distinct("creation.user"))
print('Number of unique users: ' + str(unique_users))
print('----------------------------------------')
aggregation_query = [{"$group" : {"_id" : "$creation.user", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}},
                     {"$limit" : 10}]
top_users = db.wau_county.aggregate(aggregation_query)
print('First 10 users in the list:')
for user in top_users:
    print user["_id"] + ": " + str(user["count"])
print('----------------------------------------')

The top 10 users and the number of their posts is consistent with the numbers gathered earlier on XML basis (see Users). Means again, thus far it looks like the migration from XML to JSON as well as the import into MongoDB was successful.

#### Amenity Type

In [None]:
# Code Example
print '------------------ Amenity Types -------------------'
print 'Below is a list of different amenities and and their'
print 'occurence in the data.'
print '----------------------------------------------------'
aggregation_query = [
                     {"$match" : { "addinfo.type" : {"$exists" : True } }},
                     {"$group" : {"_id" : "$addinfo.type", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}},
                     {"$limit" : 10}
                    ]
info_types = db.wau_county.aggregate(aggregation_query)
print('First 10 users in the list:')
for info in info_types:
    print str(info["_id"]) + ": " + str(info["count"])
print('----------------------------------------')

Amenities list looks a bit different now. Reason for that is that the data has been restructured and thereby amenity information, but also other information like highway, building, leisure, natural, etcs. has been clubbed together into a new addinfo.type attribute. In that attribute basically describes what the data point is, independent of the main type (node, relation, way). It provides a more comprehensive insight to the data and makes searching for specific types of nodes, ways and relations easier.

However, there was an issue I had to invest in detail. Sometime, the count in the XML data was higher than the one in the MongoDB. An example for that is: school, restaurant, shops. During my investigations I found out that sometimes the migration procedure from XML to JSON was overwriting some values for "addinfo.type" in JSON format. For example if there was a restaurant which was a butcher shop too, in addition of being a restaurant. 

One example of an issue with an "amenity" = "parking" which was migrated to an "addinfo.type" = "leisure" is added in the submission in file issue_example.txt. Here, due to the restructuring of some data during the migration process, the value for "addinfo.type" is overwritten by "addinfo.type" = "leisure".

#### Post Codes

In [None]:
print '------------------ Post Codes -------------------'
print 'Below is a list of all post codes and and their'
print 'occurence in the data.'
print '-------------------------------------------------'
aggregation_query = [
                     {"$match" : { "address.postcode" : {"$exists" : True }}},
                     {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}},
                     {"$limit" : 10}
                    ]
post_codes = db.wau_county.aggregate(aggregation_query)
print('First 10 post codes in the list:')
for post_code in post_codes:
    print str(post_code["_id"]) + ": " + str(post_code["count"])
print('----------------------------------------')

Result is equal to XML based result.

In [None]:
print '-------------- Correct Post Code Types ---------------'
print 'Below is a list of post codes which do not belong to'
print 'Waukesha County and their occurence in the data.'
print '----------------------------------------------------'
aggregation_query = [
                     {"$match" : { "address.postcode" : {"$exists" : True } } },
                     {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1} } },
                     {"$sort" : {"count":-1} }
                    ]
post_codes = db.wau_county.aggregate(aggregation_query)
print('First 10 post codes in the list:')
for post_code in post_codes:
    # this checks if the post code is in the post codes of waukesha county area
    if post_code["_id"] in post_code_expected:
        print str(post_code["_id"]) + ": " + str(post_code["count"])
print('----------------------------------------')

Result is equal to XML based result.

In [None]:
print '-------------- Wrong Post Code Types ---------------'
print 'Below is a list of post codes which do not belong to'
print 'Waukesha County and their occurence in the data.'
print '----------------------------------------------------'
c = 0
aggregation_query = [
                     {"$match" : { "address.postcode" : {"$exists" : True }}},
                     {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}}
                    ]
post_codes = db.wau_county.aggregate(aggregation_query)
print('First 10 post codes in the list:')
for post_code in post_codes:
    c = c+1
    # this checks if the post code is in the post codes of waukesha county area
    if post_code["_id"] not in post_code_expected:
        print str(post_code["_id"]) + ": " + str(post_code["count"])
    
    # exit loop after 10 times  
    if c == 10:
        break
print('----------------------------------------')

Result is equal to XML based result.

In [None]:
print '-------------- Cities in the dataset ---------------'
print 'Below is a list of different cities mentioned in the'
print 'data and their occurence.'
print '----------------------------------------------------'
aggregation_query = [{"$match" : { "address.city" : {"$exists" : True }}},
                     {"$group" : {"_id" : "$address.city", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}},
                     {"$limit" : 10}
                    ]
cities = db.wau_county.aggregate(aggregation_query)
print('First 10 users in the list:')
for city in cities:
    print str(city["_id"]) + ": " + str(city["count"])
print('----------------------------------------')

The counts for the different cities in the list changed between earlier data analysis on the XML data and now. Reason for the different numbers is the data cleaning done when transferring the data from XML to JSON. 

Example: Waukesha 
- Before data cleaning: 
        - Waukesa: 1
        - Waukesha: 31
        - Waukesha, WI: 1
        - waukesha: 1
- After data cleaning:
        - Waukesha: 34
        

However, what's strange when looking at the different cities in the data, is that the cities with the highest occurences in the data, eg. Milwaukee, do not even below to Waukesha county. To be sure that is not an issue of the investigations on my side, I doulbechecked all everything from the beginning again... selection on OpenStreetMap, list of post codes, and so on.

In [None]:
aggregation_query = [{"$match" : {"addinfo.religion" : {"$exists":1} , "addinfo.type" : "place of worship"}},
                     {"$group" : {"_id" : "$addinfo.religion", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}}
                    ]
places_of_worship = db.wau_county.aggregate(aggregation_query)

print('-----------------------------------------')
print('---------- Places of Worship ------------')
print('- Information on the RELIGION of Places -')
print('-----------------------------------------')
for place in places_of_worship:
    print str(place["_id"]).title() + ": " + str(place["count"])
print('----------------------------------------')

Above information is displaying all the different religions places which are mentioned in the data. By far, Christian's places are most frequent with 52 appearances in the data. In the next section we take a closer look at those Christian's places and what denomination they follow.

In [None]:
aggregation_query = [{"$match" : {"addinfo.religion" : "christian" , "addinfo.type" : "place of worship"}},
                     {"$group" : {"_id" : "$addinfo.denomination", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}}
                    ]
places_of_worship = db.wau_county.aggregate(aggregation_query)

print('---------------------------------------------------')
print('--------------- Places of Worship -----------------')
print('- Information on the DENOMINATION of Christian\'s -')
print('---------------------------------------------------')
for place in places_of_worship:
    print str(place["_id"]).title() + ": " + str(place["count"])
print('---------------------------------------------------')

For a good portion of all Christian places, there is no information about the denomination provided. However, from the data we have, we can see that Lutheran and Catholic denomination are most popular. 

__Note:__ What we can see here as well is a good of example of bad data quality. One denomination records has a value "Wisconsin\_Evangelical\_Lutheran\_Synod\_(Wels)". That seems to be wrong.

### Other ideas about the datasets

#### Restaurant Objects Improvement

In [None]:
aggregation_query = [{"$match" : {"addinfo.type" : {"$exists":1} , "addinfo.type" : "restaurant"}},
                     {"$group" : {"_id" : "$addinfo.cuisine", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}},
                     {"$limit" : 10}
                    ]
restaurants = db.wau_county.aggregate(aggregation_query)

c=0
print('----------------------------------------')
print('-------- Cuisine in Restaurants --------')
print('----------------------------------------')
for restaurant in restaurants:
    print str(restaurant["_id"]).title() + ": " + str(restaurant["count"])
    c = c+restaurant["count"]
print('----------------------------------------')

print "Number of Restaurants in the data: " + str(c)

__What needs to be improved?__
What we can see here is that we got a good number of 372 restaurants on the map for Waukesha County area already. However, where we can see room for improvement is definitley the specification of the cuisine. Only a about 130 restaurants provide cuisine information, where about 240 do not do so. 


__How to improve?__
What's easy, is to find out which restaurant do no have cusine information provided and to additionally add certain information like website, phone, name, operator to that report, to enable potential "data cleaners" to request the data from a certain source. Where the report (below a sample report is provided, it can be improved to further support the "data cleaner") itself is a simple job which can be done by a data analyst, the retrieving and cleaning of the data requires manual effort and therefore might be quite a challange. _One the sample Report below: Already the name of most restaurants tells a lot about what cuisines is offererd, at least about a portion of it. Eg. Cold Spoons Gelato, Cranky Al's Bakery and Pizza, Crisp Pizza Bar and Lounge, Depot Snack Shop, Chiang Mai Thai, Chipotle Bar and Grill, Pitch's BBQ,..._


__What are the benefits?__
As especially restaurants might be locations which are interesting targets for map users, improved information would help increadying user satisfaction of OpenStreetMap.

In [None]:
# SAMPLE REPORT FOR DATA CLEANERS
restaurants_wo_cusine = db.wau_county.find({"addinfo.type"    :"restaurant", 
                                            "addinfo.cuisine" : {"$exists":0}
                                           })

all_attrs = {"id" : "ID",
        "name" : "Name",
        "operator" : "Operator",
        "phone" : "Phone",
        "website" : "Website",
        "opening_hours" : "Opening Hours"
       }

# as this is only a sample report, the output lines are limited to 15
c = 0
c_max = 15

print "---------------------------------------------------"
print "---- SAMPLE REPORT FOR FILLING RESTAURANT INFO ----"
print "---------------------------------------------------"

for restaurant in restaurants_wo_cusine:
    
    # counter for sample report
    c = c+1
    if c > c_max:
        break
    
    # write results
    for attr in all_attrs.keys():
        try:
            print all_attrs[attr] + ": " + restaurant[attr]
        except:
            pass
    print '--------------------------------------------'

#### WAY Objects Improvement

In [None]:
aggregation_query = [{"$match" : {"addinfo.type" : {"$exists":1}, "type" : "way" } },
                     {"$group" : {"_id" :"$addinfo.type", "count" : {"$sum" : 1}}},
                     {"$sort" : {"count":-1}}
                    ]
ways = db.wau_county.aggregate(aggregation_query)

print('----------------------------------------')
print('------- Different Types of Ways --------')
print('----------------------------------------')
for way in ways:
    print str(way["_id"]).title() + ": " + str(way["count"])
print('----------------------------------------')

__What needs to be improved?__

Above you can see a list of different types of ways inside the dataset. Obviously, ways are not only streets, roads, avenues, paths,... All different kind of amenities have entered the way tags information. Altough, as per the definition of Way objects in the OpenStreetMap Wiki, Way objects are not designed to store such data. See link to OpenStreetMap Wiki below.

__How to improve?__

Programmatic ways seem to be applicable to clean the data to a certain degree. By combining different pieces of information (the Way object information itself, related/referenced objects) required information to build up Node objects based on that information should be availalbe. Although, it's required to develop a broader conceptional design to fully understand the scope of such a conversion project. Quickly drafting my thoughts: It looks like lot's of information within <tag>'s can be used to build up the Node objects info like "creation", "maininfo", "addinfo" (based on how I classify objects in JSON), from the referenced objects it should be possible to gather information about "geopos".

In [None]:
# EXAMPLE OF WAY ELEMENT WHICH 
<way id="381049573" version="1" timestamp="2015-11-18T18:31:24Z" changeset="35414977" uid="1952296" user="shuui">
    <nd ref="1008172080"/> # SEE EXAMPLE BELOW FOR REFERENCE
    <nd ref="3843184249"/>
    <nd ref="3843184250"/>
    <nd ref="1008172388"/>
    <nd ref="1008171880"/>
    <nd ref="3843184251"/>
    <nd ref="1008172049"/>
    <nd ref="1008172080"/> # SEE EXAMPLE BELOW FOR REFERENCE
    <tag k="addr:city" v="Milwaukee"/>
    <tag k="addr:housenumber" v="209"/>
    <tag k="addr:postcode" v="53204"/>
    <tag k="addr:state" v="WI"/>
    <tag k="addr:street" v="South Water Street"/>
    <tag k="amenity" v="architect_office"/>
    <tag k="building" v="commercial"/>
    <tag k="building:levels" v="1"/>
    <tag k="building:year_built" v="2015"/>
    <tag k="name" v="pra Plunkett Raysich Archetects, LLP"/>
    <tag k="phone" v="+1.800.208.7078"/>
    <tag k="website" v="http://prarch.com"/>
</way>

# EXAMPLE OF NODE REFERENCES BY WAY OBJECT
<node id="1008172080" lat="43.0292647" lon="-87.9083026" version="3" timestamp="2015-11-18T18:31:24Z" changeset="35414977" uid="1952296" user="shuui"/>

__What are the benefits?__

Cleaning up the way tags information might support to get to more standardized data and thereby makes it easier to search for data, describe/explain certain data objects, build reports on the data, build navigation/view layers based on the data, and so on.


_Definition of a "Way" in OpenStreetMap:_ https://wiki.openstreetmap.org/wiki/Way

## Conclusion

After completing my investigations on the data set I am quite surprised about the amout of data which is available for that rather small piece of land on the map. In total the data sums up to about 950K records. Most of it beeing Nodes (about 850K records), quite a lot Ways (about 100K) and a few Relations. Apart from the number of records, I was also impressed by the amount of data which is tied to single records. Especially places which tend to be interesting for people, like parking lots, restaurants, bars, shops, and so on, are well described in the database. Additional information like opening times, telephone, website, or operator provide even additional information, beyond just the position and name of the place. Having such data available can really make a positive difference to people's - who are using the OpenStreetMap - lifes.


Apart from that, I have also seen that OpenStreetMaps is well integrated/related to other initiatives like GNIS and TIGER. A great idea to use and integrate information which is already. Ultimatley, that ends up in more user satisfaction and therefore a braoder usage of the map.


What I found irritating is the fact that so many different data points which were not inside Waukesha County have been in the data set. 