Data Munging and Analyzing Barcelona OSM Data With MongoDB

The XML data for the city boundary of Barcelona was downloaded from OSM to clean and transform into a json encodable structure to allow loading into MongoDB, providing storage and artbitrary querying to enable further data analysis of the Barcelona OSM dataset. I chose the Barcelona dataset because I just recently spent a few weeks there and noticed some pretty cool urban planning features of the city (ie. Avinguda Diagonal) that could be interesting to explore further.

<img src="http://static1.squarespace.com/static/52b3aae6e4b00492bb71aa3d/t/54776cbde4b019f8929d0684/1414769949925/?format=1500w" style="width: 650px;">

From [dailyoverview.com](http://www.dailyoverview.com/six/)

Problems Encountered in the Dataset  
After reviewing samples of the Barcelona OSM dataset, a few data quality issues were observed, including:
1. Inconsistent house number - single numbers (ie. "120"), ranges (ie. "119-121") or lists (ie. "119,121")  
2. Inconsistent sources dates - Various combinations of Year, Month and Day  

Standardizing house number data  
As part of auditing the OSM dataset for Barcelona, it was observed that the way houses are numbered are inconsistent. House numbers were standardized into lists of the numbers if delimited by a "-" or ",", otherwise tranformed into a single item list of the "addr:housenumber" attribute. Once standardized, queries were run to see the top 5 most common house numbers in Barcelona.

In [92]:
import sys
import json
sys.path.insert(0,"C:/Users/Cole/Desktop/Udacity/Data Analyst Nano Degree/Project 3")
import OSM_data_wrangling as OSMDW
osm =OSMDW.connect_OSM_collection()
house_nums = osm.aggregate([{"$match":{"addr:housenumber":{"$exists":1}}},{"$unwind":"$addr:housenumber"},
                     {"$group":{"_id":"$addr:housenumber", "count":{"$sum":1}}}, {"$sort":{"count":-1}},
                      {"$limit":11}])
for house_num in house_nums:
    print "house_num = " + house_num['_id'] + " count = " + str(house_num['count'])

house_num = 1 count = 304
house_num = 2 count = 294
house_num = 3 count = 278
house_num = 5 count = 268
house_num = 8 count = 252
house_num = 4 count = 242
house_num = 6 count = 237
house_num = 7 count = 231
house_num = 10 count = 226
house_num = 9 count = 218
house_num = 13 count = 195


From this data, it appears that the rank of house number corresponds with its numerical value. It is interesting to note that the correlation is not exact, and numbers begin to jump around in order slightly after only the top 3 house numbers.


Further Data Analysis

Basic statistics of the Barcelona OSM dataset were obtained using various MongoDB queries.  

Barcelona OSM dataset size = 184MB

In [135]:
print "dataset record count = " + str(osm.find().count())
print "dataset cities=" + str(len(osm.distinct("addr:city")))
print "most common non-Barcelona city = " + (osm.aggregate([{"$match":{"addr:city":{"$exists":1,"$nin":["Barcelona"]}}},
                     {"$group":{"_id":"$addr:city", "count":{"$sum":1}}}, {"$sort":{"count":-1}},
                      {"$limit":1}]).next()['_id']).encode("utf-8")

dataset record count = 961553
dataset cities=72
most common non-Barcelona city = Santa Coloma de Cervelló


Additional Ideas  
Continuing from the above analysis of most common house numbers, it presents the question of does the pattern of house rank being similiar to house number continue throughout the dataset, or does a seperation begin at later values? We could explore the dataset further to answer this question by producing a scatterplot of house number versus rank and visualize if any seperation in the linear trend appears. Defining the exact point of seperation (if one exists) could prove challenging, as we have seen that the house rank to number do not exactly line up 1 to 1 (ie. Rank 4 is house number 5, not 4). We would have to define a "seperation" criteria (ie. rank - house number > 10) or do some statistical analysis to test if a change in the pattern has infact occured.   

Conclusions   
Although there is missing data in the Barcelona OSM dataset, it is the data inconsistencies that seem most prevelent and can prove to be the most difficult to deal with. Simply checking if the data exists does not suffice for data inconsistencies, and ensuring that the data is in the expected form (ie. a single or list of house numbers) needs to be done before proper analysis can be performed, as was exemplified in the above report.