# Data Wrangle OpenStreetMap Data

Map Area: Jersey City, New Jersey, USA

Map links for Jersey City     
https://www.openstreetmap.org/relation/170953    
http://overpass-api.de/api/map?bbox=-74.2432,40.6343,-73.8944,40.7961

## 1. Introduction

OpenStreepMap is a collaborative project to create free editable map of the world. Its similar to Wikipedia in sourcing data.
Data from OSM can be exported in XML format (.osm). This data can be analyzed and used in different projects. Data is availabe under Open Database License.

More details regarding OSM and .osm data format can be found below:
https://en.wikipedia.org/wiki/OpenStreetMap
http://wiki.openstreetmap.org/wiki/OSM_XML

### Map Area

In this project, I will be analyzing selected map area (Jersey City, NJ) data quality (DQ) for validity, accuracy, consistency and uniformity. I will be wrangling xml format of data using Python. After some DQ checks and cleaning some data, I will be storing data to MongoDB. Then I will be performing some queries and data aggregations to get some information about data.

I selected Jersey city (JC), because its my current work location. Jersey City is most ethnically diverse cities in the world and fourth most densely populated city in the United States. It is part of New York Metropolitan area also.
I tried downloading JC data from openstreetmap export option, but it failed. Then I decided to use overpass-api to download data. I am using Python HTTP library - Requests to get the data.

Below code will download Jersey City map data in jersey_city.osm file.


In [17]:
import requests

# overpass url for jersey city, nj
url = 'http://overpass-api.de/api/map?bbox=-74.2432,40.6343,-73.8944,40.7961'
filename = 'jersey_city.osm'


def download_osm(url, filename):
    r = requests.get(url, stream = True) # http get to download the data
    with open(filename, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()
download_osm(url, filename)

I am including below map area to show the area covered in analysis. When I got the co-ordinate range for jersey city from openstreetmap, I did not realize it was not exactly for Jersey City, but it covered many neighbouring areas. By the time I realized it, I was very further in project, so I decided to continue and use these co-ordinates as my area of interest.  Please note, Jersey City term in this project will include Jersey City and other areas in these co-ordinates.

In [1]:
from IPython.core.display import HTML
HTML('<iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=-74.25212860107422%2C40.64730356252251%2C-73.98914337158203%2C40.809131953785965&amp;layer=mapnik" style="border: 1px solid black"></iframe><br/><small><a href="http://www.openstreetmap.org/#map=12/40.7283/-74.1206">View Larger Map</a></small>')

## 2. Problems Encountered in the Map
In this section, I will parse osm map file. Talk about different problem with data and fixing it.
Finally, preparing the data for Mongodb load.

### Map Parsing

I will parse osm file using ElementTree to find out different tags present in the file. Keeping the size of the file in the mind, I will use SAX parsing. 

In [18]:
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    tags = {}
    osm_file = open(filename, "r", encoding="utf-8")
    
    for event, elem in ET.iterparse(osm_file):
        if elem.tag in tags.keys():
            tags[elem.tag] += 1
        else:
            tags[elem.tag] = 1
    return tags

tags = count_tags(filename)
pprint.pprint(tags)

{'bounds': 1,
 'member': 84098,
 'meta': 1,
 'nd': 2457737,
 'node': 1721029,
 'note': 1,
 'osm': 1,
 'relation': 2450,
 'tag': 2000510,
 'way': 312552}


We can see there are 1.7 millions nodes defined. Also, big number of ways are defined.
There are 2000510 tags present in the file. These tags are name-value pair, to define multiple attributes of nodes or ways.

Also, for my final data model for MongoDB, I will be grouping similar tags (like address). I will further audit tags to find some pattern and to remove problematic/invalid data for MongoDB. I will try to make sure if these values can be valid keys for MongoDB.

I am using Python regular expression to perform this auditing. 

In [19]:
import re

# define regular expressions
lower = re.compile(r'^([a-z]|_)*$') # reg-ex for lower case
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$') # reg-ex for lower case and presence of colon (:)
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]') # reg-ex for probelm chars not allowed for MongoDB keys

def key_type(element, keys):
    if element.tag == "tag":
        if lower.search(element.attrib["k"]):
            keys["lower"] += 1
        elif lower_colon.search(element.attrib["k"]):
            keys["lower_colon"] += 1
        elif problemchars.search(element.attrib["k"]):
            keys["problemchars"] += 1
        else:
            keys["other"] += 1
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

keys = process_map(filename)
pprint.pprint(keys)

{'lower': 742444,
 'lower_colon': 1222919,
 'other': 15181,
 'problemchars': 19966}


There are 19966 tags with problem characters which can cause issue while loading to MongoDB.
With some further exploring, I can find tags with keys : "addr:"
I will be grouping by "addr:" under "address" key in data model. lower_colon will help to find these keys.

I will do more exploring of data, this time to find out about users.

In [20]:
def get_user(element, user):
    user = ""
    at = element.attrib
    
    for key in at:
        if key == "user" and at["user"] != "":
            user = at["user"]
    return user

def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        user = get_user(element, users)
        if user != "":
            users.add(user)
    return users 

users = process_map(filename)
print("Number of Unique Users: {}".format(len(users))) # Getting number of Unique Users

Number of Unique Users: 1556


### Inconsistent Street Names
Street Names present in this map data is inconsistent. Since data is crowd sourced, there is no standard in mentioning street names and addresses. Many street names are over abbreviated.
In first problem finding exercise I will try to find such inconsistencies in street names.


In [68]:
from collections import defaultdict

# RegEx to get last string in street names. It usually gives street types.
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# List of Valid and Expected Street Types
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Plaza", "Turnpike", "Alley", "Walk", "Way", "Terrace" ]

# Function to check if its tag is for street name
def is_street_name(elem):
        return (elem.attrib['k'] == "addr:street")
    
# Fuction to find out different street types which are not in expected list of street types
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

# Function to parse the map osm file and call audit_street_type function. This will return dict 
#of different street types and their occurances
def audit(osmfile):
    osm_file = open(osmfile, "r", encoding="utf-8")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])

    return street_types

street_types = audit(filename)

Below examples of above code snippet output
```javascript
"Ave" {" Westminster Ave",
         "4th Ave",
         "5th Ave",
         "64th St and 5th Ave",
         "6th Ave",
         "Hudson Ave",
         "Norman Ave",
         "Park Ave",
         "Third Ave",
         "Willow Ave"}
 "Ave" {"Springfield Ave.", "Washington Ave."}
 "Avene": {"Nostrand Avene", "Madison Avene"},
 "Blv.": {"John F. Kennedy Blv."},
 "Blvd": {"Marin Blvd", "Queens Blvd"},
 "Park": {"FDR Four Freedoms Park",
          "Fort Hill Park",
          "Gramercy Park",
          "Washington Park"},
 "Piers": {"Northside Piers"},
 "Plz": {"University Plz"},
 "Rd": {"43rd Rd"},
 "St.": {"11th St.",
         "9th St.",
         "Devoe St.",
         "E. 54th St.",
         "East 73rd St.",
         "East 86th St.",
         "Henry St.",
         "South 4th St.",
         "Warren St.",
         "Washington St.",
         "West 44th St."},
 "Steet": {"West 8th Steet"},
 "Streeet": {"Johnson Streeet"}
 ```

I noticed there are a lot of inconsistencies, like for Streets I has below types:
'St','St.','Steet','Streeet','st','street', 'ST'
There are few typos and some abbreviations.

Samething, I noticed about Avenues, Boulevard, Plaza.
In first iteration, Plaza,Turnpike, Walk, Way were not included in expected list. I added them in expected list in next iterations and reran above audit script.
By resolving these inconsistencies I will make my data more uniform.


### Non-Uniformity in City Names

Due to crowd-sourced entries, city names in map data is not conistent. NYC is consist of many big boroughs due to that many places I noticed city name is replaces with neighbourhood names and borough names. Some places city names are misspelled or abbreviated. During my initial load to MongoDB I found below examples:

{'Manhattan NYC', 'Brooklyn, New York', 'new York', 'Queens, NY', 'NY', 'Manhattan', 'New York NY', 'Waterbury', 'Bowery Bay',
 'NEW YORK CITY'}
 Above are only few samples, there are few more . Now, I will reload the data and before loading I will cleanup city names to make it inconsistent as possible.
 
Below mapping dictionary is used to map incosistent, incorrect city names to uniform city name.

In [57]:
city_mapping = { 'New York' : 'New York City',
                'Brooklyn' : 'New York City',
                'Astoria' : 'New York City',
                'Long Island City' : 'New York City',
                'Sunnyside' : 'New York City',
                'Staten Island' : 'New York City',
                'Woodside' : 'New York City',
                'brooklyn' : 'New York City',
                'New York, NY' : 'New York City',
                'Queens' : 'New York City',
                'Brooklyn, NY' : 'New York City',
                'Ridgewood' : 'New York City',
                'NEW YORK CITY' : 'New York City',
                'New York NY' : 'New York City',
                'Bowery Bay, NY' : 'New York City',
                'Roosevelt Island' : 'New York City',
                'new york' : 'New York City',
                'new York' : 'New York City',
                'Queens, NY' : 'New York City',
                'NY' : 'New York City',
                'Manhattan' : 'New York City',
                'Middle Village' : 'New York City',
                'Manhattan NYC' : 'New York City',
                'Brooklyn, New York' : 'New York City',
                'end of pier' : 'New York City' }

# function to verify city name tag
def is_city_name(elem):
        return (elem.attrib['k'] == "addr:city")

# Below function will accept city name, after looking up in city_mapping dict, it will return consistent name
def update_city(city_name, city_mapping):
    if city_name in city_mapping:
        updated_city_name = city_mapping[city_name]
        return updated_city_name
    else:
        return city_name

#  Test update_city for city name from map data
new_city_name = update_city("Manhattan", city_mapping)
print(new_city_name)

New York City


### Preparing for Database

I will load JC OSM XML dataset to MongoDB database for further analysis. To load to MongoDB I will convert XML OSM file to JSON file. I have selected below datamodel for my database.
```javascript
{  
"id": "2406124091",  
"type: "node",  
"visible":"true",  
"created": {  
          "version":"2",  
          "changeset":"17206049",  
          "timestamp":"2013-08-03T16:43:42Z",  
          "user":"linuxUser16",  
          "uid":"1219059"  
        },  
"pos": [41.9757030, -87.6921867],  
"address": {  
          "housenumber": "5157",  
          "postcode": "60625",  
          "street": "North Lincoln Ave"  
        },  
"amenity": "restaurant",  
"cuisine": "mexican",  
"name": "La Cabana De Don Luis",  
"phone": "1 (773)-271-5176"  
}  
```

I will get all metadata details about the node or ways entry under "created" key. Latitude and Longitude are included under "pos". As discussed before all address related details will be included under "address".

Below rules will be followed to model and transform the data:

* Only "node" and "way" - 2 top level tags will be processed.
* All attributes of "node" and "way" should be turned into regular key/value pairs, except:
    * attributes in the CREATED array should be added under a key "created"
    CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
    * attributes for latitude and longitude should be added to a "pos" array,
      for use in geospacial indexing. Make sure the values inside "pos" array are floats
      and not strings.
* if second level tag "k" value contains problematic characters, it should be ignored
* if second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
* if second level tag "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
* if there is a second ":" that separates the type/direction of a street, the tag should be ignored, for example:

```xml
<tag k="addr:housenumber" v="5158"/>
<tag k="addr:street" v="North Lincoln Avenue"/>
<tag k="addr:street:name" v="Lincoln"/>
<tag k="addr:street:prefix" v="North"/>
<tag k="addr:street:type" v="Avenue"/>
<tag k="amenity" v="pharmacy"/>
```
  should be turned into:

``` javascript
{...
"address": {
    "housenumber": 5158,
    "street": "North Lincoln Avenue"
}
"amenity": "pharmacy",
...
}
```

* for "way" specifically:
```xml
  <nd ref="305896090"/>
  <nd ref="1719825889"/>
```

should be turned into
```javascript
"node_refs": ["305896090", "1719825889"]
```

Below are code snippets to convert street names to consistent names (identified in above section). And then shape the data in the required format. 

First, I will write update_name function to change street_names to better names.
I will define mapping between inconsistent and better street types in mapping dictionary.

In [73]:
# dictionary to store mapping between inconsistent and better street types
mapping = { "St": "Street",
            "St.": "Street",
            "Steet" : "Street",
            "Streeet" : "Street",
            "street" : "Street",
            "ST" : "Street",
            "st" : "Street",
            "Rd." : "Road",
            "Rd" : "Road",
            "Ave" : "Avenue",
            "ave" : "Avenue",
            "avenue" : "Avenue",
            "PKWY" : "Parkway",
            "Pl" : "Place",
            "Plz" : "Plaza"
            }

# function accept street name, then do lookup to mapping dict and return updated better street name
def update_name(name, mapping):

    m = street_type_re.search(name)
    if m:
        street_type = m.group()
        try:
            new_street_type = mapping[street_type]
            name = name.replace(street_type, new_street_type)
        except KeyError:
            pass
        
    return name

# Running test to make sure street names are getting updated as expected (for few street types only)
for st_type, ways in street_types.items():
    if st_type in mapping:
        i = 0
        for name in ways:
            better_name = update_name(name, mapping)
            print(name, "=>", better_name)
            i += 1
            if i == 5:
                break
            
        

Bloomfield St => Bloomfield Street
6th St => 6th Street
Madison St => Madison Street
9th St => 9th Street
2nd St => 2nd Street
N 9th ST => N 9th Street
West 8th Steet => West 8th Street
Johnson Streeet => Johnson Street
6th Ave => 6th Avenue
64th St and 5th Ave => 64th St and 5th Avenue
Park Ave => Park Avenue
 Westminster Ave =>  Westminster Avenue
Willow Ave => Willow Avenue
Steinway street => Steinway Street
Mott street => Mott Street
Hudson street => Hudson Street
west 55th street => west 55th Street
E 45th street => E 45th Street
E. 54th St. => E. 54th Street
East 73rd St. => East 73rd Street
9th St. => 9th Street
11th St. => 11th Street
Devoe St. => Devoe Street
University Plz => University Plaza
43rd Rd => 43rd Road
Bedford avenue => Bedford Avenue
Utica avenue => Utica Avenue
2nd avenue => 2nd Avenue
5th ave => 5th Avenue
6th ave => 6th Avenue
Union st => Union Street
South 4th st => South 4th Street
W 35th st => W 35th Street


Below is shape_element function to transform data in required dictionary format. shape_element uses update_name function to translate street names to better names.

In [58]:
# list contains all elements which will be included to "created" key
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):
    '''
    this function converts osm elements to node dictionary. It will format all attributes 
    and sub-elements in required model defined above section.
    '''
    node = {} # return dict with nodes and ways data in our data model format
    created = {} # temp dict to store "created" key, this will be added to node dict
    pos = [None, None] # temp list to store lat and log, this will be added to node dict
    
    if element.tag == "node" or element.tag == "way" :  # process only "node" and "way"
        # store element type
        node["type"] = element.tag 
        
        # loop through element's attributes
        for at in element.attrib:
            
            # all elements in CREATED list will be grouped and stored under created key
            if at in CREATED:
                created[at] = element.attrib[at]
                node["created"] = created
            
            # store latitudes and longitudes in pos key
            elif at in ['lat','lon']:
                if at == "lat":
                    pos[0] = float(element.attrib[at])
                else:
                    pos[1] = float(element.attrib[at])
            else:
                node[at] = element.attrib[at]

        if not None in pos:
            node["pos"] = pos
        
        # processing inner "tag" element for nodes and ways
        for tag in element.iter("tag"):
            if not problemchars.search(tag.attrib["k"]): # filetering problem chars
                
                # selecting tags starting with "attrib:" to get all "address" key fields
                if lower_colon.search(tag.attrib["k"]) and tag.attrib["k"].startswith("addr:"):
                    if "address" not in node:
                        node["address"] = {}
                            
                    key = tag.attrib["k"].split(":")[1]
                    if is_street_name(tag):
                        better_name = update_name(tag.attrib["v"], mapping)
                        node["address"][key] = better_name
                    elif is_city_name(tag):
                        uniform_city_name = update_city(tag.attrib["v"], city_mapping)
                        node["address"][key] = uniform_city_name
                    else:
                        node["address"][key] = tag.attrib["v"]
                        
                # store all other "tag" as normal name-value pair
                else:
                    node[tag.attrib["k"]] = tag.attrib["v"]
        
        # to store nd elements under "node_refs" list for ways
        for nd in element.iter("nd"):
            if "node_refs" not in node:
                node["node_refs"] = []
            node["node_refs"].append(nd.attrib["ref"])
            
        #pprint.pprint(node)

        return node
    else:
        return None
    

In [59]:
import json
import codecs

def process_map(file_in, pretty = False):
    '''
    This function gets osm file as input, parses the file and then using shape_element
    function transforms data in required format. Then write data to JSON file.
    '''
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")

In [60]:
# Call process map function, passing jersey_city.osm file as input.
process_map(filename)

Outfile file : jersey_city.osm.json is created.

## 3. Data Overview

In this section I have included few statistics of the data. Showed process to load data to MongoDB.
Then run some MongoDB queries to get more information regarding data.

### File Sizes
Size of the original XML OSM file downloaded for Jersey City

In [26]:
import os
osm_file_size = os.path.getsize(filename)/1.0e6
print("Size of the downloaded OSM file - {}: {} MB".format(filename, osm_file_size))

Size of the downloaded OSM file - jersey_city.osm: 474.760615 MB


In [61]:
output_filename = filename +".json"
json_file_size = os.path.getsize(output_filename)/1.0e6
print("Size of the created JSON file - {}: {} MB".format(output_filename, json_file_size))

Size of the created JSON file - jersey_city.osm.json: 501.609482 MB


### Loading Data to MongoDB
Now we will load jersey_city.osm.json to MongoDB.
MongoDB instance is running locally in my machine. I will use pyMongo to connect with MongoDB.

In [30]:
from pymongo import MongoClient

client = MongoClient("localhost:27017")
db = client.osm


osm db is created. To load json file to MongoDB, I used *mongoimport* utility. I will call this mongoimport utility from command line and load data to database. I can easily script this step using python or shell scripts. 

mongoimport takes 3 arguments : 
db : database name -> osm
collection : collection name -> jersey_city
file : json format file -> jersey_city.osm.json

Below is snippet output of my import:

```bash
C:\Program Files\MongoDB\Server\3.0\bin>mongoimport --db "osm" --collection "jersey_city" --file "C:\Users\Prashant\Dropbox\Udacity\Data_Analyst_ND\Data_Wrangling_with_MongoDB\jersey_city.osm.json"
2015-09-07T17:12:03.073-0400    connected to: localhost
2015-09-07T17:12:06.071-0400    [#.......................] osm.jersey_city      20.7 MB/478.3 MB (4.3%)
2015-09-07T17:12:09.066-0400    [##......................] osm.jersey_city      41.5 MB/478.3 MB (8.7%)
2015-09-07T17:12:12.066-0400    [###.....................] osm.jersey_city      62.2 MB/478.3 MB (13.0%)
2015-09-07T17:12:15.067-0400    [####....................] osm.jersey_city      83.4 MB/478.3 MB (17.4%)
2015-09-07T17:12:18.066-0400    [#####...................] osm.jersey_city      105.3 MB/478.3 MB (22.0%)
2015-09-07T17:12:21.066-0400    [######..................] osm.jersey_city      125.3 MB/478.3 MB (26.2%)
2015-09-07T17:12:24.067-0400    [#######.................] osm.jersey_city      143.9 MB/478.3 MB (30.1%)
2015-09-07T17:12:27.066-0400    [########................] osm.jersey_city      164.3 MB/478.3 MB (34.3%)
2015-09-07T17:12:30.066-0400    [#########...............] osm.jersey_city      185.8 MB/478.3 MB (38.8%)
2015-09-07T17:12:33.067-0400    [##########..............] osm.jersey_city      207.0 MB/478.3 MB (43.3%)
2015-09-07T17:12:36.067-0400    [###########.............] osm.jersey_city      228.2 MB/478.3 MB (47.7%)
2015-09-07T17:12:39.066-0400    [############............] osm.jersey_city      249.9 MB/478.3 MB (52.2%)
2015-09-07T17:12:42.067-0400    [#############...........] osm.jersey_city      270.7 MB/478.3 MB (56.6%)
2015-09-07T17:12:45.067-0400    [##############..........] osm.jersey_city      291.7 MB/478.3 MB (61.0%)
2015-09-07T17:12:48.066-0400    [###############.........] osm.jersey_city      311.5 MB/478.3 MB (65.1%)
2015-09-07T17:12:51.066-0400    [################........] osm.jersey_city      332.2 MB/478.3 MB (69.4%)
2015-09-07T17:12:54.067-0400    [#################.......] osm.jersey_city      354.2 MB/478.3 MB (74.0%)
2015-09-07T17:12:57.066-0400    [###################.....] osm.jersey_city      380.6 MB/478.3 MB (79.6%)
2015-09-07T17:13:00.066-0400    [####################....] osm.jersey_city      406.9 MB/478.3 MB (85.1%)
2015-09-07T17:13:03.068-0400    [#####################...] osm.jersey_city      433.9 MB/478.3 MB (90.7%)
2015-09-07T17:13:06.066-0400    [#######################.] osm.jersey_city      461.3 MB/478.3 MB (96.4%)
2015-09-07T17:13:08.038-0400    imported 2033581 documents
```

Now I ran few queries on jersey_city collections to gather some stats and information on data.

#### Number of Documents in Collections

In [64]:
db.jersey_city.find().count()

2033581

#### Number of Nodes

In [37]:
db.jersey_city.find({"type" : "node"}).count()

1720878

#### Number of Ways

In [38]:
db.jersey_city.find({"type" : "way"}).count()

312508

#### Number of Unique Users

In [39]:
len(db.jersey_city.distinct("created.user"))

1523

#### Top 5 contributing Users

In [40]:
db.jersey_city.aggregate([{ "$group" : {"_id" : "$created.user", "count" : {"$sum" : 1}}},
        {"$sort" : {"count" : -1}}, 
        {"$limit" : 5} 
        ])["result"]

[{'_id': 'Rub21_nycbuildings', 'count': 1073036},
 {'_id': 'lxbarth_nycbuildings', 'count': 135794},
 {'_id': 'ediyes_nycbuildings', 'count': 112386},
 {'_id': 'ingalls_nycbuildings', 'count': 108182},
 {'_id': 'celosia_nycbuildings', 'count': 81625}]

#### Top 5 Zip Codes Counts

In [67]:
db.jersey_city.aggregate([{ "$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}, 
                          {"$match" : {"_id" : {"$ne" : None}}},
                          {"$sort" : {"count" : -1}},
                          {"$limit" : 5}
                         ])["result"]

[{'_id': '11203', 'count': 11773},
 {'_id': '11215', 'count': 9533},
 {'_id': '11221', 'count': 9402},
 {'_id': '11236', 'count': 8797},
 {'_id': '11220', 'count': 8612}]

#### Only Jersey City Zip Codes
I noticed that the coverage of only jersey city zipcodes is very limited in OSM

In [42]:
# Get counts of Only list of Jersey City Zip Codes.
db.jersey_city.aggregate([{"$match" : {"address.postcode" : {"$in" : ["07097","07302","07303","07304","07305","07306","07307","07308","07310","07311"]} }},
                          {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}, 
                          {"$sort" : {"count" : -1}}
                         ])["result"]

[{'_id': '07302', 'count': 61},
 {'_id': '07306', 'count': 19},
 {'_id': '07311', 'count': 2},
 {'_id': '07304', 'count': 2},
 {'_id': '07310', 'count': 1}]

#### Top 5 City Names

In [66]:
db.jersey_city.aggregate([{"$group" : {"_id" : "$address.city", "count" : {"$sum" : 1}}}, 
                          {"$match" : {"_id" : {"$ne" : None}}},
                          {"$sort" : {"count" : -1}},
                          {"$limit" : 5}
                         ])["result"]

[{'_id': 'New York City', 'count': 4998},
 {'_id': 'Hoboken', 'count': 504},
 {'_id': 'Jersey City', 'count': 73},
 {'_id': 'Newark', 'count': 33},
 {'_id': 'Orange', 'count': 11}]

#### Top 5 Amenities

In [51]:
db.jersey_city.aggregate([{"$match" : {"amenity" : {"$exists" : 1}}},
                          {"$group" : {"_id" : "$amenity", "count" : {"$sum" : 1 }}},
                          {"$sort" : {"count" : -1}},
                          {"$limit" : 5}
    ])["result"]

[{'_id': 'bicycle_parking', 'count': 3895},
 {'_id': 'restaurant', 'count': 1361},
 {'_id': 'place_of_worship', 'count': 1320},
 {'_id': 'school', 'count': 954},
 {'_id': 'parking', 'count': 915}]

## 4. Additional Ideas

### nycbuildings

In previous section, I found that top 5 users are all following some pattern of "nycbuildings"
I reserached more on this, looks like as part of NYC open data initiative last year bulk of this data were loaded to OSM.

Below article contain detail on this:
https://www.mapbox.com/blog/nyc-buildings-openstreetmap/

After querying using mongodb regex, I found that 78% of this dataset is from nycbuilding. 

In [44]:
db.jersey_city.find( {"created.user" : {"$regex" : "nycbuildings$" } } ).count()

1596652

These nycbuildings documents doesnt have many useful information. Only few of them have address. All other useful attributes like amenities etc are missing.

In [45]:
db.jersey_city.find( {"created.user" : {"$regex" : "nycbuildings$" }, "address" : {"$exists" : 1} } ).count()

230417

### Number of documents for TIGER and GNIS

After looking more into this dataset, the data is filled with many entries for "way" from TIGER (Topologically Integrated Geographic Encoding and Referencing system)
http://wiki.openstreetmap.org/wiki/TIGER

There are also many entries for "node" from GNIS (USGS Geographic Names Information System).
http://wiki.openstreetmap.org/wiki/USGS_GNIS

As per documentation of both of these above sources (datasets), data seems outdated or incorrect.
GNIS:ID suppose to map with OSM amenity tags, but no corresponding amenity tags are present for these GNIS:ID

In [46]:
# Way document from tiger
db.jersey_city.find( { "type" : "way", "tiger:cfcc" : { "$exists": 1 } } ).count()

17388

In [47]:
# node documents from gnis
db.jersey_city.find( { "type" : "node", "gnis:created" : { "$exists": 1 } } ).count()

2117

####  Node and Way with Address

In [48]:
db.jersey_city.find({"type" : "node", "address.street" : {"$exists" : 1}}).count()

54044

In [49]:
db.jersey_city.find({"type" : "way", "address.street" : {"$exists" : 1}}).count()

200000

Due to presence of nycbuildings, tiger and gnis incomplete data, my dataset is not giving much useful information.
Below are important ideas which are important to make this OSM data more useful:

#### Standardrizing City Name and Zip Codes
We can easily standardrize city names and zip codes. An API can be created or current OSM API can be expanded to create a lookup based on co-ordinates which will return uniform city names and zip codes. This API can referred while sourcing data to OSM.
Same, can be used to add city names and zip codes to the nodes missing these info.

#### Cleaning up outdated and invalid TIGER and GNIS data
TIGER and GNIS data was loaded few years back. Many of that data needs cleaning. Outdated data needs to be indentifer and cleaned. We can use Google MAP API to cross reference OSM data and clean it whereever possible. 

#### nycbuildings Data can be leveraged
For my selected map area, nycbuildings data seems latest. However, it still needs more attributes and information.
A game or competition can be started using social network, where players can act as explorers. They will be provided with co-ordinates and they need to find informations regarding that co-ordinates. Player can get points based on that. Geo-tagging and hash-tagging can be leveraged for that.

#### More crowd-sourced apps
Many more crowd-sourced apps can be developed using OSM data, which in turn will help OSM data quality also.
We can create social networking street parking app, where users can geo-tag different street parking all over the city. Once data is more mature, it can be developed to be more advance app which can provide real time parking spots update.


### Conclusion
Jersey City and the neighbouring areas data needs cleaning and mapping. NYCBuidlings data has created a nice skeleton, which can be leveraged to make this data more useful.
In current state, it is difficult to gain any intelligence from this data.