# Data Wrangle OpenStreetMap Data

Map Area: Jersey City, New Jersey, USA

Map links for Jersey City     
https://www.openstreetmap.org/relation/170953    
http://overpass-api.de/api/map?bbox=-74.2432,40.6343,-73.8944,40.7961

## 1. Introduction

OpenStreepMap is a collaborative project to create free editable map of the world. Its similar to Wikipedia in sourcing data.
Data from OSM can be exported in XML format (.osm). This data can be analyzed and used in different projects. Data is availabe under Open Database License.

More details regarding OSM and .osm data format can be found below:
https://en.wikipedia.org/wiki/OpenStreetMap
http://wiki.openstreetmap.org/wiki/OSM_XML

### Map Area

In this project, I will be analyzing selected map area (Jersey City, NJ) data quality (DQ) for validity, accuracy, consistency and uniformity. I will be wrangling xml format of data using Python. After some DQ checks and cleaning some data, I will be storing data to MongoDB. Then I will be performing some queries and data aggregations to get some information about data.

I selected Jersey city (JC), because its my current work location. Jersey City is most ethnically diverse cities in the world and fourth most densely populated city in the United States. It is part of New York Metropolitan area also.
I tried downloading JC data from openstreetmap export option, but it failed. Then I decided to use overpass-api to download data. I am using Python HTTP library - Requests to get the data.

Below code will download Jersey City map data in jersey_city.osm file.


In [2]:
import requests

# overpass url for jersey city, nj
url = 'http://overpass-api.de/api/map?bbox=-74.2432,40.6343,-73.8944,40.7961'
filename = 'jersey_city.osm'


def download_osm(url, filename):
    r = requests.get(url, stream = True) # http get to download the data
    with open(filename, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()
download_osm(url, filename)

I am including below map area to show the area covered in analysis. When I got the co-ordinate range for jersey city from openstreetmap, I did not realize it was not exactly for Jersey City, but it covered many neighbouring areas. By the time I realized it, I was very further in project, so I decided to continue and use these co-ordinates as my area of interest.  Please note, Jersey City term in this project will include Jersey City and other areas in these co-ordinates.

In [85]:
from IPython.core.display import HTML
HTML('<iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=-74.25212860107422%2C40.64730356252251%2C-73.98914337158203%2C40.809131953785965&amp;layer=mapnik" style="border: 1px solid black"></iframe><br/><small><a href="http://www.openstreetmap.org/#map=12/40.7283/-74.1206">View Larger Map</a></small>')

## 2. Problems Encountered in the Map

### Map Parsing

I will parse osm file using ElementTree to find out different tags present in the file. Keeping the size of the file in the mind, I will use SAX parsing. 

In [3]:
import xml.etree.cElementTree as ET
import pprint

def count_tags(filename):
    tags = {}
    osm_file = open(filename, "r", encoding="utf-8")
    
    for event, elem in ET.iterparse(osm_file):
        if elem.tag in tags.keys():
            tags[elem.tag] += 1
        else:
            tags[elem.tag] = 1
    return tags

tags = count_tags(filename)
pprint.pprint(tags)

{'bounds': 1,
 'member': 84098,
 'meta': 1,
 'nd': 2457737,
 'node': 1721029,
 'note': 1,
 'osm': 1,
 'relation': 2450,
 'tag': 2000510,
 'way': 312552}


We can see there are 1.7 millions nodes defined. Also, big number of ways are defined.
There are 2000510 tags present in the file. These tags are name-value pair, to define multiple attributes of nodes or ways.

Also, for my final data model for MongoDB, I will be grouping similar tags (like address). I will further audit tags to find some pattern and to remove problematic/invalid data for MongoDB. I will try to make sure if these values can be valid keys for MongoDB.

I am using Python regular expression to perform this auditing. 

In [4]:
import re

# define regular expressions
lower = re.compile(r'^([a-z]|_)*$') # reg-ex for lower case
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$') # reg-ex for lower case and presence of colon (:)
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]') # reg-ex for probelm chars not allowed for MongoDB keys

def key_type(element, keys):
    if element.tag == "tag":
        if lower.search(element.attrib["k"]):
            keys["lower"] += 1
        elif lower_colon.search(element.attrib["k"]):
            keys["lower_colon"] += 1
        elif problemchars.search(element.attrib["k"]):
            keys["problemchars"] += 1
        else:
            keys["other"] += 1
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

keys = process_map(filename)
pprint.pprint(keys)

{'lower': 742444,
 'lower_colon': 1222919,
 'other': 15181,
 'problemchars': 19966}


There are 19966 tags with problem characters which can cause issue while loading to MongoDB.
With some further exploring, I can find tags with keys : "addr:"
I will be grouping by "addr:" under "address" key in data model. lower_colon will help to find these keys.

I will do more exploring of data, this time to find out about users.

In [5]:
def get_user(element, user):
    user = ""
    at = element.attrib
    
    for key in at:
        if key == "user" and at["user"] != "":
            user = at["user"]
    return user

def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        user = get_user(element, users)
        if user != "":
            users.add(user)
    return users 

users = process_map(filename)
print("Number of Unique Users: {}".format(len(users))) # Getting number of Unique Users

Number of Unique Users: 1556


### Inconsistent Street Names
Street Names present in this map data is inconsistent. Since data is crowd sourced, there is no standard in mentioning street names and addresses. Many street names are over abbreviated.
In first problem finding exercise I will try to find such inconsistencies in street names.


In [6]:
from collections import defaultdict

# RegEx to get last string in street names. It usually gives street types.
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# List of Valid and Expected Street Types
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Plaza", "Turnpike", "Alley", "Walk", "Way", "Terrace" ]

# Function to check if its tag is for street name
def is_street_name(elem):
        return (elem.attrib['k'] == "addr:street")
    
# Fuction to find out different street types which are not in expected list of street types
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

# Function to parse the map osm file and call audit_street_type function. This will return dict 
#of different street types and their occurances
def audit(osmfile):
    osm_file = open(osmfile, "r", encoding="utf-8")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])

    return street_types

street_types = audit(filename)
pprint.pprint(street_types)

{'10024': {'West 80th Street NYC 10024'},
 '1801': {'505th 8th Avenue Suite 1801'},
 '1st': {'1st'},
 '27th': {'W 27th'},
 '29th': {'29th'},
 '2N': {'400th West 20th St., Suite 2N'},
 '3': {'Hanover Square #3'},
 '300': {'Ste 300'},
 '306': {'West 30th Street Suite 306'},
 '41st': {'41st'},
 '42nd': {'West 42nd'},
 '4B': {'Union Avenue 4B'},
 '500': {'Main St., Suite 500'},
 '633': {'633'},
 '861': {'861'},
 'A': {'Avenue A'},
 'Americas': {'Avenue Of The Americas',
              'Avenue of Americas',
              'Avenue of the Americas'},
 'Atrium': {'Broadway Atrium'},
 'Ave': {' Westminster Ave',
         '4th Ave',
         '5th Ave',
         '64th St and 5th Ave',
         '6th Ave',
         'Hudson Ave',
         'Norman Ave',
         'Park Ave',
         'Third Ave',
         'Willow Ave'},
 'Ave.': {'Washington Ave.', 'Springfield Ave.'},
 'Avene': {'Nostrand Avene', 'Madison Avene'},
 'Avenue,#392': {'Columbus Avenue,#392'},
 'B': {'Avenue B'},
 'Blv.': {'John F. Kennedy 

I noticed there are a lot of inconsistencies, like for Streets I has below types:
'St','St.','Steet','Streeet','st','street', 'ST'
There are few typos and some abbreviations.

Samething, I noticed about Avenues, Boulevard, Plaza.
In first iteration, Plaza,Turnpike, Walk, Way were not included in expected list. I added them in expected list in next iterations and reran above audit script.
By resolving these inconsistencies I will make my data more uniform.


### Preparing for Database

I will load JC OSM XML dataset to MongoDB database for further analysis. To load to MongoDB I will convert XML OSM file to JSON file. I have selected below datamodel for my database.
```javascript
{  
"id": "2406124091",  
"type: "node",  
"visible":"true",  
"created": {  
          "version":"2",  
          "changeset":"17206049",  
          "timestamp":"2013-08-03T16:43:42Z",  
          "user":"linuxUser16",  
          "uid":"1219059"  
        },  
"pos": [41.9757030, -87.6921867],  
"address": {  
          "housenumber": "5157",  
          "postcode": "60625",  
          "street": "North Lincoln Ave"  
        },  
"amenity": "restaurant",  
"cuisine": "mexican",  
"name": "La Cabana De Don Luis",  
"phone": "1 (773)-271-5176"  
}  
```

I will get all metadata details about the node or ways entry under "created" key. Latitude and Longitude are included under "pos". As discussed before all address related details will be included under "address".

Below rules will be followed to model and transform the data:

* Only "node" and "way" - 2 top level tags will be processed.
* All attributes of "node" and "way" should be turned into regular key/value pairs, except:
    * attributes in the CREATED array should be added under a key "created"
    CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
    * attributes for latitude and longitude should be added to a "pos" array,
      for use in geospacial indexing. Make sure the values inside "pos" array are floats
      and not strings.
* if second level tag "k" value contains problematic characters, it should be ignored
* if second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
* if second level tag "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
* if there is a second ":" that separates the type/direction of a street, the tag should be ignored, for example:

```xml
<tag k="addr:housenumber" v="5158"/>
<tag k="addr:street" v="North Lincoln Avenue"/>
<tag k="addr:street:name" v="Lincoln"/>
<tag k="addr:street:prefix" v="North"/>
<tag k="addr:street:type" v="Avenue"/>
<tag k="amenity" v="pharmacy"/>
```
  should be turned into:

``` javascript
{...
"address": {
    "housenumber": 5158,
    "street": "North Lincoln Avenue"
}
"amenity": "pharmacy",
...
}
```

* for "way" specifically:
```xml
  <nd ref="305896090"/>
  <nd ref="1719825889"/>
```

should be turned into
```javascript
"node_refs": ["305896090", "1719825889"]
```

Below are code snippets to convert street names to consistent names (identified in above section). And then shape the data in the required format. 

First, I will write update_name function to change street_names to better names.
I will define mapping between inconsistent and better street types in mapping dictionary.

In [13]:
# dictionary to store mapping between inconsistent and better street types
mapping = { "St": "Street",
            "St.": "Street",
            "Steet" : "Street",
            "Streeet" : "Street",
            "street" : "Street",
            "ST" : "Street",
            "st" : "Street",
            "Rd." : "Road",
            "Rd" : "Road",
            "Ave" : "Avenue",
            "ave" : "Avenue",
            "avenue" : "Avenue",
            "PKWY" : "Parkway",
            "Pl" : "Place",
            "Plz" : "Plaza"
            }

# function accept street name, then do lookup to mapping dict and return updated better street name
def update_name(name, mapping):

    m = street_type_re.search(name)
    if m:
        street_type = m.group()
        try:
            new_street_type = mapping[street_type]
            name = name.replace(street_type, new_street_type)
        except KeyError:
            pass
        
    return name

# Running test to make sure street names are getting updated as expected
for st_type, ways in street_types.items():
    if st_type in mapping:
        for name in ways:
            better_name = update_name(name, mapping)
            print(name, "=>", better_name)
        

9th St. => 9th Street
West 44th St. => West 44th Street
E. 54th St. => E. 54th Street
East 73rd St. => East 73rd Street
East 86th St. => East 86th Street
Devoe St. => Devoe Street
Henry St. => Henry Street
South 4th St. => South 4th Street
Washington St. => Washington Street
11th St. => 11th Street
Warren St. => Warren Street
South 4th st => South 4th Street
W 35th st => W 35th Street
Union st => Union Street
Johnson Streeet => Johnson Street
3rd St => 3rd Street
Bloomfield St => Bloomfield Street
6th St => 6th Street
Hudson St => Hudson Street
1st St => 1st Street
40 W 94th St => 40 W 94th Street
Adams St => Adams Street
Grand St => Grand Street
Smith St & Bergen St => Smith Street & Bergen Street
8th St => 8th Street
2nd St => 2nd Street
Garden St => Garden Street
Jackson St => Jackson Street
Washington St => Washington Street
West 32nd St => West 32nd Street
7th St => 7th Street
Court St => Court Street
Monroe St => Monroe Street
362nd Grand St => 362nd Grand Street
Jefferson St => 

Below is shape_element function to transform data in required dictionary format. shape_element uses update_name function to translate street names to better names.

In [14]:
# list contains all elements which will be included to "created" key
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

def shape_element(element):
    '''
    this function converts osm elements to node dictionary. It will format all attributes 
    and sub-elements in required model defined above section.
    '''
    node = {} # return dict with nodes and ways data in our data model format
    created = {} # temp dict to store "created" key, this will be added to node dict
    pos = [None, None] # temp list to store lat and log, this will be added to node dict
    
    if element.tag == "node" or element.tag == "way" :  # process only "node" and "way"
        # store element type
        node["type"] = element.tag 
        
        # loop through element's attributes
        for at in element.attrib:
            
            # all elements in CREATED list will be grouped and stored under created key
            if at in CREATED:
                created[at] = element.attrib[at]
                node["created"] = created
            
            # store latitudes and longitudes in pos key
            elif at in ['lat','lon']:
                if at == "lat":
                    pos[0] = float(element.attrib[at])
                else:
                    pos[1] = float(element.attrib[at])
            else:
                node[at] = element.attrib[at]

        if not None in pos:
            node["pos"] = pos
        
        # processing inner "tag" element for nodes and ways
        for tag in element.iter("tag"):
            if not problemchars.search(tag.attrib["k"]): # filetering problem chars
                
                # selecting tags starting with "attrib:" to get all "address" key fields
                if lower_colon.search(tag.attrib["k"]) and tag.attrib["k"].startswith("addr:"):
                    if "address" not in node:
                        node["address"] = {}
                            
                    key = tag.attrib["k"].split(":")[1]
                    if is_street_name(tag):
                        better_name = update_name(tag.attrib["v"], mapping)
                        node["address"][key] = better_name
                    else:
                        node["address"][key] = tag.attrib["v"]
                        
                # store all other "tag" as normal name-value pair
                elif lower_colon.search(tag.attrib["k"]) and not tag.attrib["k"].startswith("addr:"):
                    node[tag.attrib["k"]] = tag.attrib["v"]
        
        # to store nd elements under "node_refs" list for ways
        for nd in element.iter("nd"):
            if "node_refs" not in node:
                node["node_refs"] = []
            node["node_refs"].append(nd.attrib["ref"])
            
        #pprint.pprint(node)

        return node
    else:
        return None
    

In [16]:
import json
import codecs

def process_map(file_in, pretty = False):
    '''
    This function gets osm file as input, parses the file and then using shape_element
    function transforms data in required format. Then write data to JSON file.
    '''
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")

In [17]:
# Call process map function, passing jersey_city.osm file as input.
process_map(filename)

Outfile file : jersey_city.osm.json is created.

## 3. Data Overview

In this section I have included few statistics of the data. Showed process to load data to MongoDB.
Then run some MongoDB queries to get more information regarding data.

### File Sizes
Size of the original XML OSM file downloaded for Jersey City

In [22]:
import os
osm_file_size = os.path.getsize(filename)/1.0e6
print("Size of the downloaded OSM file - {}: {} MB".format(filename, osm_file_size))

Size of the downloaded OSM file - jersey_city.osm: 474.760615 MB


In [23]:
output_filename = filename +".json"
json_file_size = os.path.getsize(output_filename)/1.0e6
print("Size of the created JSON file - {}: {} MB".format(output_filename, json_file_size))

Size of the created JSON file - jersey_city.osm.json: 486.374754 MB


### Loading Data to MongoDB
Now we will load jersey_city.osm.json to MongoDB.
MongoDB instance is running locally in my machine. I will use pyMongo to connect with MongoDB.

In [26]:
from pymongo import MongoClient

client = MongoClient("localhost:27017")
db = client.osm


osm db is created. To load json file to MongoDB, I used *mongoimport* utility. I will call this mongoimport utility from command line and load data to database. I can easily script this step using python or shell scripts. 

mongoimport takes 3 arguments : 
db : database name -> osm
collection : collection name -> jersey_city
file : json format file -> jersey_city.osm.json

Below is snippet output of my import:

```bash
C:\Program Files\MongoDB\Server\3.0\bin>mongoimport --db "osm" --collection "jersey_city" --file "C:\Users\Prashant\Dropbox\Udacity\Data_Analyst_ND\jersey_city.osm.json"
2015-08-30T18:02:21.207-0400    connected to: localhost
2015-08-30T18:02:24.177-0400    [........................] osm.jersey_city      16.2 MB/463.8 MB (3.5%)
2015-08-30T18:02:27.151-0400    [#.......................] osm.jersey_city      36.4 MB/463.8 MB (7.8%)
2015-08-30T18:02:30.151-0400    [##......................] osm.jersey_city      56.8 MB/463.8 MB (12.2%)
2015-08-30T18:02:33.153-0400    [####....................] osm.jersey_city      78.1 MB/463.8 MB (16.8%)
2015-08-30T18:02:36.151-0400    [#####...................] osm.jersey_city      99.6 MB/463.8 MB (21.5%)
2015-08-30T18:02:39.151-0400    [######..................] osm.jersey_city      118.1 MB/463.8 MB (25.5%)
2015-08-30T18:02:42.152-0400    [#######.................] osm.jersey_city      136.4 MB/463.8 MB (29.4%)
2015-08-30T18:02:45.151-0400    [########................] osm.jersey_city      156.5 MB/463.8 MB (33.7%)
2015-08-30T18:02:48.151-0400    [#########...............] osm.jersey_city      176.7 MB/463.8 MB (38.1%)
2015-08-30T18:02:51.154-0400    [##########..............] osm.jersey_city      194.9 MB/463.8 MB (42.0%)
2015-08-30T18:02:54.151-0400    [##########..............] osm.jersey_city      212.5 MB/463.8 MB (45.8%)
2015-08-30T18:02:57.154-0400    [############............] osm.jersey_city      232.8 MB/463.8 MB (50.2%)
2015-08-30T18:03:00.156-0400    [#############...........] osm.jersey_city      253.0 MB/463.8 MB (54.5%)
2015-08-30T18:03:03.151-0400    [##############..........] osm.jersey_city      273.1 MB/463.8 MB (58.9%)
2015-08-30T18:03:06.151-0400    [###############.........] osm.jersey_city      291.6 MB/463.8 MB (62.9%)
2015-08-30T18:03:09.152-0400    [################........] osm.jersey_city      311.8 MB/463.8 MB (67.2%)
2015-08-30T18:03:12.151-0400    [#################.......] osm.jersey_city      331.1 MB/463.8 MB (71.4%)
2015-08-30T18:03:15.151-0400    [##################......] osm.jersey_city      350.5 MB/463.8 MB (75.6%)
2015-08-30T18:03:18.157-0400    [###################.....] osm.jersey_city      370.4 MB/463.8 MB (79.9%)
2015-08-30T18:03:21.153-0400    [####################....] osm.jersey_city      394.5 MB/463.8 MB (85.0%)
2015-08-30T18:03:24.151-0400    [#####################...] osm.jersey_city      418.4 MB/463.8 MB (90.2%)
2015-08-30T18:03:27.152-0400    [#######################.] osm.jersey_city      444.8 MB/463.8 MB (95.9%)
2015-08-30T18:03:29.561-0400    imported 2033581 documents
```

Now I ran few queries on jersey_city collections to gather some stats and information on data.

#### Number of Documents in Collections

In [28]:
db.jersey_city.find().count()

2033581

#### Number of Nodes

In [30]:
db.jersey_city.find({"type" : "node"}).count()

1721029

#### Number of Ways

In [31]:
db.jersey_city.find({"type" : "way"}).count()

312552

#### Number of Unique Users

In [34]:
len(db.jersey_city.distinct("created.user"))

1523

#### Top 5 contributing Users

In [46]:
db.jersey_city.aggregate([{ "$group" : {"_id" : "$created.user", "count" : {"$sum" : 1}}},
        {"$sort" : {"count" : -1}}, 
        {"$limit" : 5} 
        ])["result"]

[{'_id': 'Rub21_nycbuildings', 'count': 1073036},
 {'_id': 'lxbarth_nycbuildings', 'count': 135794},
 {'_id': 'ediyes_nycbuildings', 'count': 112386},
 {'_id': 'ingalls_nycbuildings', 'count': 108182},
 {'_id': 'celosia_nycbuildings', 'count': 81625}]

#### Different Zip Codes Counts

In [45]:
db.jersey_city.aggregate([{ "$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}, 
                          {"$match" : {"_id" : {"$ne" : None}}},
                          {"$sort" : {"count" : -1}}
                         ])["result"]

[{'_id': '11203', 'count': 11773},
 {'_id': '11215', 'count': 9533},
 {'_id': '11221', 'count': 9402},
 {'_id': '11236', 'count': 8797},
 {'_id': '11220', 'count': 8612},
 {'_id': '11377', 'count': 8374},
 {'_id': '11385', 'count': 8282},
 {'_id': '11233', 'count': 8015},
 {'_id': '11218', 'count': 6880},
 {'_id': '11212', 'count': 6751},
 {'_id': '11216', 'count': 6381},
 {'_id': '11211', 'count': 6339},
 {'_id': '11226', 'count': 6140},
 {'_id': '11222', 'count': 6102},
 {'_id': '11105', 'count': 5949},
 {'_id': '11378', 'count': 5675},
 {'_id': '11206', 'count': 5467},
 {'_id': '11213', 'count': 5315},
 {'_id': '11231', 'count': 4881},
 {'_id': '11103', 'count': 4761},
 {'_id': '11238', 'count': 4691},
 {'_id': '11237', 'count': 4565},
 {'_id': '11217', 'count': 4509},
 {'_id': '11207', 'count': 4505},
 {'_id': '11101', 'count': 4148},
 {'_id': '11225', 'count': 4099},
 {'_id': '11219', 'count': 3946},
 {'_id': '11201', 'count': 3944},
 {'_id': '10301', 'count': 3674},
 {'_id': '111

#### Only Jersey City Zip Codes
I noticed that the coverage of only jersey city zipcodes is very limited in OSM

In [50]:
# Get counts of Only list of Jersey City Zip Codes.
db.jersey_city.aggregate([{"$match" : {"address.postcode" : {"$in" : ["07097","07302","07303","07304","07305","07306","07307","07308","07310","07311"]} }},
                          {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}, 
                          {"$sort" : {"count" : -1}}
                         ])["result"]

[{'_id': '07302', 'count': 61},
 {'_id': '07306', 'count': 19},
 {'_id': '07311', 'count': 2},
 {'_id': '07304', 'count': 2},
 {'_id': '07310', 'count': 1}]

#### All Cities
All cities in my selected area of interest

In [53]:
db.jersey_city.aggregate([{"$group" : {"_id" : "$address.city", "count" : {"$sum" : 1}}}, 
                          {"$match" : {"_id" : {"$ne" : None}}},
                          {"$sort" : {"count" : -1}}
                         ])["result"]

[{'_id': 'New York', 'count': 3733},
 {'_id': 'Brooklyn', 'count': 965},
 {'_id': 'Hoboken', 'count': 504},
 {'_id': 'New York City', 'count': 127},
 {'_id': 'Jersey City', 'count': 73},
 {'_id': 'Astoria', 'count': 69},
 {'_id': 'Long Island City', 'count': 34},
 {'_id': 'Newark', 'count': 33},
 {'_id': 'Sunnyside', 'count': 11},
 {'_id': 'Orange', 'count': 11},
 {'_id': 'Union City', 'count': 10},
 {'_id': 'Elizabeth', 'count': 9},
 {'_id': 'Staten Island', 'count': 8},
 {'_id': 'Woodside', 'count': 8},
 {'_id': 'brooklyn', 'count': 7},
 {'_id': 'New York, NY', 'count': 6},
 {'_id': 'Queens', 'count': 6},
 {'_id': 'Weehawken', 'count': 6},
 {'_id': 'Brooklyn, NY', 'count': 5},
 {'_id': 'Ridgewood', 'count': 4},
 {'_id': 'West New York', 'count': 4},
 {'_id': 'Bloomfield', 'count': 3},
 {'_id': 'NEW YORK CITY', 'count': 3},
 {'_id': 'North Bergen', 'count': 2},
 {'_id': 'North Arlington', 'count': 1},
 {'_id': 'newark', 'count': 1},
 {'_id': 'New York NY', 'count': 1},
 {'_id': 'Water

## 4. Additional Ideas

### nycbuildings

In previous section, I found that top 5 users are all following some pattern of "nycbuildings"
I reserached more on this, looks like as part of NYC open data initiative last year bulk of this data were loaded to OSM.

Below article contain detail on this:
https://www.mapbox.com/blog/nyc-buildings-openstreetmap/

After querying using mongodb regex, I found that 78% of this dataset is from nycbuilding. 

In [80]:
db.jersey_city.find( {"created.user" : {"$regex" : "nycbuildings$" } } ).count()

1596652

These nycbuildings documents doesnt have many useful information. Only few of them have address. All other useful attributes like amenities etc are missing.

In [81]:
db.jersey_city.find( {"created.user" : {"$regex" : "nycbuildings$" }, "address" : {"$exists" : 1} } ).count()

230417

### Number of documents for TIGER and GNIS

After looking more into this dataset, the data is filled with many entries for "way" from TIGER (Topologically Integrated Geographic Encoding and Referencing system)
http://wiki.openstreetmap.org/wiki/TIGER

There are also many entries for "node" from GNIS (USGS Geographic Names Information System).
http://wiki.openstreetmap.org/wiki/USGS_GNIS

As per documentation of both of these above sources (datasets), data seems outdated or incorrect.
GNIS:ID suppose to map with OSM amenity tags, but no corresponding amenity tags are present for these GNIS:ID

In [61]:
# Way document from tiger
db.jersey_city.find( { "type" : "way", "tiger:cfcc" : { "$exists": 1 } } ).count()

17388

In [59]:
# node documents from gnis
db.jersey_city.find( { "type" : "node", "gnis:created" : { "$exists": 1 } } ).count()

2117

####  Node and Way with Address

In [67]:
db.jersey_city.find({"type" : "node", "address.street" : {"$exists" : 1}}).count()

54045

In [83]:
db.jersey_city.find({"type" : "way", "address.street" : {"$exists" : 1}}).count()

200005

Due to presence of nycbuildings, tiger and gnis incomplete data, my dataset is not giving much useful information.
Below are important ideas which are important to make this OSM data more useful:

* Cleaning up invalid and outdated data from TIGER and GNIS.
* Mappings valid TIGER/GNIS data keys/attributes with OSM keys/attributes.
* Adding more information and attributes to nycbuildings data.
* Attracting more users and developers to OSM.

### Conclusion
Jersey City and the neighbouring areas data needs cleaning and mapping. NYCBuidlings data has created a nice skeleton, which can be leveraged to make this data more useful.
In current state, it is difficult to gain any intelligence from this data.