# Open Street Map Project
## Ashir Amin

https://mapzen.com/data/metro-extracts/#austin-texas

1. Problems Encountered 
   * Street Names Abbreviated
   * Phone Numbers not correctly formatted
2. Data Overview
3. Additional Ideas


In [10]:
from pymongo import MongoClient
import os
client = MongoClient()
db = client.testdb["osm"]

## Problems Encountered in the Data

I ran count_tags.py to get a rough idea of type of keys in the tags element and dumped the data into a json file and examined to see how the data was spread. Majority of the keys belong to the addr prefix.

Next step was to sample some values out of each of the these see where there is oppertunity to clean data before importing it into MongoDB

Examining the data showed to Geographical encoding system were being used GNIS and TIGER with TIGER being more comprehensive.

There were some tags with just one one occurence and some that I couldn't really make sense of. Like 'FIXME' which is sort of a flag/comment made by users if they felt the information may be incorrect.

For the purposes of this excercise I cleaned two fields within the tag element.
* Abbreviated street names
* Phone Number

For the street names I wrote scrape.py that scraped a webpage to get the mapping of street names to its abbreviation set by USPS and added some custom prefix like IH,I35 (Interstate Highway 35) used it to clean street names.

I found the phone numbers in the dataset to not follow any set standard of formatting so I used a python package phonenumbers to format them to (xxx) xxx-xxx



## Data Overview

#### Basic Statistics regaring the imported data set in the MongoDB

In [26]:
# Number of Documents
print db.count()

701231


In [25]:
# Number of Nodes and Ways

# Nodes Count
print db.find({'type':'node'}).count()

# Way Count
print db.find({ 'type' : 'way'}).count()



635008
66223


In [33]:
# Unique Users
print len(db.distinct("created.user"))

697


In [47]:
# Top 5 User By Contribution

query = db.aggregate([
        {
            '$group': {
                '_id': '$created.user',
                'count': {
                    '$sum': 1
                    }
            }
        }, {
            '$sort': {
                'count': -1
            }
        }, {
            '$limit' : 5
        } 
        
    ])
    
list(query)

[{u'_id': u'patisilva_atxbuildings', u'count': 274500},
 {u'_id': u'ccjjmartin_atxbuildings', u'count': 130117},
 {u'_id': u'ccjjmartin__atxbuildings', u'count': 93953},
 {u'_id': u'wilsaj_atxbuildings', u'count': 35852},
 {u'_id': u'jseppi_atxbuildings', u'count': 30121}]

In [49]:
# Total Users with one contribution

query = db.aggregate([{
           '$group': {
                '_id': '$created.user',
                'count': {
                    '$sum': 1
                    }
            } 
        }, {
            '$group' : {
                '_id': '$count',
                'num_users': {
                    '$sum' : 1
                }
            }
        }, {
            '$sort': {
                '_id': 1
              }
            
        }, {
            '$limit' : 1
        }
        
    ])

print list(query)

[{u'num_users': 175, u'_id': 1}]


In [71]:
# Top 5 Zipcodes

query = db.aggregate([ {
            '$match' : {
                'zipcode' : {
                    '$exists' : 1
                }
            }
        }, {
            '$group' : {
                '_id' : '$zipcode',
                'count' : {
                    '$sum' : 1
                }
            }
        }, {
            '$sort' : {
                'count' : -1
            }
        }, {
            '$limit' : 5
        }
    ])

print list(query)

[{u'count': 1080, u'_id': [u'78645']}, {u'count': 561, u'_id': [u'78734']}, {u'count': 350, u'_id': [u'78660']}, {u'count': 344, u'_id': [u'78653']}, {u'count': 315, u'_id': [u'78669']}]


In [78]:
# Documents without a phone number

query = db.aggregate([ {
            '$match' : {
                'contact_number' : {
                    '$exists' : 0
                }
            }
        }, {
            '$group' : {
                '_id' : 'contact_number',
                'count': {
                    '$sum' : 1
                }
            }
        }
        
    ])

print list(query)

[{u'count': 701177, u'_id': u'contact_number'}]


## Additional Ideas
 
* Contribution by user is heavily skewed with the top few contributing the most. It is evident by the fact that out 697 users 175 of those have contributed to just once.
* According to http://wiki.openstreetmap.org/wiki/TIGER the last data import was in 2005 which is a considereable time and a lot of changes may have happenned since then. It would be nice if this data was cross-validated with Google Maps data because some of the places that existed back then may not exists anymore
* In terms of validation there is alot more room for improvement. I validated phone numbers and street addresses but after more exploration other fields can be cleaned 
* I felt most of the queries revolvled around places and not essentialy directions between a sources and direction
* While data can be further clean but I believe for the purposes of this exercise. I did sufficient cleaning