# Project 3 - Open Street Map Data Wrangling

## Abstract
Open Street Map is a project to create a open source mapping of the world. It could most easily be understood as the Wikipedia of maps, where anyone in the world can add to the mapping dataset. The data provides a rich dataset, both interesting in the context of learning more about various geographic areas around the world, or from the data itself, such as how many people contributed to the dataset for a particular area. With any project reliant on human input though the data is sometimes inconsistent. 

### Los Angeles California
Los Angeles California was chosen for two primary reasons. One is that Los Angeles is largest city in America, and the second most populous. It is very likely that there are more records in Los Angeles than most other American cities and it is also likely that there are more contributers. The second reason is that the I was born close to Los Angeles so to me it is a more interesting dataset.

### Problems Encountered
Three problems were found in the data file. All work is documented in the Cleaning Iteration Juptyer notebook
  
**Address Suffixes**  
Street names in the dataset were inconsistently abbreviated, or not abbreviated. for instance "Street" was sometimes represnted as "st." or "st" or in one occurence misspelled as "sttreet". Additionally other suffixes looked like zip codes or numbers. The obvious inconsistences were documented and changed before the addresses were loaded into mongodb

**Postal Codes**
A verification of the postcodes was made using MongoDB using the following query.
> db.la_map.find({\$and:[{"address.postcode":{\$exists: true}},{"address.postcode" :{$not: /\d+-?\d+/}}]},{"address.postcode":1, "pos":1})

<img src="postcodes.png">

This found all postcodes that did not conform to a **digit-digit** representation that's expected in America. Of the non conforming values two distinct issues arised. The first was fairly straightforward, someone labeled the post codes as Disneyland for 10 nodes. The fix was fairly simple, just replaced this value with Disneylands actual postcode. However there were four other instances where Ca, California's abbreviation, or the city was used as a post code. For these the actual post code was imputed using the latitude and longitude of the node. These were then added to a python dictionary and replaced as the JSON was processed

In [1]:
zip_mapping = {( 34.0572495, -118.2751067):90057, (33.6561412, -117.7536445):92618,\
               (34.375206, -118.5295727):91321, (34.2623343, -118.3200911):91040}

### Additional Ideas

#### Import establishments from Environmental Health Data
Los Angeles rates every establishment on a regular basis which means a public list of the name, type of facility, and address of all these dining facilities are available and of high quality. An import of this dataset could be performed to add many relevant points of interest to OSM's dataset. 

Unfortunately this is only limited to the very specific region of Los Angeles, and does not provide a consistent solution that is able to used globally. Another issue is that while mass imports may sound simple, they can cause trouble in their initial import and reimports to update the data over time may not be possible. Issues arising from the TIGER import would likely occur with an Environment Health Data import as well.

#### Machine Learning to autocorrect or suggest fixes
It would be possible to create a machine learning algorithm which can predict attributes, or determine which attributes may be incorrect. For instance a machine learning algorithm could perform an analysis to determine zip codes. If one node has different postcode than all its surrounding nodes a clustering analysis may detect this. Similarly a machine learning algorithm could perhaps learn where nodes of interest are likely to exist.

Similar to the issue above however this Machine Learning algorithm would have to be very generalizable to work across all the regions of the world, as postal codes and points of interest vary by country. Additionally the amount of human and computational resources to train and develop a model may be prohibitive for an effort that is largely sustained by donations. It also may stand against the goals of the organization, which on it's webpage proclaims "OpenStreetMap emphasizes local knowledge" and OpenStreetMap is built by a community of mappers that contribute and maintain data". It may go against the culture of the organization to have an algorithm contribute data.


## Conclusion
The LA dataset is quite large, over 1 gigabyte as a JSON, and took numerous hours to process on my laptop. When using Mongodb over Python parsing however the results were much quicker, and I would expect that when MongoDB is used on an external server the results are even quicker.

The OSM street map data for Los Angeles was suprisingly clean, most fields were correct. However it is evident that multiple people are contributing as small differences such as **ave** vs **ave.** and mistaken postcodes. However in general it is quite amazing that without financial incentive Open Street Map has been able to create such a robust dataset of the things around us.

## Methodology
The map data was initially downloaded from MapZen as a bz2 file. The data was then preprocessed into a json file using Python, before being imported into a MongoDB instance for further analysis.

## Data Overiew

### Verify Tags

In [1]:
file_location = "data\los-angeles_california.osm"

In [2]:
from programs import tags
tag = tags.process_map(file_location)

In [3]:
tag

{'lower': 1804949, 'lower_colon': 2122407, 'other': 176260, 'problemchars': 0}

Out of the 4,103,616 tags luckily none of them contain any of characters labeled as problem characters in the Lesson 6 example.

### Number of contributors and elements

In [4]:
from programs import users
users, ids = users.process_map(file_location)

In [5]:
"Number of Users: {0}     Number of Ids: {1}".format(len(users),len(ids))

'Number of Users: 3026     Number of Ids: 5952617'

In [6]:
list(ids)[:5]

['54466833', '2170105523', '123699544', '7318336', '1118375156']

Out of the millions of people that have visited or live in LA it seems that only 3026 people are responsible for all the points in the LA Open Street Mapb Project. Further analysis will be done after the MongoDB database has been created.

### Count Nodes and Ways
A function was added to the data module which counts the number of ways and nodes elements in the original LA osm file. We'll be using this counter later to verify that import into Mongodb was successful

In [8]:
from programs import data
count = data.count_elements(file_location)
print(count)

### Create json file
For import into Mongodb, the osm file, an XML type file, will be converted into a json using python. During the conversion process street names are checked and converted. The list of conversations was generated manually by reviewing a list of all street suffixes to check for repeats or typos. Processing the map takes significant resources and was precomputed outside of this notebook.

| File Type | Size (mb) |
|-----------|-----------|
| bz2       | 89        |
| OSM       | 1211      |
| json      | 1568      |



In [9]:
if True == False: #Prevents execution
    data.process_map(file_location)

# MongoDB
Mongodb is a popular NoSQL database that stores its data in collections of documents. Documents have a flexible schema, which works well for the OpenStreetMap data as not every node and way has the same "columns". If the Open Street Map data was to be stored in a tabular databases many fields would be null, for instance "Outside Seating" would be irrelevant for most businsses. Additionally adding extra data would be burdensome as new columns would have to be added to an ever growing table.

### Loading Data
The JSON file generated by our previous Python method needs to initially be loaded into Mongodb using the following command in the terminal

>mongoimport --db test --collection la_map --file los-angeles_california.osm.json

After some exploratory analysis it was found that there were further issues in the OSM database. After making the necessary fixes the following command was used to load a collection in the final database.

>mongoimport --db test --collection la_map --file los-angeles_california.osm.json


## Queries
### Loading Python Driver

In [10]:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client.final

### Verifying all documents imported

In [11]:
db.la_map.find().count()

5953758

### Counting Number of Elements in entire collection

In [12]:
counter = 0
for document in db.la_map.find():
    counter += len(document)
print(counter)

45403693


### Number of Nodes and Ways
The Lesson 6 files initially used the **type** attribute to denote nodes and ways. After an initial exploration it was found that the OSM data already contained a type field. As was such another unused attribute named **osm_type** was created to denote nodes and ways

In [13]:
db.la_map.distinct("osm_type")

['node', 'way']

In [14]:
list(db.la_map.aggregate([{"$group":{"_id":"$osm_type", "Count":{"$sum":1}}}]))

[{'Count': 579785, '_id': 'way'}, {'Count': 5373973, '_id': 'node'}]

### See what types of cuisines are available

In [15]:
db.la_map.distinct("cuisine")

['burger',
 'japanese',
 'american',
 'thai',
 'korean',
 'vietnamese',
 'sushi',
 'mexican',
 'italian',
 'roast_beef',
 'coffee',
 'sandwich',
 'hawaiian',
 'ice_cream',
 'pizza',
 'donut',
 'fish_and_chips',
 'chinese',
 'steak_house',
 'chicken;mexican',
 'chicken',
 'Japanese Ramen',
 'Northern Chinese',
 'taiwanese',
 'indian',
 'mediterranean',
 'cantonese',
 'regional',
 'coffee_shop',
 'french',
 'gastropub',
 'noodle',
 'asian',
 'peruvian',
 'greek',
 'steak;seafood',
 'italian;mediterranean',
 'deli',
 'burger;american',
 'barbecue',
 'american;bakery',
 'seafood',
 'Californian',
 'breakfast',
 'burger;mexican',
 'american;brewpub',
 'garlic',
 'indonesian',
 'pizza;chicken',
 'catering',
 'juice',
 'sushi;steak;japanese',
 'seafood;steak',
 'mexican;pizza',
 'seafood;sushi;steak',
 'greek;burger',
 'seafood;california',
 'chinese;sushi',
 'seafood;steak;hawaiian',
 'sushi;japanese;steak',
 'mexican;steak;seafood',
 'sushi;california',
 'chicken;ice_cream',
 'seafood;brewp

### Count Number of Coffee Shops
Reading through the list I realized that I am actually very curious about the coffee shops in Los Angeles. Google Maps returns many many results, as does Yelp, so I would like to see if that's reflected here as well

In [18]:
db.la_map.find({"cuisine":{"$regex": u".*coffee.*"}}).count()

174

For these coffee shops I would like to see how many users submitted entries

In [17]:
list(db.la_map.aggregate([{"$match":{"cuisine":{"$regex": u".*coffee.*"}}},
                         {"$group": {"_id":"$created.user", "contributions": {"$sum":1}}},
                         {"$sort":{"contributions":-1}},
                         {"$limit":20}
                         ]))

[{'_id': 'Brian@Brea', 'contributions': 19},
 {'_id': 'andrewpmk', 'contributions': 7},
 {'_id': 'ponzu', 'contributions': 7},
 {'_id': 'kisaa', 'contributions': 7},
 {'_id': 'Clarke22', 'contributions': 6},
 {'_id': 'youngbasedallah', 'contributions': 4},
 {'_id': 'DMaximus', 'contributions': 4},
 {'_id': 'michael_kirk', 'contributions': 3},
 {'_id': 'freewillisanillusion', 'contributions': 3},
 {'_id': 'kdano', 'contributions': 3},
 {'_id': 'bondah', 'contributions': 3},
 {'_id': 'Oleg Shalaev', 'contributions': 3},
 {'_id': 'release_candidate', 'contributions': 3},
 {'_id': 'thevirginian', 'contributions': 3},
 {'_id': 'StellanL', 'contributions': 3},
 {'_id': 'jgpacker', 'contributions': 3},
 {'_id': 'sankeytm', 'contributions': 2},
 {'_id': 'palewire', 'contributions': 2},
 {'_id': 'pverik', 'contributions': 2},
 {'_id': 'Peejster', 'contributions': 2}]

# Appendix

## Lesson 6 Exercises
All exercises are located in the programs folder. Modifications and extra functions outside of the required exercises are included for use with the LA OSM file. Additional modules exist for data munging the LA OSM file before conversion

## Reference
Map Source - https://mapzen.com/data/metro-extracts  
MongoDB Manual - https://docs.mongodb.org/manual/  
FourSquare and OSM - https://www.mapbox.com/blog/connecting-foursquare-openstreetmap/  
Los Angeles County Department of Public Health - https://ehservices.publichealth.lacounty.gov/ezsearch