# Data Wrangle OpenStreetMaps Data
#### Mahlon Barrault
#### August 28, 2015
#### Map Area: Calgary, Alberta (Map Zen Extract (https://mapzen.com/data/metro-extracts) includes suburbs) 

## Table of Contents

[Problems Encountered in the Map](#Problems-Encountered-in-the-Map)
    
* [Directional Suffixes](#Directional-Suffixes)
* [Postal Codes](#Postal-Codes)
* [Rural Roads](#Rural-Roads)

[Data Overview](#Data-Overview)

[Additional Ideas](#Additional-Ideas)

### Problems Encountered in the Map

In [2]:
#%run tags.py

No characters that could cause issues creating keys in MongoDB were discovered. However, there appear to be some keys that are inconsistent.

In [3]:
%run mapparser.py

Tags:
defaultdict(<type 'int'>, {'node': 779009, 'nd': 935489, 'bounds': 1, 'member': 22520, 'tag': 325867, 'relation': 537, 'way': 83492, 'osm': 1})

Attributes:
defaultdict(<type 'int'>, {'changeset': 863038, 'maxlon': 1, 'type': 22520, 'uid': 863038, 'generator': 1, 'timestamp': 863039, 'k': 325867, 'v': 325867, 'lon': 779009, 'minlat': 1, 'version': 863039, 'role': 22520, 'user': 863038, 'maxlat': 1, 'lat': 779009, 'ref': 958009, 'id': 863038, 'minlon': 1})


Most of these attributes were expected. 'type' keys would have conflicted with the 'type' key that the Lesson 6 code was adding to the documents. This key was renamed to node_type. 'role' was not expected. It belonged to 'member' tags. More on processing 'member' tags to follow.

In [2]:
#%run audit.py

DIR_MAPPING = {'East' : 'E',
               'N.E.' : 'NE',
               'N.E' : 'NE',
               'N.W' : 'NW',
               'N.W.' : 'NW',
               'North' : 'N',
               'Northeast' : 'NE',
               'Northwest' : 'NW',
               'S.E' : 'SE',
               'S.E.' : 'SE',
               'S.W' : 'SW',
               'S.W.' : 'SW',
               'South' : 'S',
               'South-west' : 'SW',
               'Southeast' : 'SE',
               'South-east' : 'SE',
               'Southwest' : 'SW',
               'West' : 'W'
               }

ST_TYPE_MAPPING = { "St": "Street",
                   "St.": "Street",
                   'street' : 'Street',
                   "Rd." : 'Road',
                   "Rd" : 'Road',
                   'Ave' : "Avenue",
                   'Ave.' : "Avenue",
                   'Cres' : 'Crescent',
                   'Cres.' : 'Crescent',
                   'Blvd' : 'Boulevard',
                   'Blvd.' : 'Boulevard'
                   }

The output of audit.py allowed DIR_MAPPING and ST_TYPE_MAPPING to be produced which was used to clean the directional suffixes.

The following is an explanation of my process of creating clean.py.

#### Directional Suffixes

During the audit of street names the functions for the audit from Lesson 6 were altered to compensate for the use of directional suffixes. The standard for this notation was chosen to be the initials of the directional suffix, since that is the notation that is used on street signs in Calgary. Some of the 'addr:street' values have city and province included, so those data were split on ',' and the first item was used. Values like '400123 Highway 66' and 'Township Road  204A' were not altered as they are valid as they are.

Created a function in audit.py to get a count of tags using a specific attribute. member tags with role attributes were discovered that needed to be compensated for in the design of the cleaning functions. There were several 'type' attributes that would conflict with the 'type' key used in shape_node(), so it was renamed to 'node_type'. To assist in building the structure of the tags that had 'member' children member_prototype.json was produced to help me visualize what it should look like. From that I was able to correctly code the section in shape_node for member tags.

While working on the street name cleaning function to compensate for the street directions the trailing white space was causing the regex to not find the suffixes. Added strip() to the calls to the RegEx.

Initial test after developing clean.py revealed that the "tag" tags were getting processed at the top level shape_base as well as shape_node. The condition on the call to shape_base in shape_element will need to be recoded and include the relation tags. There are "tag" tags that have a created_by k value, these need to be added to the created dictonary. 

After the cleaned data was imported the audit functions were executed against the data now in MongoDB. There were several None values detected. Examination of the update_st_name function determined that there was a corner case that was not accounted for. There were some street values like 'Township Road  204A' that were not a concern for cleaning but the if-elif block was ignoring them and returning None. Ran the cleaning functions and imported in to MongoDB again with Drop.

 After the second import the address data was audited again and along with the expected uncleaned values like 'West Creek Court 200' was 'Rivercrest Drive South-east'. The DIR_MAPPING dictionary in clean.py was amended to include the mapping for this dirty value. Instead of extracting and importing all data again post_import_clean.py was used to correct it in MongoDB.

#### Postal Codes

Some postal codes were not in the official format A1B 2C3. Cleaned them using in MongoDB. The bad postcodes where audited in post_import_audit.py. There were 49 malformed postcodes. The ids for these documents were collected in to a list for cleaning in post_import_clean.py.

#### Rural Roads

This dataset encompassed a very large area and included large parts of farm area around the City of Calgary. As a result there are several rural roads like 'Township Road 204A' which aren't strictly part of the city. There inclusion did not cause significant issues but they do not add much value to the dataset.

### Data Overview

##### File Sizes
calgary_canada.osm : 159 MB

calgary_canada.osm.json : 184 MB

##### Number of Documents
db.DANDP3.count() : 863038

###### See analyze.py for full code for the following

##### Largest Document
print get_largest_doc(get_all_docs(db)) : 112886 characters, 'Proposed West Stoney Trail'

##### Number of Unique Users
print len(get_users(docs)) : 767

##### Top Three Contributors
print users[0:3] : [{u'count': 309818, u'_id': u'sbrown'}, {u'count': 89769, u'_id': u'Zippanova'}, {u'count': 46296, u'_id': u'markbegbie'}]

##### Rank of My Contributions
print 'My Rank ' + mb_rank : 157

##### Number of Ways
print 'Number of Ways: ' + str(db.DANDP3.find({"node_type":"way"}).count()) : 83492

##### Number of Nodes
print 'Number of Nodes: ' + str(db.DANDP3.find({"node_type":"node"}).count()) : 779009

### Additional Ideas

#### Postal Code Standarization
Write some code to validate all postal codes against Canada Post as a gold standard

#### Analyze User Contributions by Area

I thought it might be interesting to analyze what areas of the city users tend to make contributions. Are there patterns in user contribution that would indicate where that users lives or works? Are there more contributions in newer areas of the city or is it evenly distributed? How much does the City of Calgary contribute? These are just a few questions that I would be interested in digging in to further.