# Data Wrangling with MongoDB OpenStreetMap Project
Author: Brian Novak

## Map Area
Around Baton Rouge, LA, United States.

## Problems Encountered in the Map

### Find types of tags, particularly ones with problematic characters.

In [1]:
run tag_types.py

{'lower': 70020, 'lower_colon': 121529, 'other': 11680, 'problemchars': 0}

0 tags were found with problematic characters


### Street Types

The code from lesson 6 of the course was used with some small modifications. This checks for abbreviations in the last word of the street names which is usually the type of street. Some additions were made to the expected street types and the mapping dictionary. When the last word in a street name could be represented as an integer, then that street name was ignored. This ignored 'Highway 42' and 'Highway 44' in this data set.

In [3]:
run improving_street_names.py

There are 81 streets in the data set

Street names with abbreviated types:

Essen Ln => Essen Lane
Burbank Dr => Burbank Drive
Juban Rd => Juban Road
Jefferson Hwy => Jefferson Highway
East Parker Blvd => East Parker Boulevard
O'Donovan Blvd => O'Donovan Boulevard
Hazelwood dr => Hazelwood Drive

7 street name abbreviations were found


### City Names

Since there were only a few cities in the data set, the names were just checked manually and none were found to be spelled incorrectly or abbreviated. Of course if analysis of multiple data sets or larger areas were to be done, it would be worth comparing the city names to a database such as GeoNames (<http://download.geonames.org/export/dump/cities1000.zip>). The cities in the data set were:

In [4]:
run find_city_names.py

Cities included in the data:

Baton Rouge
Livingston
Denham Springs
Central
Walker
Gonzales
Brusly
Port Allen


### Zip Codes

The zip codes in the addr:postcode fields were checked against the GeoNames zip code data (<http://download.geonames.org/export/zip/US.zip>), to make sure they were consistent with the city name. No problems were found in the OSM data, but I noticed that the city of Central was missing from the GeoNames data. Central officially became a separate city from Baton Rouge in 2005, but the zip codes in that area remained as 70837. This was not a problem for checking the OSM data since there was only one entry for Central in it and it did not contain the addr:postcode field. Central was also added to the GeoNames data file.

In [29]:
run check_zip_codes.py

No problems with zip codes


### County (parish) names

The GeoNames data also contains the names and id numbers of the counties or parishes in the case of Louisiana and an id number for them. Since the OSM records containing city names sometimes also contain the county id number, the county id numbers in the OSM records were checked to make sure they were consistent with the city names. Records containing both a city name and an id number for the county seem to be rare; there were only two in this data set and both were consistent.

In [30]:
run check_county_ids.py

No problems with county id numbers


### State name abbreviations

Although the state name abbreviations in the addr:state tag are unlikely to be incorrect, it is also easy to do a quick check. Instead of using python, grep can be used to pull out the lines to check and awk can be used to pull out the state abbreviations from the lines. The lack of any output below indicates that all of the state name abbreviations are correct in the addr:state tags.

In [37]:
%%bash

cat Baton_Rouge.osm | grep "addr:state" | awk '{print $3}' | awk -F= '{print $2}' | awk -F/ '{print $1}' \
                    | awk '$1 !~ "LA"'

### Consistency of Street Address with City and Latitude and Longitude 

Although the zip codes and cities are consistent in the OSM data, there is no guarantee that the street address is not actually in a different city with a different zip code. Although the data set was chosen by location, there is still a possibility that the latitude and longitude are inside the region of interest, but still not consistent with the street address. These things were not checked, but could be checked using reverse geocoding
(http://www.geonames.org/export/web-services.html#findNearbyPlaceName, https://developers.google.com/maps/documentation/geocoding/intro?csw=1#ReverseGeocoding).

## Data Overview

### Number of documents
                                                
> db.char.find().count()                     
1555851

### Number of nodes
                                                
> db.char.find({"type":"node"}).count()
1471349

### Number of ways
                                                
> db.char.find({"type":"way"}).count()
84502

### Number of unique users
                                                
> db.char.distinct({"created.user"}).length
336

### Top 1 contributing user
                                                
> db.char.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, sort":{"count":­1}}, {"$limit":1}])
[ { "_id" : "jumbanho", "count" : 823324 } ]                

### Number of users appearing only once (having 1 post)

> db.char.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, {"$group":{"_id":"$count", "num_users":{"$sum":1}}}, {"$sort":{"_id":1}}, {"$limit":1}])
[ {"_id":1,"num_users":56} ]

\# “_id” represents postcount                     


### Other

## Additional Ideas or Observations

Check against a zipcode database to make sure that at least the city and zipcode are consistent.

Develop standards for different countries or regions. User prompted or automatically change.