# Wrangle OpenStreetMap Data
## Christopher Burrow

## Area

Memphis, Tennessee

https://overpass-api.de/api/map?bbox=-90.0783,35.0688,-89.8654,35.1968

Memphis is the city I currently live in so I thought it would create an interesting city to investigate.

## Initial investigating of the data

Before importing the data into the database, I ran some queries against the data. The first issue I found was that there are some street names that are abreviated such as Ave., Blvd., and others. The street types will need to be corrected when parsing the xml file for importation into the database. I used the update_name function to be called when importing to correct these abbreviations.

```python
    def update_name(name, mapping):
    for key, value in mapping.iteritems():
        if key in name:
            return name.replace(key, value)
    return name
    
    with open ('memphismap.xml', 'r') as mapfile:
    s_types = audit()
    
    for s_type, ways in s_types.iteritems():
        for name in ways:
            correct_name = update_name(name, mapping)
            print name, "->", correct_name
```

This function returned the streets that had abbreviated names and their corrections

```python
Covington Pike -> Covington Pike
Mississippi -> Mississippi
Perkins Extended -> Perkins Extended
Jackson -> Jackson
Front -> Front
Avon Rd -> Avon Road 
E Brookhaven Cir -> E Brookhaven Circle 
Poplar -> Poplar
Ridge Lake Blvd -> Ridge Lake Boulevard
B.B. King Blvd -> B.B. King Boulevard
Lamar Ave -> Lamar Avenue 
Shadyac Ave -> Shadyac Avenue 
W G E Patterson Ave -> W G East Patterson Ave
Chelsea Ave -> Chelsea Avenue 
Central Ave -> Central Avenue 
Lynnfield Road Suite 236 -> Lynnfield Road Suite 236
Main -> Main
Clarke Rd. -> Clarke Road
```

Another issue I discovered was that I had two zipcodes that were not complete. They were short by two digits. Using the same process I used to clean the street abbreviations, I printed and cleaned up the two zipcodes that had issues, setting them to 0. 

```python
def update_zipcode(zipcode): 
    if len(str(zipcode))<5:
        zipcode = 0
    return zipcode

with open ('map.osm', 'r') as mapfile:
    s_types = audit_zip()
    
    for s_type, ways in s_types.iteritems():
        for name in ways:
            correct_name = update_zipcode(name)
            print name, "->", correct_name
```

```python
38111 -> 38111
38112 -> 38112
38114 -> 38114
38115 -> 38115
38117 -> 38117
38118 -> 38118
38119 -> 38119
38132 -> 38132
38134 -> 38134
3813 -> 0
3951 -> 0
38107 -> 38107
38106 -> 38106
38105 -> 38105
38104 -> 38104
38103 -> 38103
38109 -> 38109
38108 -> 38108
38128 -> 38128
38152 -> 38152
38163 -> 38163
38120 -> 38120
38122 -> 38122
38127 -> 38127
38126 -> 38126
```

# Queries

## File Sizes

map.osm ..... 257 MB <br>
sample.osm ..... 26 MB <br>
map.db ..... 127 MB <br>
nodes.csv ..... 105 MB <br>
nodes_tags.csv ..... 0 MB <br>
ways.csv ..... 6 MB <br>
ways_tags.csv ..... 6 MB <br>
ways_nodes.csv ..... 35 MB <br>

## Number of Nodes and Ways

```python
#Number of nodes
query = cur.execute('SELECT COUNT(*) FROM nodes')
print query.fetchall()

#Number of ways
query = cur.execute('SELECT COUNT(*) FROM ways')
print query.fetchall()
```

1331083 Nodes <br>
118315 Ways <br>

I wanted to check out the counts of the types of nodes. I was most interested in the Historic type since the downtown area of Memphis has lots of historic sites.

```python
#Number of ways
query = cur.execute('SELECT type , COUNT(*) AS num FROM nodes_tags GROUP BY type ORDER BY num DESC;')
pprint.pprint(query.fetchall())
```

```python
[(u'regular', 13225),
 (u'gnis', 4120),
 (u'addr', 968),
 (u'brand', 134),
 (u'tower', 39),
 (u'ref', 15),
 (u'contact', 15),
 (u'service', 12),
 (u'communication', 11),
 (u'social_facility', 9),
 (u'operator', 9),
 (u'historic', 9),
 (u'was', 8),
 (u'payment', 8),
 (u'traffic_signals', 4),
 (u'source', 4),
 (u'name', 4),
 (u'socket', 3),
 (u'nrhp', 3),
 (u'healthcare', 3),
 (u'fuel', 3),
 (u'toilets', 2),
 (u'removed', 2),
 (u'railway', 2),
 (u'internet_access', 2),
 (u'demolished', 2),
 (u'heritage', 1),
 (u'disused', 1),
 (u'description', 1),
 (u'census', 1)]
```

# Number of Unique users

```python
query = cur.execute('SELECT COUNT(distinct(uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways)')
pprint.pprint(query.fetchone())
```

694 Unique Users

## User with the most submissions

```python
query = cur.execute('SELECT e.user, COUNT(*) AS num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) AS e GROUP BY user ORDER BY num DESC LIMIT 1;')
pprint.pprint(query.fetchall())
```

User: OSM901 <br>
Submissions: 1291600

# Number and type of Religious Locations

Memphis is located in the Bible Belt of America which means the population is fairly religious. I expected to find a great number of churches located in the data. 

```python
query= cur.execute("SELECT value, COUNT(*) AS num FROM (SELECT key,value FROM nodes_tags UNION ALL SELECT key,value FROM ways_tags) AS e WHERE key='religion' GROUP BY value ORDER BY num DESC;")
pprint.pprint(cur.fetchall())

[(u'christian', 724),
 (u'jewish', 7),
 (u'muslim', 5),
 (u'multifaith', 1),
 (u'hindu', 1)]
```
724 is quite a lot of churches. If you've ever driven around Memphis, you'll know that it's very hard to find a street without a church located nearby. 

## Types of Resturaunts

When I first ran this query, I expected to see barbeque resturaunts as the most common resturaunt type. I was suprised when there were only 2 results related to barbeque. Like chruches, it's very common to find multiple barbeque resturaunts on the same street in Memphis. I know from experience that the downtown area has at least 4 barbeque resturaunts on Beale Street. I would suspect that there are a lot of resturaunts that are not labeled correctly. 

```python
query=cur.execute("SELECT value, COUNT(*) AS num FROM (SELECT key,value FROM nodes_tags UNION ALL SELECT key,value FROM ways_tags) AS e WHERE e.key LIKE '%cuisine%' GROUP BY value ORDER BY num desc;")
pprint.pprint(cur.fetchall())

[(u'burger', 18),
 (u'american', 12),
 (u'sandwich', 8),
 (u'mexican', 8),
 (u'chicken', 8),
 (u'pizza', 7),
 (u'coffee_shop', 7),
 (u'japanese', 4),
 (u'italian', 4),
 (u'barbecue', 4),
 (u'tex-mex', 3),
 (u'seafood', 3),
 (u'ice_cream', 3),
 (u'regional', 2),
 (u'chinese', 2),
 (u'asian', 2),
 (u'wings', 1),
 (u'vietnamese', 1),
 (u'thai', 1),
 (u'steak_house', 1),
 (u'southern;breakfast', 1),
 (u'pretzel', 1),
 (u'pizza;barbecue;steak;southern;breakfast;lunch', 1),
 (u'mediterranean;korean;sandwich', 1),
 (u'gastropub', 1),
 (u'donut', 1),
 (u'diner', 1),
 (u'cookies', 1),
 (u'coffee_shop;southern', 1),
 (u'coffee;tea', 1),
 (u'chinese;sushi', 1),
 (u'chinese;buffet', 1),
 (u'cake;bagel;coffee_shop', 1),
 (u'breakfast;pancake', 1),
 (u'breakfast;coffee_shop', 1),
 (u'breakfast', 1),
 (u'bar;hotdogs', 1),
 (u'arab', 1),
 (u'american;steak', 1),
 (u'african', 1),
 (u'Club_and_Southern_Food', 1),
 (u'Bar_and_Pub_food', 1),
 (u'BBQ', 1)]
```

## Types of Amenities

When looking at the types of amenities we can see that Places of Worship tops the list at 688. This number doesn't match up to the query I ran before on religious locations. Perhaps some node amenities are not labeled correctly in the data or are missing designations. 

```python
query=cur.execute("SELECT value, COUNT(*) AS num FROM nodes_tags WHERE key='amenity' GROUP BY value ORDER BY num DESC;")
pprint.pprint(cur.fetchall())

[(u'place_of_worship', 688),
 (u'school', 171),
 (u'restaurant', 87),
 (u'bicycle_rental', 70),
 (u'fast_food', 22),
 (u'bar', 21),
 (u'fuel', 20),
 (u'library', 15),
 (u'cafe', 13),
 (u'post_office', 10),
 (u'social_facility', 9),
 (u'grave_yard', 8),
 (u'parking', 7),
 (u'vending_machine', 6),
 (u'toilets', 6),
 (u'theatre', 6),
 (u'pharmacy', 6),
 (u'bench', 6),
 (u'pub', 5),
 (u'kindergarten', 5),
 (u'fountain', 4),
 (u'clinic', 4),
 (u'car_rental', 4),
 (u'bicycle_repair_station', 4),
 (u'fire_station', 3),
 (u'doctors', 3),
 (u'community_centre', 3),
 (u'bank', 3),
 (u'atm', 3),
 (u'waste_basket', 2),
 (u'university', 2),
 (u'research_institute', 2),
 (u'public_building', 2),
 (u'police', 2),
 (u'nursing_home', 2),
 (u'nightclub', 2),
 (u'ice_cream', 2),
 (u'hospital', 2),
 (u'courthouse', 2),
 (u'college', 2),
 (u'charging_station', 2),
 (u'car_wash', 2),
 (u'bicycle_parking', 2),
 (u'veterinary', 1),
 (u'shelter', 1),
 (u'prison', 1),
 (u'marketplace', 1),
 (u'dentist', 1),
 (u'clock', 1),
 (u'childcare', 1),
 (u'bus_station', 1),
 (u'bbq', 1)]
```

# Improvements that could be made

## Religious Locations
Since Christianity is broken into many denominations, it would be interesting to see the counts for those denominations. I would be very interested in doing some analysis on the demonination and informations about the surrounding area such as average income, crime rate, and housing prices. 

## Resturaunts
Some of the resturaunts need to be broken into multiple catagories or labeled correctly. These designations could be broken into multiple columns for the entry. 

## Validating data for user submissions
Since I suspect that some data is missing from the nodes. It would be helpful if OpenStreetMap validated the submissions of users to verify that a resturaunt type is set or that street names are not using abbreviations. This would make queries easier and less prone to error. 