# Open Street Map Data Wrangling

### Map

Memphis, TN

Memphis is the city I currently live in so I thought it would create an interesting city to investigate. 

<img src = "image.png">

## Initial Investigating of the data

Before importing the data into the database, I ran some queries against the data. The first issue I found was that there are some street names that are abreviated such as Ave., Blvd., and others. A couple of streets were missing their street types. 

```python

defaultdict(set,
            {'236': {'Lynnfield Road Suite 236'},
             'Ave': {'Central Ave',
              'Chelsea Ave',
              'Lamar Ave',
              'Shadyac Ave',
              'W G E Patterson Ave'},
             'Blvd': {'B.B. King Blvd', 'Ridge Lake Blvd'},
             'Cir': {'E Brookhaven Cir'},
             'Extended': {'Perkins Extended'},
             'Front': {'Front'},
             'Jackson': {'Jackson'},
             'Main': {'Main'},
             'Mississippi': {'Mississippi'},
             'Pike': {'Covington Pike'},
             'Poplar': {'Poplar'},
             'Rd': {'Avon Rd'},
             'Rd.': {'Clarke Rd.'}})

```

The street types will need to be corrected when parsing the xml file for importation into the database. I used the update_name function to be called when importing to correct these abbreviations. 

```python
    def update_name(name, mapping):
    for key, value in mapping.iteritems():
        if key in name:
            return name.replace(key, value)
    return name
    
    with open ('memphismap.xml', 'r') as mapfile:
    s_types = audit()
    
    for s_type, ways in s_types.iteritems():
        for name in ways:
            correct_name = update_name(name, mapping)
            print name, "->", correct_name
```

This function returned the streets that had abbreviated names and their corrections

```python
Covington Pike -> Covington Pike
Mississippi -> Mississippi
Perkins Extended -> Perkins Extended
Jackson -> Jackson
Front -> Front
Avon Rd -> Avon Road 
E Brookhaven Cir -> E Brookhaven Circle 
Poplar -> Poplar
Ridge Lake Blvd -> Ridge Lake Boulevard
B.B. King Blvd -> B.B. King Boulevard
Lamar Ave -> Lamar Avenue 
Shadyac Ave -> Shadyac Avenue 
W G E Patterson Ave -> W G East Patterson Ave
Chelsea Ave -> Chelsea Avenue 
Central Ave -> Central Avenue 
Lynnfield Road Suite 236 -> Lynnfield Road Suite 236
Main -> Main
Clarke Rd. -> Clarke Road
```

Another issue I discovered was that I had two zipcodes that were not complete. They were short by two digits. 

```python
defaultdict(set,
            {'38103': {'38103'},
             '38104': {'38104'},
             '38105': {'38105'},
             '38106': {'38106'},
             '38107': {'38107'},
             '38108': {'38108'},
             '38109': {'38109'},
             '38111': {'38111'},
             '38112': {'38112'},
             '38114': {'38114'},
             '38115': {'38115'},
             '38117': {'38117'},
             '38118': {'38118'},
             '38119': {'38119'},
             '38120': {'38120'},
             '38122': {'38122'},
             '38126': {'38126'},
             '38128': {'38128'},
             '3813': {'3813'},
             '38134': {'38134'},
             '38152': {'38152'},
             '38163': {'38163'},
             '3951': {'3951'}})
```

Using a short function later when importing the data will set those two to 00000 since we do not know what digit is missing from the zipcode. 

```python
def update_zipcode(zipcode): 
    if len(str(zipcode))<5:
        zipcode = 00000
    return zipcode
```

# Data Overview

## File Sizes
memphismap.xml ..... 247 MB <br>
memphis.db ..... 122 MB <br>
nodes.csv ..... 101 MB <br>
nodes_tags.csv ..... 0 MB <br>
ways.csv ..... 6 MB <br>
ways_tags.csv ..... 6 MB <br>
ways_nodes.csv ..... 33 MB <br>

## Number of Nodes and Ways

```python
query = cur.execute('SELECT COUNT(*) FROM nodes')
print query.fetchall()

query = cur.execute('SELECT COUNT(*) FROM ways')
print query.fetchall()
```

Nodes: 1278477 <br>
Ways: 113652

## Number of Unique Users

```python
query = cur.execute('SELECT COUNT(distinct(uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways)')
pprint.pprint(query.fetchone())
```

676

## User with the most submissions
```python
query = cur.execute('SELECT e.user, COUNT(*) AS num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) AS e GROUP BY user ORDER BY num DESC LIMIT 1;')
pprint.pprint(query.fetchall())
```
User: OSM901 <br>
Submissions: 1239829

## Number and type of religious locations
```python
query= cur.execute("SELECT value, COUNT(*) AS num FROM (SELECT key,value FROM nodes_tags UNION ALL SELECT key,value FROM ways_tags) AS e WHERE key='religion' GROUP BY value ORDER BY num DESC;")
pprint.pprint(cur.fetchall())
```

Christian: 720 <br>
Jewish: 7 <br>
Muslim: 5 <br>
Multifaith: 1 <br>
Hindu: 1 <br>

Since Memphis is withing the 'Bible Belt', it's not suprising to see a large number of Christian faith churches. 

## Types of resturaunts
```python
query=cur.execute("SELECT value, COUNT(*) AS num FROM (SELECT key,value FROM nodes_tags UNION ALL SELECT key,value FROM ways_tags) AS e WHERE e.key LIKE '%cuisine%' GROUP BY value ORDER BY num desc;")
pprint.pprint(cur.fetchall())

[(u'burger', 18),
 (u'american', 12),
 (u'sandwich', 8),
 (u'mexican', 8),
 (u'chicken', 8),
 (u'pizza', 7),
 (u'coffee_shop', 7),
 (u'japanese', 4),
 (u'italian', 4),
 (u'barbecue', 4),
 (u'tex-mex', 3),
 (u'seafood', 3),
 (u'ice_cream', 3),
 (u'regional', 2),
 (u'chinese', 2),
 (u'asian', 2),
 (u'wings', 1),
 (u'vietnamese', 1),
 (u'thai', 1),
 (u'steak_house', 1),
 (u'southern;breakfast', 1),
 (u'pretzel', 1),
 (u'pizza;barbecue;steak;southern;breakfast;lunch', 1),
 (u'mediterranean;korean;sandwich', 1),
 (u'gastropub', 1),
 (u'donut', 1),
 (u'diner', 1),
 (u'cookies', 1),
 (u'coffee_shop;southern', 1),
 (u'coffee;tea', 1),
 (u'chinese;sushi', 1),
 (u'chinese;buffet', 1),
 (u'cake;bagel;coffee_shop', 1),
 (u'breakfast;pancake', 1),
 (u'breakfast;coffee_shop', 1),
 (u'breakfast', 1),
 (u'bar;hotdogs', 1),
 (u'arab', 1),
 (u'american;steak', 1),
 (u'african', 1),
 (u'Club_and_Southern_Food', 1),
 (u'Bar_and_Pub_food', 1),
 (u'BBQ', 1)]

```

This was a pretty interesting query because there are only 6 resturaunts listed as Barbeque. If you've ever been to Memphis, you'll know that you can't go half a block without finding a barbeque resturaunt. I feel that either some resturaunts are not labeled correctly or are missing classification. 

# Improvements that could be made

## Religious locations
Since Christianity is broken into many denominations, it would be intereseting to see what denominations are tied to each location. I'd be interested to see where these denominations are in relation to the average wealth of surrounding area. 

## Resturaunts

Some of the resturaunts had multiple designations for type. These would be better suited broken out into seperate columns so that they aren't classed together as a unique type. 

## Validating data for user submissions

A validation process could be put in place to prevent users from submitting streets that contain abbreviated names to make quering the data easier and less prone to error. 