# Project 3: Wrangle OpenStreetMap Data

### Map Area: Atlanta, GA (https://mapzen.com/data/metro-extracts/metro/atlanta_georgia/)

#### I picked Atlanta becuase it is the city I currently live in.

In [7]:
#create a dictionary of the different tags as the key and how many as the value
def count_tags(filename):
    tags = {}
    for ev, elem in ET.iterparse(filename):
        tag = elem.tag
        if tag not in tags.keys():
            tags[tag] = 1
        else:
            tags[tag] = tags[tag]+1
    return tags

In [None]:
#checking the 'k' value for each tag for potential problem characters
def key_type(element, keys):
    if element.tag == "tag":
        k = element.attrib['k']
        if re.search(lower,k):
            keys["lower"] += 1
        elif re.search(lower_colon,k):
            keys["lower_colon"] += 1
        elif re.search(problemchars,k):
            keys["problemchars"] += 1
            print k
        else:
            keys["other"] += 1
    return keys
    pass

#create a dictionary with 4 tag categories as the key and how many as the value
def process_map_v1(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys

In [1]:
#number of users that have edited the map data
def process_map_v2(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for e in element:
            if 'uid' in e.attrib:
                users.add(e.attrib['uid'])
    return users

## Problems Encountered

#### Street abbreviations:

When looking through the output of audit_1 and audit_2 functions I saw that several street names are abbreviated. For example: Ave, Blvd, Cir, Ct and Dr to name some. There is also some incorrect case abbreviations / spellings, for example: COurt, blvd, circle, dr, drive, lane and place.

These can be updated with a mapping dictionary and using the update_name and test functions.




#### Invalid zip codes:

When looking through the zip code data I noticed 2 records that had invalid zip codes. All zip codes in the Atlanta area start with '30' so it was easy to identify with the below query.

`SELECT id, value
FROM nodes_tags
WHERE key like '%postcode%' AND value not LIKE '30%';`  

`2352501668 | 80083              
3121340792 | Georgia`

id '2352501668' is 4997 Saxony Court, Stone Mountain, GA 80083. The correct zip code is 30083 for this address. This can be updated in the database by an UPDATE statement.

`UPDATE nodes_tags        
SET VALUE = '30083'            
WHERE id = '2352501668' AND key LIKE '%postcode%';`

id '3121340792' is 1420 Cresthaven Lane NW, Lawrenceville, Georgia with the zip code being 'Georgia'. The correct zip code is 30043 for this address. This can also be updated in the database by an UPDATE statement.

`UPDATE nodes_tags        
SET VALUE = '30043'            
WHERE id = '3121340792' AND key LIKE '%postcode%';`

In [2]:
#updating street names per the mapping dictionary
def update_name(name, mapping):

    m = street_type_re.search(name)
    if m:
        street_type = m.group()
        if street_type in mapping.keys():
            #print 'Before: ' , name
            name = re.sub(m.group(), mapping[m.group()], name)
            #print 'After: ', name
    return name   

def test():
    st_types = audit_2(osmfile)   
    for st_type, ways in st_types.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            print name, "=>", better_name            

In [None]:
#shapes the interparse element object and return the dictionary
def shape_element_2(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""
    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  #handle secondary tags the same way for both node and way elements

    if element.tag == 'node':
        for a in node_attr_fields:
            node_attribs[a] = element.attrib[a]

    if element.tag == 'way':
        for b in way_attr_fields:
            way_attribs[b] = element.attrib[b]

    for tag in element.iter('tag'):
        tag_dict = {}
        attributes = tag.attrib
        if problem_chars.search(tag.attrib['k']):
            continue
        if element.tag == 'node':
            tag_dict['id'] = node_attribs['id']
        else:
            tag_dict['id'] = way_attribs['id']
        tag_dict['value'] = attributes['v']

        if tag.attrib['k'] == 'addr:street':
            #update street names
            tag_dict['value'] = update_name(tag.attrib['v'], mapping) 
        else:
            pass
        lower_colon = LOWER_COLON.search(tag.attrib['k'])

        if lower_colon:
            before_colon = re.findall('^(.+?):', tag.attrib['k'])
            after_colon = re.findall('^[a-z|_]+:(.+)', tag.attrib['k'])
            tag_dict['type'] = before_colon
            tag_dict['key'] = after_colon
        else:
            tag_dict['key'] = attributes['k']
            tag_dict['type'] = 'regular'
        tags.append(tag_dict)
        
    if element.tag == 'way':
        count = 0 
        for nd in element.iter('nd'):
            way_node_dict = {}
            way_node_dict['id'] = element.attrib['id']
            way_node_dict['node_id'] = nd.attrib['ref']
            way_node_dict['position'] = count
            count += 1
            way_nodes.append(way_node_dict)
            
    if element.tag == 'node':
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}

# Data Overview
### Statistics from the Atlanta OpenStreetMap dataset

`sample.osm:          248 MB                                
project3.db:         146 MB                                      
nodes.csv:           97 MB                            
nodes_tags.csv:      8 MB            
ways.csv:            5 MB             
ways_nodes.csv:      31 MB                  
ways_tags.csv:       17 MB`

#### Number of unique users:

``` sql
SELECT COUNT(DISTINCT(e.uid))                   
FROM (SELECT uid FROM nodes 
      UNION ALL 
      SELECT uid FROM ways) e;
```

1381

#### Number of nodes:

```sql
SELECT COUNT(*) FROM nodes;
```

1171609

#### Number of ways:

``` sql
SELECT COUNT(*) FROM ways;
```

84260

#### Top 5 contributing users:

```sql
SELECT u.user, COUNT(*) as num         
FROM (SELECT user FROM nodes 
      UNION ALL 
      SELECT user FROM ways) u           
GROUP BY u.user         
ORDER BY num DESC          
LIMIT 5;
```


`Liber                         | 532818                   
Saikrishna_FultonCountyImport | 241015                    
woodpeck_fixbot               | 148812                   
Jack the Ripper               | 35026                
afonit                        | 33632                       `

#### Top 5 types of 'natural' key:

```sql
SELECT tags.value, COUNT(*) as count               
FROM (SELECT * FROM nodes_tags               
      UNION ALL              
      SELECT * FROM ways_tags) tags            
WHERE tags.key = 'natural'                                 
GROUP BY tags.value              
ORDER BY count DESC            
LIMIT 5;      
```

`water   | 3180                
wood    | 810        
tree    | 561            
wetland | 137        
peak    | 25`


#### Most popular religions:

```sql
SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='religion'
GROUP BY value
ORDER BY num DESC;
```      

`christian | 382                     
muslim    | 1`

#### Top 5 cusines:

```sql
SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='cuisine'
GROUP BY value
ORDER BY num DESC
LIMIT 5;
```


`
burger   | 16      
pizza    | 11               
mexican  | 9            
chinese  | 7          
american | 6    `          



### Improvements

#### 'place' key in the dataset:

```sql
SELECT value, COUNT(*) as num
FROM nodes_tags
WHERE key='place'
GROUP BY value
ORDER BY num DESC;
```

`hamlet        | 193            
neighbourhood | 140           
village       | 13            
locality      | 6          
county        | 5          
suburb        | 5         
island        | 4    `

I thought it was odd to see the top place was a hamlet, the 3rd most common place was a village and neighbourhood is the British spelling. So I investigated where OpenStreetMaps was founded and it was as I guessed, in the UK. I would say that in the US we don't use the words hamlet or village to describe a place, at least not as widely as in the UK. 

I would recommend using more common words to the region to describe places. What does a hamlet mean vs. a village in the US? This would allow for better understanding of place types in the US, although I do understand using these place constructs for continuity throughout the entire map dataset. If this was implemented it would require common words to the region to be mapped back to the original meaning of hamlet/village.

Benefits:

* Better understanding of the place type keys in different areas of the world
* Add meaning to the place keys in different areas of the world

Anticipated Problems:

* Going against the standard place keys can create continuity issues and create confusion
* Common words to a region would have to be mapped back to the original place keys they are referring to

I also noticed that there are apparently 4 islands in the dataset. I am not aware of any island in the Atlanta area so I investigated the 4 nodes.

``` sql
SELECT nt.id, n.lat, n.lon
FROM nodes_tags nt
JOIN nodes n
ON nt.id = n.id
WHERE nt.key = 'place' AND nt.value = 'island';
```

`358686776  | 32.9934577 | -85.1860533  -> Hairston Island (can't visibly see the island from Google Maps)                     
358697497  | 32.8515216 | -84.4668655  -> Owens Island                
358705587  | 33.1481772 | -85.0546617  -> Swanson Island (can't visibly see the island from Google Maps)           
3473397106 | 33.5887561 | -84.20201    -> An unnamed island, verified with Google Maps earth shot, maps shows as a body of water`

It is strange that both OpenStreetMaps and Google Maps incorrectly identify 2 islands (and both show the unnamed island as a body of water), I would assume one is based on the other's incorrect assumption and neither have been audited.


