# OpenStreetMapData Case Study

**Author : Chaitanya Madala **

**Date : May 15, 2016 **

## Map Area

[Ahemdabad, Gujarat, India](https://en.wikipedia.org/wiki/Ahmedabad)

[DataSet](https://mapzen.com/data/metro-extracts/metro/ahmedabad_india/) : This Dataset which is extracted from website openstreetmap contains information about the city Ahemdabad, India

## Data Auditing
- As part of data auditing plan lets find out what are the different types of tags present in our data set, but also how many, to get the feeling on how much of which data we can expect to have in the map.

- Below are required imports and constants which will be used throught the project.

In [23]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import pprint
import re

INPUT_FILENAME = 'ahmedabad_india1.osm'

In [5]:
def count_tags(filename): 
   
    '''This function is written to count no of 
       different tags present in the given dataset'''
    
    dict_tags = {}
    for event,element in ET.iterparse(filename):
        tag = element.tag
        if tag in dict_tags:
            dict_tags[tag] += 1
        else:
            dict_tags[tag] = 1
            
    return dict_tags

tags = count_tags(INPUT_FILENAME)
print(tags)

{'bounds': 1, 'tag': 98131, 'node': 546085, 'nd': 634041, 'way': 81271, 'member': 2291, 'relation': 511, 'osm': 1}


- Now lets find out how many different users contributed to this Ahemdabad openstreetmap dataset. 

In [6]:
def count_users(filename):
    
    users_set = set()
    for event,element in ET.iterparse(filename):
        tag = element.tag
        if tag == 'node' or tag == 'relation' or tag == 'way':
             users_set.add(element.attrib['user'])
        element.clear()        
    return users_set

users = count_users(INPUT_FILENAME)
print('Number of users contributed: ',len(users))

Number of users contributed:  354


- Before we procees the data and add it into our database, we should check "k" value for each tag and see if there are any potential problems

In [7]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>();\'"?%#$@\,\. \t\r\n]')

problem_chars_set = set()
others_set = set()

def key_type(element, keys):
    if element.tag == "tag":
        tag_k_value = element.attrib['k']
        match_lower = re.search(lower,tag_k_value)
        match_lower_colon = re.search(lower_colon,tag_k_value)
        match_problemchars  = re.search(problemchars,tag_k_value)
        
        if match_lower :
            keys['lower'] += 1     
        elif match_lower_colon :
            keys['lower_colon'] += 1            
        elif match_problemchars:
            keys['problemchars'] += 1
            problem_chars_set.add(tag_k_value)
        else :
            keys['other'] += 1
            others_set.add(tag_k_value)
            
    return keys

def process_tags(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for event,element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

process_tags(INPUT_FILENAME)

{'lower': 96127, 'lower_colon': 1962, 'other': 35, 'problemchars': 7}

- The above data shows that there are 35 other category tags and 7 tags which has problem chars in them. Now lets take a look at these problemchars tags and other category tags to identify those tags, which might be useful later.

In [24]:
print(problem_chars_set)

{'average rate/kg', 'famous for'}

In [26]:
print(others_set)

{'source_2', 'AND_a_c', 'fuel:octane_91', 'source_1', 'fuel:octane_92', 'fuel:octane_80', 'naptan:CommonName', 'mtb:scale:uphill', 'plant:output:electricity', 'FIXME', 'mtb:scale:imba', 'Business', 'is_in:iso_3166_2', 'name_1', 'FID_1', 'currency:INR', 'AND_a_nosr_p', 'IR:zone', 'name_2'}


- From the above listed tags, we can discard all of them except for "famous for" tag, as it has some meaningfull data associated with it i.e, it has the value of famous dish of that particular place or resturant. 

- Now lets find out what all different "k" values are present in the data set.

In [11]:
def process_tags_k_val(filename):
    tags_k_values_dict = {}
    final_list = list(others_set) + list(problem_chars_set)
    for event,element in ET.iterparse(filename):
        if element.tag == 'tag' :
            tag_k = element.attrib['k']
            if tag_k not in final_list:
                if tag_k not in tags_k_values_dict:
                    tags_k_values_dict[tag_k] = 1
                else :
                    tags_k_values_dict[tag_k] += 1
                
    return tags_k_values_dict

tags = process_tags_k_val(INPUT_FILENAME)

In [21]:
tags_file = open('tags.txt','w')
for k in sorted(tags.keys()):
    tags_file.write("{0} --> {1}\n".format(k,tags[k]))
tags_file.close()   