# OpenStreetMapData Case Study

**Author : Chaitanya Madala **

**Date : May 15, 2016 **

## Map Area

[Ahmedabad, Gujarat, India](https://en.wikipedia.org/wiki/Ahmedabad)

[DataSet](https://mapzen.com/data/metro-extracts/metro/ahmedabad_india/) : This Dataset which is extracted from website openstreetmap contains information about the city Ahmedabad, India

## Data Auditing
- As part of data auditing plan lets find out what are the different types of tags present in our data set, but also how many, to get the feeling on how much of which data we can expect to have in the map.

- Below are required imports and constants which will be used throught the project.

In [2]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import pprint
import re

INPUT_FILENAME = 'ahmedabad_india1.osm'

In [3]:
def count_tags(filename): 
   
    '''This function is written to count no of 
       different tags present in the given dataset'''
    
    dict_tags = {}
    for event,element in ET.iterparse(filename):
        tag = element.tag
        if tag in dict_tags:
            dict_tags[tag] += 1
        else:
            dict_tags[tag] = 1
            
    return dict_tags

tags = count_tags(INPUT_FILENAME)
print(tags)

{'bounds': 1, 'tag': 98131, 'node': 546085, 'nd': 634041, 'way': 81271, 'member': 2291, 'relation': 511, 'osm': 1}


- Now lets find out how many different users contributed to this Ahemdabad openstreetmap dataset. 

In [4]:
def count_users(filename):
    
    '''This function is written to countthe number of distinct 
    users who contributed to the Ahemdabad Openstreetmap data'''
    
    users_set = set()
    for event,element in ET.iterparse(filename):
        tag = element.tag
        if tag == 'node' or tag == 'relation' or tag == 'way':
             users_set.add(element.attrib['user'])
        element.clear()        
    return users_set

users = count_users(INPUT_FILENAME)
print('Number of users contributed: ',len(users))

Number of users contributed:  354


- Before we procees the data and add it into our database, we should check "k" value for each tag and see if there are any potential problems

In [5]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>();\'"?%#$@\,\. \t\r\n]')

problem_chars_set = set()
others_set = set()

def key_type(element, keys):
    '''This function is defined to categorize different "k" values'''
    if element.tag == "tag":
        tag_k_value = element.attrib['k']
        match_lower = re.search(lower,tag_k_value)
        match_lower_colon = re.search(lower_colon,tag_k_value)
        match_problemchars  = re.search(problemchars,tag_k_value)
        
        if match_lower :
            keys['lower'] += 1     
        elif match_lower_colon :
            keys['lower_colon'] += 1            
        elif match_problemchars:
            keys['problemchars'] += 1
            problem_chars_set.add(tag_k_value)
        else :
            keys['other'] += 1
            others_set.add(tag_k_value)
            
    return keys

def process_tags(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for event,element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

process_tags(INPUT_FILENAME)

{'lower': 96127, 'lower_colon': 1962, 'other': 35, 'problemchars': 7}

- The above data shows that there are 35 other category tags and 7 problem char tags. Now lets take a look at these problemchars tags and other category tags to identify those tags, which might be useful for database insertion.

In [6]:
print(problem_chars_set)

{'average rate/kg', 'famous for'}


In [7]:
print(others_set)

{'name_2', 'plant:output:electricity', 'IR:zone', 'name_1', 'mtb:scale:imba', 'fuel:octane_91', 'naptan:CommonName', 'is_in:iso_3166_2', 'Business', 'source_1', 'source_2', 'mtb:scale:uphill', 'FID_1', 'FIXME', 'currency:INR', 'fuel:octane_80', 'fuel:octane_92', 'AND_a_nosr_p', 'AND_a_c'}


- From the above listed tags, we can discard all of them except for "famous for" tag, as it has some meaningfull data associated with it i.e, it has the value of famous dish of that particular place or resturant. 

- Now lets find out what all different "k" values are present in the data set.

In [8]:
def process_tags_k_val(filename):
    '''This function is written to find out 
    different k values present in dataset'''
    tags_k_values_dict = {}
    final_list = list(others_set) + list(problem_chars_set)
    for event,element in ET.iterparse(filename):
        if element.tag == 'tag' :
            tag_k = element.attrib['k']
            if tag_k not in final_list:
                if tag_k not in tags_k_values_dict:
                    tags_k_values_dict[tag_k] = 1
                else :
                    tags_k_values_dict[tag_k] += 1
                
    return tags_k_values_dict

tags = process_tags_k_val(INPUT_FILENAME)
print("Length of k values dictionary: ",len(tags))

Length of k values dictionary:  203


- As the length of dictionary is 203, the output will be huge, So I writing it to a external file called **"tags.txt" **.

In [10]:
with open('tags.txt','w') as tags_file:
    for k in sorted(tags.keys()):
        tags_file.write("{0} --> {1}\n".format(k,tags[k]))   

- Now lets take a look at different postal codes present in the dataset to validate them against correct format of Ahemdabad postal codes.
- This [[website]](http://www.mapsofindia.com/pincode/india/gujarat/ahmedabad/) lists out all the available postal codes of Ahemdabad, whcih are of the format **(38\*\*\*\*)** and are 6 digits in length.
- When we take a look at different "k" value tags present in "tags.txt", we find that postal codes are defined under **"addr:postcode","postal_code"**.   

In [82]:
correct_postal_code_set = set()
incorrect_postal_code_set = set()

def validate_postal_code(code):
    validate_postal_code = re.compile(r'^38(\d{4})$') #regular expression to validate postal codes.
    match = re.search(validate_postal_code,code)
    return match
    
def process_postal_codes(filename):
    for event,element in ET.iterparse(filename):
        if element.tag == 'tag':
            tag_k = element.attrib['k']
            if tag_k in ['addr:postcode','postal_code']:
                tag_v = element.attrib['v'].replace(' ','')
                
                match = validate_postal_code(tag_v)
                if match :
                    correct_postal_code_set.add(tag_v)
                else:
                    incorrect_postal_code_set.add(tag_v)
                                        
process_postal_codes(INPUT_FILENAME)                        

In [139]:
print(sorted(correct_postal_code_set))

['380001', '380003', '380004', '380005', '380006', '380007', '380008', '380009', '380013', '380014', '380015', '380021', '380023', '380024', '380026', '380027', '380028', '380043', '380051', '380052', '380054', '380055', '380058', '380059', '380061', '380063', '382006', '382007', '382009', '382110', '382210', '382325', '382345', '382350', '382405', '382418', '382421', '382424', '382440', '382445', '382475', '382480', '382481']


In [140]:
incorrect_postal_code_set

{'3', '33026', '3800013'}